public class WARCExtractCDXJob extends WARCBatchJob
A CDX file contains sorted lines of metadata from the WARC files, with each line followed by the file and offset the record was found at, and optionally a checksum. The timeout of this job is 7 days. See http://www.archive.org/web/researcher/cdx_file_format.php
FileBatchJob.ExceptionOccurrence
noOfRecordsProcessed
batchJobTimeout, exceptions, filesFailed, noOfFilesProcessed
Constructor and Description |
---|
WARCExtractCDXJob()
Equivalent to WARCExtractCDXJob(true).
|
WARCExtractCDXJob(boolean includeChecksum)
Constructs a new job for extracting CDX indexes.
|
Modifier and Type | Method and Description |
---|---|
void |
finish(OutputStream os)
End of the batch job.
|
WARCBatchFilter |
getFilter()
Filters out the NON-RESPONSE records.
|
void |
initialize(OutputStream os)
Initialize any data needed (none).
|
void |
processRecord(org.archive.io.warc.WARCRecord sar,
OutputStream os)
Process this entry, reading metadata into the output stream.
|
String |
toString() |
getExceptionArray, handleException, noOfRecordsProcessed, processFile
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout
public WARCExtractCDXJob(boolean includeChecksum)
includeChecksum
- If true, an MD5 checksum is also written for each record. If false, it is not.public WARCExtractCDXJob()
public WARCBatchFilter getFilter()
getFilter
in class WARCBatchJob
WARCBatchJob.getFilter()
public void initialize(OutputStream os)
initialize
in class WARCBatchJob
os
- The OutputStream to which output data is writtenWARCBatchJob.initialize(OutputStream)
public void processRecord(org.archive.io.warc.WARCRecord sar, OutputStream os)
processRecord
in class WARCBatchJob
sar
- the object to be processed.os
- The OutputStream to which output data is writtenIOFailure
- on trouble reading WARC record dataWARCBatchJob.processRecord(WARCRecord, OutputStream)
public void finish(OutputStream os)
finish
in class WARCBatchJob
os
- The OutputStream to which output data is writtenWARCBatchJob.finish(OutputStream)
Copyright © 2005–2016 The Royal Danish Library, the Danish State and University Library, the National Library of France and the Austrian National Library.. All rights reserved.