dk.netarkivet.common.utils.cdx
Class WARCExtractCDXJob
java.lang.Object
dk.netarkivet.common.utils.batch.FileBatchJob
dk.netarkivet.common.utils.warc.WARCBatchJob
dk.netarkivet.common.utils.cdx.WARCExtractCDXJob
- All Implemented Interfaces:
- java.io.Serializable
public class WARCExtractCDXJob
- extends WARCBatchJob
Batch job that extracts information to create a CDX file.
A CDX file contains sorted lines of metadata from the WARC files, with
each line followed by the file and offset the record was found at, and
optionally a checksum.
The timeout of this job is 7 days.
See http://www.archive.org/web/researcher/cdx_file_format.php
- See Also:
- Serialized Form
Constructor Summary |
WARCExtractCDXJob()
Equivalent to WARCExtractCDXJob(true). |
WARCExtractCDXJob(boolean includeChecksum)
Constructs a new job for extracting CDX indexes. |
Method Summary |
void |
finish(java.io.OutputStream os)
End of the batch job. |
WARCBatchFilter |
getFilter()
Filters out the NON-RESPONSE records. |
void |
initialize(java.io.OutputStream os)
Initialize any data needed (none). |
void |
processRecord(org.archive.io.warc.WARCRecord sar,
java.io.OutputStream os)
Process this entry, reading metadata into the output stream. |
java.lang.String |
toString()
|
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob |
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
WARCExtractCDXJob
public WARCExtractCDXJob(boolean includeChecksum)
- Constructs a new job for extracting CDX indexes.
- Parameters:
includeChecksum
- If true, an MD5 checksum is also
written for each record. If false, it is not.
WARCExtractCDXJob
public WARCExtractCDXJob()
- Equivalent to WARCExtractCDXJob(true).
getFilter
public WARCBatchFilter getFilter()
- Filters out the NON-RESPONSE records.
- Overrides:
getFilter
in class WARCBatchJob
- Returns:
- The filter that defines what WARC records are wanted
in the output CDX file.
- See Also:
WARCBatchJob.getFilter()
initialize
public void initialize(java.io.OutputStream os)
- Initialize any data needed (none).
- Specified by:
initialize
in class WARCBatchJob
- Parameters:
os
- The OutputStream to which output data is written- See Also:
WARCBatchJob.initialize(OutputStream)
processRecord
public void processRecord(org.archive.io.warc.WARCRecord sar,
java.io.OutputStream os)
- Process this entry, reading metadata into the output stream.
- Specified by:
processRecord
in class WARCBatchJob
- Parameters:
sar
- the object to be processed.os
- The OutputStream to which output data is written
- Throws:
IOFailure
- on trouble reading WARC record data- See Also:
WARCBatchJob.processRecord(
WARCRecord, OutputStream)
finish
public void finish(java.io.OutputStream os)
- End of the batch job.
- Specified by:
finish
in class WARCBatchJob
- Parameters:
os
- The OutputStream to which output data is written- See Also:
WARCBatchJob.finish(OutputStream)
toString
public java.lang.String toString()
- Overrides:
toString
in class java.lang.Object
- Returns:
- Humanly readable description of this instance.