Package dk.netarkivet.common.utils.cdx
Class WARCExtractCDXJob
- java.lang.Object
-
- dk.netarkivet.common.utils.batch.FileBatchJob
-
- dk.netarkivet.common.utils.warc.WARCBatchJob
-
- dk.netarkivet.common.utils.cdx.WARCExtractCDXJob
-
- All Implemented Interfaces:
Serializable
public class WARCExtractCDXJob extends WARCBatchJob
Batch job that extracts information to create a CDX file.A CDX file contains sorted lines of metadata from the WARC files, with each line followed by the file and offset the record was found at, and optionally a checksum. The timeout of this job is 7 days. See http://www.archive.org/web/researcher/cdx_file_format.php
- See Also:
- Serialized Form
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
FileBatchJob.ExceptionOccurrence
-
-
Field Summary
-
Fields inherited from class dk.netarkivet.common.utils.warc.WARCBatchJob
noOfRecordsProcessed
-
Fields inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
batchJobTimeout, exceptions, filesFailed, noOfFilesProcessed
-
-
Constructor Summary
Constructors Constructor Description WARCExtractCDXJob()
Equivalent to WARCExtractCDXJob(true).WARCExtractCDXJob(boolean includeChecksum)
Constructs a new job for extracting CDX indexes.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
finish(OutputStream os)
End of the batch job.WARCBatchFilter
getFilter()
Filters out the NON-RESPONSE records.void
initialize(OutputStream os)
Initialize any data needed (none).void
processRecord(org.archive.io.warc.WARCRecord sar, OutputStream os)
Process this entry, reading metadata into the output stream.String
toString()
-
Methods inherited from class dk.netarkivet.common.utils.warc.WARCBatchJob
getExceptionArray, handleException, noOfRecordsProcessed, processFile
-
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout
-
-
-
-
Constructor Detail
-
WARCExtractCDXJob
public WARCExtractCDXJob(boolean includeChecksum)
Constructs a new job for extracting CDX indexes.- Parameters:
includeChecksum
- If true, an MD5 checksum is also written for each record. If false, it is not.
-
WARCExtractCDXJob
public WARCExtractCDXJob()
Equivalent to WARCExtractCDXJob(true).
-
-
Method Detail
-
getFilter
public WARCBatchFilter getFilter()
Filters out the NON-RESPONSE records.- Overrides:
getFilter
in classWARCBatchJob
- Returns:
- The filter that defines what WARC records are wanted in the output CDX file.
- See Also:
WARCBatchJob.getFilter()
-
initialize
public void initialize(OutputStream os)
Initialize any data needed (none).- Specified by:
initialize
in classWARCBatchJob
- Parameters:
os
- The OutputStream to which output data is written- See Also:
WARCBatchJob.initialize(OutputStream)
-
processRecord
public void processRecord(org.archive.io.warc.WARCRecord sar, OutputStream os)
Process this entry, reading metadata into the output stream.- Specified by:
processRecord
in classWARCBatchJob
- Parameters:
sar
- the object to be processed.os
- The OutputStream to which output data is written- Throws:
IOFailure
- on trouble reading WARC record data- See Also:
WARCBatchJob.processRecord(WARCRecord, OutputStream)
-
finish
public void finish(OutputStream os)
End of the batch job.- Specified by:
finish
in classWARCBatchJob
- Parameters:
os
- The OutputStream to which output data is written- See Also:
WARCBatchJob.finish(OutputStream)
-
-