dk.netarkivet.common.utils.cdx
Class ExtractCDXJob
java.lang.Object
dk.netarkivet.common.utils.batch.FileBatchJob
dk.netarkivet.common.utils.arc.ARCBatchJob
dk.netarkivet.common.utils.cdx.ExtractCDXJob
- All Implemented Interfaces:
- java.io.Serializable
public class ExtractCDXJob
- extends ARCBatchJob
Batch job that extracts information to create a CDX file.
A CDX file contains sorted lines of metadata from the ARC files, with
each line followed by the file and offset the record was found at, and
optionally a checksum.
The timeout of this job is 7 days.
See http://www.archive.org/web/researcher/cdx_file_format.php
- See Also:
- Serialized Form
Constructor Summary |
ExtractCDXJob()
Equivalent to ExtractCDXJob(true). |
ExtractCDXJob(boolean includeChecksum)
Constructs a new job for extracting CDX indexes. |
Method Summary |
void |
finish(java.io.OutputStream os)
End of the batch job. |
ARCBatchFilter |
getFilter()
Filter out the filedesc: headers. |
void |
initialize(java.io.OutputStream os)
Initialize any data needed (none). |
void |
processRecord(org.archive.io.arc.ARCRecord sar,
java.io.OutputStream os)
Process this entry, reading metadata into the output stream. |
java.lang.String |
toString()
|
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob |
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
ExtractCDXJob
public ExtractCDXJob(boolean includeChecksum)
- Constructs a new job for extracting CDX indexes.
- Parameters:
includeChecksum
- If true, an MD5 checksum is also
written for each record. If false, it is not.
ExtractCDXJob
public ExtractCDXJob()
- Equivalent to ExtractCDXJob(true).
getFilter
public ARCBatchFilter getFilter()
- Filter out the filedesc: headers.
- Overrides:
getFilter
in class ARCBatchJob
- Returns:
- The filter that defines what ARC records are wanted
in the output CDX file.
- See Also:
ARCBatchJob.getFilter()
initialize
public void initialize(java.io.OutputStream os)
- Initialize any data needed (none).
- Specified by:
initialize
in class ARCBatchJob
- Parameters:
os
- The OutputStream to which output data is written- See Also:
ARCBatchJob.initialize(OutputStream)
processRecord
public void processRecord(org.archive.io.arc.ARCRecord sar,
java.io.OutputStream os)
- Process this entry, reading metadata into the output stream.
- Specified by:
processRecord
in class ARCBatchJob
- Parameters:
sar
- the object to be processed.os
- The OutputStream to which output data is written
- Throws:
IOFailure
- on trouble reading arc record data- See Also:
ARCBatchJob.processRecord(
ARCRecord, OutputStream)
finish
public void finish(java.io.OutputStream os)
- End of the batch job.
- Specified by:
finish
in class ARCBatchJob
- Parameters:
os
- The OutputStream to which output data is written- See Also:
ARCBatchJob.finish(OutputStream)
toString
public java.lang.String toString()
- Overrides:
toString
in class java.lang.Object
- Returns:
- Humanly readable description of this instance.