Package dk.netarkivet.common.utils.cdx
Class ExtractCDXJob
- java.lang.Object
-
- dk.netarkivet.common.utils.batch.FileBatchJob
-
- dk.netarkivet.common.utils.arc.ARCBatchJob
-
- dk.netarkivet.common.utils.cdx.ExtractCDXJob
-
- All Implemented Interfaces:
Serializable
public class ExtractCDXJob extends ARCBatchJob
Batch job that extracts information to create a CDX file.A CDX file contains sorted lines of metadata from the ARC files, with each line followed by the file and offset the record was found at, and optionally a checksum. The timeout of this job is 7 days. See http://www.archive.org/web/researcher/cdx_file_format.php
- See Also:
- Serialized Form
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
FileBatchJob.ExceptionOccurrence
-
-
Field Summary
-
Fields inherited from class dk.netarkivet.common.utils.arc.ARCBatchJob
noOfRecordsProcessed
-
Fields inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
batchJobTimeout, exceptions, filesFailed, noOfFilesProcessed
-
-
Constructor Summary
Constructors Constructor Description ExtractCDXJob()
Equivalent to ExtractCDXJob(true).ExtractCDXJob(boolean includeChecksum)
Constructs a new job for extracting CDX indexes.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
finish(OutputStream os)
End of the batch job.ARCBatchFilter
getFilter()
Filter out the filedesc: headers.void
initialize(OutputStream os)
Initialize any data needed (none).void
processRecord(org.archive.io.arc.ARCRecord sar, OutputStream os)
Process this entry, reading metadata into the output stream.String
toString()
-
Methods inherited from class dk.netarkivet.common.utils.arc.ARCBatchJob
getExceptionArray, handleException, noOfRecordsProcessed, processFile
-
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout
-
-
-
-
Constructor Detail
-
ExtractCDXJob
public ExtractCDXJob(boolean includeChecksum)
Constructs a new job for extracting CDX indexes.- Parameters:
includeChecksum
- If true, an MD5 checksum is also written for each record. If false, it is not.
-
ExtractCDXJob
public ExtractCDXJob()
Equivalent to ExtractCDXJob(true).
-
-
Method Detail
-
getFilter
public ARCBatchFilter getFilter()
Filter out the filedesc: headers.- Overrides:
getFilter
in classARCBatchJob
- Returns:
- The filter that defines what ARC records are wanted in the output CDX file.
- See Also:
ARCBatchJob.getFilter()
-
initialize
public void initialize(OutputStream os)
Initialize any data needed (none).- Specified by:
initialize
in classARCBatchJob
- Parameters:
os
- The OutputStream to which output data is written- See Also:
ARCBatchJob.initialize(OutputStream)
-
processRecord
public void processRecord(org.archive.io.arc.ARCRecord sar, OutputStream os)
Process this entry, reading metadata into the output stream.- Specified by:
processRecord
in classARCBatchJob
- Parameters:
sar
- the object to be processed.os
- The OutputStream to which output data is written- Throws:
IOFailure
- on trouble reading arc record data- See Also:
ARCBatchJob.processRecord(ARCRecord, OutputStream)
-
finish
public void finish(OutputStream os)
End of the batch job.- Specified by:
finish
in classARCBatchJob
- Parameters:
os
- The OutputStream to which output data is written- See Also:
ARCBatchJob.finish(OutputStream)
-
-