dk.netarkivet.common.utils.cdx
Class ExtractCDXJob

java.lang.Object
  extended by dk.netarkivet.common.utils.batch.FileBatchJob
      extended by dk.netarkivet.common.utils.arc.ARCBatchJob
          extended by dk.netarkivet.common.utils.cdx.ExtractCDXJob
All Implemented Interfaces:
java.io.Serializable

public class ExtractCDXJob
extends ARCBatchJob

Batch job that extracts information to create a CDX file. A CDX file contains sorted lines of metadata from the ARC files, with each line followed by the file and offset the record was found at, and optionally a checksum. See http://www.archive.org/web/researcher/cdx_file_format.php

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
FileBatchJob.ExceptionOccurrence
 
Field Summary
 
Fields inherited from class dk.netarkivet.common.utils.arc.ARCBatchJob
noOfRecordsProcessed
 
Fields inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
batchJobTimeout, exceptions, filesFailed, noOfFilesProcessed
 
Constructor Summary
ExtractCDXJob()
          Equivalent to ExtractCDXJob(true).
ExtractCDXJob(boolean includeChecksum)
          Constructs a new job for extracting CDX indexes.
 
Method Summary
 void finish(java.io.OutputStream os)
          End of the batch job.
 ARCBatchFilter getFilter()
          Filter out the filedesc: headers.
 void initialize(java.io.OutputStream os)
          Initialize any data needed (none).
 void processRecord(org.archive.io.arc.ARCRecord sar, java.io.OutputStream os)
          Process this entry, reading metadata into the output stream.
 java.lang.String toString()
          Humanly readable description of this instance.
 
Methods inherited from class dk.netarkivet.common.utils.arc.ARCBatchJob
getExceptionArray, handleException, noOfRecordsProcessed, processFile
 
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

ExtractCDXJob

public ExtractCDXJob(boolean includeChecksum)
Constructs a new job for extracting CDX indexes.

Parameters:
includeChecksum - If true, an MD5 checksum is also written for each record. If false, it is not.

ExtractCDXJob

public ExtractCDXJob()
Equivalent to ExtractCDXJob(true).

Method Detail

getFilter

public ARCBatchFilter getFilter()
Filter out the filedesc: headers.

Overrides:
getFilter in class ARCBatchJob
Returns:
The filter that defines what ARC records are wanted in the output CDX file.
See Also:
ARCBatchJob.getFilter()

initialize

public void initialize(java.io.OutputStream os)
Initialize any data needed (none).

Specified by:
initialize in class ARCBatchJob
Parameters:
os - The OutputStream to which output data is written
See Also:
ARCBatchJob.initialize(OutputStream)

processRecord

public void processRecord(org.archive.io.arc.ARCRecord sar,
                          java.io.OutputStream os)
Process this entry, reading metadata into the output stream.

Specified by:
processRecord in class ARCBatchJob
Parameters:
sar - the object to be processed.
os - The OutputStream to which output data is written
Throws:
IOFailure - on trouble reading arc record data
See Also:
ARCBatchJob.processRecord( ARCRecord, OutputStream)

finish

public void finish(java.io.OutputStream os)
End of the batch job.

Specified by:
finish in class ARCBatchJob
Parameters:
os - The OutputStream to which output data is written
See Also:
ARCBatchJob.finish(OutputStream)

toString

public java.lang.String toString()
Humanly readable description of this instance.

Overrides:
toString in class java.lang.Object