dk.netarkivet.common.utils.cdx
Class ArchiveExtractCDXJob

java.lang.Object
  extended by dk.netarkivet.common.utils.batch.FileBatchJob
      extended by dk.netarkivet.common.utils.archive.ArchiveBatchJobBase
          extended by dk.netarkivet.common.utils.archive.ArchiveBatchJob
              extended by dk.netarkivet.common.utils.cdx.ArchiveExtractCDXJob
All Implemented Interfaces:
java.io.Serializable

public class ArchiveExtractCDXJob
extends ArchiveBatchJob

Batch job that extracts information to create a CDX file. A CDX file contains sorted lines of metadata from the ARC/WARC files, with each line followed by the file and offset the record was found at, and optionally a checksum. The timeout of this job is 7 days. See http://www.archive.org/web/researcher/cdx_file_format.php

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
FileBatchJob.ExceptionOccurrence
 
Field Summary
 
Fields inherited from class dk.netarkivet.common.utils.archive.ArchiveBatchJobBase
noOfRecordsProcessed
 
Fields inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
batchJobTimeout, exceptions, filesFailed, noOfFilesProcessed
 
Constructor Summary
ArchiveExtractCDXJob()
          Equivalent to ArchiveExtractCDXJob(true).
ArchiveExtractCDXJob(boolean includeChecksum)
          Constructs a new job for extracting CDX indexes.
 
Method Summary
 void finish(java.io.OutputStream os)
          End of the batch job.
 ArchiveBatchFilter getFilter()
          Filters out the NON-RESPONSE records.
 void initialize(java.io.OutputStream os)
          Initialize any data needed (none).
 void processRecord(ArchiveRecordBase record, java.io.OutputStream os)
          Process this entry, reading metadata into the output stream.
 java.lang.String toString()
           
 
Methods inherited from class dk.netarkivet.common.utils.archive.ArchiveBatchJob
processFile
 
Methods inherited from class dk.netarkivet.common.utils.archive.ArchiveBatchJobBase
getExceptionArray, handleException, handleOurException, noOfRecordsProcessed
 
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

ArchiveExtractCDXJob

public ArchiveExtractCDXJob(boolean includeChecksum)
Constructs a new job for extracting CDX indexes.

Parameters:
includeChecksum - If true, an MD5 checksum is also written for each record. If false, it is not.

ArchiveExtractCDXJob

public ArchiveExtractCDXJob()
Equivalent to ArchiveExtractCDXJob(true).

Method Detail

getFilter

public ArchiveBatchFilter getFilter()
Filters out the NON-RESPONSE records.

Overrides:
getFilter in class ArchiveBatchJob
Returns:
The filter that defines what ARC/WARC records are wanted in the output CDX file.
See Also:
ArchiveBatchJob.getFilter()

initialize

public void initialize(java.io.OutputStream os)
Initialize any data needed (none).

Specified by:
initialize in class ArchiveBatchJobBase
Parameters:
os - The OutputStream to which output data is written
See Also:
ArchiveBatchJobBase.initialize(OutputStream)

processRecord

public void processRecord(ArchiveRecordBase record,
                          java.io.OutputStream os)
Process this entry, reading metadata into the output stream.

Specified by:
processRecord in class ArchiveBatchJob
Parameters:
record - the object to be processed.
os - The OutputStream to which output data is written
Throws:
IOFailure - on trouble reading arc record data
See Also:
ArchiveBatchJob.processRecord( ArchiveRecordBase, OutputStream)

finish

public void finish(java.io.OutputStream os)
End of the batch job.

Specified by:
finish in class ArchiveBatchJobBase
Parameters:
os - The OutputStream to which output data is written
See Also:
ARCBatchJob.finish(OutputStream)

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object
Returns:
Humanly readable description of this instance.