dk.netarkivet.common.utils.cdx
Class WARCExtractCDXJob

java.lang.Object
  extended by dk.netarkivet.common.utils.batch.FileBatchJob
      extended by dk.netarkivet.common.utils.warc.WARCBatchJob
          extended by dk.netarkivet.common.utils.cdx.WARCExtractCDXJob
All Implemented Interfaces:
java.io.Serializable

public class WARCExtractCDXJob
extends WARCBatchJob

Batch job that extracts information to create a CDX file. A CDX file contains sorted lines of metadata from the WARC files, with each line followed by the file and offset the record was found at, and optionally a checksum. The timeout of this job is 7 days. See http://www.archive.org/web/researcher/cdx_file_format.php

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
FileBatchJob.ExceptionOccurrence
 
Field Summary
 
Fields inherited from class dk.netarkivet.common.utils.warc.WARCBatchJob
noOfRecordsProcessed
 
Fields inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
batchJobTimeout, exceptions, filesFailed, noOfFilesProcessed
 
Constructor Summary
WARCExtractCDXJob()
          Equivalent to WARCExtractCDXJob(true).
WARCExtractCDXJob(boolean includeChecksum)
          Constructs a new job for extracting CDX indexes.
 
Method Summary
 void finish(java.io.OutputStream os)
          End of the batch job.
 WARCBatchFilter getFilter()
          Filters out the NON-RESPONSE records.
 void initialize(java.io.OutputStream os)
          Initialize any data needed (none).
 void processRecord(org.archive.io.warc.WARCRecord sar, java.io.OutputStream os)
          Process this entry, reading metadata into the output stream.
 java.lang.String toString()
           
 
Methods inherited from class dk.netarkivet.common.utils.warc.WARCBatchJob
getExceptionArray, handleException, noOfRecordsProcessed, processFile
 
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

WARCExtractCDXJob

public WARCExtractCDXJob(boolean includeChecksum)
Constructs a new job for extracting CDX indexes.

Parameters:
includeChecksum - If true, an MD5 checksum is also written for each record. If false, it is not.

WARCExtractCDXJob

public WARCExtractCDXJob()
Equivalent to WARCExtractCDXJob(true).

Method Detail

getFilter

public WARCBatchFilter getFilter()
Filters out the NON-RESPONSE records.

Overrides:
getFilter in class WARCBatchJob
Returns:
The filter that defines what WARC records are wanted in the output CDX file.
See Also:
WARCBatchJob.getFilter()

initialize

public void initialize(java.io.OutputStream os)
Initialize any data needed (none).

Specified by:
initialize in class WARCBatchJob
Parameters:
os - The OutputStream to which output data is written
See Also:
WARCBatchJob.initialize(OutputStream)

processRecord

public void processRecord(org.archive.io.warc.WARCRecord sar,
                          java.io.OutputStream os)
Process this entry, reading metadata into the output stream.

Specified by:
processRecord in class WARCBatchJob
Parameters:
sar - the object to be processed.
os - The OutputStream to which output data is written
Throws:
IOFailure - on trouble reading WARC record data
See Also:
WARCBatchJob.processRecord( WARCRecord, OutputStream)

finish

public void finish(java.io.OutputStream os)
End of the batch job.

Specified by:
finish in class WARCBatchJob
Parameters:
os - The OutputStream to which output data is written
See Also:
WARCBatchJob.finish(OutputStream)

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object
Returns:
Humanly readable description of this instance.