dk.netarkivet.wayback.batch
Class WaybackCDXExtractionWARCBatchJob

java.lang.Object
  extended by dk.netarkivet.common.utils.batch.FileBatchJob
      extended by dk.netarkivet.common.utils.warc.WARCBatchJob
          extended by dk.netarkivet.wayback.batch.WaybackCDXExtractionWARCBatchJob
All Implemented Interfaces:
java.io.Serializable

public class WaybackCDXExtractionWARCBatchJob
extends WARCBatchJob

Returns a cdx file using the appropriate format for wayback, including canonicalisation of urls. The returned files are unsorted.

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
FileBatchJob.ExceptionOccurrence
 
Field Summary
 
Fields inherited from class dk.netarkivet.common.utils.warc.WARCBatchJob
noOfRecordsProcessed
 
Fields inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
batchJobTimeout, exceptions, filesFailed, noOfFilesProcessed
 
Constructor Summary
WaybackCDXExtractionWARCBatchJob()
          Constructor which set timeout to one day.
WaybackCDXExtractionWARCBatchJob(long timeout)
          Alternate constructor, where a timeout can be set.
 
Method Summary
 void finish(java.io.OutputStream os)
          Does nothing except log the end of the job.
 WARCBatchFilter getFilter()
          Set the filter, so only response records are currently processed.
 void initialize(java.io.OutputStream os)
          Initializes the private fields of this class.
 void processRecord(org.archive.io.warc.WARCRecord record, java.io.OutputStream os)
          For each response WARCRecord it writes one CDX line (including newline) to the output.
 
Methods inherited from class dk.netarkivet.common.utils.warc.WARCBatchJob
getExceptionArray, handleException, noOfRecordsProcessed, processFile
 
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WaybackCDXExtractionWARCBatchJob

public WaybackCDXExtractionWARCBatchJob()
Constructor which set timeout to one day.


WaybackCDXExtractionWARCBatchJob

public WaybackCDXExtractionWARCBatchJob(long timeout)
Alternate constructor, where a timeout can be set.

Parameters:
timeout - specific timeout period
Method Detail

getFilter

public WARCBatchFilter getFilter()
Set the filter, so only response records are currently processed.

Overrides:
getFilter in class WARCBatchJob
Returns:
A filter telling which records should be given to processRecord().

initialize

public void initialize(java.io.OutputStream os)
Initializes the private fields of this class. Some of these are relatively heavy objects, so it is important that they are only initialised once.

Specified by:
initialize in class WARCBatchJob
Parameters:
os - unused argument

finish

public void finish(java.io.OutputStream os)
Does nothing except log the end of the job.

Specified by:
finish in class WARCBatchJob
Parameters:
os - unused argument.

processRecord

public void processRecord(org.archive.io.warc.WARCRecord record,
                          java.io.OutputStream os)
For each response WARCRecord it writes one CDX line (including newline) to the output. If an warcrecord cannot be converted to a CDX record for any reason then any resulting exception is caught and logged.

Specified by:
processRecord in class WARCBatchJob
Parameters:
record - the WARCRecord to be indexed.
os - the OutputStream to which output is written.