dk.netarkivet.wayback.batch
Class ExtractWaybackCDXBatchJob

java.lang.Object
  extended by dk.netarkivet.common.utils.batch.FileBatchJob
      extended by dk.netarkivet.common.utils.arc.ARCBatchJob
          extended by dk.netarkivet.wayback.batch.ExtractWaybackCDXBatchJob
All Implemented Interfaces:
java.io.Serializable

public class ExtractWaybackCDXBatchJob
extends ARCBatchJob

Returns a cdx file using the appropriate format for wayback, including canonicalisation of urls. The returned files are unsorted.

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
FileBatchJob.ExceptionOccurrence
 
Field Summary
 
Fields inherited from class dk.netarkivet.common.utils.arc.ARCBatchJob
noOfRecordsProcessed
 
Fields inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
batchJobTimeout, exceptions, filesFailed, noOfFilesProcessed
 
Constructor Summary
ExtractWaybackCDXBatchJob()
          Constructor which set timeout to one day.
ExtractWaybackCDXBatchJob(long timeout)
          Constructor.
 
Method Summary
 void finish(java.io.OutputStream os)
          Does nothing except log the end of the job.
 void initialize(java.io.OutputStream os)
          Initializes the private fields of this class.
 void processRecord(org.archive.io.arc.ARCRecord record, java.io.OutputStream os)
          For each ARCRecord writes one CDX line (including newline) to the output.
 
Methods inherited from class dk.netarkivet.common.utils.arc.ARCBatchJob
getExceptionArray, getFilter, handleException, noOfRecordsProcessed, processFile
 
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ExtractWaybackCDXBatchJob

public ExtractWaybackCDXBatchJob()
Constructor which set timeout to one day.


ExtractWaybackCDXBatchJob

public ExtractWaybackCDXBatchJob(long timeout)
Constructor.

Parameters:
timeout - specific timeout period
Method Detail

initialize

public void initialize(java.io.OutputStream os)
Initializes the private fields of this class. Some of these are relatively heavy objects, so it is important that they are only initialised once.

Specified by:
initialize in class ARCBatchJob
Parameters:
os - unused argument

processRecord

public void processRecord(org.archive.io.arc.ARCRecord record,
                          java.io.OutputStream os)
For each ARCRecord writes one CDX line (including newline) to the output. If an arcrecord cannot be converted to a CDX record for any reason then any resulting exception is caught and logged.

Specified by:
processRecord in class ARCBatchJob
Parameters:
record - the ARCRecord to be indexed.
os - the OutputStream to which output is written.

finish

public void finish(java.io.OutputStream os)
Does nothing except log the end of the job.

Specified by:
finish in class ARCBatchJob
Parameters:
os - unused argument.