dk.netarkivet.wayback.batch
Class DeduplicationCDXExtractionBatchJob

java.lang.Object
  extended by dk.netarkivet.common.utils.batch.FileBatchJob
      extended by dk.netarkivet.common.utils.archive.ArchiveBatchJobBase
          extended by dk.netarkivet.common.utils.archive.ArchiveBatchJob
              extended by dk.netarkivet.wayback.batch.DeduplicationCDXExtractionBatchJob
All Implemented Interfaces:
java.io.Serializable

public class DeduplicationCDXExtractionBatchJob
extends ArchiveBatchJob

This batch batch job takes deduplication records from a crawl log in a metadata arcfile and converts them to cdx records for use in wayback.

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
FileBatchJob.ExceptionOccurrence
 
Field Summary
 
Fields inherited from class dk.netarkivet.common.utils.archive.ArchiveBatchJobBase
noOfRecordsProcessed
 
Fields inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
batchJobTimeout, exceptions, filesFailed, noOfFilesProcessed
 
Constructor Summary
DeduplicationCDXExtractionBatchJob()
           
 
Method Summary
 void finish(java.io.OutputStream os)
          Does nothing.
 void initialize(java.io.OutputStream os)
          Initializes various fields of this class.
 void processRecord(ArchiveRecordBase record, java.io.OutputStream os)
          If the ArchiveRecord is a crawl-log entry then any duplicate entries in the crawl log are converted to CDX entries and written to the output.
 
Methods inherited from class dk.netarkivet.common.utils.archive.ArchiveBatchJob
getFilter, processFile
 
Methods inherited from class dk.netarkivet.common.utils.archive.ArchiveBatchJobBase
getExceptionArray, handleException, handleOurException, noOfRecordsProcessed
 
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DeduplicationCDXExtractionBatchJob

public DeduplicationCDXExtractionBatchJob()
Method Detail

initialize

public void initialize(java.io.OutputStream os)
Initializes various fields of this class.

Specified by:
initialize in class ArchiveBatchJobBase
Parameters:
os - unused parameter

processRecord

public void processRecord(ArchiveRecordBase record,
                          java.io.OutputStream os)
If the ArchiveRecord is a crawl-log entry then any duplicate entries in the crawl log are converted to CDX entries and written to the output. Otherwise this method returns without doing anything. If the ArchiveRecord is a WarcRecord, and the record is the warcinfo, the record is skipped.

Specified by:
processRecord in class ArchiveBatchJob
Parameters:
record - The ArchiveRecord to be processed
os - the stream to which output is written

finish

public void finish(java.io.OutputStream os)
Does nothing.

Specified by:
finish in class ArchiveBatchJobBase
Parameters:
os - an outputstream