Package dk.netarkivet.wayback.batch
Class WaybackCDXExtractionWARCBatchJob
- java.lang.Object
-
- dk.netarkivet.common.utils.batch.FileBatchJob
-
- dk.netarkivet.common.utils.warc.WARCBatchJob
-
- dk.netarkivet.wayback.batch.WaybackCDXExtractionWARCBatchJob
-
- All Implemented Interfaces:
Serializable
public class WaybackCDXExtractionWARCBatchJob extends WARCBatchJob
Returns a cdx file using the appropriate format for wayback, including canonicalisation of urls. The returned files are unsorted.- See Also:
- Serialized Form
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
FileBatchJob.ExceptionOccurrence
-
-
Field Summary
-
Fields inherited from class dk.netarkivet.common.utils.warc.WARCBatchJob
noOfRecordsProcessed
-
Fields inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
batchJobTimeout, exceptions, filesFailed, noOfFilesProcessed
-
-
Constructor Summary
Constructors Constructor Description WaybackCDXExtractionWARCBatchJob()
Constructor which set timeout to one day.WaybackCDXExtractionWARCBatchJob(long timeout)
Alternate constructor, where a timeout can be set.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
finish(OutputStream os)
Does nothing except log the end of the job.WARCBatchFilter
getFilter()
Set the filter, so only response records are currently processed.void
initialize(OutputStream os)
Initializes the private fields of this class.void
processRecord(org.archive.io.warc.WARCRecord record, OutputStream os)
For each response WARCRecord it writes one CDX line (including newline) to the output.-
Methods inherited from class dk.netarkivet.common.utils.warc.WARCBatchJob
getExceptionArray, handleException, noOfRecordsProcessed, processFile
-
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout
-
-
-
-
Constructor Detail
-
WaybackCDXExtractionWARCBatchJob
public WaybackCDXExtractionWARCBatchJob()
Constructor which set timeout to one day.
-
WaybackCDXExtractionWARCBatchJob
public WaybackCDXExtractionWARCBatchJob(long timeout)
Alternate constructor, where a timeout can be set.- Parameters:
timeout
- specific timeout period
-
-
Method Detail
-
getFilter
public WARCBatchFilter getFilter()
Set the filter, so only response records are currently processed.- Overrides:
getFilter
in classWARCBatchJob
- Returns:
- A filter telling which records should be given to processRecord().
-
initialize
public void initialize(OutputStream os)
Initializes the private fields of this class. Some of these are relatively heavy objects, so it is important that they are only initialised once.- Specified by:
initialize
in classWARCBatchJob
- Parameters:
os
- unused argument
-
finish
public void finish(OutputStream os)
Does nothing except log the end of the job.- Specified by:
finish
in classWARCBatchJob
- Parameters:
os
- unused argument.
-
processRecord
public void processRecord(org.archive.io.warc.WARCRecord record, OutputStream os)
For each response WARCRecord it writes one CDX line (including newline) to the output. If an warcrecord cannot be converted to a CDX record for any reason then any resulting exception is caught and logged.- Specified by:
processRecord
in classWARCBatchJob
- Parameters:
record
- the WARCRecord to be indexed.os
- the OutputStream to which output is written.
-
-