dk.netarkivet.wayback.batch
Class WaybackCDXExtractionWARCBatchJob
java.lang.Object
dk.netarkivet.common.utils.batch.FileBatchJob
dk.netarkivet.common.utils.warc.WARCBatchJob
dk.netarkivet.wayback.batch.WaybackCDXExtractionWARCBatchJob
- All Implemented Interfaces:
- java.io.Serializable
public class WaybackCDXExtractionWARCBatchJob
- extends WARCBatchJob
Returns a cdx file using the appropriate format for wayback, including
canonicalisation of urls. The returned files are unsorted.
- See Also:
- Serialized Form
Method Summary |
void |
finish(java.io.OutputStream os)
Does nothing except log the end of the job. |
WARCBatchFilter |
getFilter()
Set the filter, so only response records are
currently processed. |
void |
initialize(java.io.OutputStream os)
Initializes the private fields of this class. |
void |
processRecord(org.archive.io.warc.WARCRecord record,
java.io.OutputStream os)
For each response WARCRecord it writes one CDX line (including newline) to the output. |
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob |
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
WaybackCDXExtractionWARCBatchJob
public WaybackCDXExtractionWARCBatchJob()
- Constructor which set timeout to one day.
WaybackCDXExtractionWARCBatchJob
public WaybackCDXExtractionWARCBatchJob(long timeout)
- Alternate constructor, where a timeout can be set.
- Parameters:
timeout
- specific timeout period
getFilter
public WARCBatchFilter getFilter()
- Set the filter, so only response records are
currently processed.
- Overrides:
getFilter
in class WARCBatchJob
- Returns:
- A filter telling which records should be given to
processRecord().
initialize
public void initialize(java.io.OutputStream os)
- Initializes the private fields of this class. Some of these are
relatively heavy objects, so it is important that they are only
initialised once.
- Specified by:
initialize
in class WARCBatchJob
- Parameters:
os
- unused argument
finish
public void finish(java.io.OutputStream os)
- Does nothing except log the end of the job.
- Specified by:
finish
in class WARCBatchJob
- Parameters:
os
- unused argument.
processRecord
public void processRecord(org.archive.io.warc.WARCRecord record,
java.io.OutputStream os)
- For each response WARCRecord it writes one CDX line (including newline) to the output.
If an warcrecord cannot be converted to a CDX record for any reason then
any resulting exception is caught and logged.
- Specified by:
processRecord
in class WARCBatchJob
- Parameters:
record
- the WARCRecord to be indexed.os
- the OutputStream to which output is written.