Package dk.netarkivet.common.utils.warc
Class WARCBatchJob
- java.lang.Object
-
- dk.netarkivet.common.utils.batch.FileBatchJob
-
- dk.netarkivet.common.utils.warc.WARCBatchJob
-
- All Implemented Interfaces:
Serializable
- Direct Known Subclasses:
WARCExtractCDXJob
,WaybackCDXExtractionWARCBatchJob
public abstract class WARCBatchJob extends FileBatchJob
Abstract class defining a batch job to run on a set of WARC files. Each implementation is required to define initialize() , processRecord() and finish() methods. The bitarchive application then ensures that the batch job run initialize(), runs processRecord() on each record in each file in the archive, and then runs finish().- See Also:
- Serialized Form
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
FileBatchJob.ExceptionOccurrence
-
-
Field Summary
Fields Modifier and Type Field Description protected int
noOfRecordsProcessed
The total number of records processed.-
Fields inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
batchJobTimeout, exceptions, filesFailed, noOfFilesProcessed
-
-
Constructor Summary
Constructors Constructor Description WARCBatchJob()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description abstract void
finish(OutputStream os)
Finish up the job.Exception[]
getExceptionArray()
Returns a representation of the list of Exceptions recorded for this WARC batch job.WARCBatchFilter
getFilter()
returns a BatchFilter object which restricts the set of warc records in the archive on which this batch-job is performed.void
handleException(Exception e, File warcfile, long index)
When the org.archive.io.arc classes throw IOExceptions while reading, this is where they go.abstract void
initialize(OutputStream os)
Initialize the job before running.int
noOfRecordsProcessed()
boolean
processFile(File warcFile, OutputStream os)
Accepts only WARC and WARCGZ files.abstract void
processRecord(org.archive.io.warc.WARCRecord record, OutputStream os)
Exceptions should be handled with the handleException() method.-
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout
-
-
-
-
Method Detail
-
initialize
public abstract void initialize(OutputStream os)
Initialize the job before running. This is called before the processRecord() calls start coming.- Specified by:
initialize
in classFileBatchJob
- Parameters:
os
- The OutputStream to which output data is written
-
processRecord
public abstract void processRecord(org.archive.io.warc.WARCRecord record, OutputStream os)
Exceptions should be handled with the handleException() method.- Parameters:
os
- The OutputStream to which output data is writtenrecord
- the object to be processed.
-
finish
public abstract void finish(OutputStream os)
Finish up the job. This is called after the last processRecord() call.- Specified by:
finish
in classFileBatchJob
- Parameters:
os
- The OutputStream to which output data is written
-
getFilter
public WARCBatchFilter getFilter()
returns a BatchFilter object which restricts the set of warc records in the archive on which this batch-job is performed. The default value is a neutral filter which allows all records.- Returns:
- A filter telling which records should be given to processRecord().
-
processFile
public final boolean processFile(File warcFile, OutputStream os) throws ArgumentNotValid
Accepts only WARC and WARCGZ files. Runs through all records and calls processRecord() on every record that is allowed by getFilter(). Does nothing on a non-arc file.- Specified by:
processFile
in classFileBatchJob
- Parameters:
warcFile
- The WARC or WARCGZ file to be processed.os
- the OutputStream to which output is to be written- Returns:
- true, if file processed successful, otherwise false
- Throws:
ArgumentNotValid
- if either argument is null
-
handleException
public void handleException(Exception e, File warcfile, long index) throws ArgumentNotValid
When the org.archive.io.arc classes throw IOExceptions while reading, this is where they go. Subclasses are welcome to override the default functionality which simply logs and records them in a list. TODO Actually use the warcfile/index entries in the exception list- Parameters:
e
- An Exception thrown by the org.archive.io.arc classes.warcfile
- The arcFile that was processed while the Exception was thrownindex
- The index (in the WARC file) at which the Exception was thrown- Throws:
ArgumentNotValid
- if e is null
-
getExceptionArray
public Exception[] getExceptionArray()
Returns a representation of the list of Exceptions recorded for this WARC batch job. If called by a subclass, a method overriding handleException() should always call super.handleException().- Returns:
- All Exceptions passed to handleException so far.
-
noOfRecordsProcessed
public int noOfRecordsProcessed()
- Returns:
- the number of records processed.
-
-