dk.netarkivet.common.utils.archive
Class ArchiveBatchJob
java.lang.Object
dk.netarkivet.common.utils.batch.FileBatchJob
dk.netarkivet.common.utils.archive.ArchiveBatchJobBase
dk.netarkivet.common.utils.archive.ArchiveBatchJob
- All Implemented Interfaces:
- java.io.Serializable
- Direct Known Subclasses:
- ArchiveExtractCDXJob, CrawlLogLinesMatchingRegexp, DeduplicationCDXExtractionBatchJob, GetMetadataArchiveBatchJob, HarvestedUrlsForDomainBatchJob
public abstract class ArchiveBatchJob
- extends ArchiveBatchJobBase
Abstract class defining a batch job to run on a set of ARC/WARC files.
Each implementation is required to define initialize() , processRecord() and
finish() methods. The bitarchive application then ensures that the batch
job runs initialize(), runs processRecord() on each record in each file in
the archive, and then runs finish().
- See Also:
- Serialized Form
Method Summary |
ArchiveBatchFilter |
getFilter()
Returns an ArchiveBatchFilter object which restricts the set of records in the
archive on which this batch-job is performed. |
boolean |
processFile(java.io.File archiveFile,
java.io.OutputStream os)
Accepts only arc(.gz) and warc(.gz) files. |
abstract void |
processRecord(ArchiveRecordBase record,
java.io.OutputStream os)
Exceptions should be handled with the handleException() method. |
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob |
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
ArchiveBatchJob
public ArchiveBatchJob()
processRecord
public abstract void processRecord(ArchiveRecordBase record,
java.io.OutputStream os)
- Exceptions should be handled with the handleException() method.
- Parameters:
os
- The OutputStream to which output data is writtenrecord
- the object to be processed.
getFilter
public ArchiveBatchFilter getFilter()
- Returns an ArchiveBatchFilter object which restricts the set of records in the
archive on which this batch-job is performed. The default value is
a neutral filter which allows all records.
- Returns:
- A filter telling which records should be given to processRecord().
processFile
public final boolean processFile(java.io.File archiveFile,
java.io.OutputStream os)
throws ArgumentNotValid
- Accepts only arc(.gz) and warc(.gz) files. Runs through all records and calls
processRecord() on every record that is allowed by getFilter().
Does nothing on a non-(w)arc file.
- Specified by:
processFile
in class FileBatchJob
- Parameters:
archiveFile
- The arc(.gz) or warc(.gz) file to be processed.os
- the OutputStream to which output is to be written
- Returns:
- true, if file processed successful, otherwise false
- Throws:
ArgumentNotValid
- if either argument is null