dk.netarkivet.common.utils.archive
Class ArchiveBatchJob

java.lang.Object
  extended by dk.netarkivet.common.utils.batch.FileBatchJob
      extended by dk.netarkivet.common.utils.archive.ArchiveBatchJobBase
          extended by dk.netarkivet.common.utils.archive.ArchiveBatchJob
All Implemented Interfaces:
java.io.Serializable
Direct Known Subclasses:
ArchiveExtractCDXJob, CrawlLogLinesMatchingRegexp, DeduplicationCDXExtractionBatchJob, GetMetadataArchiveBatchJob, HarvestedUrlsForDomainBatchJob

public abstract class ArchiveBatchJob
extends ArchiveBatchJobBase

Abstract class defining a batch job to run on a set of ARC/WARC files. Each implementation is required to define initialize() , processRecord() and finish() methods. The bitarchive application then ensures that the batch job runs initialize(), runs processRecord() on each record in each file in the archive, and then runs finish().

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
FileBatchJob.ExceptionOccurrence
 
Field Summary
 
Fields inherited from class dk.netarkivet.common.utils.archive.ArchiveBatchJobBase
noOfRecordsProcessed
 
Fields inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
batchJobTimeout, exceptions, filesFailed, noOfFilesProcessed
 
Constructor Summary
ArchiveBatchJob()
           
 
Method Summary
 ArchiveBatchFilter getFilter()
          Returns an ArchiveBatchFilter object which restricts the set of records in the archive on which this batch-job is performed.
 boolean processFile(java.io.File archiveFile, java.io.OutputStream os)
          Accepts only arc(.gz) and warc(.gz) files.
abstract  void processRecord(ArchiveRecordBase record, java.io.OutputStream os)
          Exceptions should be handled with the handleException() method.
 
Methods inherited from class dk.netarkivet.common.utils.archive.ArchiveBatchJobBase
finish, getExceptionArray, handleException, handleOurException, initialize, noOfRecordsProcessed
 
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ArchiveBatchJob

public ArchiveBatchJob()
Method Detail

processRecord

public abstract void processRecord(ArchiveRecordBase record,
                                   java.io.OutputStream os)
Exceptions should be handled with the handleException() method.

Parameters:
os - The OutputStream to which output data is written
record - the object to be processed.

getFilter

public ArchiveBatchFilter getFilter()
Returns an ArchiveBatchFilter object which restricts the set of records in the archive on which this batch-job is performed. The default value is a neutral filter which allows all records.

Returns:
A filter telling which records should be given to processRecord().

processFile

public final boolean processFile(java.io.File archiveFile,
                                 java.io.OutputStream os)
                          throws ArgumentNotValid
Accepts only arc(.gz) and warc(.gz) files. Runs through all records and calls processRecord() on every record that is allowed by getFilter(). Does nothing on a non-(w)arc file.

Specified by:
processFile in class FileBatchJob
Parameters:
archiveFile - The arc(.gz) or warc(.gz) file to be processed.
os - the OutputStream to which output is to be written
Returns:
true, if file processed successful, otherwise false
Throws:
ArgumentNotValid - if either argument is null