dk.netarkivet.common.utils.warc
Class WARCBatchJob

java.lang.Object
  extended by dk.netarkivet.common.utils.batch.FileBatchJob
      extended by dk.netarkivet.common.utils.warc.WARCBatchJob
All Implemented Interfaces:
java.io.Serializable
Direct Known Subclasses:
WARCExtractCDXJob, WaybackCDXExtractionWARCBatchJob

public abstract class WARCBatchJob
extends FileBatchJob

Abstract class defining a batch job to run on a set of WARC files. Each implementation is required to define initialize() , processRecord() and finish() methods. The bitarchive application then ensures that the batch job run initialize(), runs processRecord() on each record in each file in the archive, and then runs finish().

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
FileBatchJob.ExceptionOccurrence
 
Field Summary
protected  int noOfRecordsProcessed
          The total number of records processed.
 
Fields inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
batchJobTimeout, exceptions, filesFailed, noOfFilesProcessed
 
Constructor Summary
WARCBatchJob()
           
 
Method Summary
abstract  void finish(java.io.OutputStream os)
          Finish up the job.
 java.lang.Exception[] getExceptionArray()
          Returns a representation of the list of Exceptions recorded for this WARC batch job.
 WARCBatchFilter getFilter()
          returns a BatchFilter object which restricts the set of warc records in the archive on which this batch-job is performed.
 void handleException(java.lang.Exception e, java.io.File warcfile, long index)
          When the org.archive.io.arc classes throw IOExceptions while reading, this is where they go.
abstract  void initialize(java.io.OutputStream os)
          Initialize the job before running.
 int noOfRecordsProcessed()
           
 boolean processFile(java.io.File warcFile, java.io.OutputStream os)
          Accepts only WARC and WARCGZ files.
abstract  void processRecord(org.archive.io.warc.WARCRecord record, java.io.OutputStream os)
          Exceptions should be handled with the handleException() method.
 
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

noOfRecordsProcessed

protected int noOfRecordsProcessed
The total number of records processed.

Constructor Detail

WARCBatchJob

public WARCBatchJob()
Method Detail

initialize

public abstract void initialize(java.io.OutputStream os)
Initialize the job before running. This is called before the processRecord() calls start coming.

Specified by:
initialize in class FileBatchJob
Parameters:
os - The OutputStream to which output data is written

processRecord

public abstract void processRecord(org.archive.io.warc.WARCRecord record,
                                   java.io.OutputStream os)
Exceptions should be handled with the handleException() method.

Parameters:
os - The OutputStream to which output data is written
record - the object to be processed.

finish

public abstract void finish(java.io.OutputStream os)
Finish up the job. This is called after the last processRecord() call.

Specified by:
finish in class FileBatchJob
Parameters:
os - The OutputStream to which output data is written

getFilter

public WARCBatchFilter getFilter()
returns a BatchFilter object which restricts the set of warc records in the archive on which this batch-job is performed. The default value is a neutral filter which allows all records.

Returns:
A filter telling which records should be given to processRecord().

processFile

public final boolean processFile(java.io.File warcFile,
                                 java.io.OutputStream os)
                          throws ArgumentNotValid
Accepts only WARC and WARCGZ files. Runs through all records and calls processRecord() on every record that is allowed by getFilter(). Does nothing on a non-arc file.

Specified by:
processFile in class FileBatchJob
Parameters:
warcFile - The WARC or WARCGZ file to be processed.
os - the OutputStream to which output is to be written
Returns:
true, if file processed successful, otherwise false
Throws:
ArgumentNotValid - if either argument is null

handleException

public void handleException(java.lang.Exception e,
                            java.io.File warcfile,
                            long index)
                     throws ArgumentNotValid
When the org.archive.io.arc classes throw IOExceptions while reading, this is where they go. Subclasses are welcome to override the default functionality which simply logs and records them in a list. TODO Actually use the warcfile/index entries in the exception list

Parameters:
e - An Exception thrown by the org.archive.io.arc classes.
warcfile - The arcFile that was processed while the Exception was thrown
index - The index (in the WARC file) at which the Exception was thrown
Throws:
ArgumentNotValid - if e is null

getExceptionArray

public java.lang.Exception[] getExceptionArray()
Returns a representation of the list of Exceptions recorded for this WARC batch job. If called by a subclass, a method overriding handleException() should always call super.handleException().

Returns:
All Exceptions passed to handleException so far.

noOfRecordsProcessed

public int noOfRecordsProcessed()
Returns:
the number of records processed.