dk.netarkivet.common.utils.arc
Class ARCBatchJob

java.lang.Object
  extended by dk.netarkivet.common.utils.batch.FileBatchJob
      extended by dk.netarkivet.common.utils.arc.ARCBatchJob
All Implemented Interfaces:
java.io.Serializable
Direct Known Subclasses:
CrawlLogLinesMatchingRegexp, ExtractCDXJob, ExtractDeduplicateCDXBatchJob, ExtractWaybackCDXBatchJob, GetCDXRecordsBatchJob, HarvestedUrlsForDomainBatchJob

public abstract class ARCBatchJob
extends FileBatchJob

Abstract class defining a batch job to run on a set of ARC files. Each implementation is required to define initialize() , processRecord() and finish() methods. The bitarchive application then ensures that the batch job run initialize(), runs processRecord() on each record in each file in the archive, and then runs finish().

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
FileBatchJob.ExceptionOccurrence
 
Field Summary
protected  int noOfRecordsProcessed
          The total number of records processed.
 
Fields inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
batchJobTimeout, exceptions, filesFailed, noOfFilesProcessed
 
Constructor Summary
ARCBatchJob()
           
 
Method Summary
abstract  void finish(java.io.OutputStream os)
          Finish up the job.
 java.lang.Exception[] getExceptionArray()
          Returns a representation of the list of Exceptions recorded for this ARC batch job.
 ARCBatchFilter getFilter()
          returns a BatchFilter object which restricts the set of arcrecords in the archive on which this batch-job is performed.
 void handleException(java.lang.Exception e, java.io.File arcfile, long index)
          When the org.archive.io.arc classes throw IOExceptions while reading, this is where they go.
abstract  void initialize(java.io.OutputStream os)
          Initialize the job before runnning.
 int noOfRecordsProcessed()
           
 boolean processFile(java.io.File arcFile, java.io.OutputStream os)
          Accepts only ARC and ARCGZ files.
abstract  void processRecord(org.archive.io.arc.ARCRecord record, java.io.OutputStream os)
          Exceptions should be handled with the handleException() method.
 
Methods inherited from class dk.netarkivet.common.utils.batch.FileBatchJob
addException, addFinishException, addInitializeException, getBatchJobTimeout, getExceptions, getFilenamePattern, getFilesFailed, getNoOfFilesProcessed, maxExceptionsReached, postProcess, processOnlyFileNamed, processOnlyFilesMatching, processOnlyFilesMatching, processOnlyFilesNamed, setBatchJobTimeout
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

noOfRecordsProcessed

protected int noOfRecordsProcessed
The total number of records processed.

Constructor Detail

ARCBatchJob

public ARCBatchJob()
Method Detail

initialize

public abstract void initialize(java.io.OutputStream os)
Initialize the job before runnning. This is called before the processRecord() calls start coming.

Specified by:
initialize in class FileBatchJob
Parameters:
os - The OutputStream to which output data is written

processRecord

public abstract void processRecord(org.archive.io.arc.ARCRecord record,
                                   java.io.OutputStream os)
Exceptions should be handled with the handleException() method.

Parameters:
os - The OutputStream to which output data is written
record - the object to be processed.

finish

public abstract void finish(java.io.OutputStream os)
Finish up the job. This is called after the last processRecord() call.

Specified by:
finish in class FileBatchJob
Parameters:
os - The OutputStream to which output data is written

getFilter

public ARCBatchFilter getFilter()
returns a BatchFilter object which restricts the set of arcrecords in the archive on which this batch-job is performed. The default value is a neutral filter which allows all records.

Returns:
A filter telling which records should be given to processRecord().

processFile

public final boolean processFile(java.io.File arcFile,
                                 java.io.OutputStream os)
                          throws ArgumentNotValid
Accepts only ARC and ARCGZ files. Runs through all records and calls processRecord() on every record that is allowed by getFilter(). Does nothing on a non-arc file.

Specified by:
processFile in class FileBatchJob
Parameters:
arcFile - The ARC or ARCGZ file to be processed.
os - the OutputStream to which output is to be written
Returns:
true, if file processed successful, otherwise false
Throws:
ArgumentNotValid - if either argument is null

handleException

public void handleException(java.lang.Exception e,
                            java.io.File arcfile,
                            long index)
                     throws ArgumentNotValid
When the org.archive.io.arc classes throw IOExceptions while reading, this is where they go. Subclasses are welcome to override the default functionality which simply logs and records them in a list. TODO Actually use the arcfile/index entries in the exception list

Parameters:
e - An Exception thrown by the org.archive.io.arc classes.
arcfile - The arcFile that was processed while the Exception was thrown
index - The index (in the ARC file) at which the Exception was thrown
Throws:
ArgumentNotValid - if e is null

getExceptionArray

public java.lang.Exception[] getExceptionArray()
Returns a representation of the list of Exceptions recorded for this ARC batch job. If called by a subclass, a method overriding handleException() should always call super.handleException().

Returns:
All Exceptions passed to handleException so far.

noOfRecordsProcessed

public int noOfRecordsProcessed()
Returns:
the number of records processed.