Class ARCBatchJob

  • All Implemented Interfaces:
    Serializable
    Direct Known Subclasses:
    ExtractCDXJob, GetCDXRecordsBatchJob, WaybackCDXExtractionARCBatchJob

    public abstract class ARCBatchJob
    extends FileBatchJob
    Abstract class defining a batch job to run on a set of ARC files. Each implementation is required to define initialize() , processRecord() and finish() methods. The bitarchive application then ensures that the batch job run initialize(), runs processRecord() on each record in each file in the archive, and then runs finish().
    See Also:
    Serialized Form
    • Field Detail

      • noOfRecordsProcessed

        protected int noOfRecordsProcessed
        The total number of records processed.
    • Constructor Detail

      • ARCBatchJob

        public ARCBatchJob()
    • Method Detail

      • initialize

        public abstract void initialize​(OutputStream os)
        Initialize the job before running. This is called before the processRecord() calls start coming.
        Specified by:
        initialize in class FileBatchJob
        Parameters:
        os - The OutputStream to which output data is written
      • processRecord

        public abstract void processRecord​(org.archive.io.arc.ARCRecord record,
                                           OutputStream os)
        Exceptions should be handled with the handleException() method.
        Parameters:
        os - The OutputStream to which output data is written
        record - the object to be processed.
      • finish

        public abstract void finish​(OutputStream os)
        Finish up the job. This is called after the last processRecord() call.
        Specified by:
        finish in class FileBatchJob
        Parameters:
        os - The OutputStream to which output data is written
      • getFilter

        public ARCBatchFilter getFilter()
        returns a BatchFilter object which restricts the set of arcrecords in the archive on which this batch-job is performed. The default value is a neutral filter which allows all records.
        Returns:
        A filter telling which records should be given to processRecord().
      • processFile

        public final boolean processFile​(File arcFile,
                                         OutputStream os)
                                  throws ArgumentNotValid
        Accepts only ARC and ARCGZ files. Runs through all records and calls processRecord() on every record that is allowed by getFilter(). Does nothing on a non-arc file.
        Specified by:
        processFile in class FileBatchJob
        Parameters:
        arcFile - The ARC or ARCGZ file to be processed.
        os - the OutputStream to which output is to be written
        Returns:
        true, if file processed successful, otherwise false
        Throws:
        ArgumentNotValid - if either argument is null
      • handleException

        public void handleException​(Exception e,
                                    File arcfile,
                                    long index)
                             throws ArgumentNotValid
        When the org.archive.io.arc classes throw IOExceptions while reading, this is where they go. Subclasses are welcome to override the default functionality which simply logs and records them in a list. TODO Actually use the arcfile/index entries in the exception list
        Parameters:
        e - An Exception thrown by the org.archive.io.arc classes.
        arcfile - The arcFile that was processed while the Exception was thrown
        index - The index (in the ARC file) at which the Exception was thrown
        Throws:
        ArgumentNotValid - if e is null
      • getExceptionArray

        public Exception[] getExceptionArray()
        Returns a representation of the list of Exceptions recorded for this ARC batch job. If called by a subclass, a method overriding handleException() should always call super.handleException().
        Returns:
        All Exceptions passed to handleException so far.
      • noOfRecordsProcessed

        public int noOfRecordsProcessed()
        Returns:
        the number of records processed.