Class WARCBatchJob

  • All Implemented Interfaces:
    Serializable
    Direct Known Subclasses:
    WARCExtractCDXJob, WaybackCDXExtractionWARCBatchJob

    public abstract class WARCBatchJob
    extends FileBatchJob
    Abstract class defining a batch job to run on a set of WARC files. Each implementation is required to define initialize() , processRecord() and finish() methods. The bitarchive application then ensures that the batch job run initialize(), runs processRecord() on each record in each file in the archive, and then runs finish().
    See Also:
    Serialized Form
    • Field Detail

      • noOfRecordsProcessed

        protected int noOfRecordsProcessed
        The total number of records processed.
    • Constructor Detail

      • WARCBatchJob

        public WARCBatchJob()
    • Method Detail

      • initialize

        public abstract void initialize​(OutputStream os)
        Initialize the job before running. This is called before the processRecord() calls start coming.
        Specified by:
        initialize in class FileBatchJob
        Parameters:
        os - The OutputStream to which output data is written
      • processRecord

        public abstract void processRecord​(org.archive.io.warc.WARCRecord record,
                                           OutputStream os)
        Exceptions should be handled with the handleException() method.
        Parameters:
        os - The OutputStream to which output data is written
        record - the object to be processed.
      • finish

        public abstract void finish​(OutputStream os)
        Finish up the job. This is called after the last processRecord() call.
        Specified by:
        finish in class FileBatchJob
        Parameters:
        os - The OutputStream to which output data is written
      • getFilter

        public WARCBatchFilter getFilter()
        returns a BatchFilter object which restricts the set of warc records in the archive on which this batch-job is performed. The default value is a neutral filter which allows all records.
        Returns:
        A filter telling which records should be given to processRecord().
      • processFile

        public final boolean processFile​(File warcFile,
                                         OutputStream os)
                                  throws ArgumentNotValid
        Accepts only WARC and WARCGZ files. Runs through all records and calls processRecord() on every record that is allowed by getFilter(). Does nothing on a non-arc file.
        Specified by:
        processFile in class FileBatchJob
        Parameters:
        warcFile - The WARC or WARCGZ file to be processed.
        os - the OutputStream to which output is to be written
        Returns:
        true, if file processed successful, otherwise false
        Throws:
        ArgumentNotValid - if either argument is null
      • handleException

        public void handleException​(Exception e,
                                    File warcfile,
                                    long index)
                             throws ArgumentNotValid
        When the org.archive.io.arc classes throw IOExceptions while reading, this is where they go. Subclasses are welcome to override the default functionality which simply logs and records them in a list. TODO Actually use the warcfile/index entries in the exception list
        Parameters:
        e - An Exception thrown by the org.archive.io.arc classes.
        warcfile - The arcFile that was processed while the Exception was thrown
        index - The index (in the WARC file) at which the Exception was thrown
        Throws:
        ArgumentNotValid - if e is null
      • getExceptionArray

        public Exception[] getExceptionArray()
        Returns a representation of the list of Exceptions recorded for this WARC batch job. If called by a subclass, a method overriding handleException() should always call super.handleException().
        Returns:
        All Exceptions passed to handleException so far.
      • noOfRecordsProcessed

        public int noOfRecordsProcessed()
        Returns:
        the number of records processed.