Class WaybackCDXExtractionARCBatchJob

  • All Implemented Interfaces:
    Serializable

    public class WaybackCDXExtractionARCBatchJob
    extends ARCBatchJob
    Returns a cdx file using the appropriate format for wayback, including canonicalisation of urls. The returned files are unsorted.
    See Also:
    Serialized Form
    • Constructor Detail

      • WaybackCDXExtractionARCBatchJob

        public WaybackCDXExtractionARCBatchJob()
        Constructor which set timeout to one day.
      • WaybackCDXExtractionARCBatchJob

        public WaybackCDXExtractionARCBatchJob​(long timeout)
        Constructor.
        Parameters:
        timeout - specific timeout period
    • Method Detail

      • initialize

        public void initialize​(OutputStream os)
        Initializes the private fields of this class. Some of these are relatively heavy objects, so it is important that they are only initialised once.
        Specified by:
        initialize in class ARCBatchJob
        Parameters:
        os - unused argument
      • finish

        public void finish​(OutputStream os)
        Does nothing except log the end of the job.
        Specified by:
        finish in class ARCBatchJob
        Parameters:
        os - unused argument.
      • getFilter

        public ARCBatchFilter getFilter()
        Description copied from class: ARCBatchJob
        returns a BatchFilter object which restricts the set of arcrecords in the archive on which this batch-job is performed. The default value is a neutral filter which allows all records.
        Overrides:
        getFilter in class ARCBatchJob
        Returns:
        A filter telling which records should be given to processRecord().
      • processRecord

        public void processRecord​(org.archive.io.arc.ARCRecord record,
                                  OutputStream os)
        For each ARCRecord writes one CDX line (including newline) to the output. If an arcrecord cannot be converted to a CDX record for any reason then any resulting exception is caught and logged.
        Specified by:
        processRecord in class ARCBatchJob
        Parameters:
        record - the ARCRecord to be indexed.
        os - the OutputStream to which output is written.