Class WaybackCDXExtractionWARCBatchJob

  • All Implemented Interfaces:
    Serializable

    public class WaybackCDXExtractionWARCBatchJob
    extends WARCBatchJob
    Returns a cdx file using the appropriate format for wayback, including canonicalisation of urls. The returned files are unsorted.
    See Also:
    Serialized Form
    • Constructor Detail

      • WaybackCDXExtractionWARCBatchJob

        public WaybackCDXExtractionWARCBatchJob()
        Constructor which set timeout to one day.
      • WaybackCDXExtractionWARCBatchJob

        public WaybackCDXExtractionWARCBatchJob​(long timeout)
        Alternate constructor, where a timeout can be set.
        Parameters:
        timeout - specific timeout period
    • Method Detail

      • getFilter

        public WARCBatchFilter getFilter()
        Set the filter, so only response records are currently processed.
        Overrides:
        getFilter in class WARCBatchJob
        Returns:
        A filter telling which records should be given to processRecord().
      • initialize

        public void initialize​(OutputStream os)
        Initializes the private fields of this class. Some of these are relatively heavy objects, so it is important that they are only initialised once.
        Specified by:
        initialize in class WARCBatchJob
        Parameters:
        os - unused argument
      • finish

        public void finish​(OutputStream os)
        Does nothing except log the end of the job.
        Specified by:
        finish in class WARCBatchJob
        Parameters:
        os - unused argument.
      • processRecord

        public void processRecord​(org.archive.io.warc.WARCRecord record,
                                  OutputStream os)
        For each response WARCRecord it writes one CDX line (including newline) to the output. If an warcrecord cannot be converted to a CDX record for any reason then any resulting exception is caught and logged.
        Specified by:
        processRecord in class WARCBatchJob
        Parameters:
        record - the WARCRecord to be indexed.
        os - the OutputStream to which output is written.