Class WARCExtractCDXJob

  • All Implemented Interfaces:
    Serializable

    public class WARCExtractCDXJob
    extends WARCBatchJob
    Batch job that extracts information to create a CDX file.

    A CDX file contains sorted lines of metadata from the WARC files, with each line followed by the file and offset the record was found at, and optionally a checksum. The timeout of this job is 7 days. See http://www.archive.org/web/researcher/cdx_file_format.php

    See Also:
    Serialized Form
    • Constructor Detail

      • WARCExtractCDXJob

        public WARCExtractCDXJob​(boolean includeChecksum)
        Constructs a new job for extracting CDX indexes.
        Parameters:
        includeChecksum - If true, an MD5 checksum is also written for each record. If false, it is not.
      • WARCExtractCDXJob

        public WARCExtractCDXJob()
        Equivalent to WARCExtractCDXJob(true).