Class DeduplicationCDXExtractionBatchJob

  • All Implemented Interfaces:
    Serializable

    public class DeduplicationCDXExtractionBatchJob
    extends ArchiveBatchJob
    This batch batch job takes deduplication records from a crawl log in a metadata arcfile and converts them to cdx records for use in wayback.
    See Also:
    Serialized Form
    • Constructor Detail

      • DeduplicationCDXExtractionBatchJob

        public DeduplicationCDXExtractionBatchJob()
    • Method Detail

      • processRecord

        public void processRecord​(ArchiveRecordBase record,
                                  OutputStream os)
        If the ArchiveRecord is a crawl-log entry then any duplicate entries in the crawl log are converted to CDX entries and written to the output. Otherwise this method returns without doing anything. If the ArchiveRecord is a WarcRecord, and the record is the warcinfo, the record is skipped.
        Specified by:
        processRecord in class ArchiveBatchJob
        Parameters:
        record - The ArchiveRecord to be processed
        os - the stream to which output is written