Class WaybackCDXExtractionWARCBatchJob

    • Method Detail

      • getFilter

        public WARCBatchFilter getFilter()
        Set the filter, so only response records are currently processed.
        Overrides:
        getFilter in class WARCBatchJob
        Returns:
        A filter telling which records should be given to processRecord().
      • initialize

        public void initialize​(java.io.OutputStream os)
        Initializes the private fields of this class. Some of these are relatively heavy objects, so it is important that they are only initialised once.
        Specified by:
        initialize in class WARCBatchJob
        Parameters:
        os - unused argument
      • finish

        public void finish​(java.io.OutputStream os)
        Does nothing except log the end of the job.
        Specified by:
        finish in class WARCBatchJob
        Parameters:
        os - unused argument.
      • processRecord

        public void processRecord​(org.archive.io.warc.WARCRecord record,
                                  java.io.OutputStream os)
        For each response WARCRecord it writes one CDX line (including newline) to the output. If an warcrecord cannot be converted to a CDX record for any reason then any resulting exception is caught and logged.
        Specified by:
        processRecord in class WARCBatchJob
        Parameters:
        record - the WARCRecord to be indexed.
        os - the OutputStream to which output is written.