Class CDXOriginCrawlLogIterator


  • public class CDXOriginCrawlLogIterator
    extends CrawlLogIterator
    This subclass of CrawlLogIterator adds the layer of digging an origin of the form "arcfile,offset" out of a corresponding CDX index. This may cause some of the entries in the crawl log to be skipped. The two files are read in parallel.
    • Field Detail

      • reader

        protected java.io.BufferedReader reader
        The reader of the (sorted) CDX index.
      • lastRecord

        protected CDXRecord lastRecord
        The last record we read from the reader. We may overshoot on the CDX reading if there are entries not in CDX, so we hang onto this until the reading of the crawl.log catches up.
    • Constructor Detail

      • CDXOriginCrawlLogIterator

        public CDXOriginCrawlLogIterator​(java.io.File source,
                                         java.io.BufferedReader cdx)
                                  throws java.io.IOException
        Create a new CDXOriginCrawlLogIterator from crawl.log and CDX sources.
        Parameters:
        source - File containing a crawl.log sorted by URL (LANG=C sort -k 4b)
        cdx - A reader of a sorted CDX file. This is given as a reader so that it may be closed after use (CrawlLogIterator provides no close())
        Throws:
        java.io.IOException - If the underlying CrawlLogIterator fails, e.g. due to missing files.
    • Method Detail

      • parseLine

        protected CrawlDataItem parseLine​(java.lang.String line)
                                   throws IOFailure
        Parse a crawl.log line into a valid CrawlDataItem.

        If CrawlLogIterator is ok with this line, we must make sure that it has an origin by finding missing ones in the CDX file. If multiple origins are found in the CDX files, the one that was harvested last is chosen. If no origin can be found, the item is rejected.

        We assume that super.parseLine() delivers us the items in the crawl.log in the given (sorted) order with non-null URLs, though we admit that some undeclared exceptions can be thrown by it.

        Overrides:
        parseLine in class CrawlLogIterator
        Parameters:
        line - A crawl.log line to parse.
        Returns:
        A CrawlDataItem with a valid origin field, or null if we could not determine an appropriate origin.
        Throws:
        IOFailure - if there is an error reading the files.