dk.netarkivet.archive.indexserver
Class CDXOriginCrawlLogIterator

java.lang.Object
  extended by is.hi.bok.deduplicator.CrawlDataIterator
      extended by is.hi.bok.deduplicator.CrawlLogIterator
          extended by dk.netarkivet.archive.indexserver.CDXOriginCrawlLogIterator

public class CDXOriginCrawlLogIterator
extends is.hi.bok.deduplicator.CrawlLogIterator

This subclass of CrawlLogIterator adds the layer of digging an origin of the form "arcfile,offset" out of a corresponding CDX index. This may cause some of the entries in the crawl log to be skipped. The two files are read in parallel.


Field Summary
protected  CDXRecord lastRecord
          The last record we read from the reader.
protected  java.io.BufferedReader reader
          The reader of the (sorted) CDX index.
 
Fields inherited from class is.hi.bok.deduplicator.CrawlLogIterator
crawlDataItemFormat, crawlDateFormat, in, next
 
Constructor Summary
CDXOriginCrawlLogIterator(java.io.File source, java.io.BufferedReader cdx)
          Create a new CDXOriginCrawlLogIterator from crawl.log and CDX sources.
 
Method Summary
protected  is.hi.bok.deduplicator.CrawlDataItem parseLine(java.lang.String line)
          Parse a crawl.log line into a valid CrawlDataItem.
 
Methods inherited from class is.hi.bok.deduplicator.CrawlLogIterator
close, getSourceType, hasNext, next, prepareNext
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

reader

protected java.io.BufferedReader reader
The reader of the (sorted) CDX index.


lastRecord

protected CDXRecord lastRecord
The last record we read from the reader. We may overshoot on the CDX reading if there are entries not in CDX, so we hang onto this until the reading of the crawl.log catches up.

Constructor Detail

CDXOriginCrawlLogIterator

public CDXOriginCrawlLogIterator(java.io.File source,
                                 java.io.BufferedReader cdx)
                          throws java.io.IOException
Create a new CDXOriginCrawlLogIterator from crawl.log and CDX sources.

Parameters:
source - File containing a crawl.log sorted by URL (LANG=C sort -k 4b)
cdx - A reader of a sorted CDX file. This is given as a reader so that it may be closed after use (CrawlLogIterator provides no close())
Throws:
java.io.IOException - If the underlying CrawlLogIterator fails, e.g. due to missing files.
Method Detail

parseLine

protected is.hi.bok.deduplicator.CrawlDataItem parseLine(java.lang.String line)
                                                  throws IOFailure
Parse a crawl.log line into a valid CrawlDataItem. If CrawlLogIterator is ok with this line, we must make sure that it has an origin by finding missing ones in the CDX file. If multiple origins are found in the CDX files, the one that was harvested last is chosen. If no origin can be found, the item is rejected. We assume that super.parseLine() delivers us the items in the crawl.log in the given (sorted) order with non-null URLs, though we admit that some undeclared exceptions can be thrown by it.

Overrides:
parseLine in class is.hi.bok.deduplicator.CrawlLogIterator
Parameters:
line - A crawl.log line to parse.
Returns:
A CrawlDataItem with a valid origin field, or null if we could not determine an appropriate origin.
Throws:
IOFailure - if there is an error reading the files.