dk.netarkivet.archive.indexserver
Class CDXOriginCrawlLogIterator
java.lang.Object
is.hi.bok.deduplicator.CrawlDataIterator
is.hi.bok.deduplicator.CrawlLogIterator
dk.netarkivet.archive.indexserver.CDXOriginCrawlLogIterator
public class CDXOriginCrawlLogIterator
- extends is.hi.bok.deduplicator.CrawlLogIterator
This subclass of CrawlLogIterator adds the layer of digging an origin of
the form "arcfile,offset" out of a corresponding CDX index. This may
cause some of the entries in the crawl log to be skipped. The two files
are read in parallel.
Field Summary |
protected CDXRecord |
lastRecord
The last record we read from the reader. |
protected java.io.BufferedReader |
reader
The reader of the (sorted) CDX index. |
Fields inherited from class is.hi.bok.deduplicator.CrawlLogIterator |
crawlDataItemFormat, crawlDateFormat, in, next |
Constructor Summary |
CDXOriginCrawlLogIterator(java.io.File source,
java.io.BufferedReader cdx)
Create a new CDXOriginCrawlLogIterator from crawl.log and CDX sources. |
Method Summary |
protected is.hi.bok.deduplicator.CrawlDataItem |
parseLine(java.lang.String line)
Parse a crawl.log line into a valid CrawlDataItem. |
Methods inherited from class is.hi.bok.deduplicator.CrawlLogIterator |
close, getSourceType, hasNext, next, prepareNext |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
reader
protected java.io.BufferedReader reader
- The reader of the (sorted) CDX index.
lastRecord
protected CDXRecord lastRecord
- The last record we read from the reader. We may overshoot on the
CDX reading if there are entries not in CDX, so we hang onto this
until the reading of the crawl.log catches up.
CDXOriginCrawlLogIterator
public CDXOriginCrawlLogIterator(java.io.File source,
java.io.BufferedReader cdx)
throws java.io.IOException
- Create a new CDXOriginCrawlLogIterator from crawl.log and CDX sources.
- Parameters:
source
- File containing a crawl.log sorted by URL
(LANG=C sort -k 4b)cdx
- A reader of a sorted CDX file. This is given as a reader
so that it may be closed after use (CrawlLogIterator provides no close())
- Throws:
java.io.IOException
- If the underlying CrawlLogIterator fails, e.g.
due to missing files.
parseLine
protected is.hi.bok.deduplicator.CrawlDataItem parseLine(java.lang.String line)
throws IOFailure
- Parse a crawl.log line into a valid CrawlDataItem.
If CrawlLogIterator is ok with this line, we must make sure that it
has an origin by finding missing ones in the CDX file.
If multiple origins are found in the CDX files, the one that was
harvested last is chosen.
If no origin can be found, the item is rejected.
We assume that super.parseLine() delivers us the items in the crawl.log
in the given (sorted) order with non-null URLs, though we admit that
some undeclared exceptions can be thrown by it.
- Overrides:
parseLine
in class is.hi.bok.deduplicator.CrawlLogIterator
- Parameters:
line
- A crawl.log line to parse.
- Returns:
- A CrawlDataItem with a valid origin field, or null if we could
not determine an appropriate origin.
- Throws:
IOFailure
- if there is an error reading the files.