Class CDXOriginCrawlLogIterator
- java.lang.Object
-
- is.hi.bok.deduplicator.CrawlDataIterator
-
- is.hi.bok.deduplicator.CrawlLogIterator
-
- dk.netarkivet.harvester.indexserver.CDXOriginCrawlLogIterator
-
public class CDXOriginCrawlLogIterator extends CrawlLogIterator
This subclass of CrawlLogIterator adds the layer of digging an origin of the form "arcfile,offset" out of a corresponding CDX index. This may cause some of the entries in the crawl log to be skipped. The two files are read in parallel.
-
-
Field Summary
Fields Modifier and Type Field Description protected CDXRecord
lastRecord
The last record we read from the reader.protected java.io.BufferedReader
reader
The reader of the (sorted) CDX index.-
Fields inherited from class is.hi.bok.deduplicator.CrawlLogIterator
crawlDataItemFormat, crawlDateFormat, crawlDateFormatStr, fallbackCrawlDateFormat, fallbackCrawlDateFormatStr, in, next
-
-
Constructor Summary
Constructors Constructor Description CDXOriginCrawlLogIterator(java.io.File source, java.io.BufferedReader cdx)
Create a new CDXOriginCrawlLogIterator from crawl.log and CDX sources.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected CrawlDataItem
parseLine(java.lang.String line)
Parse a crawl.log line into a valid CrawlDataItem.-
Methods inherited from class is.hi.bok.deduplicator.CrawlLogIterator
close, getSourceType, hasNext, next, prepareNext
-
-
-
-
Field Detail
-
reader
protected java.io.BufferedReader reader
The reader of the (sorted) CDX index.
-
lastRecord
protected CDXRecord lastRecord
The last record we read from the reader. We may overshoot on the CDX reading if there are entries not in CDX, so we hang onto this until the reading of the crawl.log catches up.
-
-
Constructor Detail
-
CDXOriginCrawlLogIterator
public CDXOriginCrawlLogIterator(java.io.File source, java.io.BufferedReader cdx) throws java.io.IOException
Create a new CDXOriginCrawlLogIterator from crawl.log and CDX sources.- Parameters:
source
- File containing a crawl.log sorted by URL (LANG=C sort -k 4b)cdx
- A reader of a sorted CDX file. This is given as a reader so that it may be closed after use (CrawlLogIterator provides no close())- Throws:
java.io.IOException
- If the underlying CrawlLogIterator fails, e.g. due to missing files.
-
-
Method Detail
-
parseLine
protected CrawlDataItem parseLine(java.lang.String line) throws IOFailure
Parse a crawl.log line into a valid CrawlDataItem.If CrawlLogIterator is ok with this line, we must make sure that it has an origin by finding missing ones in the CDX file. If multiple origins are found in the CDX files, the one that was harvested last is chosen. If no origin can be found, the item is rejected.
We assume that super.parseLine() delivers us the items in the crawl.log in the given (sorted) order with non-null URLs, though we admit that some undeclared exceptions can be thrown by it.
- Overrides:
parseLine
in classCrawlLogIterator
- Parameters:
line
- A crawl.log line to parse.- Returns:
- A CrawlDataItem with a valid origin field, or null if we could not determine an appropriate origin.
- Throws:
IOFailure
- if there is an error reading the files.
-
-