public class CDXOriginCrawlLogIterator extends CrawlLogIterator
Modifier and Type | Field and Description |
---|---|
protected CDXRecord |
lastRecord
The last record we read from the reader.
|
protected BufferedReader |
reader
The reader of the (sorted) CDX index.
|
crawlDataItemFormat, crawlDateFormat, in, next
Constructor and Description |
---|
CDXOriginCrawlLogIterator(File source,
BufferedReader cdx)
Create a new CDXOriginCrawlLogIterator from crawl.log and CDX sources.
|
Modifier and Type | Method and Description |
---|---|
protected CrawlDataItem |
parseLine(String line)
Parse a crawl.log line into a valid CrawlDataItem.
|
close, getSourceType, hasNext, next, prepareNext
protected BufferedReader reader
protected CDXRecord lastRecord
public CDXOriginCrawlLogIterator(File source, BufferedReader cdx) throws IOException
source
- File containing a crawl.log sorted by URL (LANG=C sort -k 4b)cdx
- A reader of a sorted CDX file. This is given as a reader so that it may be closed after use
(CrawlLogIterator provides no close())IOException
- If the underlying CrawlLogIterator fails, e.g. due to missing files.protected CrawlDataItem parseLine(String line) throws IOFailure
If CrawlLogIterator is ok with this line, we must make sure that it has an origin by finding missing ones in the CDX file. If multiple origins are found in the CDX files, the one that was harvested last is chosen. If no origin can be found, the item is rejected.
We assume that super.parseLine() delivers us the items in the crawl.log in the given (sorted) order with non-null URLs, though we admit that some undeclared exceptions can be thrown by it.
parseLine
in class CrawlLogIterator
line
- A crawl.log line to parse.IOFailure
- if there is an error reading the files.Copyright © 2005–2015 The Royal Danish Library, the Danish State and University Library, the National Library of France and the Austrian National Library.. All rights reserved.