CDXOriginCrawlLogIterator (NetarchiveSuite 5.0 API)

java.lang.Object
- is.hi.bok.deduplicator.CrawlDataIterator
- - is.hi.bok.deduplicator.CrawlLogIterator
  - - dk.netarkivet.harvester.indexserver.CDXOriginCrawlLogIterator

```
public class CDXOriginCrawlLogIterator
extends CrawlLogIterator
```
This subclass of CrawlLogIterator adds the layer of digging an origin of the form "arcfile,offset" out of a corresponding CDX index. This may cause some of the entries in the crawl log to be skipped. The two files are read in parallel.

Field Summary

Fields
Modifier and Type Field and Description

protected CDXRecord lastRecord
The last record we read from the reader.

protected BufferedReader reader
The reader of the (sorted) CDX index.
- Fields inherited from class is.hi.bok.deduplicator.CrawlLogIterator
  crawlDataItemFormat, crawlDateFormat, in, next

Fields
Modifier and Type	Field and Description
`protected CDXRecord`	`lastRecord` The last record we read from the reader.
`protected BufferedReader`	`reader` The reader of the (sorted) CDX index.

Constructor Summary

Constructors
Constructor and Description
`CDXOriginCrawlLogIterator(File source, BufferedReader cdx)` Create a new CDXOriginCrawlLogIterator from crawl.log and CDX sources.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type Method and Description

protected CrawlDataItem parseLine(String line)
Parse a crawl.log line into a valid CrawlDataItem.
- Methods inherited from class is.hi.bok.deduplicator.CrawlLogIterator
  close, getSourceType, hasNext, next, prepareNext
- Methods inherited from class java.lang.Object
  clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected CrawlDataItem`	`parseLine(String line)` Parse a crawl.log line into a valid CrawlDataItem.

- Field Detail
  - reader
```
protected BufferedReader reader
```
    The reader of the (sorted) CDX index.
  - lastRecord
```
protected CDXRecord lastRecord
```
    The last record we read from the reader. We may overshoot on the CDX reading if there are entries not in CDX, so we hang onto this until the reading of the crawl.log catches up.
- Constructor Detail
  - CDXOriginCrawlLogIterator
```
public CDXOriginCrawlLogIterator(File source,
                                 BufferedReader cdx)
                          throws IOException
```
    Create a new CDXOriginCrawlLogIterator from crawl.log and CDX sources.
    
    Parameters:
    
    source - File containing a crawl.log sorted by URL (LANG=C sort -k 4b)
    
    cdx - A reader of a sorted CDX file. This is given as a reader so that it may be closed after use (CrawlLogIterator provides no close())
    
    Throws:
    
    IOException - If the underlying CrawlLogIterator fails, e.g. due to missing files.
- Method Detail
  - parseLine
```
protected CrawlDataItem parseLine(String line)
                           throws IOFailure
```
    Parse a crawl.log line into a valid CrawlDataItem.
    If CrawlLogIterator is ok with this line, we must make sure that it has an origin by finding missing ones in the CDX file. If multiple origins are found in the CDX files, the one that was harvested last is chosen. If no origin can be found, the item is rejected.
    We assume that super.parseLine() delivers us the items in the crawl.log in the given (sorted) order with non-null URLs, though we admit that some undeclared exceptions can be thrown by it.
    
    Overrides:
    
    parseLine in class CrawlLogIterator
    
    Parameters:
    
    line - A crawl.log line to parse.
    
    Returns:
    
    A CrawlDataItem with a valid origin field, or null if we could not determine an appropriate origin.
    
    Throws:
    
    IOFailure - if there is an error reading the files.

Class CDXOriginCrawlLogIterator

Field Summary

Fields inherited from class is.hi.bok.deduplicator.CrawlLogIterator

Constructor Summary

Method Summary

Methods inherited from class is.hi.bok.deduplicator.CrawlLogIterator

Methods inherited from class java.lang.Object

Field Detail

reader

lastRecord

Constructor Detail

CDXOriginCrawlLogIterator

Method Detail

parseLine