CDXOriginCrawlLogIterator

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

dk.netarkivet.archive.indexserver
Class CDXOriginCrawlLogIterator

java.lang.Object
  is.hi.bok.deduplicator.CrawlDataIterator
      is.hi.bok.deduplicator.CrawlLogIterator
          dk.netarkivet.archive.indexserver.CDXOriginCrawlLogIterator

public class CDXOriginCrawlLogIterator
extends is.hi.bok.deduplicator.CrawlLogIterator
extends is.hi.bok.deduplicator.CrawlLogIterator

This subclass of CrawlLogIterator adds the layer of digging an origin of the form "arcfile,offset" out of a corresponding CDX index. This may cause some of the entries in the crawl log to be skipped. The two files are read in parallel.

Field Summary
`protected CDXRecord`	`lastRecord` The last record we read from the reader.
`protected java.io.BufferedReader`	`reader` The reader of the (sorted) CDX index.

Fields inherited from class is.hi.bok.deduplicator.CrawlLogIterator
`crawlDataItemFormat, crawlDateFormat, in, next`

Constructor Summary
`CDXOriginCrawlLogIterator(java.io.File source, java.io.BufferedReader cdx)` Create a new CDXOriginCrawlLogIterator from crawl.log and CDX sources.

Method Summary
`protected is.hi.bok.deduplicator.CrawlDataItem`	`parseLine(java.lang.String line)` Parse a crawl.log line into a valid CrawlDataItem.

Methods inherited from class is.hi.bok.deduplicator.CrawlLogIterator
`close, getSourceType, hasNext, next, prepareNext`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

reader

protected java.io.BufferedReader reader

The reader of the (sorted) CDX index.

lastRecord

protected CDXRecord lastRecord

The last record we read from the reader. We may overshoot on the CDX reading if there are entries not in CDX, so we hang onto this until the reading of the crawl.log catches up.

Constructor Detail

CDXOriginCrawlLogIterator

public CDXOriginCrawlLogIterator(java.io.File source,
                                 java.io.BufferedReader cdx)
                          throws java.io.IOException

Create a new CDXOriginCrawlLogIterator from crawl.log and CDX sources.

Parameters:: source - File containing a crawl.log sorted by URL (LANG=C sort -k 4b); cdx - A reader of a sorted CDX file. This is given as a reader so that it may be closed after use (CrawlLogIterator provides no close())
Throws:: java.io.IOException - If the underlying CrawlLogIterator fails, e.g. due to missing files.

Method Detail

parseLine

protected is.hi.bok.deduplicator.CrawlDataItem parseLine(java.lang.String line)
                                                  throws IOFailure

Parse a crawl.log line into a valid CrawlDataItem. If CrawlLogIterator is ok with this line, we must make sure that it has an origin by finding missing ones in the CDX file. If multiple origins are found in the CDX files, the one that was harvested last is chosen. If no origin can be found, the item is rejected. We assume that super.parseLine() delivers us the items in the crawl.log in the given (sorted) order with non-null URLs, though we admit that some undeclared exceptions can be thrown by it.

Overrides:: parseLine in class is.hi.bok.deduplicator.CrawlLogIterator

Parameters:: line - A crawl.log line to parse.
Returns:: A CrawlDataItem with a valid origin field, or null if we could not determine an appropriate origin.
Throws:: IOFailure - if there is an error reading the files.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

dk.netarkivet.archive.indexserver Class CDXOriginCrawlLogIterator

reader

lastRecord

CDXOriginCrawlLogIterator

parseLine

dk.netarkivet.archive.indexserver
Class CDXOriginCrawlLogIterator