Package is.hi.bok.deduplicator
Class CrawlLogIterator
- java.lang.Object
-
- is.hi.bok.deduplicator.CrawlDataIterator
-
- is.hi.bok.deduplicator.CrawlLogIterator
-
- Direct Known Subclasses:
CDXOriginCrawlLogIterator
public class CrawlLogIterator extends CrawlDataIterator
An implementation of aCrawlDataIterator
capable of iterating over a Heritrix's stylecrawl.log
.- Author:
- Kristinn Sigurðsson, Lars Clausen
-
-
Field Summary
Fields Modifier and Type Field Description protected SimpleDateFormat
crawlDataItemFormat
The date format specified by theCrawlDataItem
for dates entered into it (and eventually into the index)protected SimpleDateFormat
crawlDateFormat
The date format used in crawl.log files.protected String
crawlDateFormatStr
protected SimpleDateFormat
fallbackCrawlDateFormat
protected String
fallbackCrawlDateFormatStr
protected BufferedReader
in
A reader for the crawl.log file being processedprotected CrawlDataItem
next
The next item to be issued (if ready) or null if the next item has not been prepared or there are no more elements
-
Constructor Summary
Constructors Constructor Description CrawlLogIterator(String source)
Create a new CrawlLogIterator that reads items from a Heritrix crawl.log
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
close()
Closes the crawl.log file.String
getSourceType()
A short, human readable, string about what source this iterator uses.boolean
hasNext()
Returns true if there are more items available.CrawlDataItem
next()
Returns the next valid item from the crawl log.protected CrawlDataItem
parseLine(String line)
Parse the a line in the crawl log.protected void
prepareNext()
Ready the next item.
-
-
-
Field Detail
-
crawlDateFormatStr
protected final String crawlDateFormatStr
- See Also:
- Constant Field Values
-
fallbackCrawlDateFormatStr
protected final String fallbackCrawlDateFormatStr
- See Also:
- Constant Field Values
-
crawlDateFormat
protected final SimpleDateFormat crawlDateFormat
The date format used in crawl.log files.
-
fallbackCrawlDateFormat
protected final SimpleDateFormat fallbackCrawlDateFormat
-
crawlDataItemFormat
protected final SimpleDateFormat crawlDataItemFormat
The date format specified by theCrawlDataItem
for dates entered into it (and eventually into the index)
-
in
protected BufferedReader in
A reader for the crawl.log file being processed
-
next
protected CrawlDataItem next
The next item to be issued (if ready) or null if the next item has not been prepared or there are no more elements
-
-
Constructor Detail
-
CrawlLogIterator
public CrawlLogIterator(String source) throws IOException
Create a new CrawlLogIterator that reads items from a Heritrix crawl.log- Parameters:
source
- The path of a Heritrix crawl.log file.- Throws:
IOException
- If errors were found reading the log.
-
-
Method Detail
-
hasNext
public boolean hasNext() throws IOException
Returns true if there are more items available.- Specified by:
hasNext
in classCrawlDataIterator
- Returns:
- True if at least one more item can be fetched with next().
- Throws:
IOException
- If an error occurs accessing the crawl data.
-
next
public CrawlDataItem next() throws IOException
Returns the next valid item from the crawl log.- Specified by:
next
in classCrawlDataIterator
- Returns:
- An item from the crawl log. Note that unlike the Iterator interface, this method returns null if there are no more items to fetch.
- Throws:
IOException
- If there is an error reading the item *after* the item to be returned from the crawl.log.NoSuchElementException
- If there are no more items
-
prepareNext
protected void prepareNext() throws IOException
Ready the next item. This method will skip over items that getNextItem() rejects. When the method returns, either next is non-null or there are no more items in the crawl log.Note: This method should only be called when
next==null
- Throws:
IOException
-
parseLine
protected CrawlDataItem parseLine(String line)
Parse the a line in the crawl log.Override this method to change how individual crawl log items are processed and accepted/rejected. This method is called from within the loop in prepareNext().
- Parameters:
line
- A line from the crawl log. Must not be null.- Returns:
- A
CrawlDataItem
if the next line in the crawl log yielded a usable item, null otherwise.
-
close
public void close() throws IOException
Closes the crawl.log file.- Specified by:
close
in classCrawlDataIterator
- Throws:
IOException
- If an error occurs closing access to crawl data.
-
getSourceType
public String getSourceType()
Description copied from class:CrawlDataIterator
A short, human readable, string about what source this iterator uses. I.e. "Iterator for Heritrix style crawl.log" etc.- Specified by:
getSourceType
in classCrawlDataIterator
- Returns:
- A short, human readable, string about what source this iterator uses.
-
-