Class CrawlLogIterator

    • Field Detail

      • crawlDateFormat

        protected final SimpleDateFormat crawlDateFormat
        The date format used in crawl.log files.
      • fallbackCrawlDateFormat

        protected final SimpleDateFormat fallbackCrawlDateFormat
      • crawlDataItemFormat

        protected final SimpleDateFormat crawlDataItemFormat
        The date format specified by the CrawlDataItem for dates entered into it (and eventually into the index)
      • in

        protected BufferedReader in
        A reader for the crawl.log file being processed
      • next

        protected CrawlDataItem next
        The next item to be issued (if ready) or null if the next item has not been prepared or there are no more elements
    • Constructor Detail

      • CrawlLogIterator

        public CrawlLogIterator​(String source)
                         throws IOException
        Create a new CrawlLogIterator that reads items from a Heritrix crawl.log
        Parameters:
        source - The path of a Heritrix crawl.log file.
        Throws:
        IOException - If errors were found reading the log.
    • Method Detail

      • hasNext

        public boolean hasNext()
                        throws IOException
        Returns true if there are more items available.
        Specified by:
        hasNext in class CrawlDataIterator
        Returns:
        True if at least one more item can be fetched with next().
        Throws:
        IOException - If an error occurs accessing the crawl data.
      • next

        public CrawlDataItem next()
                           throws IOException
        Returns the next valid item from the crawl log.
        Specified by:
        next in class CrawlDataIterator
        Returns:
        An item from the crawl log. Note that unlike the Iterator interface, this method returns null if there are no more items to fetch.
        Throws:
        IOException - If there is an error reading the item *after* the item to be returned from the crawl.log.
        NoSuchElementException - If there are no more items
      • prepareNext

        protected void prepareNext()
                            throws IOException
        Ready the next item. This method will skip over items that getNextItem() rejects. When the method returns, either next is non-null or there are no more items in the crawl log.

        Note: This method should only be called when next==null

        Throws:
        IOException
      • parseLine

        protected CrawlDataItem parseLine​(String line)
        Parse the a line in the crawl log.

        Override this method to change how individual crawl log items are processed and accepted/rejected. This method is called from within the loop in prepareNext().

        Parameters:
        line - A line from the crawl log. Must not be null.
        Returns:
        A CrawlDataItem if the next line in the crawl log yielded a usable item, null otherwise.
      • getSourceType

        public String getSourceType()
        Description copied from class: CrawlDataIterator
        A short, human readable, string about what source this iterator uses. I.e. "Iterator for Heritrix style crawl.log" etc.
        Specified by:
        getSourceType in class CrawlDataIterator
        Returns:
        A short, human readable, string about what source this iterator uses.