Class CrawlLogIterator

    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected java.text.SimpleDateFormat crawlDataItemFormat
      The date format specified by the CrawlDataItem for dates entered into it (and eventually into the index)
      protected java.text.SimpleDateFormat crawlDateFormat
      The date format used in crawl.log files.
      protected java.lang.String crawlDateFormatStr  
      protected java.text.SimpleDateFormat fallbackCrawlDateFormat  
      protected java.lang.String fallbackCrawlDateFormatStr  
      protected java.io.BufferedReader in
      A reader for the crawl.log file being processed
      protected CrawlDataItem next
      The next item to be issued (if ready) or null if the next item has not been prepared or there are no more elements
    • Constructor Summary

      Constructors 
      Constructor Description
      CrawlLogIterator​(java.lang.String source)
      Create a new CrawlLogIterator that reads items from a Heritrix crawl.log
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void close()
      Closes the crawl.log file.
      java.lang.String getSourceType()
      A short, human readable, string about what source this iterator uses.
      boolean hasNext()
      Returns true if there are more items available.
      CrawlDataItem next()
      Returns the next valid item from the crawl log.
      protected CrawlDataItem parseLine​(java.lang.String line)
      Parse the a line in the crawl log.
      protected void prepareNext()
      Ready the next item.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • CrawlLogIterator

        public CrawlLogIterator​(java.lang.String source)
                         throws java.io.IOException
        Create a new CrawlLogIterator that reads items from a Heritrix crawl.log
        Parameters:
        source - The path of a Heritrix crawl.log file.
        Throws:
        java.io.IOException - If errors were found reading the log.
    • Method Detail

      • hasNext

        public boolean hasNext()
                        throws java.io.IOException
        Returns true if there are more items available.
        Specified by:
        hasNext in class CrawlDataIterator
        Returns:
        True if at least one more item can be fetched with next().
        Throws:
        java.io.IOException - If an error occurs accessing the crawl data.
      • next

        public CrawlDataItem next()
                           throws java.io.IOException
        Returns the next valid item from the crawl log.
        Specified by:
        next in class CrawlDataIterator
        Returns:
        An item from the crawl log. Note that unlike the Iterator interface, this method returns null if there are no more items to fetch.
        Throws:
        java.io.IOException - If there is an error reading the item *after* the item to be returned from the crawl.log.
        java.util.NoSuchElementException - If there are no more items
      • prepareNext

        protected void prepareNext()
                            throws java.io.IOException
        Ready the next item. This method will skip over items that getNextItem() rejects. When the method returns, either next is non-null or there are no more items in the crawl log.

        Note: This method should only be called when next==null

        Throws:
        java.io.IOException
      • parseLine

        protected CrawlDataItem parseLine​(java.lang.String line)
        Parse the a line in the crawl log.

        Override this method to change how individual crawl log items are processed and accepted/rejected. This method is called from within the loop in prepareNext().

        Parameters:
        line - A line from the crawl log. Must not be null.
        Returns:
        A CrawlDataItem if the next line in the crawl log yielded a usable item, null otherwise.
      • close

        public void close()
                   throws java.io.IOException
        Closes the crawl.log file.
        Specified by:
        close in class CrawlDataIterator
        Throws:
        java.io.IOException - If an error occurs closing access to crawl data.
      • getSourceType

        public java.lang.String getSourceType()
        Description copied from class: CrawlDataIterator
        A short, human readable, string about what source this iterator uses. I.e. "Iterator for Heritrix style crawl.log" etc.
        Specified by:
        getSourceType in class CrawlDataIterator
        Returns:
        A short, human readable, string about what source this iterator uses.