Class HeritrixLauncherAbstract

  • Direct Known Subclasses:
    HeritrixLauncher

    public abstract class HeritrixLauncherAbstract
    extends Object
    A HeritrixLauncher object wraps around an instance of the web crawler Heritrix3. The object is constructed with the necessary information to do a crawl. The crawl is performed when doOneCrawl() is called. doOneCrawl() monitors progress and returns when the crawl is finished or must be stopped because it has stalled.
    • Field Detail

      • CRAWL_CONTROL_WAIT_PERIOD

        protected static final int CRAWL_CONTROL_WAIT_PERIOD
        The period to wait in seconds before checking if Heritrix3 has done anything.
    • Constructor Detail

      • HeritrixLauncherAbstract

        protected HeritrixLauncherAbstract​(Heritrix3Files files)
                                    throws ArgumentNotValid
        Private HeritrixLauncher constructor. Sets up the HeritrixLauncher from the given order file and seedsfile.
        Parameters:
        files - Object encapsulating location of Heritrix3 crawldir and configuration files.
        Throws:
        ArgumentNotValid - If either seedsfile or orderfile does not exist.
      • HeritrixLauncherAbstract

        public HeritrixLauncherAbstract​(Object... args)
        Generic constructor to allow HeritrixLauncher to use any implementation of HeritrixController.
        Parameters:
        args - the arguments to be passed to the constructor or non-static factory method of the HeritrixController class specified in settings
    • Method Detail

      • doCrawl

        public abstract void doCrawl()
                              throws IOFailure
        Launches the crawl and monitors its progress.
        Throws:
        IOFailure
      • getHeritrixFiles

        protected Heritrix3Files getHeritrixFiles()
        Returns:
        an instance of the wrapper class for Heritrix files.
      • getControllerArguments

        protected Object[] getControllerArguments()
        Returns:
        the optional arguments used to initialize the chosen Heritrix controller implementation.
      • makeTemplateReadyForHeritrix3

        public static void makeTemplateReadyForHeritrix3​(String jobName,
                                                         Heritrix3Files files)
                                                  throws IOFailure
        This method prepares the crawler-beans.cxml file used by the Heritrix3 crawler.

        1. alters the crawler-beans.cxml in the following-way: (overriding whatever is in the crawler-beans.cxml)
        1. sets the prefix of the archive files to the unique prefix defined in Heritrix3Files
        2. if deduplication is enabled, sets the node pointing to index directory for deduplication (see step 3)
        2. saves the orderfile back to disk

        3. if deduplication is enabled in the order.xml, it writes the absolute path of the lucene index used by the deduplication processor.

        Throws:
        IOFailure - - When the orderfile could not be saved to disk
        IllegalState - - When the orderfile is not a H3 template