dk.netarkivet.harvester.harvesting
Class HeritrixLauncher

java.lang.Object
  extended by dk.netarkivet.harvester.harvesting.HeritrixLauncher
Direct Known Subclasses:
BnfHeritrixLauncher, DefaultHeritrixLauncher

public abstract class HeritrixLauncher
extends java.lang.Object

A HeritrixLauncher object wraps around an instance of the web crawler Heritrix. The object is constructed with the necessary information to do a crawl. The crawl is performed when doOneCrawl() is called. doOneCrawl() monitors progress and returns when the crawl is finished or must be stopped because it has stalled.


Field Summary
protected static int CRAWL_CONTROL_WAIT_PERIOD
          The period to wait in seconds before checking if Heritrix has done anything.
(package private) static java.lang.String DEDUPLICATOR_ENABLED
          Xpath for the boolean telling if the deduplicator is enabled in order.xml documents.
(package private) static java.lang.String DEDUPLICATOR_INDEX_LOCATION_XPATH
          Xpath for the deduplicator index directory node in order.xml documents.
(package private)  org.apache.commons.logging.Log log
          The class logger.
 
Constructor Summary
protected HeritrixLauncher(HeritrixFiles files)
          Private HeritrixLaucher constructor.
  HeritrixLauncher(java.lang.Object... args)
          Generic constructor to allow HeritrixLauncher to use any implementation of HeritrixController.
 
Method Summary
abstract  void doCrawl()
          Launches the crawl and monitors its progress.
protected  java.lang.Object[] getControllerArguments()
           
protected  HeritrixFiles getHeritrixFiles()
           
 JMSConnection getJMSConnection()
           
static boolean isDeduplicationEnabledInTemplate(org.dom4j.Document doc)
          Return true if the given order.xml file has deduplication enabled.
 void setupOrderfile()
          This method prepares the orderfile used by the Heritrix crawler.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CRAWL_CONTROL_WAIT_PERIOD

protected static final int CRAWL_CONTROL_WAIT_PERIOD
The period to wait in seconds before checking if Heritrix has done anything.


DEDUPLICATOR_INDEX_LOCATION_XPATH

static final java.lang.String DEDUPLICATOR_INDEX_LOCATION_XPATH
Xpath for the deduplicator index directory node in order.xml documents.

See Also:
Constant Field Values

DEDUPLICATOR_ENABLED

static final java.lang.String DEDUPLICATOR_ENABLED
Xpath for the boolean telling if the deduplicator is enabled in order.xml documents.

See Also:
Constant Field Values

log

final org.apache.commons.logging.Log log
The class logger.

Constructor Detail

HeritrixLauncher

protected HeritrixLauncher(HeritrixFiles files)
                    throws ArgumentNotValid
Private HeritrixLaucher constructor. Sets up the HeritrixLauncher from the given order file and seedsfile.

Parameters:
files - Object encapsulating location of Heritrix crawldir and configuration files.
Throws:
ArgumentNotValid - If either seedsfile or orderfile does not exist.

HeritrixLauncher

public HeritrixLauncher(java.lang.Object... args)
Generic constructor to allow HeritrixLauncher to use any implementation of HeritrixController.

Parameters:
args - the arguments to be passed to the constructor or non-static factory method of the HeritrixController class specified in settings
Method Detail

doCrawl

public abstract void doCrawl()
                      throws IOFailure
Launches the crawl and monitors its progress.

Throws:
IOFailure

setupOrderfile

public void setupOrderfile()
                    throws IOFailure
This method prepares the orderfile used by the Heritrix crawler.

1. alters the orderfile in the following-way: (overriding whatever is in the orderfile)
  1. sets the disk-path to the outputdir specified in HeritrixFiles.
  2. sets the seedsfile to the seedsfile specified in HeritrixFiles.
  3. sets the prefix of the arcfiles to unique prefix defined in HeritrixFiles
  4. checks that the arcs-file dir is 'arcs' - to ensure that we know where the arc-files are when crawl finishes
  5. if deduplication is enabled, sets the node pointing to index directory for deduplication (see step 3)
2. saves the orderfile back to disk

3. if deduplication is enabled in the order.xml, it writes the absolute path of the lucene index used by the deduplication processor.

Throws:
IOFailure - - When the orderfile could not be saved to disk When a specific node is not found in the XML-document When the SAXReader cannot parse the XML

isDeduplicationEnabledInTemplate

public static boolean isDeduplicationEnabledInTemplate(org.dom4j.Document doc)
Return true if the given order.xml file has deduplication enabled.

Parameters:
doc - An order.xml document
Returns:
True if Deduplicator is enabled.

getHeritrixFiles

protected HeritrixFiles getHeritrixFiles()
Returns:
an instance of the wrapper class for Heritrix files.

getControllerArguments

protected java.lang.Object[] getControllerArguments()
Returns:
the optional arguments used to initialize the chosen Heritrix controller implementation.

getJMSConnection

public JMSConnection getJMSConnection()
Returns:
the JMS connection used to send messages.