dk.netarkivet.harvester.harvesting
Class HeritrixLauncher

java.lang.Object
  extended by dk.netarkivet.harvester.harvesting.HeritrixLauncher

public class HeritrixLauncher
extends java.lang.Object

A HeritrixLauncher object wraps around an instance of the web crawler Heritrix. The object is constructed with the necessary information to do a crawl. The crawl is performed when doOneCrawl() is called. doOneCrawl() monitors progress and returns when the crawl is finished or must be stopped because it has stalled.


Field Summary
(package private) static java.lang.String DEDUPLICATOR_ENABLED
          Xpath for the boolean telling if the deduplicator is enabled in order.xml documents.
(package private) static java.lang.String DEDUPLICATOR_INDEX_LOCATION_XPATH
          Xpath for the deduplicator index directory node in order.xml documents.
(package private) static java.lang.String DEDUPLICATOR_XPATH
          Xpath for the deduplicator node in order.xml documents.
(package private)  org.apache.commons.logging.Log log
          The class logger.
 
Method Summary
 void doCrawl()
          This method launches heritrix in the following way:
1.
static HeritrixLauncher getInstance(HeritrixFiles files)
          Get instance of this class.
static boolean isDeduplicationEnabled(org.dom4j.Document doc)
          Return true if the given order.xml file has deduplication enabled.
 void setupOrderfile()
          This method prepares the orderfile used by the Heritrix crawler.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEDUPLICATOR_XPATH

static final java.lang.String DEDUPLICATOR_XPATH
Xpath for the deduplicator node in order.xml documents.

See Also:
Constant Field Values

DEDUPLICATOR_INDEX_LOCATION_XPATH

static final java.lang.String DEDUPLICATOR_INDEX_LOCATION_XPATH
Xpath for the deduplicator index directory node in order.xml documents.

See Also:
Constant Field Values

DEDUPLICATOR_ENABLED

static final java.lang.String DEDUPLICATOR_ENABLED
Xpath for the boolean telling if the deduplicator is enabled in order.xml documents.

See Also:
Constant Field Values

log

final org.apache.commons.logging.Log log
The class logger.

Method Detail

getInstance

public static HeritrixLauncher getInstance(HeritrixFiles files)
                                    throws ArgumentNotValid
Get instance of this class.

Parameters:
files - Object encapsulating location of Heritrix crawldir and configuration files
Returns:
HeritrixLauncher object
Throws:
ArgumentNotValid - If either order.xml or seeds.txt does not exist, or argument files is null.

doCrawl

public void doCrawl()
             throws IOFailure
This method launches heritrix in the following way:
1. copies the orderfile and the seedsfile to current working directory.
2. sets up the newly created copy of the orderfile
3. starts the crawler
4. stops the crawler (Either when heritrix has finished crawling, or when heritrix is forcefully stopped due to inactivity).

The exit from the while-loop depends on Heritrix calling the crawlEnded() method, when the crawling is finished. This method is called from the HarvestControllerServer.onDoOneCrawl() method.

Throws:
IOFailure - - if the order.xml is invalid if unable to initialize Heritrix CrawlController if Heritrix process interrupted

setupOrderfile

public void setupOrderfile()
                    throws IOFailure
This method prepares the orderfile used by the Heritrix crawler.

1. alters the orderfile in the following-way: (overriding whatever is in the orderfile)
  1. sets the disk-path to the outputdir specified in HeritrixFiles.
  2. sets the seedsfile to the seedsfile specified in HeritrixFiles.
  3. sets the prefix of the arcfiles to unique prefix defined in HeritrixFiles
  4. checks that the arcs-file dir is 'arcs' - to ensure that we know where the arc-files are when crawl finishes
  5. if deduplication is enabled, sets the node pointing to index directory for deduplication (see step 3)
2. saves the orderfile back to disk

3. if deduplication is enabled in order.xml; fetches lucene index over crawl.logs for jobs we use for deduplication from index server, and writes it to directory

Throws:
IOFailure - - When the orderfile could not be saved to disk When a specific node is not found in the XML-document When the SAXReader cannot parse the XML

isDeduplicationEnabled

public static boolean isDeduplicationEnabled(org.dom4j.Document doc)
Return true if the given order.xml file has deduplication enabled.

Parameters:
doc - An order.xml document
Returns:
True if Deduplicator is enabled.