|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object dk.netarkivet.harvester.harvesting.HeritrixLauncher
public abstract class HeritrixLauncher
A HeritrixLauncher object wraps around an instance of the web crawler Heritrix. The object is constructed with the necessary information to do a crawl. The crawl is performed when doOneCrawl() is called. doOneCrawl() monitors progress and returns when the crawl is finished or must be stopped because it has stalled.
Field Summary | |
---|---|
protected static int |
CRAWL_CONTROL_WAIT_PERIOD
The period to wait in seconds before checking if Heritrix has done anything. |
(package private) static java.lang.String |
DEDUPLICATOR_ENABLED
Xpath for the boolean telling if the deduplicator is enabled in order.xml documents. |
(package private) static java.lang.String |
DEDUPLICATOR_INDEX_LOCATION_XPATH
Xpath for the deduplicator index directory node in order.xml documents. |
(package private) org.apache.commons.logging.Log |
log
The class logger. |
Constructor Summary | |
---|---|
protected |
HeritrixLauncher(HeritrixFiles files)
Private HeritrixLaucher constructor. |
|
HeritrixLauncher(java.lang.Object... args)
Generic constructor to allow HeritrixLauncher to use any implementation of HeritrixController. |
Method Summary | |
---|---|
abstract void |
doCrawl()
Launches the crawl and monitors its progress. |
protected java.lang.Object[] |
getControllerArguments()
|
protected HeritrixFiles |
getHeritrixFiles()
|
JMSConnection |
getJMSConnection()
|
static boolean |
isDeduplicationEnabledInTemplate(org.dom4j.Document doc)
Return true if the given order.xml file has deduplication enabled. |
void |
setupOrderfile()
This method prepares the orderfile used by the Heritrix crawler. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static final int CRAWL_CONTROL_WAIT_PERIOD
static final java.lang.String DEDUPLICATOR_INDEX_LOCATION_XPATH
static final java.lang.String DEDUPLICATOR_ENABLED
final org.apache.commons.logging.Log log
Constructor Detail |
---|
protected HeritrixLauncher(HeritrixFiles files) throws ArgumentNotValid
files
- Object encapsulating location of Heritrix crawldir and
configuration files.
ArgumentNotValid
- If either seedsfile or orderfile does not
exist.public HeritrixLauncher(java.lang.Object... args)
args
- the arguments to be passed to the constructor or non-static
factory method of the HeritrixController class specified in
settingsMethod Detail |
---|
public abstract void doCrawl() throws IOFailure
IOFailure
public void setupOrderfile() throws IOFailure
IOFailure
- - When the orderfile could not be saved to disk When a
specific node is not found in the XML-document When the
SAXReader cannot parse the XMLpublic static boolean isDeduplicationEnabledInTemplate(org.dom4j.Document doc)
doc
- An order.xml document
protected HeritrixFiles getHeritrixFiles()
protected java.lang.Object[] getControllerArguments()
public JMSConnection getJMSConnection()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |