|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectdk.netarkivet.harvester.harvesting.HeritrixLauncher
public class HeritrixLauncher
A HeritrixLauncher object wraps around an instance of the web crawler Heritrix. The object is constructed with the necessary information to do a crawl. The crawl is performed when doOneCrawl() is called. doOneCrawl() monitors progress and returns when the crawl is finished or must be stopped because it has stalled.
Field Summary | |
---|---|
(package private) static java.lang.String |
DEDUPLICATOR_ENABLED
Xpath for the boolean telling if the deduplicator is enabled in order.xml documents. |
(package private) static java.lang.String |
DEDUPLICATOR_INDEX_LOCATION_XPATH
Xpath for the deduplicator index directory node in order.xml documents. |
(package private) org.apache.commons.logging.Log |
log
The class logger. |
Constructor Summary | |
---|---|
HeritrixLauncher(java.lang.Object... args)
Generic constructor to allow HeritrixLauncher to use any implementation of HeritrixController. |
Method Summary | |
---|---|
void |
doCrawl()
This method launches heritrix in the following way: 1. |
static HeritrixLauncher |
getInstance(HeritrixFiles files)
Get instance of this class. |
static boolean |
isDeduplicationEnabledInTemplate(org.dom4j.Document doc)
Return true if the given order.xml file has deduplication enabled. |
void |
setupOrderfile()
This method prepares the orderfile used by the Heritrix crawler. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
static final java.lang.String DEDUPLICATOR_INDEX_LOCATION_XPATH
static final java.lang.String DEDUPLICATOR_ENABLED
final org.apache.commons.logging.Log log
Constructor Detail |
---|
public HeritrixLauncher(java.lang.Object... args)
args
- the arguments to be passed to the constructor or non-static
factory method of the HeritrixController class specified in
settingsMethod Detail |
---|
public static HeritrixLauncher getInstance(HeritrixFiles files) throws ArgumentNotValid
files
- Object encapsulating location of Heritrix crawldir and
configuration files
ArgumentNotValid
- If either order.xml or seeds.txt does not exist,
or argument files is null.public void doCrawl() throws IOFailure
IOFailure
- - if the order.xml is invalid if unable to initialize
Heritrix CrawlController if Heritrix process
interruptedpublic void setupOrderfile() throws IOFailure
IOFailure
- - When the orderfile could not be saved to disk When a
specific node is not found in the XML-document When the
SAXReader cannot parse the XMLpublic static boolean isDeduplicationEnabledInTemplate(org.dom4j.Document doc)
doc
- An order.xml document
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |