|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectdk.netarkivet.harvester.harvesting.HeritrixLauncher
public class HeritrixLauncher
A HeritrixLauncher object wraps around an instance of the web crawler Heritrix. The object is constructed with the necessary information to do a crawl. The crawl is performed when doOneCrawl() is called. doOneCrawl() monitors progress and returns when the crawl is finished or must be stopped because it has stalled.
Field Summary | |
---|---|
(package private) static java.lang.String |
DEDUPLICATOR_ENABLED
Xpath for the boolean telling if the deduplicator is enabled in order.xml documents |
(package private) static java.lang.String |
DEDUPLICATOR_INDEX_LOCATION_XPATH
Xpath for the deduplicator index directory node in order.xml documents |
(package private) static java.lang.String |
DEDUPLICATOR_XPATH
Xpath for the deduplicator node in order.xml documents |
static java.lang.String |
FROM_XPATH
Xpath for the http "from" header field in order.xml-documents |
(package private) org.apache.commons.logging.Log |
log
The class logger. |
static java.lang.String |
USER_AGENT_XPATH
Xpath for the http "user-agent" header field in order.xml-documents |
Method Summary | |
---|---|
void |
doCrawl()
This method launches heritrix in the following way: 1. |
static HeritrixLauncher |
getInstance(HeritrixFiles files)
Get instance of this class. |
static boolean |
isDeduplicationEnabled(org.dom4j.Document doc)
Return true if the given order.xml file has deduplication enabled. |
void |
setupOrderfile()
This method prepares the orderfile used by the Heritrix crawler. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
static final java.lang.String DEDUPLICATOR_XPATH
static final java.lang.String DEDUPLICATOR_INDEX_LOCATION_XPATH
static final java.lang.String DEDUPLICATOR_ENABLED
final org.apache.commons.logging.Log log
public static final java.lang.String USER_AGENT_XPATH
public static final java.lang.String FROM_XPATH
Method Detail |
---|
public static HeritrixLauncher getInstance(HeritrixFiles files) throws ArgumentNotValid
files
- Object encapsulating location of Heritrix crawldir and
configuration files
ArgumentNotValid
- If either order.xml or seeds.txt does not existpublic void doCrawl() throws IOFailure
IOFailure
- - if the order.xml is invalid if unable to initialize
Heritrix CrawlController if Heritrix process
interruptedpublic void setupOrderfile() throws IOFailure
IOFailure
- - When the orderfile could not be saved to disk When a
specific node is not found in the XML-document When the
SAXReader cannot parse the XMLpublic static boolean isDeduplicationEnabled(org.dom4j.Document doc)
doc
- An order.xml document
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |