Class HeritrixLauncherAbstract
- java.lang.Object
-
- dk.netarkivet.harvester.heritrix3.HeritrixLauncherAbstract
-
- Direct Known Subclasses:
HeritrixLauncher
public abstract class HeritrixLauncherAbstract extends Object
A HeritrixLauncher object wraps around an instance of the web crawler Heritrix3. The object is constructed with the necessary information to do a crawl. The crawl is performed when doOneCrawl() is called. doOneCrawl() monitors progress and returns when the crawl is finished or must be stopped because it has stalled.
-
-
Field Summary
Fields Modifier and Type Field Description protected static int
CRAWL_CONTROL_WAIT_PERIOD
The period to wait in seconds before checking if Heritrix3 has done anything.
-
Constructor Summary
Constructors Modifier Constructor Description protected
HeritrixLauncherAbstract(Heritrix3Files files)
Private HeritrixLauncher constructor.HeritrixLauncherAbstract(Object... args)
Generic constructor to allow HeritrixLauncher to use any implementation of HeritrixController.
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description abstract void
doCrawl()
Launches the crawl and monitors its progress.protected Object[]
getControllerArguments()
protected Heritrix3Files
getHeritrixFiles()
static void
makeTemplateReadyForHeritrix3(String jobName, Heritrix3Files files)
This method prepares the crawler-beans.cxml file used by the Heritrix3 crawler.void
setupOrderfile(String jobName, Heritrix3Files files)
-
-
-
Constructor Detail
-
HeritrixLauncherAbstract
protected HeritrixLauncherAbstract(Heritrix3Files files) throws ArgumentNotValid
Private HeritrixLauncher constructor. Sets up the HeritrixLauncher from the given order file and seedsfile.- Parameters:
files
- Object encapsulating location of Heritrix3 crawldir and configuration files.- Throws:
ArgumentNotValid
- If either seedsfile or orderfile does not exist.
-
HeritrixLauncherAbstract
public HeritrixLauncherAbstract(Object... args)
Generic constructor to allow HeritrixLauncher to use any implementation of HeritrixController.- Parameters:
args
- the arguments to be passed to the constructor or non-static factory method of the HeritrixController class specified in settings
-
-
Method Detail
-
doCrawl
public abstract void doCrawl() throws IOFailure
Launches the crawl and monitors its progress.- Throws:
IOFailure
-
getHeritrixFiles
protected Heritrix3Files getHeritrixFiles()
- Returns:
- an instance of the wrapper class for Heritrix files.
-
getControllerArguments
protected Object[] getControllerArguments()
- Returns:
- the optional arguments used to initialize the chosen Heritrix controller implementation.
-
setupOrderfile
public void setupOrderfile(String jobName, Heritrix3Files files)
-
makeTemplateReadyForHeritrix3
public static void makeTemplateReadyForHeritrix3(String jobName, Heritrix3Files files) throws IOFailure
This method prepares the crawler-beans.cxml file used by the Heritrix3 crawler. 1. alters the crawler-beans.cxml in the following-way: (overriding whatever is in the crawler-beans.cxml)- sets the prefix of the archive files to the unique prefix defined in Heritrix3Files
- if deduplication is enabled, sets the node pointing to index directory for deduplication (see step 3)
3. if deduplication is enabled in the order.xml, it writes the absolute path of the lucene index used by the deduplication processor.
- Throws:
IOFailure
- - When the orderfile could not be saved to diskIllegalState
- - When the orderfile is not a H3 template
-
-