Class HeritrixController
- java.lang.Object
-
- dk.netarkivet.harvester.heritrix3.controller.AbstractRestHeritrixController
-
- dk.netarkivet.harvester.heritrix3.controller.HeritrixController
-
- All Implemented Interfaces:
IHeritrixController
public class HeritrixController extends AbstractRestHeritrixController
This implementation of the HeritrixController interface starts Heritrix3 as a separate process and uses JMX to communicate with it. Each instance executes exactly one process that runs exactly one crawl job.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class dk.netarkivet.harvester.heritrix3.controller.AbstractRestHeritrixController
AbstractRestHeritrixController.LaunchResultHandler
-
-
Field Summary
-
Fields inherited from class dk.netarkivet.harvester.heritrix3.controller.AbstractRestHeritrixController
errorPrinter, files, h3handler, h3launcher, h3wrapper, heritrixBaseDir, outputPrinter
-
-
Constructor Summary
Constructors Constructor Description HeritrixController(Heritrix3Files files, java.lang.String jobName)
Create a BnfHeritrixController object.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
atFinish()
Query whether Heritrix is in a state where it can finish crawling.void
beginCrawlStop()
Tell Heritrix to stop crawling.void
cleanup()
Release any resources kept by the class.void
cleanup(java.io.File crawlDir)
Cleanup after an Heritrix3 process.boolean
crawlIsEnded()
Returns true if the crawl has ended, either because Heritrix finished or because we terminated it.int
getActiveToeCount()
Get the number of currently active ToeThreads (crawler threads).java.lang.String
getAdminInterfaceUrl()
Return the URL for monitoring this instance.CrawlProgressMessage
getCrawlProgress()
Gets a message that stores the information summarizing the crawl progress.int
getCurrentProcessedKBPerSec()
Get an estimate of the rate, in kb, at which documents are currently being processed by the crawler.FullFrontierReport
getFullFrontierReport()
Generates a full frontier report from H3 using an REST call (Groovy script)java.lang.String
getHarvestInformation()
Get harvest information.java.lang.String
getHeritrixConsoleURL()
Return the URL for monitoring this instance.java.lang.String
getHeritrixJobConsoleURL()
Return the URL for monitoring the job of this instance.java.lang.String
getProgressStats()
Get a human-readable set of statistics on the progress of the crawl.long
getQueuedUriCount()
Get the number of URIs currently on the queue to be processed.void
initialize()
Initialize the JMXconnection to the Heritrix3.boolean
isPaused()
Returns true if the crawler has been paused, and thus not supposed to fetch anything.void
requestCrawlStart()
Request that Heritrix start crawling.void
requestCrawlStop(java.lang.String reason)
Request that the crawler stops.void
stopHeritrix()
Stop the heritrix process.-
Methods inherited from class dk.netarkivet.harvester.heritrix3.controller.AbstractRestHeritrixController
getFiles, getGuiPort, getHeritrixAdminName, getHeritrixAdminPassword, getHeritrixFiles, getHostName, getJobDescription, toString
-
-
-
-
Constructor Detail
-
HeritrixController
public HeritrixController(Heritrix3Files files, java.lang.String jobName)
Create a BnfHeritrixController object.- Parameters:
files
- Files that are used to set up Heritrix3.
-
-
Method Detail
-
initialize
public void initialize()
Initialize the JMXconnection to the Heritrix3.- Throws:
IOFailure
- If Heritrix3 dies before initialisation, or we encounter any problems during the initialisation.- See Also:
IHeritrixController.initialize()
-
requestCrawlStart
public void requestCrawlStart()
Description copied from interface:IHeritrixController
Request that Heritrix start crawling. When this method returns, either Heritrix has failed in the early stages, or the crawljob has been successfully created. Actual crawling will commence at some point hereafter.
-
requestCrawlStop
public void requestCrawlStop(java.lang.String reason)
Description copied from interface:IHeritrixController
Request that the crawler stops. This makes a call to beginCrawlStop(), unless the crawler is already shutting down. In that case it does nothing.- Parameters:
reason
- A human-readable reason the crawl is being stopped.
-
stopHeritrix
public void stopHeritrix()
Description copied from interface:IHeritrixController
Stop the heritrix process.
-
getHeritrixConsoleURL
public java.lang.String getHeritrixConsoleURL()
Return the URL for monitoring this instance.- Returns:
- the URL for monitoring this instance.
-
getHeritrixJobConsoleURL
public java.lang.String getHeritrixJobConsoleURL()
Return the URL for monitoring the job of this instance.- Returns:
- the URL for monitoring the job of this instance.
-
cleanup
public void cleanup(java.io.File crawlDir)
Cleanup after an Heritrix3 process. This entails sending the shutdown command to the Heritrix3 process, and killing it forcefully, if it is still alive after waiting the period of time specified by the CommonSettings.PROCESS_TIMEOUT setting.- Parameters:
crawlDir
- the crawldir to cleanup (argument is currently not used)- See Also:
IHeritrixController.cleanup()
-
getAdminInterfaceUrl
public java.lang.String getAdminInterfaceUrl()
Return the URL for monitoring this instance.- Returns:
- the URL for monitoring this instance.
-
getCrawlProgress
public CrawlProgressMessage getCrawlProgress()
Gets a message that stores the information summarizing the crawl progress.- Returns:
- a message that stores the information summarizing the crawl progress.
-
getFullFrontierReport
public FullFrontierReport getFullFrontierReport()
Generates a full frontier report from H3 using an REST call (Groovy script)- Returns:
- a Full frontier report.
-
atFinish
public boolean atFinish()
Description copied from interface:IHeritrixController
Query whether Heritrix is in a state where it can finish crawling. Returns true if no uris remain to be harvested, or it has met either the maxbytes limit, the document limit, or the time-limit for the current harvest.- Returns:
- True if Heritrix thinks it is time to stop crawling.
-
beginCrawlStop
public void beginCrawlStop()
Description copied from interface:IHeritrixController
Tell Heritrix to stop crawling. Heritrix may take a while to actually stop, so you cannot assume that crawling is stopped when this method returns.
-
cleanup
public void cleanup()
Description copied from interface:IHeritrixController
Release any resources kept by the class.
-
crawlIsEnded
public boolean crawlIsEnded()
Description copied from interface:IHeritrixController
Returns true if the crawl has ended, either because Heritrix finished or because we terminated it.- Returns:
- True if the CrawlEnded event has happened in Heritrix, indicating that all crawls have stopped.
-
getActiveToeCount
public int getActiveToeCount()
Description copied from interface:IHeritrixController
Get the number of currently active ToeThreads (crawler threads).- Returns:
- Number of ToeThreads currently active within Heritrix.
-
getCurrentProcessedKBPerSec
public int getCurrentProcessedKBPerSec()
Description copied from interface:IHeritrixController
Get an estimate of the rate, in kb, at which documents are currently being processed by the crawler.- Returns:
- Number of KB data downloaded by Heritrix over an undefined interval up to now.
- See Also:
org.archive.crawler.framework.StatisticsTracking#currentProcessedKBPerSec()
-
getHarvestInformation
public java.lang.String getHarvestInformation()
Description copied from interface:IHeritrixController
Get harvest information. An example of this can be an URL pointing to the GUI of a running Heritrix process.- Returns:
- information about the harvest process.
-
getProgressStats
public java.lang.String getProgressStats()
Description copied from interface:IHeritrixController
Get a human-readable set of statistics on the progress of the crawl. The statistics is discovered uris, queued uris, downloaded uris, doc/s(avg), KB/s(avg), dl-failures, busy-thread, mem-use-KB, heap-size-KB, congestion, max-depth, avg-depth. If no statistics are available, the string "No statistics available" is returned. Note: this method may disappear in the future.- Returns:
- Some ascii-formatted statistics on the progress of the crawl.
-
getQueuedUriCount
public long getQueuedUriCount()
Description copied from interface:IHeritrixController
Get the number of URIs currently on the queue to be processed. This number may not be exact and should only be used in informal texts.- Returns:
- How many URIs Heritrix have lined up for processing.
-
isPaused
public boolean isPaused()
Description copied from interface:IHeritrixController
Returns true if the crawler has been paused, and thus not supposed to fetch anything. Heritrix may still be fetching stuff, as it takes some time for it to go into full pause mode. This method can be used as an indicator that we should not be worried if Heritrix appears to be idle.- Returns:
- True if the crawler has been paused, e.g. by using the Heritrix GUI.
-
-