Interface IHeritrixController
-
- All Known Implementing Classes:
AbstractRestHeritrixController
,HeritrixController
public interface IHeritrixController
This interface encapsulates the direct access to Heritrix, allowing for accessing in various ways (direct class access or JMX). Heritrix is expected to perform one crawl for each instance of an implementing class.
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description boolean
atFinish()
Query whether Heritrix is in a state where it can finish crawling.void
beginCrawlStop()
Tell Heritrix to stop crawling.void
cleanup()
Release any resources kept by the class.boolean
crawlIsEnded()
Returns true if the crawl has ended, either because Heritrix finished or because we terminated it.int
getActiveToeCount()
Get the number of currently active ToeThreads (crawler threads).int
getCurrentProcessedKBPerSec()
Get an estimate of the rate, in kb, at which documents are currently being processed by the crawler.String
getHarvestInformation()
Get harvest information.String
getProgressStats()
Get a human-readable set of statistics on the progress of the crawl.long
getQueuedUriCount()
Get the number of URIs currently on the queue to be processed.void
initialize()
Initialize a new CrawlController for executing a Heritrix crawl.boolean
isPaused()
Returns true if the crawler has been paused, and thus not supposed to fetch anything.void
requestCrawlStart()
Request that Heritrix start crawling.void
requestCrawlStop(String reason)
Request that the crawler stops.void
stopHeritrix()
Stop the heritrix process.
-
-
-
Method Detail
-
initialize
void initialize()
Initialize a new CrawlController for executing a Heritrix crawl. This does not start the crawl.
-
requestCrawlStart
void requestCrawlStart() throws IOFailure
Request that Heritrix start crawling. When this method returns, either Heritrix has failed in the early stages, or the crawljob has been successfully created. Actual crawling will commence at some point hereafter.- Throws:
IOFailure
- If something goes wrong during startup.
-
beginCrawlStop
void beginCrawlStop()
Tell Heritrix to stop crawling. Heritrix may take a while to actually stop, so you cannot assume that crawling is stopped when this method returns.
-
requestCrawlStop
void requestCrawlStop(String reason)
Request that the crawler stops. This makes a call to beginCrawlStop(), unless the crawler is already shutting down. In that case it does nothing.- Parameters:
reason
- A human-readable reason the crawl is being stopped.
-
atFinish
boolean atFinish()
Query whether Heritrix is in a state where it can finish crawling. Returns true if no uris remain to be harvested, or it has met either the maxbytes limit, the document limit, or the time-limit for the current harvest.- Returns:
- True if Heritrix thinks it is time to stop crawling.
-
crawlIsEnded
boolean crawlIsEnded()
Returns true if the crawl has ended, either because Heritrix finished or because we terminated it.- Returns:
- True if the CrawlEnded event has happened in Heritrix, indicating that all crawls have stopped.
-
getActiveToeCount
int getActiveToeCount()
Get the number of currently active ToeThreads (crawler threads).- Returns:
- Number of ToeThreads currently active within Heritrix.
-
getQueuedUriCount
long getQueuedUriCount()
Get the number of URIs currently on the queue to be processed. This number may not be exact and should only be used in informal texts.- Returns:
- How many URIs Heritrix have lined up for processing.
-
getCurrentProcessedKBPerSec
int getCurrentProcessedKBPerSec()
Get an estimate of the rate, in kb, at which documents are currently being processed by the crawler.- Returns:
- Number of KB data downloaded by Heritrix over an undefined interval up to now.
- See Also:
org.archive.crawler.framework.StatisticsTracking#currentProcessedKBPerSec()
-
getProgressStats
String getProgressStats()
Get a human-readable set of statistics on the progress of the crawl. The statistics is discovered uris, queued uris, downloaded uris, doc/s(avg), KB/s(avg), dl-failures, busy-thread, mem-use-KB, heap-size-KB, congestion, max-depth, avg-depth. If no statistics are available, the string "No statistics available" is returned. Note: this method may disappear in the future.- Returns:
- Some ascii-formatted statistics on the progress of the crawl.
-
isPaused
boolean isPaused()
Returns true if the crawler has been paused, and thus not supposed to fetch anything. Heritrix may still be fetching stuff, as it takes some time for it to go into full pause mode. This method can be used as an indicator that we should not be worried if Heritrix appears to be idle.- Returns:
- True if the crawler has been paused, e.g. by using the Heritrix GUI.
-
cleanup
void cleanup()
Release any resources kept by the class.
-
getHarvestInformation
String getHarvestInformation()
Get harvest information. An example of this can be an URL pointing to the GUI of a running Heritrix process.- Returns:
- information about the harvest process.
-
stopHeritrix
void stopHeritrix()
Stop the heritrix process.
-
-