dk.netarkivet.harvester.harvesting
Class DirectHeritrixController

java.lang.Object
  extended by dk.netarkivet.harvester.harvesting.DirectHeritrixController
All Implemented Interfaces:
HeritrixController

Deprecated. The JMXHeritrixController offers an implementation that's better on almost all counts.

public class DirectHeritrixController
extends java.lang.Object
implements HeritrixController

This class encapsulates one full run of Heritrix by grabbing hold of a CrawlController class. It implements the CrawlController interface.


Nested Class Summary
(package private)  class DirectHeritrixController.SimpleCrawlStatusListener
          Deprecated. Class for handling callbacks from Heritrix.
 
Field Summary
(package private)  org.archive.crawler.framework.CrawlController myController
          Deprecated. the controller object, which initializes, starts, and stops a Heritrix crawl job.
 
Constructor Summary
protected DirectHeritrixController(HeritrixFiles files)
          Deprecated. Create a new DirectHeritrixController object with a given set of files.
 
Method Summary
 void addCrawlStatusListener(org.archive.crawler.event.CrawlStatusListener listener)
          Deprecated. Add a listener to this crawlController.
 boolean atFinish()
          Deprecated. Query whether Heritrix is in a state where it can finish crawling.
 void beginCrawlStop()
          Deprecated. Tell Heritrix to stop crawling.
 void cleanup()
          Deprecated. Release any resources kept by the class.
 boolean crawlIsEnded()
          Deprecated. Returns true if the crawl has ended, either because Heritrix finished or because we terminated it.
 int getActiveToeCount()
          Deprecated. Get the number of currently active ToeThreads (crawler threads).
 int getCurrentProcessedKBPerSec()
          Deprecated. Get an estimate of the rate, in kb, at which documents are currently being processed by the crawler.
 java.lang.String getProgressStats()
          Deprecated. Get a human-readable set of statistics on the progress of the crawl.
 long getQueuedUriCount()
          Deprecated. Get the number of URIs currently on the queue to be processed.
 void initialize()
          Deprecated. Initialize a new CrawlController for executing a Heritrix crawl.
 boolean isPaused()
          Deprecated. Returns true if the crawler has been paused, and thus not supposed to fetch anything.
 void requestCrawlStart()
          Deprecated. Request that Heritrix start crawling.
 void requestCrawlStop(java.lang.String reason)
          Deprecated. Request that crawling stops.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

myController

org.archive.crawler.framework.CrawlController myController
Deprecated. 
the controller object, which initializes, starts, and stops a Heritrix crawl job.

Constructor Detail

DirectHeritrixController

protected DirectHeritrixController(HeritrixFiles files)
Deprecated. 
Create a new DirectHeritrixController object with a given set of files.

Parameters:
files - Files for Heritrix to use.
Method Detail

initialize

public void initialize()
Deprecated. 
Description copied from interface: HeritrixController
Initialize a new CrawlController for executing a Heritrix crawl. This does not start the crawl.

Specified by:
initialize in interface HeritrixController
See Also:
HeritrixController.initialize()

requestCrawlStart

public void requestCrawlStart()
Deprecated. 
Description copied from interface: HeritrixController
Request that Heritrix start crawling. When this method returns, either Heritrix has failed in the early stages, or the crawljob has been successfully created. Actual crawling will commence at some point hereafter.

Specified by:
requestCrawlStart in interface HeritrixController
See Also:
HeritrixController.requestCrawlStart()

atFinish

public boolean atFinish()
Deprecated. 
Description copied from interface: HeritrixController
Query whether Heritrix is in a state where it can finish crawling. Returns true if no uris remain to be harvested, or it has met either the maxbytes limit, the document limit, or the time-limit for the current harvest.

Specified by:
atFinish in interface HeritrixController
Returns:
True if Heritrix thinks it is time to stop crawling.
See Also:
HeritrixController.atFinish()

beginCrawlStop

public void beginCrawlStop()
Deprecated. 
Description copied from interface: HeritrixController
Tell Heritrix to stop crawling. Heritrix may take a while to actually stop, so you cannot assume that crawling is stopped when this method returns.

Specified by:
beginCrawlStop in interface HeritrixController
See Also:
HeritrixController.beginCrawlStop()

getActiveToeCount

public int getActiveToeCount()
Deprecated. 
Description copied from interface: HeritrixController
Get the number of currently active ToeThreads (crawler threads).

Specified by:
getActiveToeCount in interface HeritrixController
Returns:
Number of ToeThreads currently active within Heritrix.
See Also:
HeritrixController.getActiveToeCount()

requestCrawlStop

public void requestCrawlStop(java.lang.String reason)
Deprecated. 
Description copied from interface: HeritrixController
Request that crawling stops. This makes a call to beginCrawlStop(), unless the crawler is already shutting down. In that case it does nothing.

Specified by:
requestCrawlStop in interface HeritrixController
Parameters:
reason - A human-readable reason the crawl is being stopped.
See Also:
HeritrixController.requestCrawlStop(String)

addCrawlStatusListener

public void addCrawlStatusListener(org.archive.crawler.event.CrawlStatusListener listener)
Deprecated. 
Add a listener to this crawlController. This is currently only needed to known when the crawler finished.

Parameters:
listener - The listener for crawlstatus messages.
See Also:
HeritrixController.crawlIsEnded()

getQueuedUriCount

public long getQueuedUriCount()
Deprecated. 
Description copied from interface: HeritrixController
Get the number of URIs currently on the queue to be processed. This number may not be exact and should only be used in informal texts.

Specified by:
getQueuedUriCount in interface HeritrixController
Returns:
How many URIs Heritrix have lined up for processing.

getCurrentProcessedKBPerSec

public int getCurrentProcessedKBPerSec()
Deprecated. 
Description copied from interface: HeritrixController
Get an estimate of the rate, in kb, at which documents are currently being processed by the crawler.

Specified by:
getCurrentProcessedKBPerSec in interface HeritrixController
Returns:
Number of KB data downloaded by Heritrix over an undefined interval up to now.
See Also:
HeritrixController.getCurrentProcessedKBPerSec()

getProgressStats

public java.lang.String getProgressStats()
Deprecated. 
Description copied from interface: HeritrixController
Get a human-readable set of statistics on the progress of the crawl. The statistics is discovered uris, queued uris, downloaded uris, doc/s(avg), KB/s(avg), dl-failures, busy-thread, mem-use-KB, heap-size-KB, congestion, max-depth, avg-depth. If no statistics are available, the string "No statistics available" is returned. Note: this method may disappear in the future.

Specified by:
getProgressStats in interface HeritrixController
Returns:
Some ascii-formatted statistics on the progress of the crawl.
See Also:
HeritrixController.getProgressStats()

isPaused

public boolean isPaused()
Deprecated. 
Description copied from interface: HeritrixController
Returns true if the crawler has been paused, and thus not supposed to fetch anything. Heritrix may still be fetching stuff, as it takes some time for it to go into full pause mode. This method can be used as an indicator that we should not be worried if Heritrix appears to be idle.

Specified by:
isPaused in interface HeritrixController
Returns:
True if the crawler has been paused, e.g. by using the Heritrix GUI.
See Also:
HeritrixController.isPaused()

crawlIsEnded

public boolean crawlIsEnded()
Deprecated. 
Returns true if the crawl has ended, either because Heritrix finished or because we terminated it. This implementation returns true, after the CrawlController has ended a crawl and is about to exit, when it sends a crawlEnded(String sExitMessage) to all listeners.

Specified by:
crawlIsEnded in interface HeritrixController
Returns:
True if Heritrix is entirely done and cleanup can start.

cleanup

public void cleanup()
Deprecated. 
Description copied from interface: HeritrixController
Release any resources kept by the class.

Specified by:
cleanup in interface HeritrixController
See Also:
HeritrixController.cleanup()