BnfHeritrixController

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

dk.netarkivet.harvester.harvesting.controller
Class BnfHeritrixController

java.lang.Object
  dk.netarkivet.harvester.harvesting.controller.AbstractJMXHeritrixController
      dk.netarkivet.harvester.harvesting.controller.BnfHeritrixController

All Implemented Interfaces:: HeritrixController

public class BnfHeritrixController
extends AbstractJMXHeritrixController
extends AbstractJMXHeritrixController

This implementation of the HeritrixController interface starts Heritrix as a separate process and uses JMX to communicate with it. Each instance executes exactly one process that runs exactly one crawl job.

Constructor Summary
`BnfHeritrixController(HeritrixFiles files)` Create a BnfHeritrixController object.

Method Summary
`boolean`	`atFinish()` Query whether Heritrix is in a state where it can finish crawling.
`void`	`beginCrawlStop()` Tell Heritrix to stop crawling.
`void`	`cleanup()` Release any resources kept by the class.
`void`	`cleanup(java.io.File crawlDir)` Cleanup after an Heritrix process.
`boolean`	`crawlIsEnded()` Returns true if the crawl has ended, either because Heritrix finished or because we terminated it.
`int`	`getActiveToeCount()` Get the number of currently active ToeThreads (crawler threads).
`java.lang.String`	`getAdminInterfaceUrl()` Return the URL for monitoring this instance.
`CrawlProgressMessage`	`getCrawlProgress()` Gets a message that stores the information summarizing the crawl progress.
`int`	`getCurrentProcessedKBPerSec()` Get an estimate of the rate, in kb, at which documents are currently being processed by the crawler.
`FullFrontierReport`	`getFullFrontierReport()` Generates a full frontier report.
`java.lang.String`	`getHarvestInformation()` Get harvest information.
`java.lang.String`	`getHeritrixConsoleURL()` Return the URL for monitoring this instance.
`java.lang.String`	`getProgressStats()` Get a human-readable set of statistics on the progress of the crawl.
`long`	`getQueuedUriCount()` Get the number of URIs currently on the queue to be processed.
`void`	`initialize()` Initialize a new CrawlController for executing a Heritrix crawl.
`boolean`	`isPaused()` Returns true if the crawler has been paused, and thus not supposed to fetch anything.
`void`	`requestCrawlStart()` Request that Heritrix start crawling.
`void`	`requestCrawlStop(java.lang.String reason)` Request that crawling stops.

Methods inherited from class dk.netarkivet.harvester.harvesting.controller.AbstractJMXHeritrixController
`getGuiPort, getHeritrixFiles, getHostName, getJmxPort, getJobDescription, processHasExited, toString, waitForHeritrixProcessExit`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Constructor Detail

BnfHeritrixController

public BnfHeritrixController(HeritrixFiles files)

Create a BnfHeritrixController object.

Parameters:: files - Files that are used to set up Heritrix.

Method Detail

initialize

public void initialize()

Description copied from interface: HeritrixController

Initialize a new CrawlController for executing a Heritrix crawl. This does not start the crawl.

Throws:: IOFailure - If Heritrix dies before initialization, or we encounter any problems during the initialization.
See Also:: HeritrixController.initialize()

requestCrawlStart

public void requestCrawlStart()

Description copied from interface: HeritrixController

Request that Heritrix start crawling. When this method returns, either Heritrix has failed in the early stages, or the crawljob has been successfully created. Actual crawling will commence at some point hereafter.

Throws:: IOFailure - if unable to communicate with Heritrix
See Also:: HeritrixController.requestCrawlStart()

requestCrawlStop

public void requestCrawlStop(java.lang.String reason)

Description copied from interface: HeritrixController

Request that crawling stops. This makes a call to beginCrawlStop(), unless the crawler is already shutting down. In that case it does nothing.

Parameters:: reason - A human-readable reason the crawl is being stopped.
See Also:: HeritrixController.requestCrawlStop(String)

getHeritrixConsoleURL

public java.lang.String getHeritrixConsoleURL()

Return the URL for monitoring this instance.

Returns:: the URL for monitoring this instance.

cleanup

public void cleanup(java.io.File crawlDir)

Cleanup after an Heritrix process. This entails sending the shutdown command to the Heritrix process, and killing it forcefully, if it is still alive after waiting the period of time specified by the CommonSettings.PROCESS_TIMEOUT setting.

See Also:: HeritrixController.cleanup()

getAdminInterfaceUrl

public java.lang.String getAdminInterfaceUrl()

Return the URL for monitoring this instance.

Returns:: the URL for monitoring this instance.

getCrawlProgress

public CrawlProgressMessage getCrawlProgress()

Gets a message that stores the information summarizing the crawl progress.

Returns:: a message that stores the information summarizing the crawl progress.

getFullFrontierReport

public FullFrontierReport getFullFrontierReport()

Generates a full frontier report.

atFinish

public boolean atFinish()

Description copied from interface: HeritrixController

Query whether Heritrix is in a state where it can finish crawling. Returns true if no uris remain to be harvested, or it has met either the maxbytes limit, the document limit, or the time-limit for the current harvest.

Returns:: True if Heritrix thinks it is time to stop crawling.

beginCrawlStop

public void beginCrawlStop()

Description copied from interface: HeritrixController

Tell Heritrix to stop crawling. Heritrix may take a while to actually stop, so you cannot assume that crawling is stopped when this method returns.

cleanup

public void cleanup()

Description copied from interface: HeritrixController

Release any resources kept by the class.

crawlIsEnded

public boolean crawlIsEnded()

Description copied from interface: HeritrixController

Returns true if the crawl has ended, either because Heritrix finished or because we terminated it.

Returns:: True if the CrawlEnded event has happened in Heritrix, indicating that all crawls have stopped.

getActiveToeCount

public int getActiveToeCount()

Description copied from interface: HeritrixController

Get the number of currently active ToeThreads (crawler threads).

Returns:: Number of ToeThreads currently active within Heritrix.

getCurrentProcessedKBPerSec

public int getCurrentProcessedKBPerSec()

Description copied from interface: HeritrixController

Get an estimate of the rate, in kb, at which documents are currently being processed by the crawler.

Returns:: Number of KB data downloaded by Heritrix over an undefined interval up to now.
See Also:: StatisticsTracking.currentProcessedKBPerSec()

getHarvestInformation

public java.lang.String getHarvestInformation()

Description copied from interface: HeritrixController

Get harvest information. An example of this can be an URL pointing to the GUI of a running Heritrix process.

Returns:: information about the harvest process.

getProgressStats

public java.lang.String getProgressStats()

Description copied from interface: HeritrixController

Get a human-readable set of statistics on the progress of the crawl. The statistics is discovered uris, queued uris, downloaded uris, doc/s(avg), KB/s(avg), dl-failures, busy-thread, mem-use-KB, heap-size-KB, congestion, max-depth, avg-depth. If no statistics are available, the string "No statistics available" is returned. Note: this method may disappear in the future.

Returns:: Some ascii-formatted statistics on the progress of the crawl.

getQueuedUriCount

public long getQueuedUriCount()

Description copied from interface: HeritrixController

Get the number of URIs currently on the queue to be processed. This number may not be exact and should only be used in informal texts.

Returns:: How many URIs Heritrix have lined up for processing.

isPaused

public boolean isPaused()

Description copied from interface: HeritrixController

Returns true if the crawler has been paused, and thus not supposed to fetch anything. Heritrix may still be fetching stuff, as it takes some time for it to go into full pause mode. This method can be used as an indicator that we should not be worried if Heritrix appears to be idle.

Returns:: True if the crawler has been paused, e.g. by using the Heritrix GUI.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

dk.netarkivet.harvester.harvesting.controller Class BnfHeritrixController

BnfHeritrixController

initialize

requestCrawlStart

requestCrawlStop

getHeritrixConsoleURL

cleanup

getAdminInterfaceUrl

getCrawlProgress

getFullFrontierReport

atFinish

beginCrawlStop

cleanup

crawlIsEnded

getActiveToeCount

getCurrentProcessedKBPerSec

getHarvestInformation

getProgressStats

getQueuedUriCount

isPaused

dk.netarkivet.harvester.harvesting.controller
Class BnfHeritrixController