dk.netarkivet.harvester.heritrix3 (NetarchiveSuite 7.1 API)

This module handles defining, scheduling, and execution of harvests.

Harvesting uses the Heritrix crawler developed by Internet Archive. The harvesting module allows for flexible automated definitions of harvests. The system gives access to the full power of the Heritrix crawler, given adequate knowledge of the Heritrix crawler. NetarchiveSuite wraps the crawler in an easy-to-use interface that handles scheduling and configuring of the crawls, and distributes it to several crawling servers.
The harvester module allows for de-duplication, using an index of URLs already crawled and stored in the archive to avoid storing duplicates more than once. This function uses the de-duplicator module from Kristinn Sigurdsson.
The harvester module supports packaging metadata about the harvest together with the harvested data.

Class Summary
Class	Description
BlockingCommandLauncher
Constants	Constants for heritrix3-controller module.
HarvestControllerApplication	This application controls the Heritrix3 harvester which does the actual harvesting, and is also responsible for uploading the harvested data to the ArcRepository.
HarvestControllerServer	This class responds to JMS doOneCrawl messages from the HarvestScheduler and launches a Heritrix crawl with the received job description.
HarvestDocumentation	This class contains code for documenting a H3 harvest.
HarvestJob
Heritrix3Files	This class encapsulates the information generated by Heritrix3 or delivered to Heritrix3 before a crawl.
Heritrix3Settings	Settings specific to the heritrix3 harvester module of NetarchiveSuite.
HeritrixLauncherAbstract	A HeritrixLauncher object wraps around an instance of the web crawler Heritrix3.
HeritrixLauncherFactory	Factory class for instantiating a specific implementation of `HeritrixLauncherAbstract`.
IngestableFiles	Encapsulation of files to be ingested into the archive.
PostProcessing

Package dk.netarkivet.harvester.heritrix3