Package dk.netarkivet.harvester.heritrix3
This module handles defining, scheduling, and execution of harvests.
- Harvesting uses the Heritrix crawler developed by Internet Archive. The harvesting module allows for flexible automated definitions of harvests. The system gives access to the full power of the Heritrix crawler, given adequate knowledge of the Heritrix crawler. NetarchiveSuite wraps the crawler in an easy-to-use interface that handles scheduling and configuring of the crawls, and distributes it to several crawling servers.
- The harvester module allows for de-duplication, using an index of URLs already crawled and stored in the archive to avoid storing duplicates more than once. This function uses the de-duplicator module from Kristinn Sigurdsson.
- The harvester module supports packaging metadata about the harvest together with the harvested data.
-
Class Summary Class Description BlockingCommandLauncher Constants Constants for heritrix3-controller module.HarvestControllerApplication This application controls the Heritrix3 harvester which does the actual harvesting, and is also responsible for uploading the harvested data to the ArcRepository.HarvestControllerServer This class responds to JMS doOneCrawl messages from the HarvestScheduler and launches a Heritrix crawl with the received job description.HarvestDocumentation This class contains code for documenting a H3 harvest.HarvestJob Heritrix3Files This class encapsulates the information generated by Heritrix3 or delivered to Heritrix3 before a crawl.Heritrix3Settings Settings specific to the heritrix3 harvester module of NetarchiveSuite.HeritrixLauncherAbstract A HeritrixLauncher object wraps around an instance of the web crawler Heritrix3.HeritrixLauncherFactory Factory class for instantiating a specific implementation ofHeritrixLauncherAbstract
.IngestableFiles Encapsulation of files to be ingested into the archive.PostProcessing