Netarchive Suite

The NetarchiveSuite is a toolset for managing webharvests of subparts of the internet using the Heritrix webcrawler (http://crawler.archive.org). For further information about crawling the web, see http://en.wikipedia.org/wiki/Web_crawling). We use Heritrix 1.10.1 (http://crawler.archive.org) as our webcrawler. Different versions of Heritrix is presently not supported. In the future, other crawlers may be supported by the NetarchiveSuite.

The NetarchiveSuite can organize three different kind of harvests:

The crawling itself is divided into CrawlJobs, which is performed by an instance of Heritrix running locally or remotely. After the crawling is done, the statistics is sent back to the central machine, and the data (the ARC-files are sent to the archive).

The suite consists of a number of packages: dk.netarkivet.common, dk.netarkivet.harvester, dk.netarkivet.archive, dk.netarkivet.viewerproxy

The common package ....

The harvester package ....

The archive package ....

The viewerproxy package ....

There are two more packages, which does not belong to the NetarchiveSuite as such: The dk.netarkivet.deploy and dk.netarkivet.monitor packages. Prerequisites