The primary function of the NetarchiveSuite is to plan, schedule and archive web harvests of parts of the internet. We use Heritrix3 as our web-crawler.
NetarchiveSuite was released on July 2007 as Open Source under the LGPL license and is used by the Danish organization Netarkivet.dk. This organization has since July 2005 been using NetarchiveSuite to harvest Danish websites as authorized by the latest Danish Legal Deposit Act. A number of other national libraries are also using NetarchiveSuite to harvest their web, and the software is maintained as an Open Source partnership between these organisations.
The NetarchiveSuite can organize three different kinds of web harvestingharvest:
- Event harvesting (organize harvests of a set of domains related to a specific event e.g. 9/11, Elections and so on).
- Selective harvesting (recurrent harvests of a set of domains).
- Snapshot harvesting (organizing a complete snapshot of all known domains).
- Friendly to non-technicians - designed to be usable by librarians and curators with a minimum of technical supervision.
- Low maintenance - easy setup of automated harvests, automated bit-integrity checks, and simple curator tools.
- High bit-preservation security - replication and active integrity tests of large data contents using the built in ArcRepository module (although some organisations prefer to use their own bit-storage solutions).
- Loosely coupled - the suite consists of modules that can be used individually, or be used as one large web-archiving system.
The framework and utilities used by the whole suite, like exceptions, settings, messaging, file transfer (RemoteFile), and logging. It also defines the Java interfaces used to communicate between the different modules, to support alternative implementations. The Common Module includes the web front-end through which curators and managers can define harvests, monitor running harvests, and perform quality assurance on completed harvests.
The Harvester Module
This module handles defining, scheduling, and performing harvests.
- The harvesting module uses the Heritrix crawler developed by Internet Archive. The harvesting module allows for flexible automated definitions of harvests. The system gives access to the full power of the Heritrix crawler, given adequate knowledge of the Heritrix crawleritself. NetarchiveSuite wraps the crawler in an easy-to-use interface that handles scheduling and configuring of the crawls, and distributes it to several crawling servers.
- The harvester module allows for de-duplication, using an index of URLs already crawled and stored in the archive to avoid storing duplicates more than once. (This function uses the de-duplicator module from Kristinn Sigurdsson of the National Library of Iceland.)
- The harvester module supports packaging metadata about the harvest together with the harvested data.
- The archiving component offers a secure environment for storing your harvested material. It is designed for high preservation guarantees on bit preservation.
- It allows for replication of data on different locations, and distribution of content on several servers on each location. It supports different software and hardware platforms (Linux & Windows).
- The module allows for distributed batch jobs, i.e. running the same jobs on all servers at a location in parallel, and merging the results.
- An index of data in the archive allows fast access to the harvested materials.