The primary function of the NetarchiveSuite is to plan, schedule and
archive web harvests of parts of the internet. We use Heritrix as our
web-crawler. NetarchiveSuite was released on July 2007 as Open Source
under the LGPL license and is used by the Danish organization
Netarkivet.dk (
http://netarkivet.dk). This
organization has since July 2005 been using NetarchiveSuite to harvest
Danish websites as authorized by the latest Danish Legal Deposit Act.
The NetarchiveSuite can organize three different kinds of
harvests:
- Event harvesting (organize harvests of a set of domains
related to a specific event e.g. 9/11, Elections and so on).
- Selective harvesting (recurrent harvests of a set of
domains).
- Snapshot harvesting (organizing a complete snapshot of all
known domains).
The software has been designed with the following in mind:
- Friendly to non-technicians - designed to be usable by
librarians and curators with a minimum of technical supervision.
- Low maintenance - easy setup of automated harvests, automated
bit-integrity checks, and simple curator tools.
- High bit-preservation security - replication and active
integrity tests of large data contents.
- Loosely coupled - the suite consists of modules that can be
used individually, or be used as one large web-archiving system.
See the NetarchiveSuite wiki for further details.