NetarchiveSuite system overview

See:
          Description

Packages
dk.netarkivet.archive  
dk.netarkivet.archive.arcrepository  
dk.netarkivet.archive.arcrepository.bitpreservation  
dk.netarkivet.archive.arcrepository.distribute  
dk.netarkivet.archive.arcrepositoryadmin  
dk.netarkivet.archive.bitarchive  
dk.netarkivet.archive.bitarchive.distribute  
dk.netarkivet.archive.checksum  
dk.netarkivet.archive.checksum.distribute  
dk.netarkivet.archive.distribute  
dk.netarkivet.archive.indexserver  
dk.netarkivet.archive.indexserver.distribute  
dk.netarkivet.archive.tools  
dk.netarkivet.archive.webinterface  
dk.netarkivet.common  
dk.netarkivet.common.distribute  
dk.netarkivet.common.distribute.arcrepository  
dk.netarkivet.common.distribute.indexserver  
dk.netarkivet.common.distribute.monitorregistry  
dk.netarkivet.common.exceptions  
dk.netarkivet.common.lifecycle  
dk.netarkivet.common.management  
dk.netarkivet.common.tools  
dk.netarkivet.common.utils  
dk.netarkivet.common.utils.arc  
dk.netarkivet.common.utils.batch  
dk.netarkivet.common.utils.cdx  
dk.netarkivet.common.webinterface  
dk.netarkivet.deploy  
dk.netarkivet.harvester  
dk.netarkivet.harvester.datamodel  
dk.netarkivet.harvester.distribute  
dk.netarkivet.harvester.harvesting  
dk.netarkivet.harvester.harvesting.controller  
dk.netarkivet.harvester.harvesting.distribute  
dk.netarkivet.harvester.harvesting.frontier  
dk.netarkivet.harvester.harvesting.monitor  
dk.netarkivet.harvester.scheduler  
dk.netarkivet.harvester.tools  
dk.netarkivet.harvester.webinterface  
dk.netarkivet.monitor  
dk.netarkivet.monitor.distribute  
dk.netarkivet.monitor.jmx  
dk.netarkivet.monitor.logging  
dk.netarkivet.monitor.registry  
dk.netarkivet.monitor.registry.distribute  
dk.netarkivet.monitor.tools  
dk.netarkivet.monitor.webinterface  
dk.netarkivet.viewerproxy  
dk.netarkivet.viewerproxy.distribute  
dk.netarkivet.viewerproxy.reporting  
dk.netarkivet.viewerproxy.webinterface  
dk.netarkivet.wayback  
dk.netarkivet.wayback.aggregator The Aggregator takes care of sorting the raw index files generated by the indexer and merge the files into larger index files usable by Wayback.
dk.netarkivet.wayback.batch  
dk.netarkivet.wayback.batch.copycode  
dk.netarkivet.wayback.indexer Retrieves indexes of the ARC files in the repository which are needed by Wayback.

 

NetarchiveSuite system overview

Introduction

The primary function of the NetarchiveSuite is to plan, schedule and archive web harvests of parts of the internet. We use Heritrix as our web-crawler. NetarchiveSuite was released on July 2007 as Open Source under the LGPL license and is used by the Danish organization Netarkivet.dk (http://netarkivet.dk). This organization has since July 2005 been using NetarchiveSuite to harvest Danish websites as authorized by the latest Danish Legal Deposit Act.

The NetarchiveSuite can organize three different kinds of harvests:

The software has been designed with the following in mind:

The modules in the NetarchiveSuite

The NetarchiveSuite is split into four main modules: One module with common functionality and three modules corresponding to processes of harvesting, archiving and accessing, respectively.
System overview

The Common Module

The framework and utilities used by the whole suite, like exceptions, settings, messaging, file transfer (RemoteFile), and logging. It also defines the Java interfaces used to communicate between the different modules, to support alternative implementations.

The Harvester Module

This module handles defining, scheduling, and performing harvests.

The Archive Module

This module makes it possible to setup and run a repository with replication, active bit consistency checks for bit-preservation, and support for distributed batch jobs on the archive.

The Access (Viewerproxy) Module

This module gives access to previously harvested material, through a proxy solution.

For developers