If the deploy software is not adequate for the installation needed, this section will give some hints on how to distribute and install the NetarchiveSuite software on a number of machines.
In the examples below, we assume that
$deployInstallDir is set to the directory in which the NetarchiveSuite code is to be installed.
We assume that all machines in the chosen scenario are unix/linux servers. The procedure below may not work on other platforms. After having created the new settings to be used in the deployment of the software, zip together the NetarchiveSuite files including the new settings and copy the modified NetarchiveSuite.zip to all machines taking part in the deployment:
The NetarchiveSuite settings can be set for applications in three different ways:
- use default setting
- in a setting file
- on command line
Using NetarchiveSuite default settings
If no settings are set, the default setting is used. Please refer to the [Configuration Manual 3.16#DefaultSettings] for more information on these.
Setting NetarchiveSuite settings on the command line
To set the value of a setting on the command line, add "-Dkey=value" to your java command line, for instance:
will override the setting for the http port to be 8076.
Setting NetarchiveSuite settings with settings files
To set the values using a configuration file, save the settings in an XML file as described above. By default, NetarchiveSuite will look for the settings file in
conf/settings.xml, that is: the file
settings.xml under the directory
conf from the current working directory. You can override this, by specifying
-Ddk.netarkivet.settings.file=path/to/settings.file.xml on the commandline, for instance:
will read settings from the file
You can even specify multiple configuration files, if you wish. You do this by separating the paths with ':' on unix/linux/MacOS or ';' on windows. For instance:
will read settings from both
basicsettings.xml in the current directory.
The order of resolving NetarchiveSuite settings
If a setting is set on both command line and in settings files, or if it is set in multiple settings files, the setting is resolved as follows:
- If the setting is set with system properties (i.e. set on the command line), use these
- Else if the setting is specified in configuration files, use the '''first''' specified value
- Else use default value
As an example, consider the resulting value for http-port (knowing that the default value is empty) when using the following two configuration files:
The following command will use the value empty string as http-port:
The following command will use the value 8078 as http-port:
The following command will use the value 8076 as http-port:
The following command will use the value 8077 as http-port:
Standard commandline settings
The CLASSPATH needed to start and run the java applications in NetarchiveSuite consists of 5 jarfiles,
dk.netarkivet.harvester.jar, dk.netarkivet.archive.jar, dk.netarkivet.viewerproxy.jar, dk.netarkivet.wayback.jar, and
dk.netarkivet.monitor.jar. The dk.netarkivet.common.jar and all our 3rd party dependencies need not be added explicitly to the CLASSPATH, as they are referenced indirectly in the jar-files.
We use the apache.commons.logging.framework, so we need to point to the wanted logger-class (eg. org.apache.commons.logging.impl.Jdk14Logger) as well as to the logging configuration file. You may want to use different logging properties for different applications, especially when more than one application logs to the same logging directory. E.g. you want the change line
java.util.logging.FileHandler.pattern=./log/APPID%u.log in the
conf/log.prop file to something different.
Note that if you use the MonitorSiteSection, your logging properties file must contain the handler
Each application instance has its own JMX- and RMI port. For example the JMX port could be 8100 and the associated RMI port 8200, as in the example below, for the first application instance on the machine , then 8101/8201 for the second application instance, and so on. JMX also uses a password-file, which is the same throughout the installation ($deployInstallDir/conf/jmxremote.password)
Note: For the StatusSiteSection to work, your logging must be configured to use java.util.logging with the
dk.netarkivet.monitor.logging.CachingLogHandler enabled, see Command Line Logging section (This is done automatically, if the NetarchiveSuite deploy software is used to configure and install your NetarchiveSuite installation).
Select the appropriate settings.file for the application
The conf/settings.xml (the new one configured to your environment) is probably OK for most applications. But you may need to use special purpose settings-files for some applications, e.g. BitarchiveApplications (since you can't allocate more than one
baseFileDir on the commandline). The settings file used in an application can be specified by:
We need to set the maximum Java heap size to 1.5 Gbytes. You may use this to change that or add other JVM options.
On the admin machine, we have to start the following 5 applications:
- 1 GUIApplication.
- 1 HarvestJobManagerApplication (handles the scheduling of jobs)
- 2 instances of BitarchiveMonitorApplication (Controlling the access to a single bitarchive replica), one for each bitarchive replicas (e.g. EAST and WEST).
- 1 ARCRepositoryApplication (this application handles access to the bitarchive replicas).
Starting the GUIApplication
Before, we can start the GUIApplication, the external database needs to started in advance (The deploy software does for you if the external database is a derby database).
We also need to prepare the JSP-pages. You can unzip the war-files in the webpages directory as below:
Or you can update your settings.xml file to refer to the war-files instead of the unpacked directories, for instance
and similar for other sitesections.
Now we are ready to start the application:
Starting the BitarchiveMonitorApplication instances
In the general set-up with two distributed bitarchive replicas, we have a BitarchiveMonitorApplication associated with each replica. Here the replicas are
ReplicaOne (with replicaId
ReplicaTwo (with replicaId
To distinguish the two instances from each other, we use the '''settings.common.applicationInstanceId''' setting, which is used as a identifier (here we use BMONE and BMTWO) as the two identifiers.
Start the monitor for bitarchive at
BMONE as identifier thus:
Start the monitor for the bitarchive at
BMTWO as identifier thus:
- one ARCRepository (this application handles all access to the bitarchives).
On each harvester machine, we have one or more HarvestControllerApplications. Settings related to the HarvestControllerApplication are
- setting.common.applicationInstanceId (to distinguish between HarvestControllerApplications running on same machine)
- settings.harvester.harvesting.queuePriority (to select which of two queues to accept jobs from: HIGHPRIORITY (jobs part of a selective harvest), or LOWPRIORITY (jobs part of a snapshotharvest)
- settings.harvester.harvesting.minSpaceLeft (how many bytes ''must'' be available in the serverdir to accept crawljobs). The default is 400000000 (~400 Mbytes).
In the following, a low-priority HarvestControllerApplication is started with application instance id=SEL
For each Replica, you can have BitarchiveServer's installed on one or more machines. We suggest using just one BitarchiveServer for each machine, though it is possible to use more than one.
Each BitarchiveServer can have storage on several filesystems, so if archive-storage is spread over more than one filesystem, you need to modify the settings file like this
Starting a BitarchiveServer requires knowing what Replica it resides on, and the credentials required for correcting the data stored in the bitarchive, for
ReplicaOne with id
ONE this would be:
On the access-servers, we deploy any number of ViewerProxyApplication instances, and maybe one IndexServerApplication (only one in all) used to generate indices needed by the harvesters and the ViewerProxyApplication instances.
Each ViewerproxyApplication instance uses a application instance id(settings.common.applicationInstanceId), and its own distinct base directory (settings.viewerproxy.baseDir). They also belong to a Replica(settings.archive.bitarchive.useReplicaId). In the start sample below, the instance uses application instance id "first" and 'viewerproxy_first' as base directory, and belongs to
ReplicaOne with id
About the NetarchiveSuite support for wayback, see Additional Tools Manual