This page describes the high-level functionality of the NetarchiveSuite deploy software and the hierarchical scoping of deployment parameters.
The main function of deploy is to install and configure NetarchiveSuite on a distributed system. This is done through scripts to install, start and stop the applications of NetarchiveSuite based on a configuration file for the system. A sample file is provided with NetarchiveSuite in the file examples/deploy_distributed_example.xml.
The figure below shows the hierarchy of the instances in the deploy configuration file.
installDir: The directory on a machine where the installation is done. This is the directory environmentName from the ssh initial directory. Linux path: /home/machineUser/environmentName/, and in Windows C:\Documents and Settings\machineUser\environmentName or C:\Users\machineUser\environmentName depending on the Windows version.
The Deploy module has to be run from a Linux/Unix machine, since the scripts for handling the physical locations use the linux bash shell. The figure below shows what happens when the deploy application is run.
Deploy takes the following arguments:
The Deploy application requires the following libraries in the classpath:
Note that you only need to reference the netarchivesuite-deploy-core.jar file explicitly in the classpath, because the others are referenced inside the netarchivesuite-deploy-core.jar file.
The complete call (without optionals) for running deploy will therefore be something like the following (with
lib/ being the directory for the libraries):
export JAVA_HOME=/usr/java/jdk1.8.0 export PATH=$JAVA_HOME/bin:$PATH java -cp lib/netarchivesuite-deploy-core.jar dk.netarkivet.deploy.DeployApplication -Cdeploy_config.xml -ZNetarchiveSuite.zip -Ssecurity.policy -Llog.prop -Bheritrix-bundler.zip
deploy_config.xml is the name and path to the configuration file,
NetarchiveSuite.zip is the path of the NetarchiveSuite package,
security.policy is the path of the security policy file and
log.prop is the path of the property file for logging.
When deploy is run a number of files are created in the output directory. These include scripts to install, start and kill the applications on the distributed platform. Also the NetarchiveSuite package file is copied to this location (unless it already exists in the output directory).
In addition to a NetarchiveSuite settings file, the following configuration files are also created on a per-machine or per-application basis:
This file is created from scratch for each machine. A large instructional header for the use of the
jmxremote.password is initially created for the file, then the jmx username and jmx password for the monitor and for heritrix are appended. It is only the jmx logins (username and password), which are used by the applications.
The login variables for the monitor are found through the paths in the settings for any of the applications:
The login variables for heritrix are found through the paths in any of the application settings:
If any application has a monitor defined in the settings file, the monitor must have a jmx login defined. The monitor jmx logins must be the same for all applications on a machine. This also applies for heritrix jmx logins, though the monitor jmx login and heritrix jmx login do not have to be the same as each other.
A log property file for each application is created. This file is given as input and it is changed to fit the application.
The only change in the log property file is changing the tag
APPID to the identification of the application (
"_" + applicationInstanceId). Where the
"_" + applicationInstanceId only is appended to the
applicationName if the application has an
The name of this application specific log property file is:
"logback_" + applicationIdentification + ".xml". Where the
applicationIdentification is given as
"_" + applicationInstanceId, as described above.
The security policy file for a bitarchive machine is initially a copy of the security policy file given as argument. This machine specific security policy file is then modified to suit the needs of the machine and it's applications.
The tag ROLE is replaced by the monitor.jmxUsername for the machine. This has to be defined on the machine level in the deploy configuration file.
Permission to read the baseFileDir under bitarchive for all applications is granted. The path to these directories are changed to fit the language in security policy.
It is possible to evaluate the content of the configuration file when deploying, by giving the '-E' parameter with argument either 'y' or 'yes'. This is a tool for finding bugs within a configuration file (e.g. a mispelled name or wrongly placed branch).
This checks if the all the branches in the configuration file can be found within the default settings, and makes a warning for those it cannot find. It does not check if the content of these branches are correct (e.g. http-port = -1), it only checks whether the branches also exist in the default settings.
Deploy does not abort the program when unknown branches are found. It only generates warnings about each unknown branch and then continues with the deployment.
Some module have plugins which use some values within the settings, which are not part of the default settings, and they will therefore be noted as unknown. Such plugin specific branches should not be considered errors, even though warnings are issued for them.
In the case where test arguments are given a new configuration file is created, with _test appended to the name (e.g. deploy_config.xml will have the test instance configuration file: deploy_config_test.xml).
The following test arguments are given:
test_Mailreceivers. These arguments are given without spaces between them in the above order. An
Offset variable is calculated as the difference between the
test_HttpPort and the
test_HttpOffsetPort). The value of this
Offset must be between 0 and 9 .
The test argument is applied to deploy_config_test file, where the following changes are made:
test_HttpPortreplaces the value in the settings path: settings.common.http.port.
test_Mailreceiverreplaces the value in the settings path: settings.common.notification.receiver.
Offsetreplaces a single digit in some four-digit ports under settings. This is seen in the table below.
Offset = 7 and a settings.common.jmx.port = 1234 will yield a new settings.common.jmx.port = 1274 for the test instance, whereas a
settings.harvester.harvesting.heritrix.jmxPort = 1234 will yield a new
settings.harvester.harvesting.heritrix.jmxPort = 1734.
An installation script is created for each physical location. This script contains the commands for making the installation on all the machine of the physical location as described in the pseudo code.
The figure below shows the pattern of installation.
. The install script for a physical location has the following procedure:
The NetarchiveSuite file is copied to the machine using scp (secure copy). Then the file is unzipped in the installation directory, which is created as a subdirectory in the local user directory.
In the config file a number of directories are defined, and these directories have to be created during the installation on a machine. The following table show which directories are created based on the main branch where they are defined, and their path from this branch. The branch level represents where the applications have to be defined before they can be applied. They can easily be defined in a prior instance, and then be inherited to the given branch level.
where $/ in Directory is the value of the path. All the directories along this path will be created, if they do not exist already. A directory is only created if the path is defined under settings for the branch level (or inherited at the branch level) and it contains a non-empty value.
The installation of the directories will be executed from the installDir. The directories will only be installed if they do not already exist, with the optional exception of the tempDir, which will be removed before creation if the
-R argument is set to 'yes'. It is only the directory at the end of the path, which has its content removed, not all the directories along the path. E.g. a tempDir with the path
myPath/myEndDir will only clean the directory '
myEndDir', and not the directory '
On Linux/Unix machines directories are created directly through
ssh, while Windows machines use a batch program, which is installed, run and then deleted.
(This is because only a single command line can be run through
ssh, and this command line is run as
bash on Linux/Unix and as
batch on Windows. Since
bash can take many commands on a single command line, it is possible to install all the directories through
ssh on Linux/Unix.
batch on the other hand can only handle a single command per command line, and the directories can therefore not be installed through a single
ssh call. The
batch commands to install the directories are therefore combined in a
batch program, which is installed on the windows machine, then run and afterwards deleted.)
The jmxremote.password file has to be not-writable when the applications are running, which means that a reinstallation of this file cannot happen before it is made writable again.
Then all the script and setting files are copied from the local directory with the machine name to the 'conf/' directory in the installation directory on the machine.
Then the optional database is handled, though only on the machines with a specified database directory. This database overrides the existing standard database in the NetarchiveSuite package. The database is then unzipped to the database directory, but only if it is empty.
Then the scripts are made executable and the jmxremote.password is made read-only.
The figure below shows how the applications are started, and the same pattern are used for killing the applications again (replace start with kill in the figure).
Note that an application cannot be started if it is already running, and how this is checked is different on the two supported platforms: Linux and Windows platforms, as we will see below.
The restart script can be used for restarting the running applications. It starts by calling the killall script, then waits 5 seconds for the applications to terminate completely, and finally runs the startall script. This script can be used for Windows Services (automatic execution during startup).
On the Linux platform an application is only started if no instances of this application are found among the running processes. Likewise an application is only killed if it can be found in the process list.
The way an instance of a specific application can be found amongst the list of running processes, is by looking for any process with the same name, and which is using the same settings file.
When killing an application of the instance
dk.netarkivet.harvester.heritrix3.HarvestControllerApplication, any corresponding Heritrix application is also killed.
It requires several files on windows to run the application, and making sure that maximum one instance of the application is running. Two scripts for killing it, two scripts for starting it and one temporary file for telling whether an instance is running.
The application can only be started if the temporary run-file does not exist. It is done by calling a VBS script for running the application. This script starts the application as a process and saves the method for killing this process in a kill-process file.
The application can only be killed if the temporary run-file exists. The kill-process file is called for killing the process of the application. Then the temporary run-file is removed, thus telling that the application is not running and can be started again.
The Heritrix application is not killed when an application of the instance dk.netarkivet.harvester.heritrix3.HarvestControllerApplication is killed. This is because a Heritrix is not thoroughly tested on Windows, and might not be supported.