Functionality of the Deploy Software
The main function of deploy is to install and configure NetarchiveSuite on a distributed system. This is done through scripts to install, start and stop the applications of NetarchiveSuite based on a configuration file for the system. A sample file is provided with NetarchiveSuite in the file examples/deploy_distributed_example.xml.
The figure below shows the hierarchy of the instances in the deploy configuration file.
- environmentName: The required value in the deploy configuration file.
- machineUser: The login for the machine.
- installDir: The directory on a machine where the installation is done. This is the directory environmentName from the ssh initial directory. Linux path:
/home/machineUser/environmentName/, and most versions of Windows use the path:
C:\Documents and Settings\machineUser\environmentName, except Windows Vista (and newest equivalent server) which has the path:
Performing a deploy
The Deploy module has to be run from a Linux/Unix machine, since the scripts for handling the physical locations use the linux bash shell. BitarchiveApplications are supported on Windows, and therefore some machines with Windows as operating system can be used in the distributed system.
The figure below shows what happens when the deploy application is run.
Deploy takes the following arguments:
- -C - The configuration file for deploy, has to have the '.xml' suffix.
- The required structure of this file is described in the Configuration file section. It has to be XML parseable.
- -Z - The NetarchiveSuite file, has to be '.zip'.
- This is the NetarchiveSuite package file, which is unzipped on all the machines during installation. This contains the libraries which are used when applications are run. The NetarchiveSuite package file is copied to the output directory when deploy is run.
- -L - The log property file, has to be '.prop'.
- This file contains the basic properties for logging. A copy of this file is made for each machine, where it is changed to fit purposes of the machine. See the Log property file section under Files.
- -S - The security policy file, has to be '.policy'.
- The security policy file defines where the applications are allowed to operate. A copy of this file is made for each machine, where the required security properties for the applications are granted. See the Security Policy file section under Files.
- -O [OPTIONAL] - The output directory.
- This is the directory on the root machine (the machine where deploy is run from) where the scripts and setting files are created by deploy (the environmentName is used as default name for the output directory).
- -D [OPTIONAL] - The harvesting (derby) database, has to be either '.zip' or '.jar'.
- The database where the harvesting information is to be located. If the database is not given as an argument, the default database in NetarchiveSuite package file is used. The database has to be placed in an unzippable file ('.zip' or '.jar'), and it is only unzipped on machines where a database directory has been defined. Currently databases are only supported on Linux machines.
- -R [OPTIONAL] - Whether the temporary file directory should be reset. Any argument different from 'y' or 'yes' will be considered a 'no'.
- During installation some directories are created, if they do not already exist. This argument defines whether the temporary directory should be cleared during installation (or reinstallation).
- -T [OPTIONAL] - For creating a test instance.
- The argument is required to have the following format: 'HttpOffsetPort,HttpPort,EnvironmentName,MailReceivers' (no spaces between them). A new config file is created based on these inputs and the given config file (this file has the same name, just with the extension '_test.xml' instead of '.xml'). See the Test instance section.
- -E [OPTIONAL] - For evaluating the config file. Any argument different from 'y' or 'yes' will be considered a 'no'.
- This evaluates whether the settings in the deploy configuration file are compatible with the standard settings. See the Evaluation section below.
- -A [OPTIONAL] - The archive (derby) database, has to be either '.zip' or '.jar'.
- This database will be used for both the ArcRepository and the DatabaseBasedActiveBitPreservation. If the database is not given as an argument, a default empty archive database in the NetarchiveSuite package file is used. The database has to be placed in an unzippable file ('.zip' or '.jar'), and it is only unzipped on machines where the <globalArchiveDatabaseDir> parameter is defined in the configuration. This is currently only supported on Linux machines.
- -B [OPTIONAL] The bundled heritrix harvester to use. If not specified as a deploy parameter the harvester bundle needs to be defined in the deploy configuration xml file for each (Heritrix3) harvester.
The Deploy application requires the following libraries in the classpath:
Note that you only need to reference the netarchivesuite-deploy-core.jar file explicitly in the classpath, because the others are referenced inside the netarchivesuite-deploy-core.jar file.
The complete call (without optionals) for running deploy will therefore be something like the following (with
lib/ being the directory for the libraries):
deploy_config.xml is the name and path to the configuration file,
NetarchiveSuite.zip is the path of the NetarchiveSuite package,
security.policy is the path of the security policy file and
log.prop is the path of the property file for logging.
When deploy is run a number of files are created in the output directory. These includes scripts to install, start and kill the applications on the distributed platform. Also the NetarchiveSuite package file is copied to this location (unless it already exists in the output directory).
In addition to a NetarchiveSuite settings file, the following configuration files are also created on a per-machine or per-application basis:
Jmxremote password file
This file is created from scratch for each machine. A large instructional header for the use of the
jmxremote.password is initially created for the file, then the jmx username and jmx password for the monitor and for heritrix are appended. It is only the jmx logins (username and password), which are used by the applications.
The login variables for the monitor are found through the paths in the settings for any of the applications:
The login variables for heritrix are found through the paths in any of the application settings:
If any application has a monitor defined in the settings file, the monitor must have a jmx login defined. The monitor jmx logins must be the same for all applications on a machine. This also applies for heritrix jmx logins, though the monitor jmx login and heritrix jmx login do not have to be the same as each other.
Log property file
A log property file for each application is created. This file is given as input and it is changed to fit the application.
The only change in the log property file is changing the tag
APPID to the identification of the application (
"_" + applicationInstanceId). Where the
"_" + applicationInstanceId only is appended to the
applicationName if the application has an
The name of this application specific log property file is:
"logback_" + applicationIdentification + ".xml". Where the
applicationIdentification is given as
"_" + applicationInstanceId, as described above.
Security policy file
The security policy file for a bitarchive machine is initially a copy of the security policy file given as argument. This machine specific security policy file is then modified to suit the needs of the machine and it's applications.
The tag ROLE is replaced by the monitor.jmxUsername for the machine. This has to be defined on the machine level in the deploy configuration file.
Permission to read the baseFileDir under bitarchive for all applications is granted. The path to these directories are changed to fit the language in security policy.
It is possible to evaluate the content of the configuration file when deploying, by giving the '-E' parameter with argument either 'y' or 'yes'. This is a tool for finding bugs within a configuration file (e.g. a mispelled name or wrongly placed branch).
This checks if the all the branches in the configuration file can be found within the default settings, and makes a warning for those it cannot find. It does not check if the content of these branches are correct (e.g. http-port = -1), it only checks whether the branches also exist in the default settings.
Deploy does not abort the program when unknown branches are found. It only generates warnings about each unknown branch and then continues with the deployment.
Some module have plugins which use some values within the settings, which are not part of the default settings, and they will therefore be noted as unknown. Such plugin specific branches should not be considered errors, even though warnings are issued for them.
In the case where test arguments are given a new configuration file is created, with _test appended to the name (e.g. deploy_config.xml will have the test instance configuration file: deploy_config_test.xml).
The following test arguments are given:
test_Mailreceivers. These arguments are given without spaces between them in the above order. An
Offset variable is calculated as the difference between the
test_HttpPort and the
test_HttpOffsetPort). The value of this
Offset must be between 0 and 9 .
The test argument is applied to deploy_config_test file, where the following changes are made:
- The environtmentName is changed to
- For every level the
test_HttpPortreplaces the value in the settings path: settings.common.http.port.
- For every level the
test_Mailreceiverreplaces the value in the settings path: settings.common.notification.receiver.
- For every level the
Offsetreplaces a single digit in some four-digit ports under settings. This is seen in the table below.
Offset = 7 and a settings.common.jmx.port = 1234 will yield a new settings.common.jmx.port = 1274 for the test instance, whereas a
settings.harvester.harvesting.heritrix.jmxPort = 1234 will yield a new
settings.harvester.harvesting.heritrix.jmxPort = 1734.
An installation script is created for each physical location. This script contains the commands for making the installation on all the machine of the physical location as described in the pseudo code.
The figure below shows the pattern of installation.
Install script pseudo code
. The install script for a physical location has the following procedure:
- for each machine do the following.
- Install the NetarchiveSuite file.
- Install the necessary directories.
- Install scripts, settings and database.
Install the NetarchiveSuite file
The NetarchiveSuite file is copied to the machine using scp (secure copy). Then the file is unzipped in the installation directory, which is created as a subdirectory in the local user directory.
Install necessary directories
In the config file a number of directories are defined, and these directories have to be created during the installation on a machine. The following table show which directories are created based on the main branch where they are defined, and their path from this branch. The branch level represents where the applications have to be defined before they can be applied. They can easily be defined in a prior instance, and then be inherited to the given branch level.
where $/ in Directory is the value of the path. All the directories along this path will be created, if they do not exist already. A directory is only created if the path is defined under settings for the branch level (or inherited at the branch level) and it contains a non-empty value.
The installation of the directories will be executed from the installDir. The directories will only be installed if they do not already exist, with the optional exception of the tempDir, which will be removed before creation if the
-R argument is set to 'yes'. It is only the directory at the end of the path, which has its content removed, not all the directories along the path. E.g. a tempDir with the path
myPath/myEndDir will only clean the directory '
myEndDir', and not the directory '
On Linux/Unix machines directories are created directly through
ssh, while Windows machines use a batch program, which is installed, run and then deleted.
(This is because only a single command line can be run through
ssh, and this command line is run as
bash on Linux/Unix and as
batch on Windows. Since
bash can take many commands on a single command line, it is possible to install all the directories through
ssh on Linux/Unix.
batch on the other hand can only handle a single command per command line, and the directories can therefore not be installed through a single
ssh call. The
batch commands to install the directories are therefore combined in a
batch program, which is installed on the windows machine, then run and afterwards deleted.)
Install scripts, settings and database
The jmxremote.password file has to be not-writable when the applications are running, which means that a reinstallation of this file cannot happen before it is made writable again.
Then all the script and setting files are copied from the local directory with the machine name to the 'conf/' directory in the installation directory on the machine.
Then the optional database is handled, though only on the machines with a specified database directory. This database overrides the existing standard database in the NetarchiveSuite package. The database is then unzipped to the database directory, but only if it is empty.
Then the scripts are made executable and the jmxremote.password is made read-only.
Start, Restart and Kill
The figure below shows how the applications are started, and the same pattern are used for killing the applications again (replace start with kill in the figure).
Note that an application cannot be started if it is already running, and how this is checked is different on the two supported platforms: Linux and Windows platforms, as we will see below.
The restart script can be used for restarting the running applications. It starts by calling the killall script, then waits 5 seconds for the applications to terminate completely, and finally runs the startall script. This script can be used for Windows Services (automatic execution during startup).
On the Linux platform an application is only started if no instances of this application are found among the running processes. Likewise an application is only killed if it can be found in the process list.
The way an instance of a specific application can be found amongst the list of running processes, is by looking for any process with the same name, and which is using the same settings file.
When killing an application of the instance
dk.netarkivet.harvester.heritrix3.HarvestControllerApplication, any corresponding Heritrix application is also killed.
It requires several files on windows to run the application, and making sure that maximum one instance of the application is running. Two scripts for killing it, two scripts for starting it and one temporary file for telling whether an instance is running.
The application can only be started if the temporary run-file does not exist. It is done by calling a VBS script for running the application. This script starts the application as a process and saves the method for killing this process in a kill-process file.
The application can only be killed if the temporary run-file exists. The kill-process file is called for killing the process of the application. Then the temporary run-file is removed, thus telling that the application is not running and can be started again.
The Heritrix application is not killed when an application of the instance dk.netarkivet.harvester.heritrix3.HarvestControllerApplication is killed. This is because a Heritrix is not thoroughly tested on Windows, and might not be supported.