Contents

dk.netarkivet.archive.tools.ReestablishAdminDatabase

This is a tool which is only run when converting from a file based administration of ArcRepository to the database based administration of ArcRepository. It takes the admin.data file and enters its data into the database.

Prerequisites

You need to have the external database running, the admin.data file must exist, and the tool must be run from the installation directory of the installation.

Usage

java -Ddk.netarkivet.settings.file=conf/settings_ArcRepositoryApplication.xml \
dk.netarkivet.archive.tools.ReestablishAdminDatabase [admin.data]

The optional argument admin.data is the path to the admin.data file. As default it is assumed that it is called 'admin.data' and it is located in the directory where the tool is run. It is therefore only necessary if the admin.data is in another directory or called by another name (e.g. backups/admin.data or admin.data.backup).

dk.netarkivet.archive.tools.CreateIndex

This tool forces the IndexServer to create indices preemptively. This tool can be used for retrieving logs and cdx'es for previously completed harvestjobs before they are actual needed. This can be helpful if you want to improve the time it takes to generate Deduplication indices.

Prerequisites

You need to have a IndexServerApplication online. If you use HTTP as file transport method, you probably also need to override the settings.common.remoteFile.port in order to avoid conflicts (In the example below, we have set the port number to 5000).

Furthermore all harvestjobs referred to in the CreateIndex commands must have metadata-1.arc files stored in the archive.

Usage

export INSTALLDIR=/fullpath/to/installdir
export CLASSPATH=$INSTALLDIR/lib/dk.netarkivet.archive.jar
export OPTS=-Dsettings.common.cacheDir=/tmp/cache \
-Dsettings.common.environmentName=QUICKSTART -Dsettings.common.remoteFile.port=5000
java $OPTS dk.netarkivet.archive.tools.CreateIndex -t dedup -l 1,2
ctrl-c

This requests a deduplication index based on the harvestjobs with id 1 and 2, and stores this index in /tmp/cache/DEDUP_CRAWL_LOG/1-2-cache

dk.netarkivet.archive.tools.GetFile

With this tool you can retrieve a file from your archive.

Prerequisites

If you want to use another arcrepositoryclient than the default (dk.netarkivet.archive.arcrepository.distribute.JMSArcRepositoryClient), you need to override the setting

settings.common.arcrepositoryClient.class

If you use the default, you need to set the environmentName correctly, so your ArcrepositoryApplication receives your GetFile request, and define your replicas, and the replicaId of the replica where you want to get the data. All this is most easily put into a local settings.xml:

<settings>
 <common>
    <environmentName>QUICKSTART</environmentName>
    <replicas>
            <replica>
                <replicaId>SH</replicaId>
                <replicaType>bitarchive</replicaType>
                <replicaName>SHB</replicaName>
            </replica>
        </replicas>
        <useReplicaId>SH</useReplicaId>
</common></settings>

In the setting.xml above, the environment name have been set to QUICKSTART, you only have a single replica with replicaId=SH, and the Id of the replica where you want to get the data is "SH".

Usage

export INSTALLDIR=/fullpath/to/installdir
export CLASSPATH=$INSTALLDIR/lib/dk.netarkivet.archive.jar
export SETTINGSFILE=/home/user/conf/settings_ArcRepositoryApplication.xml
export OPTS=-Ddk.netarkivet.settings.file=$SETTINGSFILE \
-Dsettings.common.remoteFile.port=5000
java $OPTS dk.netarkivet.archive.tools.GetFile 3-metadata-1.arc

If the file 3-metadata-1.arc exists in your SH replica, the file is downloaded from the archive, and written to the current working directory. If not, you are going to wait for a long time, until the arcrepository client times out. The tool has an optional second argument, which is a destination file:

export INSTALLDIR=/fullpath/to/installdir
export CLASSPATH=$INSTALLDIR/lib/dk.netarkivet.archive.jar
export SETTINGSFILE=/home/user/conf/settings_ArcRepositoryApplication.xml
export OPTS=-Ddk.netarkivet.settings.file=$SETTINGSFILE \
-Dsettings.common.remoteFile.port=5000
java $OPTS dk.netarkivet.archive.tools.GetFile 3-metadata-1.arc destination-file.arc

dk.netarkivet.archive.tools.Upload

The tool "dk.netarkivet.archive.tools.Upload" allows one to upload ARC files to a repository of your choice.

The type of arcrepository you are uploading your files to are defined by the setting

settings.common.arcrepositoryClient.class

, where the default is dk.netarkivet.archive.arcrepository.distribute.JMSArcRepositoryClient. This client uses JMS messages to communicate with a repository.

Prerequisites

If you use the client dk.netarkivet.archive.arcrepository.distribute.JMSArcRepositoryClient, you need to ensure, that you send upload requests to the correct JMS queue, and that you receive the responses from the client. This is ensured by setting the setting

settings.common.environmentName

to the proper value (e.g. PROD or DEV). The same holds for the setting

settings.common.applicationName

(e.g. Upload), and finally "settings.common.applicationInstanceId" (e.g. ONE or TWO) If you intend to override any of the settings mentioned above, you can either do the overrides on the commandline or writing the overrides to a settings file.

Using the tool

This tool will upload a number of local files to all replicas in the archive. An example of an execution command is:

export SETTINGSFILENAME=settings_ArcRepositoryApplication.xml
java -Ddk.netarkivet.settings.file=/home/user/conf/$SETTINGSFILENAME \
        -cp lib/dk.netarkivet.archive.jar \
        dk.netarkivet.archive.tools.Upload \
        file1.arc [file2.arc ...]

where file1.arc file2.arc ... is the files to be uploaded

This will cause the files to be uploaded. The behaviour of the default client (JMSArcRepositoryClient) is furthermore, that if a file is uploaded successfully, it is deleted locally. This means that if there are files left after Upload is finished, these files are probably not stored safely.

dk.netarkivet.archive.tools.GetRecord

This tool takes a CDX based lucene-index, and an URI, and retrieves the corresponding ARC-record from the archive, and dumps it to stdout.

Prerequisites

The same as for getFile.

Usage

export INSTALLDIR=/fullpath/to/installdir
export CLASSPATH=$INSTALLDIR/lib/dk.netarkivet.archive.jar
export SETTINGSFILE=/home/user/conf/settings_ArcRepositoryApplication.xml
export LUCENE_INDEX=/tmp/cache/DEDUP_CRAWL_LOG/1-cache
export URI=http://www.netarkivet.dk
export OPTS=-Ddk.netarkivet.settings.file=$SETTINGSFILE \
-Dsettings.common.remoteFile.port=5000
java $OPTS dk.netarkivet.archive.tools.GetRecord $LUCENE_INDEX $URI

If the URI is not in the given index, an exception is sent to stdout with the message: Resource missing in index or repository for URI

TODO: Mention how to make an luceneindex for your stored arcfiles.

dk.netarkivet.archive.tools.RunBatch

The bitarchives are designed to receive batch-programs to run on all the arc-files stored in the bitarchive. This is true no matter whether the bitarchive is installed as a local arc-repository or a distributed repository with several bitarchives. Batch programs are also used internally by the NetarchiveSuite software to do specific tasks like getting a CDX'es for a specific job, or checksums of arc-files stored in the bitarchive, or lists of arc-files from the bitarchive.

The RunBatch program is used to send your own batchjobs to the bitarchives.

Note that a batchjob will only be sent to one bitarchive replica!

It is not possible to send batchjobs to checksum replicas, as only bitarchive replicas can
handle batchjobs.

Prerequisites for running a batch job

A number of prerequisites must be taken care of before a batch job can be executed. These are:

Channel settings to be able to make channel names to communicate with running system:

Other settings related to communication where the running systems settings differs from default.

If the batch program is given in a single class file, this must be specified in the parameter:

export SETTINGSFILENAME=settings_ArcRepositoryApplication.xml
java -Ddk.netarkivet.settings.file=/home/user/conf/$SETTINGSFILENAME \
  -cp lib/dk.netarkivet.archive.jar \
   dk.netarkivet.archive.tools.RunBatch \
   -CFindMime.class -R10-*.arc -BReplicaOne -Oresfile

which will take in lib/dk.netarkivet.archive.jar in the class path and execute the general NetarchiveSuite program dk.netarkivet.archive.tools.RunBatch based on settings from file /home/user/conf/settings_ArcRepositoryApplication.xml. This will result in running the batch program FindMime.class on the bitarchive replica named ReplicaOne, but only on files with names matching the pattern

 10-*.arc 

The results written by the batch program is concatenated and placed in the output file named resfile.

Example of packing and executing a batch job

To package the files do the following:

jar -cvf batchfile.jar PATH/batchProgram.class

where PATH is the path to the directory where the batch class files are placed. This is under the bin/ directory in the eclipse project. The batchProgram.class is the compiled file for your batch program.

The call to run this batch job is then:

export SETTINGSFILENAME=settings_ArcRepositoryApplication.xml
  java -Ddk.netarkivet.settings.file=conf/$SETTINGSFILENAME \
        -cp lib/dk.netarkivet.archive.jar \
       dk.netarkivet.archive.tools.RunBatch \
       -Jbatch.jar -Npath.batchProgram

where path in the -N argument has all '/' changed to '.'.

E.g. to run the batch job from the file myBatchJobs/arc/MyArcBatchJob.java, which inherits the ARCBatchJob class (dk/netarkivet/common/utils/arc/ARCBatchJob), do the following.

1. Place yourself in the bin/ folder under your project:

cd bin/

2. Package the compiled Java binaries into an .jar file:

 jar -cvf batch.jar myBatchJobs/arc/- 

3. Move the packaged batch job to your NetarchiveSuite directory.

 mv batch.jar ~/NetarchiveSuite/

4. Run the following command to execute the batch job:

export SETTINGSFILENAME=settings_ArcRepositoryApplication.xml
   java -Ddk.netarkivet.settings.file=conf/$SETTINGSFILENAME \
        -cp lib/dk.netarkivet.archive.jar:lib/dk.netarkivet.common.jar
        dk.netarkivet.archive.tools.RunBatch -Jbatch.jar -NmyBatchJobs.arc.MyArcBatchJob

The lib/dk.netarkivet.common.jar library need to be included in the classpath since the batch job (myBatchJobs/arc/MyArcBatchJob) inherits from a class within this library (dk/netarkivet/common/utils/arc/ARCBatchJob).

Security

If the security properties for the bitarchive (independent of this execution) are set as described in the Configuration Manual the batch program will not be allowed to: