Test integration of NetarchiveSuite and Wayback

 

Goals

Prerequisites

1) All netarchivesuite apps on devel@kb-test-way-001.kb.dk now uses the same derby server listening on port 50002. If this server is down, start the server with the command

cd derbyDB; bash start_derby.sh

2) This test is to be run with OpenWayback. IA wayback is no longer supported.

Procedure

Clean TEST12 derby database

On devel@kb-test-way-001.kb.dk

cd derbyDB
./stop_derby.sh
rm -r wayback_indexertest12_db
./start_derby.sh 

Note: some instances on devel@kb-test-way-001.kb.dk may need to be restarted after this operation (specifically, SystemTest and StressTest instances

Prepare Installation

Run a standard devel setup Setup DK test environment.

Upload a Small Bitarchive

On devel@kb-prod-udv-001.kb.dk:

 

scp -r ${HOME}/bitarchive_testdata kb-test-adm-001:
ssh kb-test-adm-001 chmod 755  bitarchive_testdata/upload.sh
ssh kb-test-adm-001 bitarchive_testdata/upload.sh $TESTX arcfiles
ssh kb-test-adm-001 bitarchive_testdata/upload.sh $TESTX warcfiles
ssh kb-test-adm-001 bitarchive_testdata/upload.sh $TESTX warcgzfiles

Build Netarkivets Fork of OpenWayback

Clone the repository and build and deploy it to your local maven repository

git clone https://github.com/csrster/openwayback-csrdev
cd openwayback-csrdev
mvn -DskipTests clean install

Alternatively use "mvn -DskipTests clean deploy" to deploy a new snapshot to nexus.

Build Netarkivets OpenWayback Overlay 

Clone the repository 

git clone https://github.com/netarchivesuite/netarkivet-openwayback-overlay.git

Edit pom.xml to point to refer to the latest NetarchiveSuite snapshot version and to the same openwayback version installed in the previous step (currently 2.4.0-NAS-SNAPSHOT) and then build the package 

cd netarkivet-openwayback-overlay
mvn clean package

This builds the warfile target/netarkivet-openwayback.war which should be renamed to "wayback.war" for the next step.

Construct A Clean Wayback Environment

Checkout the deploy template from ssh://git@sbprojects.statsbiblioteket.dk:7999/nark/openwayback-config.git . (possibly with command git clone ssh://git@sbprojects.statsbiblioteket.dk:7999/nark/openwayback-config.git on kb-test-way-001.kb.dk) Copy the entire tree to kb-test-way-001.

Follow the instructions in the Readme.md file in the wayback_deploytemplate directory. Note the following:

  1. The name of the directory should normally be wayback_test12
  2. The procedure for building a warfile is described above
  3. tomcat version is 6.0.26
  4. The NAS settings file contained in the git repository can be used unchanged
  5. The default ports for the proxy endpoint in settings.conf should be changed to your assigned tester port
  6. If the conf/tomcat_conf/server.xml  redirect port 8443  is not available, change it to 8444
  7. Now drop the netarkivet-openwayback.war, renamed to wayback.war, in the wars directory in the installation.

Now start wayback/tomcat with the start script in wayback_test12/bin. 

Check the log for error messages

First do a sanity test that wayback is running and that the configuration is sane

After this you can try the accessing the proxy endpoint via ssh port forwarding (see details below). 

Redeploying to an existing installation

To redeploy to an existing wayback installation

  1. Drop the warfile wayback.war in the wars directory
  2. Touch the context-descriptor file 

    touch tomcat/conf/Catalina/localhost/ROOT.xml


  3. Wait a few seconds, then restart wayback with the provided script 

    bin/start_wayback.sh


Check That Wayback Proxy Endpoint Is Working

On devel@kb-prod-udv-001

ssh -g -N -L$PORT:kb-test-way-001.kb.dk:$PORT kb-test-way-001.kb.dk &

Now, in a browser of your choice set the internet connection settings to use kb-prod-udv-001.kb.dk Port $PORT as proxy. In Firefox, a good idea is to execute firefox -P --no-remote and create a new profile which uses this proxy setting and points to wayback as its start-page.

Go to http://kb-test-way-001.kb.dk:8080/ (or whichever port you set up as the wayback endpoint in settings.conf) and check that you can see the wayback search.box.

Wait for Indexing to Complete

On kb-prod-udv-001 wait to see the indexer application run  by executing:

 [devel@kb-prod-udv-001 ~]$ watch -n 10 'ssh  devel@kb-test-way-001 tail -n 30 $TESTX/log/WaybackIndexerApplication0.log.0'

The indexer runs every five minutes. If you are impatient, just log onto kb-test-way-001 and in the directory $TESTX/conf kill and restart the indexer. It will run right away.

You can follow the progress of indexing with the following two commands

[devel@kb-prod-udv-001 ~]$ ssh  devel@kb-test-way-001 grep \'Creating object\' $TESTX/log/WaybackIndexerApplication0.log| wc -l

[devel@kb-prod-udv-001 ~]$ ssh  devel@kb-test-way-001 grep \'Received\' $TESTX/log/WaybackIndexerApplication0.log| grep arc|wc -l

The first gives the number of files discovered by the indexer, and the second gives the number of files indexed. When these are equal, indexing is done.

Wait for Aggregator

After the indexer is run, wait for the aggregator to run by watching for the creation of the index file:

[devel@kb-prod-udv-001 ~]$ watch -n 10 'ssh  devel@kb-test-way-001 ls /home/devel/$TESTX/indexDir/'

until the file wayback_intermediate.index appears. This will take at most ten minutes. If you are impatient, just log onto kb-test-way-001 and in the directory $TESTX/conf kill and restart the aggregator. It will run right away.

Move The Index File

Move the index file to the place where wayback expects to read it. [I think this is now unnecessary - CSR]

[devel@kb-prod-udv-001 ~]$ ssh  devel@kb-test-way-001 mv /home/devel/$TESTX/indexDir/wayback_intermediate.index /home/devel/wayback_cdx/index.cdx

Browse Repository

In the proxied browser you should now be able to search and browse in the repository. The following standard domains are present in the arcfiles:

www.netarkivet.dkwww.kaarefc.dkwww.oernhoej.dkwww.pligtaflevering.dk www.drive-badmintonklub.dk, www.dbc.dk,

www.kb.dk www.bs.dk www.sulnudu.dk www.kum.dk www.trinekc.dk www.slothchristensen.dk www.trineogkaare.dk www.sy-jonna.dk

www.kaareogtrine.dk www.raeder.dk www.statsbiblioteket.dk

In addition, the following domains are present in the warcfiles:

news.dk, , jp.dk

The warc.gz files contain a single harvest each of honda.dk, toyota.dk, mazda.dk and sa.dk from 2016-10-31. sa.dk is an example of a https site which renders badly in the current version of wayback used by Netarkivet.

Test Exclusions

  1. Use the wayback advanced search page to list all the url's harvested from a particular domain.
  2. Choose some of them you would like to block by regular expressions.
  3. On devel@kb-test-way-001 add these regular expressions (one per line) to the file conf/wayback_regexps.txt under the wayback installation folder.
  4. On devel@kb-test-way-001, restart tomcat by executing the script bin/start_wayback.sh under the wayback installation folder.
  5. Check the blocked urls are no longer visible in advanced search
  6. Check that if you try to visit one of the blocked urls wayback shows you a page informing you that the content has been blocked

Test NetarchiveCacheResourceStore

  1. On devel@kb-test-way-001 stop the wayback tomcat server using the stop_wayback.sh script in the bin folder of the wayback installation

  2. Edit conf/wayback/wayback.xml to use NetarchiveCacheResourceStore instead of NetarchiveResourceStore.
  3. Make sure that the NAS settings file in the conf directory includes a block with the following settings 

    <resourcestore>
      <maxfiles>10</maxfiles>
      <cachedir>/tmp</cachedir>
    </resourcestore>


  4. Start the tomcat again.

  5. Check that you can still browse in the material.
  6. Shutdown the wayback server

Shutdown the Test

  1. On devel@kb-prod-udv-001 execute

    cleanup_all_test.sh


  2. If you have a background ssh port-forwarding process running a proxy to wayback then you should also kill this at this stage.