Page tree
Skip to end of metadata
Go to start of metadata
Test integration of NetarchiveSuite and Wayback

 

Goals

  • Test search and retrieval of harvested material through wayback.

Prerequisites

1) All netarchivesuite apps on devel@kb-test-way-001.kb.dk now uses the same derby server listening on port 50002. If this server is down, start the server with the command

cd derbyDB; bash start_derby.sh

2) This test can be run either with OpenWayback or with Netarkivets for of IA wayback. The procedure for building the relevant war-packaging is described below.

Procedure

Clean TEST12 derby database

On devel@kb-test-way-001.kb.dk

cd derbyDB
./stop_derby.sh
rm -r wayback_indexertest12_db
./start_derby.sh 

Note: some instances on devel@kb-test-way-001.kb.dk may need to be restarted after this operation (specifically, SystemTest and StressTest instances

Prepare Installation

Run a standard devel setup Setup DK test environment.

Upload a Small Bitarchive

 

scp -r ${HOME}/bitarchive_testdata kb-test-adm-001:
ssh kb-test-adm-001 chmod 755  bitarchive_testdata/upload.sh
ssh kb-test-adm-001 bitarchive_testdata/upload.sh $TESTX arcfiles
ssh kb-test-adm-001 bitarchive_testdata/upload.sh $TESTX warcfiles
ssh kb-test-adm-001 bitarchive_testdata/upload.sh $TESTX warcgzfiles

Install OpenWayback

The procedure here is for a clean install of OpenWayback. To upgrade an existing installation it should only be necessary to prepare a new wayback.war file and drop it into the wars directory of the preexisting installation.

Checkout the deploy template from ssh://git@sbprojects.statsbiblioteket.dk:7999/nark/openwayback-config.git . (possibly with command git clone ssh://git@sbprojects.statsbiblioteket.dk:7999/nark/openwayback-config.git on kb-prod-udv-001.kb.dk)

Follow the instructions in the Readme.md file in the wayback_deploytemplate directory. Note the following:

  1. The name of the directory should normally be wayback_test12
  2. The procedure for building a warfile is described below
  3. tomcat version is 6.0.26
  4. The NAS settings file from the wayback indexer can be used unchanged
  5. The default ports 8080/8090 are not usually available so change these in settings.conf to 8080 and 8091.
  6. Also conf/tomcat_conf/server.xml  specifies a redirect port 8443 which is not available. Change this to 8444

To build a an OpenWayback for NAS

  1. Checkout the git repository https://github.com/netarchivesuite/netarkivet-openwayback-overlay
  2. The pom.xml specifies version numbers for both NAS and OpenWayback. Check that these are the values we want.
  3. Build the overlay with 

    mvn clean package
  4. This should build a war file netarkivet-openwayback.war which can be renamed to wayback.war and dropped in the wars directory in the installation.

Now start wayback/tomcat with the start script in wayback_test12/bin. 

Check the log for error messages

First do a sanity test that wayback is running and that the configuration is sane

  • Use X-forwarding and start a firefox running directly on kb-test-way-001.kb.dk
  • Check that the browser is not set to use a proxy
  • Browse to localhost:8091 and check that you can reach wayback

After this you can try the accessing the proxy endpoint (port 8081) via ssh port forwarding (see details below). 

Alternatively Install IA-Wayback

The procedure for installing IA Wayback is identical to that given above except that the warfile is prepared differently.

  1. clone the git project https://github.com/netarchivesuite/wayback-netarchivesuite
  2. build the project with

    mvn clean package -DskipTests
  3. Copy the file wayback-1.8.0-SNAPSHOT.war to the wars directory in the wayback_test12 folder.

Check That Wayback Proxy Endpoint Is Working

On devel@kb-prod-udv-001

ssh -g -N -L$PORT:kb-test-way-001.kb.dk:8080 kb-test-way-001.kb.dk &

Now, in a browser of your choice set the internet connection settings to use kb-prod-udv-001.kb.dk Port $PORT as proxy. In Firefox, a good idea is to execute firefox -P --no-remote and create a new profile which uses this proxy setting and points to wayback as its start-page.

Go to http://kb-test-way-001.kb.dk:8080/ (or whichever port you set up as the wayback endpoint in settings.conf) and check that you can see the wayback search.box.

Wait for Indexing to Complete

On kb-prod-udv-001 wait to see the indexer application run  by executing:

 [devel@kb-prod-udv-001 ~]$ watch -n 10 'ssh  devel@kb-test-way-001 tail -n 30 $TESTX/log/WaybackIndexerApplication0.log.0'

The indexer runs every five minutes. If you are impatient, just log onto kb-test-way-001 and in the directory $TESTX/conf kill and restart the indexer. It will run right away.

You can follow the progress of indexing with the following two commands

[devel@kb-prod-udv-001 ~]$ ssh  devel@kb-test-way-001 grep \'Creating object\' $TESTX/log/WaybackIndexerApplication0.log| wc -l

[devel@kb-prod-udv-001 ~]$ ssh  devel@kb-test-way-001 grep \'Received\' $TESTX/log/WaybackIndexerApplication0.log| grep arc|wc -l

The first gives the number of files discovered by the indexer, and the second gives the number of files indexed. When these are equal, indexing is done.

Wait for Aggregator

After the indexer is run, wait for the aggregator to run by watching for the creation of the index file:

[devel@kb-prod-udv-001 ~]$ watch -n 10 'ssh  devel@kb-test-way-001 ls /home/devel/$TESTX/indexDir/'

until the file wayback_intermediate.index appears. This will take at most ten minutes. If you are impatient, just log onto kb-test-way-001 and in the directory $TESTX/conf kill and restart the aggregator. It will run right away.

Move The Index File

Move the index file to the place where wayback expects to read it. [I think this is now unnecessary - CSR]

[devel@kb-prod-udv-001 ~]$ ssh  devel@kb-test-way-001 mv /home/devel/$TESTX/indexDir/wayback_intermediate.index /home/devel/wayback_cdx/index.cdx

Browse Repository

In the proxied browser you should now be able to search and browse in the repository. The following standard domains are present in the arcfiles:

www.netarkivet.dkwww.kaarefc.dkwww.oernhoej.dkwww.pligtaflevering.dk www.drive-badmintonklub.dk, www.dbc.dk,

www.kb.dk www.bs.dk www.sulnudu.dk www.kum.dk www.trinekc.dk www.slothchristensen.dk www.trineogkaare.dk www.sy-jonna.dk

www.kaareogtrine.dk www.raeder.dk www.statsbiblioteket.dk

In addition, the following domains are present in the warcfiles:

news.dk, , jp.dk

The warc.gz files contain a single harvest each of honda.dk, toyota.dk, mazda.dk and sa.dk from 2016-10-31. sa.dk is an example of a https site which renders badly in the current version of wayback used by Netarkivet.

Test Exclusions

  1. Use the wayback advanced search page to list all the url's harvested from a particular domain.
  2. Choose some of them you would like to block by regular expressions.
  3. On devel@kb-test-way-001 add these regular expressions (one per line) to the file conf/wayback_regexps.txt under the wayback installation folder.
  4. On devel@kb-test-way-001, restart tomcat by executing the script bin/start_wayback.sh under the wayback installation folder.
  5. Check the blocked urls are no longer visible in advanced search
  6. Check that if you try to visit one of the blocked urls wayback shows you a page informing you that the content has been blocked

Test NetarchiveCacheResourceStore

  1. On devel@kb-test-way-001 stop the wayback tomcat server using the stop_wayback.sh script in the bin folder of the wayback installation

  2. Edit tomcat.TEST12/webapps/ROOT/WEB-INF/CDXCollection.xml to use NetarchiveCacheResourceStore instead of NetarchiveResourceStore.
  3. Start the tomcat again.

  4. Check that you can still browse in the material.
  5. Shutdown the wayback server

Shutdown the Test

  1. On devel@kb-prod-udv-001 execute

    cleanup_all_test.sh
  2. If you have a background ssh port-forwarding process running a proxy to wayback then you should also kill this at this stage.
  • No labels