QA-1 (Secure viewerproxy, secure adm-machine, excludes, missing links, password-protected material, caching , index-generation)

Templates used by test: default_order_xml


1. Prepare Installation



## Replace version as needed
export VERSION=5.4-RC1
export H3ZIP=/home/devel/nas_versions/bundler/NetarchiveSuite-heritrix3-bundler-$
export TESTX=TEST4
export PORT=8077

Check that the GUI is available and that the System Status does not show any start-up problems.

2. Set up Apache Proxies for Adm and Acs

Login as root on


Ask or for the password.

Create a backup of httpd.conf and then edit it to reflect your assigned test PORT.

cp /etc/httpd/conf/proxy.conf ./proxy.conf.bak
nano /etc/httpd/conf/proxy.conf 

There are two VirtualHosts which need to be edited: one for adm and one for acs. The relevant lines look like

# This virtualhost
# Used in TEST4 as part of the releasetest when using PORT=8077
# normally assigned to developer svc
<VirtualHost *:8081>
        ErrorLog logs/proxy8081-error_log
        CustomLog logs/proxy8081-access_log combined
<IfModule mod_proxy.c>
        ProxyPass / http://kb-test-adm-001:8077/
        ProxyPassReverse / http://kb-test-adm-001:8077/


##### Added proxy used in releasetest TEST4
<VirtualHost *:8090>
        ErrorLog logs/proxy8090-error_log
        CustomLog logs/proxy8090-access_log combined
<IfModule mod_proxy.c>
        ProxyRequests On
        ProxyRemote *
        <Proxy *>

Now restart the apache server:

[root@kb-test-adm-001 ~]# /etc/rc.d/init.d/httpd restart

8081 is now the port number for the admin-gui and 8090 is the port number for the viewerproxy.

3. Set Browser Up To Use ADM Proxy

There are several ways to do this, but the following is the best. Start the firefox profile manager with

firefox -P --no-remote

and create a new profile. Call the new profile TEST4 so you can remember what it's for in the future.

Under Edit -> Preferences -> Advanced -> Network -> Settings set a manual http proxy configuration to port 8090 with no proxy for localhost,,, .

Browse to (login test/test123) . You should see the admin GUI. You can set it as your start page for the profile you just created.

4. Set Up Harvesting of

  • Edit domain '' to use maxhops=1 in the defaultconfig, still using default_orderxml as template
  • Add ^$ to the crawlertraps for
  • Add to the seedlist for

5. Harvest

Create a selective harvest of using the definitions defined in the previous step. Wait for it to complete.

6. Browse in the Job and Start Collecting Urls

  • In the GUI, select the completed job
  • Click on "Select this job for QA with viewerproxy" and wait for indexing to complete
  • Click on "Start collecting URLs"

(If prompted for a password, enter test/test123.)

Now browse in the website, being sure to go sufficiently deep that you collect URLs for some missing pages. Also be sure to click on the link "English".

7. Stop Collecting URLs

Go back to the Viewerproxy Status webpage and click on "Stop collecting URLs" then "Show collected URLs". Your list should look something like

Note that it should included the "in-english" page and several others from The google-analytics links can be ignored.

8. Add the Collected URLs as Seeds and Re-harvest

  • Edit the default seedlist for to include the gathered URLs.
  • Define and start a new harvest, or just edit the previous harvest definition to have a next-run time of now.
  • When it is finished, browse in the new harvest as before. The added URLs should be browsable, with the exception of the "in-english" URL which is still blocked by the crawlertrap.

9. Test Authentication

  • If you saved the password in Firefox, go to Preferences -> Security -> Saved Passwords and click on "Remove All".
  • Close the browser
  • Restart the browser and browse to the GUI:
  • Enter an incorrect password and confirm that it is not accepted

10. Test Logging of Failed Login

On devel@kb-prod-udv-001:

[devel@kb-prod-udv-001 ~]$ ssh grep Mismatch /etc/httpd/logs/proxy8081-error_log's password: 
[Tue Aug 06 15:23:23 2013] [error] [client] user tlr: authentication failure for "/HarvestDefinition/": Password Mismatch

Confirm that you can see the username for the failed login attempt.

11. Set Different Domains to Use Different Templates

In the Admin GUI, set the following domains to by default (i.e. in their defaultconfig configuration)  use different order templates as follows:

kaarefc.dkdefault_orderxml, max-hops=3
trinekc.dkdefault_orderxml, max-hops=4
sulnudu.dkdefault_orderxml, max-hops=1

12. Define a Multi-Domain Selective Harvest

Define a selective harvest for the domains,,,,and Activate it and wait for it to complete.

The harvest should generate 4 jobs - for example with job numbers 3,4,5,6. The first three domains are harvested separately, while and are harvested together, as they have the samme max-hops (1).

13. Create an Index for these Jobs

Browse to the harvest history for the multi-domain selective harvest and click on " Select these jobs for QA with viewerproxy ". Wait for the index to finish generating and redirect you to the "Viewerproxy Status" page.

14. Mess with a Crawl-log File to Create an Error

Log in to 

[devel@kb-test-acs-001 ~]$ cd TEST4/cache
[devel@kb-test-acs-001 cache]$ rm -rf ./fullcrawllogindex/*  ./FULL_CRAWL_LOG/*
[devel@kb-test-acs-001 cache]$ find .
./crawllog/crawllog-3-cache (etc.)

Now choose one of the jobs from the multi-harvest run - e.g. job number 5. Edit ./crawllog/crawllog-5-cache by adding the text duplicate:"foo with no closing parenthesis to one of the crawllog lines.

15. Regenerate the Index

Now check that the logback_IndexServerApplication.xml has to log at DEBUG level. Restart IndexServerApplication if the loglevel needed to be changed.

Now browse back to the Harvest Status for the multi-job harvest and again click on " Select these jobs for QA with viewerproxy ". Wait for the index to be generated. On kb-test-acs-001 execute

[devel@kb-test-acs-001 ~]$ grep Skipping TEST4/log/IndexServerApplication.log
13:45:04.093 DEBUG d.n.h.i.CDXOriginCrawlLogIterator - Skipping over bad crawl-log line '2015-06-02T11:10:17.396Z   200       4238 LEREXE image/png #004 20150602111017059+336 sha1:LOTTTOPPPPZ5KHVXZ6ATPONHIUI5HVIV - duplicate:"foo, content-size:4489'
[devel@kb-test-acs-001 ~]$ 

and confirm that the line you edited is shown as having been skipped over.

16. Check Index Caching

On kb-test-acs-001, delete a crawl log for a single harvest job:

[devel@kb-test-acs-001 ~]$ rm TEST4/cache/crawllog/crawllog-5-cache

Now regenerate the index for the multi-domain harvest in the GUI. The index isn't really regenerated, as the correct index already exists. Confirm that the file you deleted is not recreated. (It is not needed because there is a cached index for the full crawl log of the entire harvest.)

17. Check Behaviour When Metadata File is Missing

From go into (basedir in c:\bitarkiv\TEST4) or (basedirs in

e:\bitarchive_1\TEST4, f:\bitarchive_2\TEST4, g:\bitarchive_3\TEST4) and find one of the metadata files generated by the multi-job harvest. Move it away.

C:\Users\ba-devel.BITARKIV>move d:\bitarkiv_1\TEST4\filedir\4-metadata-1.warc .                                                                          

If in doubt, check the file /home/devel/prepared_software/TEST4/settings/deploy_config_test.xml for locations of bitarchive folders on each application machine.

18. Remove the Previously Generated Crawl Index

[devel@kb-test-acs-001 ~]$ cd TEST4/cache/
[devel@kb-test-acs-001 cache]$ rm -rf cdxdata/*
[devel@kb-test-acs-001 cache]$ rm -rf crawllog/*
[devel@kb-test-acs-001 cache]$ rm -rf FULL_CRAWL_LOG/*
[devel@kb-test-acs-001 cache]$ rm -rf fullcrawllogindex/*

Now regenerate the index. The name of the generated index should still include the job number "4". Specifically it is of the form


consisting of the job numbers of the four jobs in the index. If more than 4 jobs in the index, the index will be named: <job1>-<job2>-<job3>-<job4>-<checksum>.cache

For the missing job number (ie 4 in this case) confirm that

  • There is no cdxdata-4-cache in the directory cdxdata
  • There is no crawllog-4-cache in the crawllog directory
  • There is a file ./crawllog/crawllog-4-cache.working but it is empty

19. Shutdown the Test and Clean Up

On devel@kb-prod-udv-001
  1. Removed from the list of sites to be crawled in step 12. And replaced it with

  2. The harvesting of was terminated, as it took too long to harvest