QA-1 (Secure viewerproxy, secure adm-machine, excludes, missing links, password-protected material, caching , index-generation)

Templates used by test: default_order_xml

Procedure


1. Prepare Installation

On devel@kb-prod-udv-001.kb.dk:

 

cd prepared_software/
export VERSION=5.0-27-05-2015
export TESTX=TEST4
export PORT=8077
export MAILRECEIVERS=svc@kb.dk
all_test.sh

Check that the GUI is available and that the System Status does not show any start-up problems.

 

2. Set up Apache Proxies for Adm and Acs

Login as root on kb-test-adm-001.kb.dk:

ssh root@kb-test-adm-001.kb.dk

Ask csr@statsbiblioteket.dk or tlr@kb.dk for the password.

Create a backup of httpd.conf and then edit it to reflect your assigned test PORT.

cp /etc/httpd/conf/proxy.conf ./proxy.conf.bak
nano /etc/httpd/conf/proxy.conf 

There are two VirtualHosts which need to be edited: one for adm and one for acs. The relevant lines look like

# This virtualhost
# Used in TEST4 as part of the releasetest when using PORT=8077
# normally assigned to developer svc
<VirtualHost *:8081>
        ServerAdmin helpdesk@kb.dk
        ErrorLog logs/proxy8081-error_log
        CustomLog logs/proxy8081-access_log combined
<IfModule mod_proxy.c>
        ProxyPass / http://kb-test-adm-001:8077/
        ProxyPassReverse / http://kb-test-adm-001:8077/

and

#############################################
##### Added proxy used in releasetest TEST4
#############################################
<VirtualHost *:8090>
        ServerAdmin helpdesk@kb.dk
        ErrorLog logs/proxy8090-error_log
        CustomLog logs/proxy8090-access_log combined
<IfModule mod_proxy.c>
        ProxyRequests On
        ProxyRemote * http://kb-test-acs-001.kb.dk:8077
        <Proxy *>

Now restart the apache server:

[root@kb-test-adm-001 ~]# /etc/rc.d/init.d/httpd restart

8081 is now the port number for the admin-gui and 8090 is the port number for the viewerproxy.

3. Set Browser Up To Use ADM Proxy

There are several ways to do this, but the following is the best. Start the firefox profile manager with

firefox -P --no-remote

and create a new profile. Call the new profile TEST4 so you can remember what it's for in the future.

Under Edit -> Preferences -> Advanced -> Network -> Settings set a manual http proxy configuration to kb-test-adm-001.kb.dk port 8090 with no proxy for localhost, 127.0.0.1,kb-prod-udv-001.kb.dk,kb-test-adm-001.kb.dk .

Browse to http://kb-test-adm-001.kb.dk:8081/HarvestDefinition/ (login test/test123) . You should see the admin GUI. You can set it as your start page for the profile you just created.

4. Set Up Harvesting of Netarkivet.dk

5. Harvest netarkivet.dk

Create a selective harvest of netarkivet.dk using the definitions defined in the previous step. Wait for it to complete.

6. Browse in the Job and Start Collecting Urls

(If prompted for a password, enter test/test123.)

Now browse in the website, being sure to go sufficiently deep that you collect URLs for some missing pages. Also be sure to click on the link "English".

7. Stop Collecting URLs

Go back to the Viewerproxy Status webpage and click on "Stop collecting URLs" then "Show collected URLs". Your list should look something like

http://netarkivet.dk/?page_id=123
http://netarkivet.dk/in-english/
http://netarkivet.dk/wp-content/uploads/Retningslinjer-for-adgang-til-Netarkivet.pdf
http://netarkivet.dk/wp-content/uploads/ansoegererklaering.pdf
http://www.google-analytics.com/__utm.gif?utmwv=5.4.4&utms=1&utmn=1182267154&utmhn=netarkivet.dk&utmcs=UTF-8&utmsr=1920x1200&utmvp=1421x783&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=11.2%20r202&utmdt=Netarkivet&utmhid=1151260713&utmr=-&utmp=%2F&utmht=1375792372429&utmac=UA-16233002-5&utmcc=__utma%3D71594380.2107439604.1375792372.1375792372.1375792372.1%3B%2B__utmz%3D71594380.1375792372.1.1.utmcsr%3D(direct)%7Cutmccn%3D(direct)%7Cutmcmd%3D(none)%3B&utmu=q~

Note that it should included the "in-english" page and several others from netarkivet.dk. The google-analytics links can be ignored.

8. Add the Collected URLs as Seeds and Re-harvest

9. Test Authentication

10. Test Logging of Failed Login

On devel@kb-prod-udv-001:

[devel@kb-prod-udv-001 ~]$ ssh root@kb-test-adm-001.kb.dk grep Mismatch /etc/httpd/logs/proxy8081-error_log
root@kb-test-adm-001.kb.dk's password: 
[Tue Aug 06 15:23:23 2013] [error] [client 130.225.26.33] user tlr: authentication failure for "/HarvestDefinition/": Password Mismatch

Confirm that you can see the username for the failed login attempt.

11. Set Different Domains to Use Different Templates

In the Admin GUI, set the following domains to by default (i.e. in their defaultconfig configuration)  use different order templates as follows:

kaarefc.dkdefault_orderxml, max-hops=3
trinekc.dkdefault_orderxml, max-hops=4
sulnudu.dkdefault_orderxml, max-hops=1

12. Define a Multi-Domain Selective Harvest

Define a selective harvest for the domains trinekc.dk, kaarefc.dk, sulnudu.dk, raeder.dk, and netarkivet.dk. Activate it and wait for it to complete.

The harvest should generate five jobs - for example with job numbers 3,4,5,6,7.

13. Create an Index for these Jobs

Browse to the harvest history for the multi-domain selective harvest and click on " Select these jobs for QA with viewerproxy ". Wait for the index to finish generating and redirect you to the "Viewerproxy Status" page.

14. Mess with a Crawl-log File to Create an Error

Log in to devel@kb-test-acs-001.kb.dk. 

[devel@kb-test-acs-001 ~]$ cd TEST4/cache
[devel@kb-test-acs-001 cache]$ rm -rf ./fullcrawllogindex/*  ./FULL_CRAWL_LOG/*
[devel@kb-test-acs-001 cache]$ find .
.
./dedupcrawllogindex
./dedupcrawllogindex/1-cache
./dedupcrawllogindex/1-cache/segments.gen.gz
./dedupcrawllogindex/1-cache/_0.cfs.gz
./dedupcrawllogindex/1-cache/_0.si.gz
./dedupcrawllogindex/1-cache/_0.cfe.gz
./dedupcrawllogindex/1-cache/segments_1.gz
./dedupcrawllogindex/empty-cache
./dedupcrawllogindex/empty-cache/segments.gen.gz
./dedupcrawllogindex/empty-cache/segments_1.gz
./dedupcrawllogindex/1-cache.working
./dedupcrawllogindex/empty-cache.working
./fullcrawllogindex
./cdxindex
./cdxindex/empty-cache
./cdxindex/empty-cache.working
./FULL_CRAWL_LOG
./crawllog
./crawllog/crawllog-6-cache
./crawllog/crawllog-4-cache
./crawllog/crawllog-1-cache.working
./crawllog/crawllog-3-cache
./crawllog/crawllog-7-cache (etc.)

Now choose one of the jobs from the multi-harvest run - e.g. job number 5. Edit ./crawllog/crawllog-5-cache by adding the text duplicate:"foo with no closing parenthesis to one of the crawllog lines.

15. Regenerate the Index

Now check that the logback_IndexServerApplication.xml has netarkivet.dk to log at DEBUG level. Restart IndexServerApplication if the loglevel needed to be changed.

Now browse back to the Harvest Status for the multi-job harvest and again click on " Select these jobs for QA with viewerproxy ". Wait for the index to be generated. On kb-test-acs-001 execute

[devel@kb-test-acs-001 ~]$ grep Skipping TEST4/log/IndexServerApplication.log
13:45:04.093 DEBUG d.n.h.i.CDXOriginCrawlLogIterator - Skipping over bad crawl-log line '2015-06-02T11:10:17.396Z   200       4238 http://twiki.org/p/pub/TWiki05x00/TopMenuSkin/menu-reverse-bg.png LEREXE http://twiki.org/ image/png #004 20150602111017059+336 sha1:LOTTTOPPPPZ5KHVXZ6ATPONHIUI5HVIV - duplicate:"foo, content-size:4489'
[devel@kb-test-acs-001 ~]$ 

and confirm that the line you edited is shown as having been skipped over.

16. Check Index Caching

On kb-test-acs-001, delete a crawl log for a single harvest job:

[devel@kb-test-acs-001 ~]$ rm TEST4/cache/crawllog/crawllog-5-cache

Now regenerate the index in the GUI. Confirm that the file you deleted is not recreated. (It is not needed because there is a cached index for the full crawl log of the entire harvest.)

17. Check Behaviour When Metadata File is Missing

Go into ba-devel@kb-test-bar-014.bitarkiv or ba-devel@kb-test-bar-015.bitarkiv (from kb-prod-udv-001) and find one of the metadata files generated by the multi-job harvest. Move it iaway.

C:\Users\ba-devel.BITARKIV>move d:\bitarkiv_1\TEST4\filedir\4-metadata-1.warc .                                                                          

If in doubt check the "/home/devel/prepared_software/$TEXTX/settings/deploy_config_database.xml" file for locations of bitarchive folders on each application machine.

18. Remove the Previously Generated Crawl Index

[devel@kb-test-acs-001 ~]$ cd TEST4/cache/
[devel@kb-test-acs-001 cache]$ rm -rf cdxdata/*
[devel@kb-test-acs-001 cache]$ rm -rf crawllog/*
[devel@kb-test-acs-001 cache]$ rm -rf FULL_CRAWL_LOG/*
[devel@kb-test-acs-001 cache]$ rm -rf fullcrawllogindex/*

Now regenerate the index. The name of the generated index should still include the job number "4". Specifically it is of the form

./fullcrawllogindex/3-4-5-6-afaf49bfb7ea74961294d6b8b896e81f-cache

consisting of the job numbers of the first four jobs in the harvest followed by a checksum.

For the missing job number (ie 4 in this case) confirm that

19. Shutdown the Test and Clean Up

On devel@kb-prod-udv-001

cleanup_all_test.sh