Templates used by test: default_order_xml
1. Prepare Installation
## Replace version as needed
2. Set up Apache Proxies for Adm and Acs
Login as root on kb-test-adm-001.kb.dk:
Create a backup of httpd.conf and then edit it to reflect your assigned test PORT.
There are two VirtualHosts which need to be edited: one for adm and one for acs. The relevant lines look like
Now restart the apache server:
8081 is now the port number for the admin-gui and 8090 is the port number for the viewerproxy.
3. Set Browser Up To Use ADM Proxy
There are several ways to do this, but the following is the best. Start the firefox profile manager with
and create a new profile. Call the new profile TEST4 so you can remember what it's for in the future.
Under Edit -> Preferences -> Advanced -> Network -> Settings set a manual http proxy configuration to
kb-test-adm-001.kb.dk port 8090 with no proxy for
localhost, 127.0.0.1,kb-prod-udv-001.kb.dk,kb-test-adm-001.kb.dk .
Browse to http://kb-test-adm-001.kb.dk:8081/HarvestDefinition/ (login test/test123) . You should see the admin GUI. You can set it as your start page for the profile you just created.
4. Set Up Harvesting of Netarkivet.dk
- Edit domain 'netarkivet.dk' to use maxhops=1 in the defaultconfig, still using default_orderxml as template
- Add ^
http://netarkivet.dk/in-english/$to the crawlertraps for netarkivet.dk.
to the seedlist for netarkivet.dk.
5. Harvest netarkivet.dk
Create a selective harvest of netarkivet.dk using the definitions defined in the previous step. Wait for it to complete.
6. Browse in the Job and Start Collecting Urls
- In the GUI, select the completed job
- Click on "Select this job for QA with viewerproxy" and wait for indexing to complete
- Click on "Start collecting URLs"
(If prompted for a password, enter test/test123.)
Now browse in the
http://netarkivet.dk website, being sure to go sufficiently deep that you collect URLs for some missing pages. Also be sure to click on the link "English".
7. Stop Collecting URLs
Go back to the Viewerproxy Status webpage and click on "Stop collecting URLs" then "Show collected URLs". Your list should look something like
Note that it should included the "in-english" page and several others from netarkivet.dk. The google-analytics links can be ignored.
8. Add the Collected URLs as Seeds and Re-harvest
- Edit the default seedlist for netarkivet.dk to include the gathered URLs.
- Define and start a new harvest, or just edit the previous harvest definition to have a next-run time of now.
- When it is finished, browse in the new harvest as before. The added URLs should be browsable, with the exception of the "in-english" URL which is still blocked by the crawlertrap.
9. Test Authentication
- If you saved the password in Firefox, go to Preferences -> Security -> Saved Passwords and click on "Remove All".
- Close the browser
- Restart the browser and browse to the GUI: http://kb-test-adm-001.kb.dk:8081/HarvestDefinition/
- Enter an incorrect password and confirm that it is not accepted
10. Test Logging of Failed Login
Confirm that you can see the username for the failed login attempt.
11. Set Different Domains to Use Different Templates
In the Admin GUI, set the following domains to by default (i.e. in their defaultconfig configuration) use different order templates as follows:
12. Define a Multi-Domain Selective Harvest
Define a selective harvest for the domains
raeder.dk, . Activate it and wait for it to complete.
The harvest should generate 4 jobs - for example with job numbers 3,4,5,6. The first three domains are harvested separately, while sulnudu.dk and netarkivet.dk are harvested together, as they have the samme max-hops (1).
13. Create an Index for these Jobs
Browse to the harvest history for the multi-domain selective harvest and click on " Select these jobs for QA with viewerproxy ". Wait for the index to finish generating and redirect you to the "Viewerproxy Status" page.
14. Mess with a Crawl-log File to Create an Error
Log in to firstname.lastname@example.org.
Now choose one of the jobs from the multi-harvest run - e.g. job number 5. Edit ./crawllog/crawllog-5-cache by adding the text
duplicate:"foo with no closing parenthesis to one of the crawllog lines.
15. Regenerate the Index
Now check that the logback_IndexServerApplication.xml has netarkivet.dk to log at DEBUG level. Restart IndexServerApplication if the loglevel needed to be changed.
Now browse back to the Harvest Status for the multi-job harvest and again click on " Select these jobs for QA with viewerproxy ". Wait for the index to be generated. On kb-test-acs-001 execute
and confirm that the line you edited is shown as having been skipped over.
16. Check Index Caching
On kb-test-acs-001, delete a crawl log for a single harvest job:
Now regenerate the index for the multi-domain harvest in the GUI. The index isn't really regenerated, as the correct index already exists. Confirm that the file you deleted is not recreated. (It is not needed because there is a cached index for the full crawl log of the entire harvest.)
17. Check Behaviour When Metadata File is Missing
From email@example.com go into ba-devel@KB-test-bar-01.bitarkiv.kb.dk (basedir in c:\bitarkiv\TEST4) or ba-devel@KB-TEST-BAR-016.bitarkiv.kb.dk (basedirs in
e:\bitarchive_1\TEST4, f:\bitarchive_2\TEST4, g:\bitarchive_3\TEST4) and find one of the metadata files generated by the multi-job harvest. Move it away.
If in doubt, check the file /home/devel/prepared_software/TEST4/settings/deploy_config_test.xml for locations of bitarchive folders on each application machine.
18. Remove the Previously Generated Crawl Index
Now regenerate the index. The name of the generated index should still include the job number "4". Specifically it is of the form
consisting of the job numbers of the four jobs in the index. If more than 4 jobs in the index, the index will be named: <job1>-<job2>-<job3>-<job4>-<checksum>.cache
For the missing job number (ie 4 in this case) confirm that
- There is no cdxdata-4-cache in the directory cdxdata
- There is no crawllog-4-cache in the crawllog directory
- There is a file ./crawllog/crawllog-4-cache.working but it is empty
19. Shutdown the Test and Clean Up