Page tree
Skip to end of metadata
Go to start of metadata


Prerequisites

This test requires restart of infrastructure components (database and network). These steps must be coordinated with the other testers.

Resubmit jobs after restart, restart of failed jobs, upload of old files at harvester restart, scheduler skips old jobs.

 

Uses heritrix3 templates default_order_xml 

Install and Start System

On devel@kb-prod-udv-001.kb.dk:

 

export TESTX=TEST6
export PORT=807?
export MAILRECEIVERS=foo@bar.dk
export VERSION=????????????????
all_test.sh

Check that the GUI is available and that the System Status does not show any startup problems.

Start a selective harvest

Start a hourly selective harvest for the 'netarkivet.dk' domain.

Create a new template:

  • Download the template "default_orderxml" 
  • Edit the template so that max-size-bytes is 5000 in the WARCWriterProcessor:
    • In the overrides section

      metadata.jobName=default_orderxml_smallwarcs

      metadata.description=Default Profile generating small warc-files (5000 bytes)

      warcWriter.maxFileSizeBytes = 5000
      disposition.maxPerHostBandwidthUsageKbSec=30
  • Upload the template with a new name, for example "default_orderxml_smallwarcs"

Modify domain templates 

  • Configure the defaultconfig for  kum.dk to use template 'default_orderxml_smallwarcs.xml'.
  • Configure the  defaultconfig for dbc.dk to use template 'default_orderxml', and max-hops=0 
  •  Configure the  defaultconfig for bs.dk to use template 'default_order.xml', and max-hops=10

Make a new snapshot harvest definition with a name you can remember

  • Create a new snapshot harvest Set 'Max number of bytes per domain’ 1.000.000 bytes (1 mbyte).
  • Check that the job is started correctly in the  'Harvest status'->'All Jobs' in the left menu and that no errors or warnings are present in the system overview.

Stop the Test Automatically During Upload

  • Using the GUI, find the job number and the name of the harvest machine for the job in which kum.dk is being harvested.
  • Log on the Heritrix3 GUI, and pause the job, until the next two steps are done (log on to the h3 gui on the machine harvesting kum.dk  using the link in running jobs, and click on the page for the running job and click pause)
  • Download the attached script and modify it to point at the correct harvester and job number
  • Copy the script to kb-prod-udv-001.kb.dk and run it. It monitors the "warcs" directory and as soon as the first warcfile is uploaded it detects that uploading has started and shuts down the test instance.
  • Log on the Heritrix3 GUI, and unpause the job (no explicit logout is necessary)
  • Wait for the job to complete, after which the TEST6 instance is stopped, starting with the apps on machine harvesting kum.dk

Save the Metadata Warcfile

  • Log into the harvester where kum.dk was being harvested
  • Find the crawldir in TEST6/harvester_low
  • Find the metadata warcfile in the metadata subdirectory and copy it to TEST6/

Create a Fake Crawl Dir

 

ssh netarkdv@sb-test-har-001.statsbiblioteket.dk
cd TEST6/harvester_high 
cp -r ~netarkdv/testdata-h3/TEST6/23-fakejobdir .
mkdir 23-fakejobdir/heritrix3/jobs/23-fakejobdir/logs
touch 23-fakejobdir/heritrix3/jobs/23-fakejobdir/logs/crawl.log
touch 23-fakejobdir/heritrix3/jobs/23-fakejobdir/logs/progress-statistics.log

 

Wait 3 Hours then Restart the System

Wait 3 Hours then Restart the System

Verify the restarted system. On devel@kb-test-adm-001

  1. Check the log for warnings and errors.

    cd /home/devel/$TESTX/log/
    grep ERROR *.log | grep -v COMMON_ERROR
    grep WARN *.log

    The following entries are normal: 

    arcrepositoryapplication0.log.0:WARNING: AdminDataFile (./admin.data) was not found.
    guiapplication0.log.0:WARNING: Refusing to schedule harvest definition 'netarkivet' in the past. Skipped 18 events. Old nextDate was Mon Dec 18 14:29:30 CET 2006 new nextDate is Tue Dec 19 09:29:30 CET 2006
    GUIApplication0.log.0:WARNING: Job 2 failed: HarvestErrors = dk.netarkivet.common.exceptions.IOFailure: Crawl probably interrupted by shutdown of HarvestController

    The following warning may occur after a while: 

    WARNING: Error processing message '
    Class:                  com.sun.messaging.jmq.jmsclient.ObjectMessageImpl
    getJMSMessageID():      ID:40-130.225.27.140(d2:1:3:b1:10:de)-46478-1197902260630
    getJMSTimestamp():      1197902260630
    getJMSCorrelationID():  null
    JMSReplyTo:             null
    JMSDestination:         TEST6_COMMON_THE_SCHED
    getJMSDeliveryMode():   PERSISTENT
    getJMSRedelivered():    false
    getJMSType():           null
    getJMSExpiration():     0
    getJMSPriority():       4
    Properties:             null'
    dk.netarkivet.common.exceptions.UnknownID: Job id 23 is not known in persistent storage
            at dk.netarkivet.harvester.datamodel.JobDBDAO.read(JobDBDAO.java:294)
            at dk.netarkivet.harvester.scheduler.HarvestSchedulerMonitorServer.processCrawlStatusMessage(HarvestSchedulerMonitorServer.java:103)
            at dk.netarkivet.harvester.scheduler.HarvestSchedulerMonitorServer.visit(HarvestSchedulerMonitorServer.java:285)
            at dk.netarkivet.harvester.harvesting.distribute.CrawlStatusMessage.accept(CrawlStatusMessage.java:133)
            at dk.netarkivet.harvester.distribute.HarvesterMessageHandler.onMessage(HarvesterMessageHandler.java:67)
            at com.sun.messaging.jmq.jmsclient.MessageConsumerImpl.deliverAndAcknowledge(MessageConsumerImpl.java:330)
            at com.sun.messaging.jmq.jmsclient.MessageConsumerImpl.onMessage(MessageConsumerImpl.java:265)
            at com.sun.messaging.jmq.jmsclient.SessionReader.deliver(SessionReader.java:102)
            at com.sun.messaging.jmq.jmsclient.ConsumerReader.run(ConsumerReader.java:174)
            at java.lang.Thread.run(Thread.java:595)
  2. Go to the system overview page and check that all the expected applications are listening and are up without warnings or errors.
  3. Check that the scheduler schedules only one job for the hourly selective harvest.

Check that a job can be resubmitted

  1. Check that you can reject a job for resubmission using the "Reject?" button so that it is no longer visible when you list failed jobs.
  2. Check that you can see the rejected job when you now list all jobs.
  3. Click on one or more "Genstart"/"Resubmit" buttons. Note that you only can resubmit jobs failed due to harvesting errors, not due to upload errors.
  4. Check that the job-status changes to "resubmitted" and that a new Job is made from the same harvestdefinition with the same configurations.
  5. Check that resubmitted jobs contain information about which job they were resubmitted (NAS-1466)

Check Report Generation

Use a browser set up as a viewerproxy connection for this test. Select any completed job and click on the "Browse reports for jobs" link.

You should see a list like

metadata://netarkivet.dk/crawl/setup/duplicatereductionjobs?majorversion=1&minorversion=0&harvestid=1&harvestnum=10&jobid=14
metadata://netarkivet.dk/crawl/setup/crawler-beans.cxml?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/setup/harvestInfo.xml?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/setup/seeds.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/reports/archivefiles-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/reports/crawl-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/reports/frontier-summary-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/reports/hosts-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/reports/mimetype-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/reports/processors-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/reports/responsecode-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/reports/seeds-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/reports/source-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/reports/threads-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/logs/alerts.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/logs/crawl.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/logs/heritrix3_err.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/logs/heritrix3_out.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/logs/heritrix_out.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/logs/job.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/logs/nonfatal-errors.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/logs/progress-statistics.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/logs/runtime-errors.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/logs/scope.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/logs/uri-errors.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1&jobid=14
metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=1&jobid=14&filename=14-1-20161101215537865-00000-ciblee_2015_sb-test-har-001.statsbiblioteket.dk.warc

Check that all the entries are present and browse each in turn. (Note that the HeritrixVersion, harvestIf, and jobId will differ). Some of the entries might be empty

Database crash test

Tests that the system can survive a database crash/stop and resume operation after the database is restarted

Network recovery test

Tests that the system can survive a network crash/stop and resume operation after the becomes available

Shutdown the system

 

  • No labels