[NAS-2686] Alway create a metadata-warcfile even if Heritrix3 doesn't create any (w)arc files Created: 23/Nov/17  Updated: 25/Apr/18  Resolved: 02/Feb/18

Status: Closed
Project: NetarchiveSuite
Component/s: Harvester Controller Server
Affects Version/s: 5.2.2, 5.3, 5.3.1
Fix Version/s: 5.2.3, 5.4

Type: New Feature Priority: Minor
Reporter: Søren Vejrup Carlsen (Inactive) Assignee: Søren Vejrup Carlsen (Inactive)
Resolution: Fixed  
Labels: None
Remaining Estimate: Not Specified
Time Spent: 17m
Original Estimate: Not Specified

Issue Links:
Related
related to WEBDAN-282 NetarchiveSuite shouldn't fail the po... Resolved
Sprint: NAS 5.4
Verification:

To test:

  1. Upload a bad template (e.g. invalid xml)
  2. Create a harvest using the template
  3. Wait for the job to fail
  4. Check that a metadata file is created and uploaded. Browse the file and look for e.g. heritrix output complaining about the bad xml syntax.

 Description   

You typically receive two mail notifications whenever a harvest fails
1)

Host: narcana-webdanica01.statsbiblioteket.dk
Date: Thu Nov 23 06:51:04 CET 2017
dk.netarkivet.harvester.heritrix3.PostProcessing.storeFiles(PostProcessing.java:269)
Probable error in Heritrix job setup. No arcfiles or warcfiles generated by Heritrix for job 1204

2)

Host: narcana-webdanica01.statsbiblioteket.dk
Date: Thu Nov 23 06:51:04 CET 2017
dk.netarkivet.harvester.heritrix3.PostProcessing.doPostProcessing(PostProcessing.java:165)
Trouble during postprocessing of files in '/opt/webdanica/WEBDANICA/harvester_focused/1204_1511416193560'. Errors accumulated during the postprocessing: Metadata file /opt/webdanica/WEBDANICA/harvester_focused/1204_1511416193560/metadata/1204-metadata-1.warc does not exist

dk.netarkivet.common.exceptions.IllegalState: Metadata file /opt/webdanica/WEBDANICA/harvester_focused/1204_1511416193560/metadata/1204-metadata-1.warc does not exist
        at dk.netarkivet.harvester.heritrix3.IngestableFiles.getMetadataArcFiles(IngestableFiles.java:183)
        at dk.netarkivet.harvester.heritrix3.PostProcessing.storeFiles(PostProcessing.java:281)
        at dk.netarkivet.harvester.heritrix3.PostProcessing.doPostProcessing(PostProcessing.java:159)
        at dk.netarkivet.harvester.heritrix3.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:457)

The problem is that if Heritrix3 doesn't create any (w)arc files, no metadata-warc is created
And you really should, as the reports are still being written by Heritrix, and they contain valuable information



 Comments   
Comment by Søren Vejrup Carlsen (Inactive) [ 10/Jan/18 ]

https://sbforge.org/fisheye/changelog/NetarchiveSuite-Github?cs=057523ef5f40c4c41b3a69eb4ef560b602bb9442

Comment by Søren Vejrup Carlsen (Inactive) [ 23/Nov/17 ]

A valid fix should just be changing

if (cdxGenerationSucceeded) {
                // This indicates, that either the files in the arcsdir or in the warcsdir
                // have now been CDX-processed.
                //
                // TODO refactor, as this call has too many sideeffects
                ingestables.setMetadataGenerationSucceeded(true);
            } else {
                log.warn("Found no archive directory with ARC og WARC files. Looked for dirs '{}' and '{}'.",
                        arcFilesDir.getAbsolutePath(), warcFilesDir.getAbsolutePath());
            }
 

to

if (!cdxGenerationSucceeded) {
                log.warn("Found no archive directory with ARC og WARC files. Looked for dirs '{}' and '{}'.",
                        arcFilesDir.getAbsolutePath(), warcFilesDir.getAbsolutePath());
            }
            ingestables.setMetadataGenerationSucceeded(true);
Comment by Søren Vejrup Carlsen (Inactive) [ 23/Nov/17 ]

This code in HarvestDocumentation.documentHarvest() is to blame

boolean cdxGenerationSucceeded = false;

            // Try to create CDXes over ARC and WARC files.
            File arcFilesDir = ingestables.getArcsDir();
            File warcFilesDir = ingestables.getWarcsDir();

            if (arcFilesDir.isDirectory() && FileUtils.hasFiles(arcFilesDir)) {
                addCDXes(ingestables, arcFilesDir, mdfw, ArchiveProfile.ARC_PROFILE);
                cdxGenerationSucceeded = true;
            }
            if (warcFilesDir.isDirectory() && FileUtils.hasFiles(warcFilesDir)) {
                addCDXes(ingestables, warcFilesDir, mdfw, ArchiveProfile.WARC_PROFILE);
                cdxGenerationSucceeded = true;
            }

            if (cdxGenerationSucceeded) {
                // This indicates, that either the files in the arcsdir or in the warcsdir
                // have now been CDX-processed.
                //
                // TODO refactor, as this call has too many sideeffects
                ingestables.setMetadataGenerationSucceeded(true);
            } else {
                log.warn("Found no archive directory with ARC og WARC files. Looked for dirs '{}' and '{}'.",
                        arcFilesDir.getAbsolutePath(), warcFilesDir.getAbsolutePath());
            }

If no arcs or warcs are found, cdxGenerationSucceeded is false at the end , and then

 ingestables.setMetadataGenerationSucceeded(true);

is never executed, causing the file JOBID-metadata-1.warc.open never to be closed, and renamed to JOBID-metadata-1.warc

Comment by Søren Vejrup Carlsen (Inactive) [ 23/Nov/17 ]

In the case of job 1204, a oldjobs/1204_1511416193560/tmp-meta/1204-metadata-1.warc.open file exists with the available h3 reports already added

WARC-Target-URI: metadata://netarkivet.dk/crawl/setup/crawler-beans.cxml?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/setup/harvestInfo.xml?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/setup/seeds.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/archivefiles-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/crawl-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/frontier-summary-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/hosts-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/mimetype-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/processors-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/responsecode-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/seeds-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/source-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/threads-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/alerts.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/crawl.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/heritrix3_err.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/heritrix3_out.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/heritrix_out.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/job.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/nonfatal-errors.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/progress-statistics.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/runtime-errors.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/uri-errors.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204
Generated at Sat Apr 27 04:15:19 CEST 2024 using Jira 9.4.15#940015-sha1:bdaa9cbecfb6791ea579749728cab771f0dfe90b.