[NAS-2686] Alway create a metadata-warcfile even if Heritrix3 doesn't create any (w)arc files Created: 23/Nov/17 Updated: 25/Apr/18 Resolved: 02/Feb/18 |
|
Status: | Closed |
Project: | NetarchiveSuite |
Component/s: | Harvester Controller Server |
Affects Version/s: | 5.2.2, 5.3, 5.3.1 |
Fix Version/s: | 5.2.3, 5.4 |
Type: | New Feature | Priority: | Minor |
Reporter: | Søren Vejrup Carlsen (Inactive) | Assignee: | Søren Vejrup Carlsen (Inactive) |
Resolution: | Fixed | ||
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | 17m | ||
Original Estimate: | Not Specified |
Issue Links: |
|
||||||||
Sprint: | NAS 5.4 | ||||||||
Verification: | To test:
|
Description |
You typically receive two mail notifications whenever a harvest fails
Host: narcana-webdanica01.statsbiblioteket.dk
Date: Thu Nov 23 06:51:04 CET 2017
dk.netarkivet.harvester.heritrix3.PostProcessing.storeFiles(PostProcessing.java:269)
Probable error in Heritrix job setup. No arcfiles or warcfiles generated by Heritrix for job 1204
2)
Host: narcana-webdanica01.statsbiblioteket.dk
Date: Thu Nov 23 06:51:04 CET 2017
dk.netarkivet.harvester.heritrix3.PostProcessing.doPostProcessing(PostProcessing.java:165)
Trouble during postprocessing of files in '/opt/webdanica/WEBDANICA/harvester_focused/1204_1511416193560'. Errors accumulated during the postprocessing: Metadata file /opt/webdanica/WEBDANICA/harvester_focused/1204_1511416193560/metadata/1204-metadata-1.warc does not exist
dk.netarkivet.common.exceptions.IllegalState: Metadata file /opt/webdanica/WEBDANICA/harvester_focused/1204_1511416193560/metadata/1204-metadata-1.warc does not exist
at dk.netarkivet.harvester.heritrix3.IngestableFiles.getMetadataArcFiles(IngestableFiles.java:183)
at dk.netarkivet.harvester.heritrix3.PostProcessing.storeFiles(PostProcessing.java:281)
at dk.netarkivet.harvester.heritrix3.PostProcessing.doPostProcessing(PostProcessing.java:159)
at dk.netarkivet.harvester.heritrix3.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:457)
The problem is that if Heritrix3 doesn't create any (w)arc files, no metadata-warc is created |
Comments |
Comment by Søren Vejrup Carlsen (Inactive) [ 10/Jan/18 ] |
Comment by Søren Vejrup Carlsen (Inactive) [ 23/Nov/17 ] |
A valid fix should just be changing if (cdxGenerationSucceeded) { // This indicates, that either the files in the arcsdir or in the warcsdir // have now been CDX-processed. // // TODO refactor, as this call has too many sideeffects ingestables.setMetadataGenerationSucceeded(true); } else { log.warn("Found no archive directory with ARC og WARC files. Looked for dirs '{}' and '{}'.", arcFilesDir.getAbsolutePath(), warcFilesDir.getAbsolutePath()); } to if (!cdxGenerationSucceeded) { log.warn("Found no archive directory with ARC og WARC files. Looked for dirs '{}' and '{}'.", arcFilesDir.getAbsolutePath(), warcFilesDir.getAbsolutePath()); } ingestables.setMetadataGenerationSucceeded(true); |
Comment by Søren Vejrup Carlsen (Inactive) [ 23/Nov/17 ] |
This code in HarvestDocumentation.documentHarvest() is to blame boolean cdxGenerationSucceeded = false; // Try to create CDXes over ARC and WARC files. File arcFilesDir = ingestables.getArcsDir(); File warcFilesDir = ingestables.getWarcsDir(); if (arcFilesDir.isDirectory() && FileUtils.hasFiles(arcFilesDir)) { addCDXes(ingestables, arcFilesDir, mdfw, ArchiveProfile.ARC_PROFILE); cdxGenerationSucceeded = true; } if (warcFilesDir.isDirectory() && FileUtils.hasFiles(warcFilesDir)) { addCDXes(ingestables, warcFilesDir, mdfw, ArchiveProfile.WARC_PROFILE); cdxGenerationSucceeded = true; } if (cdxGenerationSucceeded) { // This indicates, that either the files in the arcsdir or in the warcsdir // have now been CDX-processed. // // TODO refactor, as this call has too many sideeffects ingestables.setMetadataGenerationSucceeded(true); } else { log.warn("Found no archive directory with ARC og WARC files. Looked for dirs '{}' and '{}'.", arcFilesDir.getAbsolutePath(), warcFilesDir.getAbsolutePath()); } If no arcs or warcs are found, cdxGenerationSucceeded is false at the end , and then
ingestables.setMetadataGenerationSucceeded(true);
is never executed, causing the file JOBID-metadata-1.warc.open never to be closed, and renamed to JOBID-metadata-1.warc |
Comment by Søren Vejrup Carlsen (Inactive) [ 23/Nov/17 ] |
In the case of job 1204, a oldjobs/1204_1511416193560/tmp-meta/1204-metadata-1.warc.open file exists with the available h3 reports already added WARC-Target-URI: metadata://netarkivet.dk/crawl/setup/crawler-beans.cxml?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/setup/harvestInfo.xml?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/setup/seeds.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/archivefiles-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/crawl-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/frontier-summary-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/hosts-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/mimetype-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/processors-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/responsecode-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/seeds-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/source-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/reports/threads-report.txt?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/alerts.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/crawl.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/heritrix3_err.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/heritrix3_out.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/heritrix_out.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/job.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/nonfatal-errors.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/progress-statistics.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/runtime-errors.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 WARC-Target-URI: metadata://netarkivet.dk/crawl/logs/uri-errors.log?heritrixVersion=3.3.0-LBS-2016-02&harvestid=1232&jobid=1204 |