[NAS-2501] 10 minutes waitstate between harvest is finished, and postprocessing begins Created: 11/Feb/16  Updated: 24/Feb/16  Resolved: 24/Feb/16

Status: Resolved
Project: NetarchiveSuite
Component/s: Harvester Controller Server
Affects Version/s: None
Fix Version/s: 5.1

Type: Bug Priority: Minor
Reporter: Søren Vejrup Carlsen (Inactive) Assignee: Søren Vejrup Carlsen (Inactive)
Resolution: Fixed  
Labels: None
Remaining Estimate: Not Specified
Time Spent: 0.1h
Original Estimate: Not Specified

Issue Links:
Related
related to NAS-1790 Update deploy for quickstart to the h... Resolved

 Description   

There is seemingly a constant 10 minutes waitstate between harvest is finished, and postprocessing begins

2016-02-11 14:08:56.262 [pool-2-thread-1] INFO  dk.netarkivet.harvester.heritrix3.controller.Heritri
xLauncher.run 
- Job ID 4: crawl is finished.

2016-02-11 14:18:56.221 [Thread-9] INFO dk.netarkivet.harvester.heritrix3.controller.HeritrixLauncher.doCrawl 
- CrawlJob is now over



 Comments   
Comment by Søren Vejrup Carlsen (Inactive) [ 17/Feb/16 ]

Replaced the rather complex thread-structure with simple wait-loop

Comment by Søren Vejrup Carlsen (Inactive) [ 12/Feb/16 ]

After inserting extra logging in the first if-statement

private class CrawlControl implements Runnable {

        @Override
        public void run() {
            if (crawlIsOver) { // Don't check again; we are already done
                log.warn("Why do you check me again. we're done already");
                return;
            }
            CrawlProgressMessage cpm = null;
            try {
                cpm = heritrixController.getCrawlProgress();
            } catch (IOFailure e) {
                // Log a warning and retry
                log.warn("IOFailure while getting crawl progress", e);
                return;
            } catch (HarvestingAbort e) {
                log.warn("Got HarvestingAbort exception while getting crawl progress. Means crawl is over", e);
                crawlIsOver = true;
                return;
            }

            JMSConnectionFactory.getInstance().send(cpm);

            Heritrix3Files files = getHeritrixFiles();
            if (cpm.crawlIsFinished()) {
                log.info("Job ID {}: crawl is finished.", files.getJobID());
                crawlIsOver = true;
                return;
            }
            
            log.info("Job ID: " + files.getJobID() + ", Harvest ID: " + files.getHarvestID() + ", " + cpm.getHostUrl()
                    + "\n" + cpm.getProgressStatisticsLegend() + "\n" + cpm.getJobStatus().getStatus() + " "
                    + cpm.getJobStatus().getProgressStatistics());
   }
}

I get the sequence, which indicates, that the threads runs 4 more times more than it should

2016-02-12 11:35:09.264 [pool-2-thread-1] INFO  dk.netarkivet.harvester.heritrix3.controller.HeritrixLauncher.run - Job ID 13: crawl is finished.

2016-02-12 11:36:09.215 [pool-2-thread-1] WARN  dk.netarkivet.harvester.heritrix3.controller.HeritrixLauncher.run - Why do you check me again. we're done already

2016-02-12 11:37:09.215 [pool-2-thread-1] WARN  dk.netarkivet.harvester.heritrix3.controller.HeritrixLauncher.run - Why do you check me again. we're done already

2016-02-12 11:38:09.215 [pool-2-thread-1] WARN  dk.netarkivet.harvester.heritrix3.controller.HeritrixLauncher.run - Why do you check me again. we're done already

2016-02-12 11:39:09.215 [pool-2-thread-1] WARN  dk.netarkivet.harvester.heritrix3.controller.HeritrixLauncher.run - Why do you check me again. we're done already

2016-02-12 11:39:09.216 [Thread-8] INFO dk.netarkivet.harvester.heritrix3.controller.HeritrixLauncher.doCrawl - CrawlJob is now over

Generated at Sat Apr 27 04:20:22 CEST 2024 using Jira 9.4.15#940015-sha1:bdaa9cbecfb6791ea579749728cab771f0dfe90b.