Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
Description
In version 3.4.0 of Netarchivesuite, Heritrix was automatically shutdown by the Harvester Controller Server due to inactivity, if nothing was harvested for some time.
This feature was controlled by these settings in HarvesterSettings
/** * <b>settings.harvester.harvesting.heritrix.inactivityTimeout</b>: <br> * The timeout setting for aborting a crawl based on crawler-inactivity. If the crawler is inactive for this amount * of seconds the crawl will be aborted. The inactivity is measured on the crawlController.activeToeCount(). */ public static String INACTIVITY_TIMEOUT_IN_SECS = "settings.harvester.harvesting.heritrix.inactivityTimeout"; /** * <b>settings.harvester.harvesting.heritrix.noresponseTimeout</b>: <br> * The timeout value (in seconds) used in HeritrixLauncher for aborting crawl when no bytes are being received from * web servers. */ public static String CRAWLER_TIMEOUT_NON_RESPONDING = "settings.harvester.harvesting.heritrix.noresponseTimeout";
The current crawl-workflow where this should be added to the HeritrixLauncher#doCrawl() method or rather the CrawlControl#run() method
(harvester/heritrix3/heritrix3-controller/src/main/java/dk/netarkivet/harvester/heritrix3/controller/HeritrixLauncher.java)
public void doCrawl() throws IOFailure { setupOrderfile(getHeritrixFiles()); heritrixController = new HeritrixController(getHeritrixFiles(), jobName); try { // Initialize Heritrix settings according to the crawler-beans.cxml file. heritrixController.initialize(); log.debug("Setup and start new h3 crawl"); heritrixController.requestCrawlStart(); log.info("Starting periodic CrawlControl with CRAWL_CONTROL_WAIT_PERIOD={} seconds", CRAWL_CONTROL_WAIT_PERIOD); while (!crawlIsOver) { CrawlControl cc = new CrawlControl(); cc.run(); FrontierReportAnalyzer fra = new FrontierReportAnalyzer(heritrixController); fra.run(); if (!crawlIsOver) { try { Thread.sleep(CRAWL_CONTROL_WAIT_PERIOD*1000L); } catch (InterruptedException e) { log.warn("Wait interrupted: " + e); } } } log.info("CrawlJob is now over"); } catch (IOFailure e) { log.warn("Error during initialisation of crawl", e); throw (e); } catch (Exception e) { log.warn("Exception during crawl", e); throw new RuntimeException("Exception during crawl", e); } finally { if (heritrixController != null) { heritrixController.cleanup(getHeritrixFiles().getCrawlDir()); } } log.debug("Heritrix3 has finished crawling..."); } /** * This class executes a crawl control task, e.g. queries the crawler for progress summary, sends the adequate JMS * message to the monitor, and checks whether the crawl is finished, in which case crawl control will be ended. * <p> */ private class CrawlControl implements Runnable { @Override public void run() { CrawlProgressMessage cpm = null; try { cpm = heritrixController.getCrawlProgress(); } catch (IOFailure e) { // Log a warning and retry log.warn("IOFailure while getting crawl progress", e); return; } catch (HarvestingAbort e) { log.warn("Got HarvestingAbort exception while getting crawl progress. Means crawl is over", e); crawlIsOver = true; return; } JMSConnectionFactory.getInstance().send(cpm); Heritrix3Files files = getHeritrixFiles(); if (cpm.crawlIsFinished()) { log.info("Job ID {}: crawl is finished.", files.getJobID()); crawlIsOver = true; return; } log.info("Job ID: " + files.getJobID() + ", Harvest ID: " + files.getHarvestID() + ", " + cpm.getHostUrl() + "\n" + cpm.getProgressStatisticsLegend() + "\n" + cpm.getJobStatus().getStatus() + " " + cpm.getJobStatus().getProgressStatistics()); }