Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2780

reenable automatic shutdown of Heritrix due to inactivity

    XMLWordPrintable

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      In version 3.4.0 of Netarchivesuite, Heritrix was automatically shutdown by the Harvester Controller Server due to inactivity, if nothing was harvested for some time.

      This feature was controlled by these settings in HarvesterSettings 

      /**
           * <b>settings.harvester.harvesting.heritrix.inactivityTimeout</b>: <br>
           * The timeout setting for aborting a crawl based on crawler-inactivity. If the crawler is inactive for this amount
           * of seconds the crawl will be aborted. The inactivity is measured on the crawlController.activeToeCount().
           */
          public static String INACTIVITY_TIMEOUT_IN_SECS = "settings.harvester.harvesting.heritrix.inactivityTimeout";
       /**
           * <b>settings.harvester.harvesting.heritrix.noresponseTimeout</b>: <br>
           * The timeout value (in seconds) used in HeritrixLauncher for aborting crawl when no bytes are being received from
           * web servers.
           */
          public static String CRAWLER_TIMEOUT_NON_RESPONDING = "settings.harvester.harvesting.heritrix.noresponseTimeout";
       

      The current crawl-workflow  where this should be added to the HeritrixLauncher#doCrawl() method or rather the CrawlControl#run() method

      (harvester/heritrix3/heritrix3-controller/src/main/java/dk/netarkivet/harvester/heritrix3/controller/HeritrixLauncher.java)

      public void doCrawl() throws IOFailure {
              setupOrderfile(getHeritrixFiles());
              heritrixController = new HeritrixController(getHeritrixFiles(), jobName);
              
              try {
                  // Initialize Heritrix settings according to the crawler-beans.cxml file.
                  heritrixController.initialize();
                  log.debug("Setup and start new h3 crawl");
                  heritrixController.requestCrawlStart();
                      
                  log.info("Starting periodic CrawlControl with CRAWL_CONTROL_WAIT_PERIOD={} seconds", CRAWL_CONTROL_WAIT_PERIOD);            
                
                  while (!crawlIsOver) {
                      CrawlControl cc = new CrawlControl();
                      cc.run();
                      FrontierReportAnalyzer fra = new FrontierReportAnalyzer(heritrixController);
                      fra.run();
                      if (!crawlIsOver) {
                          try {
                          Thread.sleep(CRAWL_CONTROL_WAIT_PERIOD*1000L);
                          } catch (InterruptedException e) {
                              log.warn("Wait interrupted: " + e);
                          }
                      }
                  }
                  log.info("CrawlJob is now over");
              } catch (IOFailure e) {
                  log.warn("Error during initialisation of crawl", e);
                  throw (e);
              } catch (Exception e) {
                  log.warn("Exception during crawl", e);
                  throw new RuntimeException("Exception during crawl", e);
              } finally {
                  if (heritrixController != null) {
                      heritrixController.cleanup(getHeritrixFiles().getCrawlDir());
                  }
              }
              log.debug("Heritrix3 has finished crawling...");
          }
      
         /**
           * This class executes a crawl control task, e.g. queries the crawler for progress summary, sends the adequate JMS
           * message to the monitor, and checks whether the crawl is finished, in which case crawl control will be ended.
           * <p>
           */
          private class CrawlControl implements Runnable {
             
              @Override
              public void run() {
                  CrawlProgressMessage cpm = null;
                  try {
                      cpm = heritrixController.getCrawlProgress();
                  } catch (IOFailure e) {
                      // Log a warning and retry
                      log.warn("IOFailure while getting crawl progress", e);
                      return;
                  } catch (HarvestingAbort e) {
                      log.warn("Got HarvestingAbort exception while getting crawl progress. Means crawl is over", e);
                      crawlIsOver = true;
                      return;
                  }
                  JMSConnectionFactory.getInstance().send(cpm);            Heritrix3Files files = getHeritrixFiles();
                  if (cpm.crawlIsFinished()) {
                      log.info("Job ID {}: crawl is finished.", files.getJobID());
                      crawlIsOver = true;
                      return;
                  }
                  
                  log.info("Job ID: " + files.getJobID() + ", Harvest ID: " + files.getHarvestID() + ", " + cpm.getHostUrl()
                          + "\n" + cpm.getProgressStatisticsLegend() + "\n" + cpm.getJobStatus().getStatus() + " "
                          + cpm.getJobStatus().getProgressStatistics());
              }
      
       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            svc Søren Vejrup Carlsen (Inactive)
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: