Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2885

Jobs killed by HarvestController as a consequence of a bug in Heritrix calculating job data size

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • 7.4
    • Heritrix 3
    • None

    Description

       

      At BnF, after launching a test broad cralw, we faced an issue with the HarvestController killing H3 jobs.

      Here, traces from the HarvestController log :

      00:05:27.061 INFO d.n.h.h.c.FrontierReportAnalyzer - Will generate full Heritrix frontier report, 0d 00:00:00 elapsed since last generation started.
      00:05:32.859 INFO d.n.h.h.c.FrontierReportAnalyzer - Generated full Heritrix frontier report in 00d 00:00:05.
      00:05:32.886 INFO d.n.h.h.c.FrontierReportAnalyzer - Applied filter dk.netarkivet.harvester.harvesting.frontier.TopTotalEnqueuesFilter to full frontier report, this took 7 ms.
      00:05:33.055 INFO d.n.h.h.c.FrontierReportAnalyzer - Applied filter dk.netarkivet.harvester.harvesting.frontier.RetiredQueuesFilter to full frontier report, this took 164 ms.
      00:07:34.256 ERROR o.n.h.xmlutils.XmlErrorHandler - SAX parsing error!
      org.xml.sax.SAXParseException: Premature end of file.
      at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
      at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
      at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
      at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
      at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
      at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
      at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
      at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
      at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
      at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
      at org.netarchivesuite.heritrix3wrapper.xmlutils.XmlValidator.testStructuralValidity(XmlValidator.java:233)
      at org.netarchivesuite.heritrix3wrapper.JobResult.parse(JobResult.java:22)
      at org.netarchivesuite.heritrix3wrapper.Heritrix3Wrapper.jobResult(Heritrix3Wrapper.java:408)
      at org.netarchivesuite.heritrix3wrapper.Heritrix3Wrapper.job(Heritrix3Wrapper.java:454)
      at dk.netarkivet.harvester.heritrix3.controller.HeritrixController.getCrawlProgress(HeritrixController.java:411)
      at dk.netarkivet.harvester.heritrix3.controller.HeritrixLauncher$CrawlControl.run(HeritrixLauncher.java:146)
      at dk.netarkivet.harvester.heritrix3.controller.HeritrixLauncher.doCrawl(HeritrixLauncher.java:109)
      at dk.netarkivet.harvester.heritrix3.HarvestJob.runHarvest(HarvestJob.java:102)
      at dk.netarkivet.harvester.heritrix3.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:485)
      [...]
      00:07:34.265 WARN d.n.h.h.controller.HeritrixLauncher - Exception during crawl
      java.lang.NullPointerException: null
      at dk.netarkivet.harvester.heritrix3.controller.HeritrixController.getCrawlServiceAttributes(HeritrixController.java:438)
      at dk.netarkivet.harvester.heritrix3.controller.HeritrixController.getCrawlProgress(HeritrixController.java:413)
      at dk.netarkivet.harvester.heritrix3.controller.HeritrixLauncher$CrawlControl.run(HeritrixLauncher.java:146)
      at dk.netarkivet.harvester.heritrix3.controller.HeritrixLauncher.doCrawl(HeritrixLauncher.java:109)
      at dk.netarkivet.harvester.heritrix3.HarvestJob.runHarvest(HarvestJob.java:102)
      at dk.netarkivet.harvester.heritrix3.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:485)
      00:07:34.339 WARN d.n.h.h.c.HeritrixController - Should be one job but there is 3 jobs: 41005_1662658807658 CrawlRSS-Sample-Profile CrawlRSS-Sample-Profile-DB-conf
      00:07:34.509 INFO d.n.h.h.c.HeritrixController - Tearing down h3 job 41005_1662658807658
      00:07:47.217 WARN d.n.h.h.c.HeritrixController - The job 41005_1662658807658 is still lurking about. Shutdown heritrix3 and ignore the job
      00:07:48.595 WARN d.n.h.h.HarvestControllerServer - Error during crawling. The crawl may have been only partially completed.
      java.lang.RuntimeException: Exception during crawl
      at dk.netarkivet.harvester.heritrix3.controller.HeritrixLauncher.doCrawl(HeritrixLauncher.java:126)
      at dk.netarkivet.harvester.heritrix3.HarvestJob.runHarvest(HarvestJob.java:102)
      at dk.netarkivet.harvester.heritrix3.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:485)

      While using the debug mode en H3, we noticed we occasionnaly get empty responses from H3 while calculating job data size (sizeOnDisk). The error pops when open files are closed in the process.

       

      We need a fix to take closing file into account and make calculation more robust.

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            sara Sara Aubry
            Clara Wiatrowski Clara Wiatrowski
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: