Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-1881

Give the ability to use Heritrix recover.gz to resubmit a failed job and let it continue where is stopped

    XMLWordPrintable

Details

    • BNF
    • Confident
    • Hide

      0) Install NAS

      1) Make sure that recover-log is enabled in Heritrix. This is done by setting the "recovery-log-enabled" property of the frontier to true:

       <boolean name="recovery-log-enabled">true</boolean> 

      1a) Make sure that the recover.gz file is included in the metadata arc by adding
      <logFilePattern>.*(\.log|\.out|\.gz)</logFilePattern> to harvester.harvesting.metadata in the HarvestController settings file.
      2) Enable recovery in NetarchiveSuite for just one of the HIGH-priority harvesters by enabling the "settings.harvester.harvesting.continuationFromHeritrixRecoverlogEnabled" setting in the harvester settings-file like this

      <harvesting>
      <continuationFromHeritrixRecoverlogEnabled>true
      </continuationFromHeritrixRecoverlogEnabled>
      ..
      

      3) Restart the HIGH priority harvester where you have modified the settings, shutdown the the rest of the HIGH-priority harvesters in your installation.

      4) Make a rather large selective harvest so there is time for the recovery mechanism to work before we kill the harvester (kill -9) after about 15 minutes, and afterwards restart the harvesting processor as well (If not, it takes a long time (6 hours+), before NAS gives up on the dead Heritrix process, and continues with the data-processing and uploads the metadata (including the the recover.gz file) to the archive.

      5) restart the killed job.

      6) See that NAS successfully extracts the recovery.gz from the previous job, and points to this in the order.xml (recover-path attribute).

      7) after 15 minutes, terminate the job in the Heritrix GUI

      8) Examine the cdx'es from the two metadata-arc files, and verify, that the downloads from the killed harvest are not harvested again in the next harvest.

      Show
      0) Install NAS 1) Make sure that recover-log is enabled in Heritrix. This is done by setting the "recovery-log-enabled" property of the frontier to true: < boolean name= "recovery-log-enabled" > true </ boolean > 1a) Make sure that the recover.gz file is included in the metadata arc by adding <logFilePattern>.*(\.log|\.out|\.gz)</logFilePattern> to harvester.harvesting.metadata in the HarvestController settings file. 2) Enable recovery in NetarchiveSuite for just one of the HIGH-priority harvesters by enabling the "settings.harvester.harvesting.continuationFromHeritrixRecoverlogEnabled" setting in the harvester settings-file like this <harvesting> <continuationFromHeritrixRecoverlogEnabled> true </continuationFromHeritrixRecoverlogEnabled> .. 3) Restart the HIGH priority harvester where you have modified the settings, shutdown the the rest of the HIGH-priority harvesters in your installation. 4) Make a rather large selective harvest so there is time for the recovery mechanism to work before we kill the harvester (kill -9) after about 15 minutes, and afterwards restart the harvesting processor as well (If not, it takes a long time (6 hours+), before NAS gives up on the dead Heritrix process, and continues with the data-processing and uploads the metadata (including the the recover.gz file) to the archive. 5) restart the killed job. 6) See that NAS successfully extracts the recovery.gz from the previous job, and points to this in the order.xml (recover-path attribute). 7) after 15 minutes, terminate the job in the Heritrix GUI 8) Examine the cdx'es from the two metadata-arc files, and verify, that the downloads from the killed harvest are not harvested again in the next harvest.

    Description

      When experiencing Heritrix crashing or jobs failing after having run for a long time, it would be useful to be able the restart the failed Heritrix job from the crashing point using the recover.gz file.
      See section 9.3. Recovery of Frontier State and recover.gz in Heritrix user manual:
      http://crawler.archive.org/articles/user_manual/outside.html

      Attachments

        Issue Links

          Activity

            People

              svc Søren Vejrup Carlsen (Inactive)
              svc Søren Vejrup Carlsen (Inactive)
              Colin Rosenthal Colin Rosenthal
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 35h
                  35h
                  Remaining:
                  Time Spent - 4h Remaining Estimate - 31h
                  31h
                  Logged:
                  Time Spent - 4h Remaining Estimate - 31h
                  4h