Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2799

Implement "Force Pause" feature for Heritrix



    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • 5.4.2
    • Harvest Monitor
    • None


      See also https://sbprojects.statsbiblioteket.dk/jira/projects/NARK/issues/NARK-1704

      Background: There is a long running problem that some heritrix jobs hang, usually because specific queues/urls in the heritrix frontier hang. We would like to pause heritrix, kill the relevant queues/urls, and unpause heritrix so it can finish normally. Unfortunately, when pausing heritrix, it first waits for all queues to enter a well-defined state, which means that it won't pause until the hanging queues stop hanging - Catch-22!

      The wrong workaround is just to kill the queues without pausing heritrix first. This puts the frontier in an inconsistent state which may cause it to hang when closing. 

      The other wrong workaround is just to terminate heritrix. This results in all domains in the current job getting a "harvest aborted" status, which means they will all included in the next snapshot-harvest step or restarted job - even if they have already been fully harvested.

      The best-known workaround is the one described in https://sbprojects.statsbiblioteket.dk/jira/projects/NARK/issues/NARK-868 as tested by both csr and Lauren Ko. Here one uses the kill-with-replacement method in the heritrix ToePool. By replacing the killed thread with a new one, this seems to prevent the hang at close.

      This task is to implement a functionality, for example a new button, in H3 Monitor which automatically identifies hanging ToeThreads and kills them. Probably the button should only be available after one has attempted a normal Pause (ie in state PAUSING) and should leave Heritrix in a state where it is still pausing but will hopefully soon reach PAUSED. 




            Unassigned Unassigned
            csr Colin Rosenthal
            1 Start watching this issue