Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2852

Umbra Needs to its State Reset for each Job

    XMLWordPrintable

Details

    Description

      If HarvestController/Heritrix dies for any reason (e.g. quota reached, power outage) then any urls remaining on the umbra input queue will be picked up by umbra when it is restarted, meaning that it will not take new urls until all these have been processed. Our solution will be to create a hook to insert a cleanup script to be run before heritrix is launched. The script could do anything, but the most obvious thing is for it to drain the umbra queue and optionally restart umbra. NB!NB!NB! This means that only one HarvestController instance must use any goven umbra instance.

      A future refinement of the script might allow one to inspect the items in the queue before removing them and only remove those belonging to the given HarvestController instance.

      Attachments

        Activity

          People

            csr Colin Rosenthal
            csr Colin Rosenthal
            Tue Hejlskov Larsen Tue Hejlskov Larsen
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: