Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
Description
If HarvestController/Heritrix dies for any reason (e.g. quota reached, power outage) then any urls remaining on the umbra input queue will be picked up by umbra when it is restarted, meaning that it will not take new urls until all these have been processed. Our solution will be to create a hook to insert a cleanup script to be run before heritrix is launched. The script could do anything, but the most obvious thing is for it to drain the umbra queue and optionally restart umbra. NB!NB!NB! This means that only one HarvestController instance must use any goven umbra instance.
A future refinement of the script might allow one to inspect the items in the queue before removing them and only remove those belonging to the given HarvestController instance.