Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We will need to modify the Selective Harvest definition part of the GUI to enable selection/deselection of Umbra - for example a drop down among know known channel names, or a tickbox, or checkboxes. How do we know which selective channel is the default and which is the "other one" ie umbra? By convention (e.g. if channel name contains "umbra") or configuration (as a startup setting to HarvestJobManager)? 

HarvestController Settings

...

Here we may choose to use our current templating approach whereby we push values into the cxml file. However we could also choose to use xml DOM-processing, e.g. with XPATH. The test cxml https://github.com/netarchivesuite/netarchivesuite-umbra-docker/blob/master/heritrix/umbra.cxml provides a good template for which beans and parameters we need to add and where. Iirc (check please!) the current logic is that HarvestJobManager pushes all the template values and then deletes any unused placeholders. That won't work if we want the HarvestController to make use of the placeholders later. So either we delegate deletion of unused placeholders to HarvestController, or we use the DOM-processing approach.

Job Isolation and Cleaning-up of Umbra/RabbitMQ

Job isolation means making sure that urls discovered by a specific Umbra harvest are only returned to the job which initiated them. The test profile https://github.com/netarchivesuite/netarchivesuite-umbra-docker/blob/master/heritrix/umbra.cxml shows how to define a specific rabbitMQ channel. If we make this channel unique per job (e.g. just using the Harvest Job Number) then we should be safe. But how do we prevent the broker accumulating old queues, possibly with data on them? This is maybe not something to worry about for now.

Umbra Deployment Architecure

Now that we know how to do job isolation, one-Umbra-instance-per-harvest-machine has some advantages. It means that every HarvestController uses the same umbra endpoint configuration - "localhost:5672".

Umbra Configuration

The question of native vs. docker is less critical if we are going to just have one long-running Umbra instance per machine. But docker has advantages of allowing us to easily manage and deploy identical configurations to multiple machines.