All the configuration is in the NetarchiveSuite settings for the HarvestControllerApplication. A typical example might look like
<settings> <common> . . </common> <harvester> <harvesting> <channel>UMBRA</channel> <heritrix> . . <metadata> . . </metadata> <serverDir>harvester_umbra_1</serverDir> <umbra> <isEnabled>true</isEnabled> <rabbitmqUrl>amqp://guest:guest@localhost:8998/%2f</rabbitmqUrl> <hopsShouldProcess>^$|.*L</hopsShouldProcess> </umbra> </harvesting> </harvester> </settings>
The complete list of relevant settings is
|There needs to be a specific harvest channel for Umbra harvests defined in the NetarchiveSuite GUI. This channel name should then be used by any HarvestControllerApplication which is intended to receive jobs for Umbra|
|A flag indicating whether this HarvestControllerApplication instance is intended for Umbra harvesting. The default value is |
|The connection URL for the tcp-socket connection to the RabbitMQ broker to which Umbra listens. The URL should include the username and password for the broker. Only basic authentication is supported.|
|This parameter is a regex applied to the Heritrix Discovery Path to limit which URLs should be sent from Heritrix to Umbra. The default value "^$|.*L" limits this to the empty string (harvest seeds) or any string ending in "L", that is any link. With this choice, Umbra is only used for actual webpages such as an enduser might load by clicking on a hyperlink.|
Putting It All Together
In order to make it all work, one needs to
Choose an Installation Architecture for Umbra Itself
The Umbra documentation describes how to use Python pip to install Umbra on a single server. (There is some discussion of early experiments with this at the Danish Royal Library here.) An alternative approach is a deployment based on Docker Compose - https://github.com/netarchivesuite/netarchivesuite-umbra-docker/blob/master/umbra/Dockerfile. Although not formally "supported" it seems to work well. (There have also been some experiments with deploying Umbra in the cloud using Elastic Beanstalk - see https://github.com/netarchivesuite/netarchivesuite-umbra-docker/tree/elastic_beanstalk and A Novice Learns About Amazon Web Services) Here is how we installed the basic software in DK.
The choice of Umbra architecture will depend on your system requirements. Some possibilities are
- A single instance of Umbra used by all umbra-enabled harvesters
- One Umbra per HarvestControllerServer instance
- One Umbra per harvesting machine (possibly running several instances of HarvestControllerServer)
At the Danish Royal Library we have tested with a One-Umbra-Per-Machine setup.
Add an Umbra queue to the NetarchiveSuite GUI
This uses longstanding functionality in the NAS GUI. Just add a new channel for Umbra harvesting - for example
Add one or more umbra-enabled HarvestControllerServer instances to your NAS distribution
As described above, configure one or more of your existing or new HarvestControllerServices to listen to the Queue you created and with the necessary connection information for one of your umbra instances.
Add the four umbra-related placeholders to any harvest templates to be used in Umbra harvests
The placholders are documented in Appendix B2: Managing Heritrix 3 Crawler-Beans. Note that for non-umbra-enabled HarvestControllerServers these new placeholders will be silently removed before starting the harvest. Therefore you can safely add these placeholders to all your harvest templates, whether or not you are currently planning to use them all for Umbra harvests.
Map some or all of your harvests to the Umbra channel
Use the existing Harvest Channel Mapping section of the NAS GUI to send specific harvests to the Umbra channel.
To check that Umbra is functioning, look in the crawl log of an umbra-enabled job for the strings "sentToAMQP" and "receivedFromAMQP" which indicate which URLs were sent to Umbra and which were received by Heritrix after being found by Umbra.
- Currently it is only possible to map an entire HarvestDefinition to Umbra. Ideally one would have a more fine-grained approach whereby specific HarvestConfigurations in a given HarvestDefinition would be sent to Umbra. ie there should be a mapping from the pair (HarvestConfiguration, HarvestDefinition) to Harvest Channel, and this mapping would be configureable in the page for editing the HarvestDefinition. This could be implemented as an override to the current behaviour ie. the more-specific mapping would "win" over the per-HarvestDefinition mapping.
- The hopsShouldProcess string is currently defined by the HarvestControllerServer settings so is the same for all harvests on a given HarvestControllerServer. This too should ideally be definable for each (HarvestConfiguration, HarvestDefinition) pair, perhaps implemented as an override to a default value.
As usual, the NetarchiveSuite development team welcomes external contributions the codebase!