The Issue: Suppose we have a HarvestDefinition (HD) based on some logical curatorial category (e.g. newspaper frontpages). Perhaps the curators wish to send only some of the HarvestConfigurations (HC) in this definition to a particular JobQueue (JQ) - say some especially difficult domains should go to a browser-based harvester. Currently this isn't possible because the only mapping available is (HD <-> JQ). [The workaround is to split the HD into two HDs.]
The Solution: add a new set of mappings (in a new three-column database table) HD<
>HC<>JQ . Now when a HarvestJobManager schedules an HD it groups the HCs according to their JQ mappings and if necessary creates multiple jobs to send to different JQs. The logic should be that for each HC the triple mapping (HD,HC,JQ) takes precedence over the double mapping (HD, JQ) and if both are null then the default JQ is used.
The solution will require
- a database schema modification
- new Entity/DAO classes to implement CRUD functionality for the triples
- new GUI elements on the Selective Harvest Definition page to enable setting/unsetting the triples (and also the HD<->JQ mapping which currently is set on a separate page)
- altered logic in HarvestJobManager to split HDs into multiple job (perhaps using the current mechanism which splits on various other fields)