Page tree
Skip to end of metadata
Go to start of metadata

The current HARVEST_CONTROLLER_PRIORITY setting implies that the harvest controller architecture is based on a number of harvest controller instances, each with a specific priority. This does not reflect the current setup, where there only exists one 'HarvestControllerClient' which dispatches HarvestJob to the harvester via. 2 message queues. The HarvestControllerClient choses which queue to use based on the JobPriority of the concrete job. This means that the HARVEST_CONTROLLER_PRIORITY setting isn't used on the client side. On the server side the setting is used to find the relevant queue to consume messages from. So if we wish to use the name of the setting to describe the settings purpose, is should be called something like: HARVEST_SERVER_JOB_CHANNEL and the type should change to a ChannelID.

The current code is ridelled with code like:

String priority = Settings.get(
                HarvesterSettings.HARVEST_CONTROLLER_PRIORITY);
  ChannelID result;
  if (priority.equals(JobPriority.LOWPRIORITY.toString())) {
    result = Channels.getAnyLowpriorityHaco();
  } 
  else {
    if (priority.equals(JobPriority.HIGHPRIORITY.toString())) {
      result = Channels.getAnyHighpriorityHaco();
    } 
    else throw new UnknownID(priority + " is not a valid priority");
  }
NetarkivetMessage naMsg = new DoOneCrawlMessage(theJob, result, TestInfo.emptyMetadata);

Which could be replaced by the much shorter and clearer:

NetarkivetMessage naMsg = new DoOneCrawlMessage(
    theJob, Settings.getChannel(HARVEST_SERVER_JOB_CHANNEL), TestInfo.emptyMetadata);

Another renaming we might consider is renaming the HarvestControllerClient to HarvestJobDispatcher to better describe what this class is doing. It really isn't controlling anything, and using 'Client' as a marker in a distributed architecture doesn't say much, many components are acting as clients and servers, often both at the same time. The HarvestControllerServer could then be renamed to just HarvestServer. Colin mentioned that the Controller part may have come from the classes responsibility of controlling a Heritrix instance, but in my opinion this is just a ordinary delegation, which shouldn't be reflected in the classes public name (a classes name should primarily reflect the services it exposes, not how it implements them, eg. controls Heritrix to do the actual crawls).

  • No labels

4 Comments

  1. The inspiration for doing this surfaced during the NARC-17 Refactor ChannelId access work

  2. I dont't like the change from HARVEST_CONTROLLER_PRIORITY to HARVEST_SERVER_JOB_CHANNEL.

    I would rather that we changed it to HARVEST_SERVER_JOB_TYPE.

    1. I agree somewhat, the HARVEST_SERVER_JOB_TYPE name is much closer to the actual usage of the setting.

      But, as far as I can see the setting is only used on the server side to identify the channel to fetch HarvestJob messages from. The Harvest server doesn't use the setting to assume anything about the type of jobs it's suppose to handle.

      If the intended purpose of the JOB_TYPE naming is to label the HarvestServer as processing Job of a particular type, wouldn't it be better to use a dedicated option for this. This approach would avoid a defining an assumed coupling between the message channel and job type on the server side, localizing the Job -> Harvest server mapping business logic to the HarvestScheduler?

  3. Note, that this complaint has been somewhat addressed in  NAS-2212 - Getting issue details... STATUS .