Class HarvestControllerServer

  • All Implemented Interfaces:
    CleanupIF, HarvesterMessageVisitor, javax.jms.MessageListener

    public class HarvestControllerServer
    extends HarvesterMessageHandler
    implements CleanupIF
    This class responds to JMS doOneCrawl messages from the HarvestScheduler and launches a Heritrix crawl with the received job description. The generated ARC files are uploaded to the bitarchives once a harvest job has been completed. Initially, the HarvestControllerServer registers its channel with the Scheduler by sending a HarvesterRegistrationRequest and waits for a positive HarvesterRegistrationResponse that its channel is recognized. If not recognized by the Scheduler, the HarvestControllerServer will send a notification about this, and then close down the application.

    During its operation CrawlStatus messages are sent to the HarvestSchedulerMonitorServer. When starting the actual harvesting a message is sent with status 'STARTED'. When the harvesting has finished a message is sent with either status 'DONE' or 'FAILED'. Either a 'DONE' or 'FAILED' message with result should ALWAYS be sent if at all possible, but only ever one such message per job. While the harvestControllerServer is waiting for the harvesting to finish, it sends HarvesterReadyMessages to the scheduler. The interval between each HarvesterReadyMessage being sent is defined by the setting 'settings.harvester.harvesting.sendReadyDelay'.

    It is necessary to be able to run the Heritrix harvester on several machines and several processes on each machine. Each instance of Heritrix is started and monitored by a HarvestControllerServer.

    Initially, all directories under serverdir are scanned for harvestinfo files. If any are found, they are parsed for information, and all remaining files are attempted uploaded to the bitarchive. It will then send back a CrawlStatusMessage with status failed.

    A new thread is started for each actual crawl, in which the JMS listener is removed. Threading is required since JMS will not let the called thread remove the listener that's being handled.

    After a harvestjob has been terminated, either successfully or unsuccessfully, the serverdir is again scanned for harvestInfo files to attempt upload of files not yet uploaded. Then it begins to listen again after new jobs, if there is enough room available on the machine. If not, it logs a warning about this, which is also sent as a notification.

    • Method Detail

      • getInstance

        public static HarvestControllerServer getInstance()
                                                   throws IOFailure
        Returns or creates the unique instance of this singleton The server creates an instance of the HarvestController, uploads arc-files from unfinished harvests, and starts to listen to JMS messages on the incoming jms queues.
        Returns:
        The instance
        Throws:
        PermissionDenied - If the serverdir or oldjobsdir can't be created
        IOFailure - if data from old harvests exist, but contain illegal data
      • close

        public void close()
        Release all jms connections. Close the Controller
      • visit

        public void visit​(DoOneCrawlMessage msg)
                   throws IOFailure,
                          UnknownID,
                          ArgumentNotValid,
                          PermissionDenied
        Checks that we're available to do a crawl, and if so, marks us as unavailable, checks that the job message is well-formed, and starts the thread that the crawl happens in. If an error occurs starting the crawl, we will start listening for messages again.

        The sequence of actions involved in a crawl are:
        1. If we are already running, resend the job to the queue and return
        2. Check the job for validity
        3. Send a CrawlStatus message that crawl has STARTED
        In a separate thread:
        4. Unregister this HACO as listener
        5. Create a new crawldir (based on the JobID and a timestamp)
        6. Write a harvestInfoFile (using JobID and crawldir) and metadata
        7. Instantiate a new HeritrixLauncher
        8. Start a crawl
        9. Store the generated arc-files and metadata in the known bit-archives
        10. _Always_ send CrawlStatus DONE or FAILED
        11. Move crawldir into oldJobs dir

        Specified by:
        visit in interface HarvesterMessageVisitor
        Overrides:
        visit in class HarvesterMessageHandler
        Parameters:
        msg - The crawl job
        Throws:
        IOFailure - On trouble harvesting, uploading or processing harvestInfo
        UnknownID - if jobID is null in the message
        ArgumentNotValid - if the status of the job is not valid - must be SUBMITTED
        PermissionDenied - if the crawldir can't be created
        See Also:
        for more details
      • sendErrorMessage

        public void sendErrorMessage​(long jobID,
                                     String message,
                                     String detailedMessage)
        Sends a CrawlStatusMessage for a failed job with the given short message and detailed message.
        Parameters:
        jobID - ID of the job that failed
        message - A short message indicating what went wrong
        detailedMessage - A more detailed message detailing why it went wrong.