Class HarvestControllerServer
- java.lang.Object
-
- dk.netarkivet.harvester.distribute.HarvesterMessageHandler
-
- dk.netarkivet.harvester.heritrix3.HarvestControllerServer
-
- All Implemented Interfaces:
CleanupIF
,HarvesterMessageVisitor
,javax.jms.MessageListener
public class HarvestControllerServer extends HarvesterMessageHandler implements CleanupIF
This class responds to JMS doOneCrawl messages from the HarvestScheduler and launches a Heritrix crawl with the received job description. The generated ARC files are uploaded to the bitarchives once a harvest job has been completed. Initially, the HarvestControllerServer registers its channel with the Scheduler by sending a HarvesterRegistrationRequest and waits for a positive HarvesterRegistrationResponse that its channel is recognized. If not recognized by the Scheduler, the HarvestControllerServer will send a notification about this, and then close down the application.During its operation CrawlStatus messages are sent to the HarvestSchedulerMonitorServer. When starting the actual harvesting a message is sent with status 'STARTED'. When the harvesting has finished a message is sent with either status 'DONE' or 'FAILED'. Either a 'DONE' or 'FAILED' message with result should ALWAYS be sent if at all possible, but only ever one such message per job. While the harvestControllerServer is waiting for the harvesting to finish, it sends HarvesterReadyMessages to the scheduler. The interval between each HarvesterReadyMessage being sent is defined by the setting 'settings.harvester.harvesting.sendReadyDelay'.
It is necessary to be able to run the Heritrix harvester on several machines and several processes on each machine. Each instance of Heritrix is started and monitored by a HarvestControllerServer.
Initially, all directories under serverdir are scanned for harvestinfo files. If any are found, they are parsed for information, and all remaining files are attempted uploaded to the bitarchive. It will then send back a CrawlStatusMessage with status failed.
A new thread is started for each actual crawl, in which the JMS listener is removed. Threading is required since JMS will not let the called thread remove the listener that's being handled.
After a harvestjob has been terminated, either successfully or unsuccessfully, the serverdir is again scanned for harvestInfo files to attempt upload of files not yet uploaded. Then it begins to listen again after new jobs, if there is enough room available on the machine. If not, it logs a warning about this, which is also sent as a notification.
-
-
Field Summary
Fields Modifier and Type Field Description static ChannelID
HARVEST_CHAN_VALID_RESP_ID
The JMS channel on which to listen forHarvesterRegistrationResponse
s.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
cleanup()
Will be called on shutdown.void
close()
Release all jms connections.static HarvestControllerServer
getInstance()
Returns or creates the unique instance of this singleton The server creates an instance of the HarvestController, uploads arc-files from unfinished harvests, and starts to listen to JMS messages on the incoming jms queues.void
sendErrorMessage(long jobID, String message, String detailedMessage)
Sends a CrawlStatusMessage for a failed job with the given short message and detailed message.void
visit(DoOneCrawlMessage msg)
Checks that we're available to do a crawl, and if so, marks us as unavailable, checks that the job message is well-formed, and starts the thread that the crawl happens in.void
visit(HarvesterRegistrationResponse msg)
This method should be overridden and implemented by a sub class if message handling is wanted.
-
-
-
Field Detail
-
HARVEST_CHAN_VALID_RESP_ID
public static final ChannelID HARVEST_CHAN_VALID_RESP_ID
The JMS channel on which to listen forHarvesterRegistrationResponse
s.
-
-
Method Detail
-
getInstance
public static HarvestControllerServer getInstance() throws IOFailure
Returns or creates the unique instance of this singleton The server creates an instance of the HarvestController, uploads arc-files from unfinished harvests, and starts to listen to JMS messages on the incoming jms queues.- Returns:
- The instance
- Throws:
PermissionDenied
- If the serverdir or oldjobsdir can't be createdIOFailure
- if data from old harvests exist, but contain illegal data
-
close
public void close()
Release all jms connections. Close the Controller
-
cleanup
public void cleanup()
Will be called on shutdown.- Specified by:
cleanup
in interfaceCleanupIF
- See Also:
CleanupIF.cleanup()
-
visit
public void visit(HarvesterRegistrationResponse msg)
Description copied from class:HarvesterMessageHandler
This method should be overridden and implemented by a sub class if message handling is wanted.- Specified by:
visit
in interfaceHarvesterMessageVisitor
- Overrides:
visit
in classHarvesterMessageHandler
- Parameters:
msg
- aHarvesterRegistrationResponse
-
visit
public void visit(DoOneCrawlMessage msg) throws IOFailure, UnknownID, ArgumentNotValid, PermissionDenied
Checks that we're available to do a crawl, and if so, marks us as unavailable, checks that the job message is well-formed, and starts the thread that the crawl happens in. If an error occurs starting the crawl, we will start listening for messages again.The sequence of actions involved in a crawl are: 1. If we are already running, resend the job to the queue and return 2. Check the job for validity 3. Send a CrawlStatus message that crawl has STARTED In a separate thread: 4. Unregister this HACO as listener 5. Create a new crawldir (based on the JobID and a timestamp) 6. Write a harvestInfoFile (using JobID and crawldir) and metadata 7. Instantiate a new HeritrixLauncher 8. Start a crawl 9. Store the generated arc-files and metadata in the known bit-archives 10. _Always_ send CrawlStatus DONE or FAILED 11. Move crawldir into oldJobs dir
- Specified by:
visit
in interfaceHarvesterMessageVisitor
- Overrides:
visit
in classHarvesterMessageHandler
- Parameters:
msg
- The crawl job- Throws:
IOFailure
- On trouble harvesting, uploading or processing harvestInfoUnknownID
- if jobID is null in the messageArgumentNotValid
- if the status of the job is not valid - must be SUBMITTEDPermissionDenied
- if the crawldir can't be created- See Also:
for more details
-
sendErrorMessage
public void sendErrorMessage(long jobID, String message, String detailedMessage)
Sends a CrawlStatusMessage for a failed job with the given short message and detailed message.- Parameters:
jobID
- ID of the job that failedmessage
- A short message indicating what went wrongdetailedMessage
- A more detailed message detailing why it went wrong.
-
-