dk.netarkivet.harvester.harvesting
Class HarvestController

java.lang.Object
  extended by dk.netarkivet.harvester.harvesting.HarvestController

public class HarvestController
extends java.lang.Object

This class handles all the things in a single harvest that are not related directly related either to launching Heritrix or to handling JMS messages.


Method Summary
 void cleanup()
          Clean up this singleton, releasing the ArcRepositoryClient and removing the instance.
static HarvestController getInstance()
          Get the instance of the singleton HarvestController.
 void runHarvest(HeritrixFiles files)
          Creates the actual HeritrixLauncher instance and runs it, after the various setup files have been written.
 HarvestReport storeFiles(HeritrixFiles files, java.lang.StringBuilder errorMessage, java.util.List<java.io.File> failedFiles)
          Controls storing all files involved in a job.
 HeritrixFiles writeHarvestFiles(java.io.File crawldir, Job job, PersistentJobData.HarvestDefinitionInfo hdi, java.util.List<MetadataEntry> metadataEntries)
          Writes the files involved with a harvests.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

getInstance

public static HarvestController getInstance()
Get the instance of the singleton HarvestController.

Returns:
The singleton instance.

cleanup

public void cleanup()
Clean up this singleton, releasing the ArcRepositoryClient and removing the instance. This instance should not be used after this method has been called. After this has been called, new calls to getInstance will return a new instance.


writeHarvestFiles

public HeritrixFiles writeHarvestFiles(java.io.File crawldir,
                                       Job job,
                                       PersistentJobData.HarvestDefinitionInfo hdi,
                                       java.util.List<MetadataEntry> metadataEntries)
Writes the files involved with a harvests. Creates the Heritrix arcs directory to ensure that this directory exists in advance.

Parameters:
crawldir - The directory that the crawl should take place in.
job - The Job object containing various harvest setup data.
hdi - The object encapsulating documentary information about the harvest.
metadataEntries - Any metadata entries sent along with the job that should be stored for later use.
Returns:
An object encapsulating where these files have been written.

runHarvest

public void runHarvest(HeritrixFiles files)
                throws ArgumentNotValid
Creates the actual HeritrixLauncher instance and runs it, after the various setup files have been written.

Parameters:
files - Description of files involved in running Heritrix. Not Null.
Throws:
ArgumentNotValid - if an argument isn't valid.

storeFiles

public HarvestReport storeFiles(HeritrixFiles files,
                                java.lang.StringBuilder errorMessage,
                                java.util.List<java.io.File> failedFiles)
                         throws ArgumentNotValid
Controls storing all files involved in a job. The files are 1) The actual ARC files, 2) The metadata files The crawl.log is parsed and information for each domain is generated and stored in a AbstractHarvestReport object which is sent along in the crawlstatusmessage. Additionally, any leftover open ARC files are closed and harvest documentation is extracted before upload starts.

Parameters:
files - The HeritrixFiles object for this crawl. Not Null.
errorMessage - A place where error messages accumulate. Not Null.
failedFiles - List of files that failed to upload. Not Null.
Returns:
An object containing info about the domains harvested.
Throws:
ArgumentNotValid - if an argument isn't valid.