dk.netarkivet.harvester.harvesting
Class HarvestController

java.lang.Object
  extended by dk.netarkivet.harvester.harvesting.HarvestController

public class HarvestController
extends java.lang.Object

This class handles all the things in a single harvest that are not related directly related either to launching Heritrix or to handling JMS messages.


Method Summary
 void cleanup()
          Clean up this singleton, releasing the ArcRepositoryClient and removing the instance.
static StopReason findDefaultStopReason(java.io.File logFile)
          Find out whether we stopped normally in progress statistics log.
static HarvestController getInstance()
          Get the instance of the singleton HarvestController.
 void runHarvest(HeritrixFiles files)
          Creates the actual HeritrixLauncher instance and runs it, after the various setup files have been written.
 DomainHarvestReport storeFiles(HeritrixFiles files, java.lang.StringBuilder errorMessage, java.util.List<java.io.File> failedFiles)
          Controls storing all files involved in a job.
 HeritrixFiles writeHarvestFiles(java.io.File crawldir, Job job, java.util.List<MetadataEntry> metadataEntries)
          Writes the files involved with a harvests.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

getInstance

public static HarvestController getInstance()
Get the instance of the singleton HarvestController.

Returns:
The singleton instance.

cleanup

public void cleanup()
Clean up this singleton, releasing the ArcRepositoryClient and removing the instance. This instance should not be used after this method has been called. After this has been called, new calls to getInstance will return a new instance.


writeHarvestFiles

public HeritrixFiles writeHarvestFiles(java.io.File crawldir,
                                       Job job,
                                       java.util.List<MetadataEntry> metadataEntries)
Writes the files involved with a harvests. Creates the Heritrix arcs directory to ensure that this directory exists in advance.

Parameters:
crawldir - The directory that the crawl should take place in.
job - The Job object containing various harvest setup data.
metadataEntries - Any metadata entries sent along with the job that should be stored for later use.
Returns:
An object encapsulating where these files have been written.

runHarvest

public void runHarvest(HeritrixFiles files)
Creates the actual HeritrixLauncher instance and runs it, after the various setup files have been written.

Parameters:
files - Description of files involved in running Heritrix.

storeFiles

public DomainHarvestReport storeFiles(HeritrixFiles files,
                                      java.lang.StringBuilder errorMessage,
                                      java.util.List<java.io.File> failedFiles)
Controls storing all files involved in a job. The files are 1) The actual ARC files, 2) The metadata files The crawl.log is parsed and information for each domain is generated and stored in a DomainHarvestReport object which is sent along in the crawlstatusmessage. Additionally, any leftover open ARC files are closed and harvest documentation is extracted before upload starts.

Parameters:
files - The HeritrixFiles object for this crawl.
errorMessage - A place where error messages accumulate.
failedFiles - List of files that failed to upload.
Returns:
An object containing info about the domains harvested.

findDefaultStopReason

public static StopReason findDefaultStopReason(java.io.File logFile)
Find out whether we stopped normally in progress statistics log.

Parameters:
logFile - A progress-statistics.log file
Returns:
StopReason.DOWNLOAD_COMPLETE for progress statistics ending with CRAWL ENDED, StopReason.DOWNLOAD_UNFINISHED otherwise or if file does not exist.
Throws:
ArgumentNotValid - on null argument.