dk.netarkivet.harvester.harvesting.report
Class AbstractHarvestReport

java.lang.Object
  extended by dk.netarkivet.harvester.harvesting.report.AbstractHarvestReport
All Implemented Interfaces:
HarvestReport, java.io.Serializable
Direct Known Subclasses:
BnfHarvestReport, LegacyHarvestReport

public abstract class AbstractHarvestReport
extends java.lang.Object
implements HarvestReport

Base implementation for a harvest report.

See Also:
Serialized Form

Nested Class Summary
static class AbstractHarvestReport.ProgressStatisticsConstants
          Strings found in the progress-statistics.log, used to devise the default stop reason for domains.
 
Constructor Summary
AbstractHarvestReport()
          Default constructor that does nothing.
AbstractHarvestReport(HeritrixFiles files)
          Constructor from Heritrix report files.
 
Method Summary
static StopReason findDefaultStopReason(java.io.File logFile)
          Find out whether we stopped normally in progress statistics log.
 java.lang.Long getByteCount(java.lang.String domainName)
          Get the number of bytes downloaded for the given domain.
 StopReason getDefaultStopReason()
          Returns the default stop reason initially assigned to every domain.
 java.util.Set<java.lang.String> getDomainNames()
          Returns the set of domain names that are contained in hosts-report.txt (i.e.
protected  HeritrixFiles getHeritrixFiles()
           
 java.lang.Long getObjectCount(java.lang.String domainName)
          Get the number of objects found for the given domain.
protected  DomainStats getOrCreateDomainStats(java.lang.String domainName)
          Attempts to get an already existing DomainStats object for that domain, and if not found creates one with zero values.
 StopReason getStopReason(java.lang.String domainName)
          Get the StopReason for the given domain.
abstract  void postProcess(Job job)
          Post-processing happens on the scheduler side when ARC files have been uploaded.
 void preProcess(HeritrixFiles files)
          Pre-processing happens when the report is built just at the end of the crawl, before the ARC files upload.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

AbstractHarvestReport

public AbstractHarvestReport()
Default constructor that does nothing. The real construction is supposed to be done in the subclasses by filling out the domainStats map with crawl results.


AbstractHarvestReport

public AbstractHarvestReport(HeritrixFiles files)
Constructor from Heritrix report files. Subclasses might use a different set of Heritrix reports.

Parameters:
files - the set of Heritrix reports.
Method Detail

preProcess

public void preProcess(HeritrixFiles files)
Pre-processing happens when the report is built just at the end of the crawl, before the ARC files upload.

Specified by:
preProcess in interface HarvestReport

postProcess

public abstract void postProcess(Job job)
Post-processing happens on the scheduler side when ARC files have been uploaded.

Specified by:
postProcess in interface HarvestReport

getDefaultStopReason

public StopReason getDefaultStopReason()
Description copied from interface: HarvestReport
Returns the default stop reason initially assigned to every domain.

Specified by:
getDefaultStopReason in interface HarvestReport

getDomainNames

public final java.util.Set<java.lang.String> getDomainNames()
Returns the set of domain names that are contained in hosts-report.txt (i.e. host names mapped to domains)

Specified by:
getDomainNames in interface HarvestReport
Returns:
a Set of Strings

getObjectCount

public final java.lang.Long getObjectCount(java.lang.String domainName)
Get the number of objects found for the given domain.

Specified by:
getObjectCount in interface HarvestReport
Parameters:
domainName - A domain name (as given by getDomainNames())
Returns:
How many objects were collected for that domain
Throws:
ArgumentNotValid - if null or empty domainName

getByteCount

public final java.lang.Long getByteCount(java.lang.String domainName)
Get the number of bytes downloaded for the given domain.

Specified by:
getByteCount in interface HarvestReport
Parameters:
domainName - A domain name (as given by getDomainNames())
Returns:
How many bytes were collected for that domain
Throws:
ArgumentNotValid - if null or empty domainName

getStopReason

public final StopReason getStopReason(java.lang.String domainName)
Get the StopReason for the given domain.

Specified by:
getStopReason in interface HarvestReport
Parameters:
domainName - A domain name (as given by getDomainNames())
Returns:
the StopReason for the given domain.
Throws:
ArgumentNotValid - if null or empty domainName

getHeritrixFiles

protected HeritrixFiles getHeritrixFiles()
Returns:
the heritrixFiles

getOrCreateDomainStats

protected DomainStats getOrCreateDomainStats(java.lang.String domainName)
Attempts to get an already existing DomainStats object for that domain, and if not found creates one with zero values.


findDefaultStopReason

public static StopReason findDefaultStopReason(java.io.File logFile)
                                        throws ArgumentNotValid
Find out whether we stopped normally in progress statistics log.

Parameters:
logFile - A progress-statistics.log file.
Returns:
StopReason.DOWNLOAD_COMPLETE for progress statistics ending with CRAWL ENDED, StopReason.DOWNLOAD_UNFINISHED otherwise or if file does not exist.
Throws:
ArgumentNotValid - on null argument.