dk.netarkivet.harvester.harvesting.report
Class LegacyHarvestReport

java.lang.Object
  extended by dk.netarkivet.harvester.harvesting.report.AbstractHarvestReport
      extended by dk.netarkivet.harvester.harvesting.report.LegacyHarvestReport
All Implemented Interfaces:
HarvestReport, java.io.Serializable

public class LegacyHarvestReport
extends AbstractHarvestReport

Class responsible for generating a domain harvest report from crawl logs created by Heritrix and presenting the relevant information to clients.

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class dk.netarkivet.harvester.harvesting.report.AbstractHarvestReport
AbstractHarvestReport.ProgressStatisticsConstants
 
Constructor Summary
LegacyHarvestReport()
           
LegacyHarvestReport(HeritrixFiles hFiles)
          The constructor gets the data in a crawl.log file, and parses the file.
 
Method Summary
 void postProcess(Job job)
          Post-processing happens on the scheduler side when ARC files have been uploaded.
 
Methods inherited from class dk.netarkivet.harvester.harvesting.report.AbstractHarvestReport
findDefaultStopReason, getByteCount, getDefaultStopReason, getDomainNames, getHeritrixFiles, getObjectCount, getOrCreateDomainStats, getStopReason, preProcess
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LegacyHarvestReport

public LegacyHarvestReport(HeritrixFiles hFiles)
The constructor gets the data in a crawl.log file, and parses the file. The crawl.log is described in the Heritrix user-manual, section 8.2.1: http://crawler.archive.org/articles/user_manual/analysis.html#logs Note: Invalid lines are logged and then ignored. Each url listed in the file is assigned to a domain, the total object count and byte count per domain is calculated. Finally, a StopReason is found for each domain: When the response is CrawlURI.S_BLOCKED_BY_QUOTA ( currently = -5003), the StopReason is set to StopReason.SIZE_LIMIT, if the annotation equals "Q:group-max-all-kb" or StopReason.OBJECT_LIMIT, if the annotation equals "Q:group-max-fetch-successes".

Parameters:
hFiles - the Heritrix reports and logs.

LegacyHarvestReport

public LegacyHarvestReport()
Method Detail

postProcess

public void postProcess(Job job)
Post-processing happens on the scheduler side when ARC files have been uploaded.

Specified by:
postProcess in interface HarvestReport
Specified by:
postProcess in class AbstractHarvestReport
Parameters:
job - the actual job.