dk.netarkivet.harvester.harvesting
Class HeritrixDomainHarvestReport

java.lang.Object
  extended by dk.netarkivet.harvester.harvesting.distribute.DomainHarvestReport
      extended by dk.netarkivet.harvester.harvesting.HeritrixDomainHarvestReport
All Implemented Interfaces:
java.io.Serializable

public class HeritrixDomainHarvestReport
extends DomainHarvestReport
implements java.io.Serializable

Class responsible for generating a domain harvest report from crawl logs created by Heritrix and presenting the relevant information to clients.

See Also:
Serialized Form

Field Summary
 
Fields inherited from class dk.netarkivet.harvester.harvesting.distribute.DomainHarvestReport
domainstats
 
Constructor Summary
HeritrixDomainHarvestReport(java.io.File reportFile, StopReason defaultStopReason)
          The constructor gets the data in a crawl.log file, and parses the file.
 
Method Summary
 
Methods inherited from class dk.netarkivet.harvester.harvesting.distribute.DomainHarvestReport
getByteCount, getDomainNames, getObjectCount, getStopReason
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HeritrixDomainHarvestReport

public HeritrixDomainHarvestReport(java.io.File reportFile,
                                   StopReason defaultStopReason)
The constructor gets the data in a crawl.log file, and parses the file. The crawl.log is described in the Heritrix user-manual, section 8.2.1: http://crawler.archive.org/articles/user_manual.html#logs Note: Invalid lines are logged and then ignored. Each url listed in the file is assigned to a domain, the total object count and byte count per domain is calculated. Finally, a StopReason is found for each domain: When the response is CrawlURI.S_BLOCKED_BY_QUOTA ( currently = -5003), the StopReason is set to StopReason.SIZE_LIMIT, if the annotation equals "Q:group-max-all-kb" or StopReason.OBJECT_LIMIT, if the annotation equals "Q:group-max-fetch-successes".

Parameters:
reportFile - a crawl.log
defaultStopReason - the default stopreason
Throws:
IOFailure - If unable to read reportFile