Class HarvestReportGenerator


  • public class HarvestReportGenerator
    extends Object
    Base implementation for a harvest report.
    • Constructor Detail

      • HarvestReportGenerator

        public HarvestReportGenerator()
        Default constructor that does nothing. The real construction is supposed to be done in the subclasses by filling out the domainStats map with crawl results.
      • HarvestReportGenerator

        public HarvestReportGenerator​(Heritrix3Files files)
        Constructor from Heritrix report files. Subclasses might use a different set of Heritrix reports.
        Parameters:
        files - the set of Heritrix reports.
    • Method Detail

      • preProcess

        public void preProcess​(Heritrix3Files files)
        Pre-processing happens when the report is built just at the end of the crawl, before the ARC files upload.
      • getOrCreateDomainStats

        protected DomainStats getOrCreateDomainStats​(String domainName)
        Attempts to get an already existing DomainStats object for that domain, and if not found creates one with zero values.
        Parameters:
        domainName - the name of the domain to get DomainStats for.
        Returns:
        a DomainStats object for the given domain-name.
      • findDefaultStopReason

        public static StopReason findDefaultStopReason​(File logFile)
        Find out whether we stopped normally in progress statistics log.
        Parameters:
        logFile - A progress-statistics.log file.
        Returns:
        StopReason.DOWNLOAD_COMPLETE for progress statistics ending with CRAWL ENDED, StopReason.DOWNLOAD_UNFINISHED otherwise or if file does not exist.
      • getDefaultStopReason

        public StopReason getDefaultStopReason()
        Returns:
        the default stop reason.
      • getDomainStatsMap

        public Map<String,​DomainStats> getDomainStatsMap()
        Returns:
        the domainStatsMap generated from parsing the crawl-log.
      • getDomainStatsReport

        public static DomainStatsReport getDomainStatsReport​(Heritrix3Files files)
        Parameters:
        files - A set of Heritrix3 files used to produce a a HarvestReport.
        Returns:
        a DomainStatsReport for a specific H3 crawl.