Class HarvestReportGenerator

    • Constructor Detail

      • HarvestReportGenerator

        public HarvestReportGenerator()
        Default constructor that does nothing. The real construction is supposed to be done in the subclasses by filling out the domainStats map with crawl results.
      • HarvestReportGenerator

        public HarvestReportGenerator​(Heritrix3Files files)
        Constructor from Heritrix report files. Subclasses might use a different set of Heritrix reports.
        Parameters:
        files - the set of Heritrix reports.
    • Method Detail

      • preProcess

        public void preProcess​(Heritrix3Files files)
        Pre-processing happens when the report is built just at the end of the crawl, before the ARC files upload.
      • getOrCreateDomainStats

        protected DomainStats getOrCreateDomainStats​(java.lang.String domainName)
        Attempts to get an already existing DomainStats object for that domain, and if not found creates one with zero values.
        Parameters:
        domainName - the name of the domain to get DomainStats for.
        Returns:
        a DomainStats object for the given domain-name.
      • findDefaultStopReason

        public static StopReason findDefaultStopReason​(java.io.File logFile)
        Find out whether we stopped normally in progress statistics log.
        Parameters:
        logFile - A progress-statistics.log file.
        Returns:
        StopReason.DOWNLOAD_COMPLETE for progress statistics ending with CRAWL ENDED, StopReason.DOWNLOAD_UNFINISHED otherwise or if file does not exist.
      • getDomainStatsMap

        public java.util.Map<java.lang.String,​DomainStatsgetDomainStatsMap()
        Returns:
        the domainStatsMap generated from parsing the crawl-log.