Class FullHarvest

  • All Implemented Interfaces:
    Named

    public class FullHarvest
    extends HarvestDefinition
    This class contains the specific properties and operations of snapshot harvest definitions.
    • Constructor Detail

      • FullHarvest

        public FullHarvest​(String harvestDefName,
                           String comments,
                           Long previousHarvestDefinitionOid,
                           long maxCountObjects,
                           long maxBytes,
                           long maxJobRunningTime,
                           boolean isIndexReady,
                           javax.inject.Provider<HarvestDefinitionDAO> hdDaoProvider,
                           javax.inject.Provider<JobDAO> jobDaoProvider,
                           javax.inject.Provider<ExtendedFieldDAO> extendedFieldDAOProvide,
                           javax.inject.Provider<DomainDAO> domainDAOProvider)
        Create new instance of FullHarvest configured according to the properties of the supplied DomainConfiguration. Should only be used by the HarvestFactory class.
        Parameters:
        harvestDefName - the name of the harvest definition
        comments - comments
        previousHarvestDefinitionOid - This harvestDefinition is used to create this Fullharvest definition.
        maxCountObjects - Limit for how many objects can be fetched per domain
        maxBytes - Limit for how many bytes can be fetched per domain
        maxJobRunningTime - Limit on how much time can be spent on each job. 0 means no limit
        isIndexReady - Is the deduplication index ready for this harvest.
    • Method Detail

      • getPreviousHarvestDefinition

        public HarvestDefinition getPreviousHarvestDefinition()
        Get the previous HarvestDefinition which is used to base this.
        Returns:
        The previous HarvestDefinition
      • setPreviousHarvestDefinition

        public void setPreviousHarvestDefinition​(Long prev)
        Set the previous HarvestDefinition which is used to base this.
        Parameters:
        prev - The id of a HarvestDefinition
      • getMaxCountObjects

        public long getMaxCountObjects()
        Description copied from class: HarvestDefinition
        Returns how many objects to harvest per domain, or 0 for no limit.
        Specified by:
        getMaxCountObjects in class HarvestDefinition
        Returns:
        Returns the maxCountObjects.
      • setMaxCountObjects

        public void setMaxCountObjects​(long maxCountObjects)
        Parameters:
        maxCountObjects - The maxCountObjects to set.
      • getMaxBytes

        public long getMaxBytes()
        Get the maximum number of bytes that this fullharvest will harvest per domain, 0 for no limit.
        Specified by:
        getMaxBytes in class HarvestDefinition
        Returns:
        Total download limit in bytes per domain.
      • setMaxBytes

        public void setMaxBytes​(long maxBytes)
        Set the limit for how many bytes this fullharvest will harvest per domain, or -1 for no limit.
        Parameters:
        maxBytes - Number of bytes to stop harvesting at.
      • getDomainConfigurations

        public Iterator<DomainConfiguration> getDomainConfigurations()
        Returns an iterator of domain configurations for this harvest definition. Domains are filtered out if, on the previous harvest, they: 1) were completed 2) reached their maxBytes limit (and the maxBytes limit has not changed since time of harvest) 3) reached their maxObjects limit (and the maxObjects limit has not changed since time of harvest) 4) died uncleanly (e.g. due to a manual shutdown of heritrix) on their last harvest.

        Domains are also excluded if they are aliases of another domain.

        Specified by:
        getDomainConfigurations in class HarvestDefinition
        Returns:
        Iterator containing information about the domain configurations
      • getDomainConfigurationsForIterativeHarvest

        public Iterator<DomainConfiguration> getDomainConfigurationsForIterativeHarvest()
        Returns:
        a iterator of DomainConfigurations not finished in previous SnapShot harvest
      • runNow

        public boolean runNow​(Date now)
        Check if this harvest definition should be run, given the time now.
        Specified by:
        runNow in class HarvestDefinition
        Parameters:
        now - The current time
        Returns:
        true if harvest definition should be run
      • isSnapShot

        public boolean isSnapShot()
        Returns whether this HarvestDefinition represents a snapshot harvest.
        Specified by:
        isSnapShot in class HarvestDefinition
        Returns:
        Returns true
      • getMaxJobRunningTime

        public long getMaxJobRunningTime()
        Returns:
        Returns the max job running time
      • setMaxJobRunningTime

        public void setMaxJobRunningTime​(long maxJobRunningtime)
        Set the limit for how many seconds each crawljob in this fullharvest will run, or 0 for no limit.
        Parameters:
        maxJobRunningtime - max number of seconds
      • getIndexReady

        public boolean getIndexReady()
        Is index ready. Used to check, whether or a FullHarvest is ready for scheduling. The scheduling requires, that the deduplication index used by the jobs in the FullHarvest, has already been prepared by the IndexServer.
        Returns:
        true, if the deduplication index is ready. Otherwise false.
      • setIndexReady

        public void setIndexReady​(boolean isIndexReady)
        Set the indexReady field.
        Parameters:
        isIndexReady - The new value of the indexReady field.