Class DomainConfiguration

  • All Implemented Interfaces:
    Named

    public class DomainConfiguration
    extends Object
    implements Named
    This class describes a configuration for harvesting a domain. It combines a number of seedlists, a number of passwords, an order template, and some specialised settings to define the way to harvest a domain.
    • Constructor Detail

      • DomainConfiguration

        public DomainConfiguration​(String theConfigName,
                                   Domain domain,
                                   List<SeedList> seedlists,
                                   List<Password> passwords)
        Create a new configuration for a domain.
        Parameters:
        theConfigName - The name of this configuration
        domain - The domain that this configuration is for
        seedlists - Seedlists to use in this configuration.
        passwords - Passwords to use in this configuration.
      • DomainConfiguration

        public DomainConfiguration​(String theConfigName,
                                   String domainName,
                                   DomainHistory history,
                                   List<String> crawlertraps,
                                   List<SeedList> seedlists,
                                   List<Password> passwords)
        Alternate constructor. TODO Filter all history not relevant for this configuration
        Parameters:
        theConfigName - theConfigName The name of this configuration
        domainName - The name of the domain that this configuration is for
        history - The domainhistory of the given domain
        crawlertraps - The crawlertraps of the given domain
        seedlists - Seedlists to use in this configuration
        passwords - Passwords to use in this configuration.
    • Method Detail

      • setOrderXmlName

        public void setOrderXmlName​(String ordername)
        Specify the name of the order.xml template to use.
        Parameters:
        ordername - order.xml template name
        Throws:
        ArgumentNotValid - if filename null or empty
      • setMaxObjects

        public void setMaxObjects​(long max)
        Specify the maximum number of objects to retrieve from the domain.
        Parameters:
        max - maximum number of objects to retrieve
        Throws:
        ArgumentNotValid - if max<-1
      • setMaxRequestRate

        public void setMaxRequestRate​(int maxrate)
        Specify the maximum request rate to use when harvesting data.
        Parameters:
        maxrate - the maximum request rate
        Throws:
        ArgumentNotValid - if maxrate<0
      • setMaxBytes

        public void setMaxBytes​(long maxBytes)
        Specify the maximum number of bytes to download from a domain in a single harvest.
        Parameters:
        maxBytes - Maximum number of bytes to download, or -1 for no limit.
        Throws:
        ArgumentNotValid - if maxBytes < -1
      • getName

        public String getName()
        Get the configuration name.
        Specified by:
        getName in interface Named
        Returns:
        the configuration name
      • getComments

        public String getComments()
        Returns comments.
        Specified by:
        getComments in interface Named
        Returns:
        string containing comments
      • getOrderXmlName

        public String getOrderXmlName()
        Returns the name of the order xml file used by the domain.
        Returns:
        name of the order.xml file that should be used when harvesting the domain
      • getMaxObjects

        public long getMaxObjects()
        Returns the maximum number of objects to harvest from the domain.
        Returns:
        maximum number of objects to harvest
      • getMaxRequestRate

        public int getMaxRequestRate()
        Returns the maximum request rate to use when harvesting the domain.
        Returns:
        maximum request rate
      • getMaxBytes

        public long getMaxBytes()
        Returns the maximum number of bytes to download during a single harvest of a domain.
        Returns:
        Maximum bytes limit, or -1 for no limit.
      • getDomainName

        public String getDomainName()
        Returns the name of the domain aggregating this configuration.
        Returns:
        the name of the domain aggregating this configuration.
      • getSeedLists

        public Iterator<SeedList> getSeedLists()
        Get an iterator of seedlists used in this configuration.
        Returns:
        seedlists as iterator
      • addSeedList

        public void addSeedList​(Domain domain,
                                SeedList seedlist)
        Add a new seedlist to the configuration. Must exist in the associated domain and the equal to that seedlist.
        Parameters:
        seedlist - the seedlist to add
        domain - The domain to check if the seedlist exists
        Throws:
        ArgumentNotValid - if the seedlist is null
        UnknownID - if the seedlist is not defined on the domain
        PermissionDenied - if the seedlist is different from the one on the domain.
      • setSeedLists

        public void setSeedLists​(Domain domain,
                                 List<SeedList> newSeedlists)
        Sets the used seedlists to the given list. Note: list is copied.
        Parameters:
        newSeedlists - The seedlists to use.
        domain - The domain where the seedlists should come from
        Throws:
        ArgumentNotValid - if the seedslists are null
      • getPasswords

        public Iterator<Password> getPasswords()
        Get an iterator of passwords used in this configuration.
        Returns:
        The passwords in an iterator
      • addPassword

        public void addPassword​(Domain domain,
                                Password password)
        Add password to the configuration.
        Parameters:
        password - to add (must exist in the domain)
        domain - the domain where the password should come from.
      • getExpectedNumberOfObjects

        public long getExpectedNumberOfObjects​(long objectLimit,
                                               long byteLimit)
        Gets the best expectation for how many objects a harvest using this configuration will retrieve, given a job with a maximum limit pr. domain
        Parameters:
        objectLimit - The maximum limit, or Constants.HERITRIX_MAXOBJECTS_INFINITY for no limit. This limit overrides the limit set on the configuration, unless override is in effect.
        byteLimit - The maximum number of bytes that will be used as limit in the harvest. This limit overrides the limit set on the configuration, unless override is in effect.
        Returns:
        The expected number of objects.
      • minObjectsBytesLimit

        public long minObjectsBytesLimit​(long objectLimit,
                                         long byteLimit,
                                         long expectedObjectSize)
        Return the lowest limit for the two values, or MAX_DOMAIN_SIZE if both are infinite, which is the max size we harvest from this domain.
        Parameters:
        objectLimit - A long value defining an object limit, or 0 for infinite
        byteLimit - A long value defining a byte limit, or HarvesterSettings.MAX_DOMAIN_SIZE for infinite.
        expectedObjectSize - The expected number of bytes per object
        Returns:
        The lowest of the two boundaries, or MAX_DOMAIN_SIZE if both are unlimited.
      • setComments

        public void setComments​(String comments)
        Set the comments field.
        Parameters:
        comments - User-entered free-form comments.
      • removePassword

        public void removePassword​(String passwordName)
        Remove a password from the list of passwords used in this domain.
        Parameters:
        passwordName - Password to Remove.
      • usesPassword

        public boolean usesPassword​(String passwordName)
        Check whether this domain uses a given password.
        Parameters:
        passwordName - The given password
        Returns:
        whether the given password is used
      • setPasswords

        public void setPasswords​(Domain domain,
                                 List<Password> newPasswords)
        Sets the used passwords to the given list. Note: list is copied.
        Parameters:
        newPasswords - The passwords to use.
        domain - The domain where the passwords should come from
        Throws:
        ArgumentNotValid - if the passwords are null
      • getID

        public Long getID()
        Get the ID of this configuration.
        Returns:
        the ID of this configuration
      • toString

        public String toString()
        ToString of DomainConfiguration class.
        Overrides:
        toString in class Object
        Returns:
        a string with info about the instance of this class.
      • setCrawlertraps

        public void setCrawlertraps​(List<String> someCrawlertraps)
        Set the crawlerltraps for this configuration.
        Parameters:
        someCrawlertraps - a list of crawlertraps
      • getCrawlertraps

        public List<String> getCrawlertraps()
        Returns:
        the known crawlertraps for this configuration.
      • getDomainhistory

        public DomainHistory getDomainhistory()
        Returns:
        the domainhistory for this configuration
      • setDomainhistory

        public void setDomainhistory​(DomainHistory newDomainhistory)
        Set the domainHistory for this configuration.
        Parameters:
        newDomainhistory - the new domainHistory for this configuration( null is accepted for no History)
      • setName

        public void setName​(String configName)
        Change the name of configuration to the given configName.
        Parameters:
        configName - a new name for this configuration.
      • getAttributesAndTypes

        public List<EAV.AttributeAndType> getAttributesAndTypes()
        Get this configurations EAV attributes and attribute types.
        Returns:
        this configurations EAV attributes and attribute types
      • setAttributesAndTypes

        public void setAttributesAndTypes​(List<EAV.AttributeAndType> attributesAndTypes)
        Set this configurations EAV attributes and attribute types.
        Parameters:
        attributesAndTypes - EAV attributes and attribute types