Class DomainConfiguration

  • All Implemented Interfaces:
    Named

    public class DomainConfiguration
    extends java.lang.Object
    implements Named
    This class describes a configuration for harvesting a domain. It combines a number of seedlists, a number of passwords, an order template, and some specialised settings to define the way to harvest a domain.
    • Constructor Summary

      Constructors 
      Constructor Description
      DomainConfiguration​(java.lang.String theConfigName, Domain domain, java.util.List<SeedList> seedlists, java.util.List<Password> passwords)
      Create a new configuration for a domain.
      DomainConfiguration​(java.lang.String theConfigName, java.lang.String domainName, DomainHistory history, java.util.List<java.lang.String> crawlertraps, java.util.List<SeedList> seedlists, java.util.List<Password> passwords)
      Alternate constructor.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void addPassword​(Domain domain, Password password)
      Add password to the configuration.
      void addSeedList​(Domain domain, SeedList seedlist)
      Add a new seedlist to the configuration.
      static java.lang.String cfgToString​(DomainConfiguration cfg)  
      java.util.List<EAV.AttributeAndType> getAttributesAndTypes()
      Get this configurations EAV attributes and attribute types.
      java.lang.String getComments()
      Returns comments.
      java.util.List<java.lang.String> getCrawlertraps()  
      DomainHistory getDomainhistory()  
      java.lang.String getDomainName()
      Returns the name of the domain aggregating this configuration.
      long getExpectedNumberOfObjects​(long objectLimit, long byteLimit)
      Gets the best expectation for how many objects a harvest using this configuration will retrieve, given a job with a maximum limit pr.
      java.lang.Long getID()
      Get the ID of this configuration.
      long getMaxBytes()
      Returns the maximum number of bytes to download during a single harvest of a domain.
      long getMaxObjects()
      Returns the maximum number of objects to harvest from the domain.
      int getMaxRequestRate()
      Returns the maximum request rate to use when harvesting the domain.
      java.lang.String getName()
      Get the configuration name.
      java.lang.String getOrderXmlName()
      Returns the name of the order xml file used by the domain.
      java.util.Iterator<Password> getPasswords()
      Get an iterator of passwords used in this configuration.
      java.util.Iterator<SeedList> getSeedLists()
      Get an iterator of seedlists used in this configuration.
      long minObjectsBytesLimit​(long objectLimit, long byteLimit, long expectedObjectSize)
      Return the lowest limit for the two values, or MAX_DOMAIN_SIZE if both are infinite, which is the max size we harvest from this domain.
      void removePassword​(java.lang.String passwordName)
      Remove a password from the list of passwords used in this domain.
      void setAttributesAndTypes​(java.util.List<EAV.AttributeAndType> attributesAndTypes)
      Set this configurations EAV attributes and attribute types.
      void setComments​(java.lang.String comments)
      Set the comments field.
      void setCrawlertraps​(java.util.List<java.lang.String> someCrawlertraps)
      Set the crawlerltraps for this configuration.
      void setDomainhistory​(DomainHistory newDomainhistory)
      Set the domainHistory for this configuration.
      void setMaxBytes​(long maxBytes)
      Specify the maximum number of bytes to download from a domain in a single harvest.
      void setMaxObjects​(long max)
      Specify the maximum number of objects to retrieve from the domain.
      void setMaxRequestRate​(int maxrate)
      Specify the maximum request rate to use when harvesting data.
      void setName​(java.lang.String configName)
      Change the name of configuration to the given configName.
      void setOrderXmlName​(java.lang.String ordername)
      Specify the name of the order.xml template to use.
      void setPasswords​(Domain domain, java.util.List<Password> newPasswords)
      Sets the used passwords to the given list.
      void setSeedLists​(Domain domain, java.util.List<SeedList> newSeedlists)
      Sets the used seedlists to the given list.
      java.lang.String toString()
      ToString of DomainConfiguration class.
      boolean usesPassword​(java.lang.String passwordName)
      Check whether this domain uses a given password.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
    • Constructor Detail

      • DomainConfiguration

        public DomainConfiguration​(java.lang.String theConfigName,
                                   Domain domain,
                                   java.util.List<SeedList> seedlists,
                                   java.util.List<Password> passwords)
        Create a new configuration for a domain.
        Parameters:
        theConfigName - The name of this configuration
        domain - The domain that this configuration is for
        seedlists - Seedlists to use in this configuration.
        passwords - Passwords to use in this configuration.
      • DomainConfiguration

        public DomainConfiguration​(java.lang.String theConfigName,
                                   java.lang.String domainName,
                                   DomainHistory history,
                                   java.util.List<java.lang.String> crawlertraps,
                                   java.util.List<SeedList> seedlists,
                                   java.util.List<Password> passwords)
        Alternate constructor. TODO Filter all history not relevant for this configuration
        Parameters:
        theConfigName - theConfigName The name of this configuration
        domainName - The name of the domain that this configuration is for
        history - The domainhistory of the given domain
        crawlertraps - The crawlertraps of the given domain
        seedlists - Seedlists to use in this configuration
        passwords - Passwords to use in this configuration.
    • Method Detail

      • setOrderXmlName

        public void setOrderXmlName​(java.lang.String ordername)
        Specify the name of the order.xml template to use.
        Parameters:
        ordername - order.xml template name
        Throws:
        ArgumentNotValid - if filename null or empty
      • setMaxObjects

        public void setMaxObjects​(long max)
        Specify the maximum number of objects to retrieve from the domain.
        Parameters:
        max - maximum number of objects to retrieve
        Throws:
        ArgumentNotValid - if max<-1
      • setMaxRequestRate

        public void setMaxRequestRate​(int maxrate)
        Specify the maximum request rate to use when harvesting data.
        Parameters:
        maxrate - the maximum request rate
        Throws:
        ArgumentNotValid - if maxrate<0
      • setMaxBytes

        public void setMaxBytes​(long maxBytes)
        Specify the maximum number of bytes to download from a domain in a single harvest.
        Parameters:
        maxBytes - Maximum number of bytes to download, or -1 for no limit.
        Throws:
        ArgumentNotValid - if maxBytes < -1
      • getName

        public java.lang.String getName()
        Get the configuration name.
        Specified by:
        getName in interface Named
        Returns:
        the configuration name
      • getComments

        public java.lang.String getComments()
        Returns comments.
        Specified by:
        getComments in interface Named
        Returns:
        string containing comments
      • getOrderXmlName

        public java.lang.String getOrderXmlName()
        Returns the name of the order xml file used by the domain.
        Returns:
        name of the order.xml file that should be used when harvesting the domain
      • getMaxObjects

        public long getMaxObjects()
        Returns the maximum number of objects to harvest from the domain.
        Returns:
        maximum number of objects to harvest
      • getMaxRequestRate

        public int getMaxRequestRate()
        Returns the maximum request rate to use when harvesting the domain.
        Returns:
        maximum request rate
      • getMaxBytes

        public long getMaxBytes()
        Returns the maximum number of bytes to download during a single harvest of a domain.
        Returns:
        Maximum bytes limit, or -1 for no limit.
      • getDomainName

        public java.lang.String getDomainName()
        Returns the name of the domain aggregating this configuration.
        Returns:
        the name of the domain aggregating this configuration.
      • getSeedLists

        public java.util.Iterator<SeedListgetSeedLists()
        Get an iterator of seedlists used in this configuration.
        Returns:
        seedlists as iterator
      • addSeedList

        public void addSeedList​(Domain domain,
                                SeedList seedlist)
        Add a new seedlist to the configuration. Must exist in the associated domain and the equal to that seedlist.
        Parameters:
        seedlist - the seedlist to add
        domain - The domain to check if the seedlist exists
        Throws:
        ArgumentNotValid - if the seedlist is null
        UnknownID - if the seedlist is not defined on the domain
        PermissionDenied - if the seedlist is different from the one on the domain.
      • setSeedLists

        public void setSeedLists​(Domain domain,
                                 java.util.List<SeedList> newSeedlists)
        Sets the used seedlists to the given list. Note: list is copied.
        Parameters:
        newSeedlists - The seedlists to use.
        domain - The domain where the seedlists should come from
        Throws:
        ArgumentNotValid - if the seedslists are null
      • getPasswords

        public java.util.Iterator<PasswordgetPasswords()
        Get an iterator of passwords used in this configuration.
        Returns:
        The passwords in an iterator
      • addPassword

        public void addPassword​(Domain domain,
                                Password password)
        Add password to the configuration.
        Parameters:
        password - to add (must exist in the domain)
        domain - the domain where the password should come from.
      • getExpectedNumberOfObjects

        public long getExpectedNumberOfObjects​(long objectLimit,
                                               long byteLimit)
        Gets the best expectation for how many objects a harvest using this configuration will retrieve, given a job with a maximum limit pr. domain
        Parameters:
        objectLimit - The maximum limit, or Constants.HERITRIX_MAXOBJECTS_INFINITY for no limit. This limit overrides the limit set on the configuration, unless override is in effect.
        byteLimit - The maximum number of bytes that will be used as limit in the harvest. This limit overrides the limit set on the configuration, unless override is in effect.
        Returns:
        The expected number of objects.
      • minObjectsBytesLimit

        public long minObjectsBytesLimit​(long objectLimit,
                                         long byteLimit,
                                         long expectedObjectSize)
        Return the lowest limit for the two values, or MAX_DOMAIN_SIZE if both are infinite, which is the max size we harvest from this domain.
        Parameters:
        objectLimit - A long value defining an object limit, or 0 for infinite
        byteLimit - A long value defining a byte limit, or HarvesterSettings.MAX_DOMAIN_SIZE for infinite.
        expectedObjectSize - The expected number of bytes per object
        Returns:
        The lowest of the two boundaries, or MAX_DOMAIN_SIZE if both are unlimited.
      • setComments

        public void setComments​(java.lang.String comments)
        Set the comments field.
        Parameters:
        comments - User-entered free-form comments.
      • removePassword

        public void removePassword​(java.lang.String passwordName)
        Remove a password from the list of passwords used in this domain.
        Parameters:
        passwordName - Password to Remove.
      • usesPassword

        public boolean usesPassword​(java.lang.String passwordName)
        Check whether this domain uses a given password.
        Parameters:
        passwordName - The given password
        Returns:
        whether the given password is used
      • setPasswords

        public void setPasswords​(Domain domain,
                                 java.util.List<Password> newPasswords)
        Sets the used passwords to the given list. Note: list is copied.
        Parameters:
        newPasswords - The passwords to use.
        domain - The domain where the passwords should come from
        Throws:
        ArgumentNotValid - if the passwords are null
      • getID

        public java.lang.Long getID()
        Get the ID of this configuration.
        Returns:
        the ID of this configuration
      • toString

        public java.lang.String toString()
        ToString of DomainConfiguration class.
        Overrides:
        toString in class java.lang.Object
        Returns:
        a string with info about the instance of this class.
      • setCrawlertraps

        public void setCrawlertraps​(java.util.List<java.lang.String> someCrawlertraps)
        Set the crawlerltraps for this configuration.
        Parameters:
        someCrawlertraps - a list of crawlertraps
      • getCrawlertraps

        public java.util.List<java.lang.String> getCrawlertraps()
        Returns:
        the known crawlertraps for this configuration.
      • setDomainhistory

        public void setDomainhistory​(DomainHistory newDomainhistory)
        Set the domainHistory for this configuration.
        Parameters:
        newDomainhistory - the new domainHistory for this configuration( null is accepted for no History)
      • setName

        public void setName​(java.lang.String configName)
        Change the name of configuration to the given configName.
        Parameters:
        configName - a new name for this configuration.
      • getAttributesAndTypes

        public java.util.List<EAV.AttributeAndTypegetAttributesAndTypes()
        Get this configurations EAV attributes and attribute types.
        Returns:
        this configurations EAV attributes and attribute types
      • setAttributesAndTypes

        public void setAttributesAndTypes​(java.util.List<EAV.AttributeAndType> attributesAndTypes)
        Set this configurations EAV attributes and attribute types.
        Parameters:
        attributesAndTypes - EAV attributes and attribute types