dk.netarkivet.harvester.harvesting
Class SeedUriDomainnameQueueAssignmentPolicy

java.lang.Object
  extended by org.archive.crawler.frontier.QueueAssignmentPolicy
      extended by org.archive.crawler.frontier.HostnameQueueAssignmentPolicy
          extended by dk.netarkivet.harvester.harvesting.SeedUriDomainnameQueueAssignmentPolicy

public class SeedUriDomainnameQueueAssignmentPolicy
extends org.archive.crawler.frontier.HostnameQueueAssignmentPolicy

This is a modified version of the DomainnameQueueAssignmentPolicy where domainname returned is the domainname of the candidateURI except where the domainname of the SeedURI is a different one. Using the domain as the queue-name. The domain is defined as the last two names in the entire hostname or the entirety of an IP address. x.y.z -> y.z y.z -> y.z nn.nn.nn.nn -> nn.nn.nn.nn


Field Summary
(package private) static java.lang.String DEFAULT_CLASS_KEY
          A key used for the cases when we can't figure out the URI.
 
Constructor Summary
SeedUriDomainnameQueueAssignmentPolicy()
           
 
Method Summary
 java.lang.String getClassKey(org.archive.crawler.framework.CrawlController controller, org.archive.crawler.datamodel.CandidateURI cauri)
          Return a key for queue names based on domain names (last two parts of host name) or IP address.
 
Methods inherited from class org.archive.crawler.frontier.QueueAssignmentPolicy
maximumNumberOfKeys
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_CLASS_KEY

static final java.lang.String DEFAULT_CLASS_KEY
A key used for the cases when we can't figure out the URI. This is taken from parent, where it has private access. Parent returns this on things like about:blank.

See Also:
Constant Field Values
Constructor Detail

SeedUriDomainnameQueueAssignmentPolicy

public SeedUriDomainnameQueueAssignmentPolicy()
Method Detail

getClassKey

public java.lang.String getClassKey(org.archive.crawler.framework.CrawlController controller,
                                    org.archive.crawler.datamodel.CandidateURI cauri)
Return a key for queue names based on domain names (last two parts of host name) or IP address. They key may include a # at the end.

Overrides:
getClassKey in class org.archive.crawler.frontier.HostnameQueueAssignmentPolicy
Parameters:
controller - The controller the crawl is running on.
cauri - A potential URI.
Returns:
a class key (really an arbitrary string), one of , #, or "default...".
See Also:
HostnameQueueAssignmentPolicy.getClassKey( org.archive.crawler.framework.CrawlController, org.archive.crawler.datamodel.CandidateURI)