dk.netarkivet.harvester.tools
Class TwitterDecidingScope

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Filter
                      extended by org.archive.crawler.framework.CrawlScope
                          extended by org.archive.crawler.deciderules.DecidingScope
                              extended by dk.netarkivet.harvester.tools.TwitterDecidingScope
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean

public class TwitterDecidingScope
extends org.archive.crawler.deciderules.DecidingScope

Heritrix CrawlScope that uses the Twitter Search API (https://dev.twitter.com/docs/api/1/get/search) to add seeds to a crawl. The following parameters to twitter search are supported: keywords: a list equivalent twitters "query" text. geo_locations: as defined in the twitter api. language: quivalent to twitter's "lang" parameter. These may be omitted. In practice only "keywords" works well in the current version of twitter. In addition, the number of results to be considered is determined by the parameters "pages" and "twitter_results_per_page".

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_GEOLOCATIONS
          Attribute/value pair.
static java.lang.String ATTR_KEYWORDS
          Attribute/value pair.
static java.lang.String ATTR_LANG
          Attribute/value pair.
static java.lang.String ATTR_PAGES
          Attribute/value pair.
static java.lang.String ATTR_QUEUE_KEYWORD_LINKS
          Attribute/value pair specifying whether an html search for the given keyword(s) should also be queued.
static java.lang.String ATTR_QUEUE_LINKS
          Attribute/value pair specifying whether embedded links should be queued.
static java.lang.String ATTR_QUEUE_USER_STATUS
          Attribute/value pair specifying whether the status of discovered users should be harvested.
static java.lang.String ATTR_QUEUE_USER_STATUS_LINKS
          Attribute/value pair specifying whether one should additionally queue all links embedded in a users status.
static java.lang.String ATTR_RESULTS_PER_PAGE
          Attribute/value pair.
(package private) static java.util.logging.Logger logger
          Logger for this class.
 
Fields inherited from class org.archive.crawler.deciderules.DecidingScope
ATTR_DECIDE_RULES
 
Fields inherited from class org.archive.crawler.framework.CrawlScope
ATTR_NAME, ATTR_REREAD_SEEDS_ON_CONFIG, ATTR_SEEDS, DEFAULT_REREAD_SEEDS_ON_CONFIG, seedListeners
 
Fields inherited from class org.archive.crawler.framework.Filter
ATTR_ENABLED
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Constructor Summary
TwitterDecidingScope(java.lang.String name)
          Constructor for the method.
 
Method Summary
 boolean addSeed(org.archive.crawler.datamodel.CandidateURI curi)
          Adds a candidate uri as a seed for the crawl.
 void initialize(org.archive.crawler.framework.CrawlController controller)
          This routine makes any necessary Twitter API calls and queues the content discovered.
 
Methods inherited from class org.archive.crawler.deciderules.DecidingScope
getDecideRule, innerAccepts, kickUpdate
 
Methods inherited from class org.archive.crawler.framework.CrawlScope
addSeedListener, checkClose, getSeedfile, isSameHost, isSeed, listUsedFiles, refreshSeeds, seedsIterator, seedsIterator, toString
 
Methods inherited from class org.archive.crawler.framework.Filter
accepts, getFilterOffPosition, returnTrueIfMatches
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

logger

static java.util.logging.Logger logger
Logger for this class.


ATTR_KEYWORDS

public static final java.lang.String ATTR_KEYWORDS
Attribute/value pair. The list of keywords to search for

See Also:
Constant Field Values

ATTR_PAGES

public static final java.lang.String ATTR_PAGES
Attribute/value pair. The number of pages of results to process.

See Also:
Constant Field Values

ATTR_RESULTS_PER_PAGE

public static final java.lang.String ATTR_RESULTS_PER_PAGE
Attribute/value pair. The number of results per twitter page.

See Also:
Constant Field Values

ATTR_GEOLOCATIONS

public static final java.lang.String ATTR_GEOLOCATIONS
Attribute/value pair. A list of geo_locations to include in the search. These have the form lat,long,radius,units e.g. 100.1,10.5,25.0,km

See Also:
Constant Field Values

ATTR_LANG

public static final java.lang.String ATTR_LANG
Attribute/value pair. If set, the language to which results are restricted. Unfortunately the twitter language identification heuristics are so poor that this option is unusable. (Broken. See http://code.google.com/p/twitter-api/issues/detail?id=1942 )

See Also:
Constant Field Values

ATTR_QUEUE_LINKS

public static final java.lang.String ATTR_QUEUE_LINKS
Attribute/value pair specifying whether embedded links should be queued.

See Also:
Constant Field Values

ATTR_QUEUE_USER_STATUS

public static final java.lang.String ATTR_QUEUE_USER_STATUS
Attribute/value pair specifying whether the status of discovered users should be harvested.

See Also:
Constant Field Values

ATTR_QUEUE_USER_STATUS_LINKS

public static final java.lang.String ATTR_QUEUE_USER_STATUS_LINKS
Attribute/value pair specifying whether one should additionally queue all links embedded in a users status.

See Also:
Constant Field Values

ATTR_QUEUE_KEYWORD_LINKS

public static final java.lang.String ATTR_QUEUE_KEYWORD_LINKS
Attribute/value pair specifying whether an html search for the given keyword(s) should also be queued.

See Also:
Constant Field Values
Constructor Detail

TwitterDecidingScope

public TwitterDecidingScope(java.lang.String name)
Constructor for the method. Sets up all known attributes.

Parameters:
name - the name of this scope.
Method Detail

initialize

public void initialize(org.archive.crawler.framework.CrawlController controller)
This routine makes any necessary Twitter API calls and queues the content discovered.

Overrides:
initialize in class org.archive.crawler.framework.CrawlScope
Parameters:
controller - The controller for this crawl.

addSeed

public boolean addSeed(org.archive.crawler.datamodel.CandidateURI curi)
Adds a candidate uri as a seed for the crawl.

Overrides:
addSeed in class org.archive.crawler.framework.CrawlScope
Parameters:
curi - The crawl uri to be added.
Returns:
whether the uri was added as a seed.