NetarchiveSuite includes a special implementation of a Heritrix DecidingScope which can be used with a modified order-template to manage a crawl of Twitter and material linked from twitter. A sample order-template (twitter-fronpages.xml) is included with the distribution. To use TwitterDecidingScope, first configure the order-xml for the desired crawl characteristics and then start a selective harvest of twitter using the specified configuration and the desired harvest limits.
The TwitterDecidingScope functions as follows. The scope searches Twitter for tweets matching any of the specified keywords (which may also be Twitter hashtags or usernames). It is possible (and advisable) to restrict the number of tweets returned by specifying the desired language and geo_location(s). The Scope then queues for download as html all the individual tweets found.
The relevant section of the order-template looks something like:
In addition to queueing individual tweets discovered by searching Twtitter, the various boolean flags tell TwitterDecidingScope which additional material to download as follows:
queue_links: if true, queue any links/media found in the discovered tweets
- queue_user_status: if true, queue an html listing of tweets from all users responsible for the discovered tweets
- queue_user_status_links: if true, attempt to find and queue any other links in other tweets from the discovered users
- queue_keyword_links: queue an html listing of a search on the specified keywords
Our experience from harvesting Danish content suggests
- Language filtering works very poorly. It is a better strategy to use language-specific keywords.
- geo_location filtering also works poorly.
- A large proportion of the linked material is from major news outlets. If you already have a harvest strategy which collects these regularly then it might be wise to block them from Twitter harvests (treating them as crawlertraps) to avoid unnecessary duplication.