These are some thoughts developed after spending a couple of days exploring twitter's user and programming interfaces.
What is Twitter?
Twitter is a website where users can post messages (tweets) up to 140 characters long. Subscribers can follow tweets from each other, engage in conversations via tweets, retweet other people's tweets to their own followers, or send direct messages to a follower. Tweets can be labelled with hashtags - a kind of not-very-well-controlled keyword vocabulary.
Why is Twitter interesting?
Tweets can be trivial (do you want to know what a given celebrity is eating for breakfast?). However tweets can also:
- be an important source of breaking news
- a window into various kinds of activism
- a link to important content on the wider web
From the point of view of web-harvesting, the last is very interesting. Tweets are tagged (though not with 100% accuracy) by language and geo-location. Twitter could therefore be an important gateway to relevant content for national web archives. For example, starting with a keen Danish tweeter, the politician Margrethe Vestager, one could:
- harvest her tweets
- harvest tweets from her followers
- harvest blogs linked to by her and her followers
in each case filtering content by language and/or location to try to restrict the harvest to material of Danish national interest. In this way one could hope to find a great deal of relevant web content which would be missed in a simple national-domain-level crawl.
But what is Twitter Really?
Twitter is obviously not (just) a website. Twitter users are as likely to be reading it in a browser plugin or a smartphone app as they are on the standard website. These applications are built on Twitter's application programming interface (API) which is open and well-documented. Shouldn't we be using this API to find the twitter content that interests us?
What do we want from Twitter?
There are many method calls in the twitter API which could be useful to us, such as methods for finding all the followers of a given tweeter. The most used method is likely to be the general search call, which looks like http://search.twitter.com/search.json?q=vestager&rpp=100 This is the API equivalent of typing https://twitter.com/search/vestager in a browser window. In general one can always transform a web search in twitter to an API search by changing the prefix of the URL to point to the API instead of the web interface.
What does the result of the above JSON search look like? Well, click on it ... or look at the formatted example.
The result is a structure which is essentially a paged list of tweets, including all sorts of relevant metadata such as language- and geo-codes. Each tweet has a unique id so the JSON result can easily be used to build a web link to the given tweet, e.g. https://twitter.com/#!/vestager/status/153531518315282432 (https://twitter.com/#\!/vestager/status/153531518315282432).
This is smart, but one can also immediately see a problem. We could just use the search API to find a lot of relevant tweets and harvest them all one-by-one, but then we would lose much of the context - for example the timeline view of an individual's tweets, or the flow of conversation. Instead we would end up with a vast archive of disconnected individual tweets which would be difficult to navigate or interpret.
So what do we want:
- We want an organised record of the twitter behaviour of interesting users
- We want a discoverable record of the interactions between twitter users
- We want links in tweets to be clickable, at least where they are of relevance to our collection profile
(How) Can the Twitter API Help Us?
We are still at the proof-of-concept and prototype stage, but we have made some experiments with creating a Twitter link-extractor for heritrix which uses the API to discover new content and queue it for harvesting. The extractor takes any twitter web-search url (such as https://twitter.com/search/vestager), transforms it into an API call, extracts all twitter usernames and inline url's (the so-called t.co url's) from the API result, and the adds to the heritrix queue:
- a web lookup for the timeline of each user's tweets
- each t.co url found
The harvest can be seeded with some relevant search terms - for example for prominent relevant users or hashtags. The extractor works as expected. It is still difficult to navigate within the harvested archive and we need more work to see if we can understand how to improve the linkage between the different harvested pages.
There are many possible future refinements to this harvesting strategy, such as
- creating new searches based on the results of previous searches
- using paging to get many more search results
- using geo- and language- restriction
- using time-based restriction to avoid reharvesting of previously seen content
On a technical note, the prototype uses very ugly regular expression parsing to get user-id's and url's from the JSON output. Any production-ready development should use a proper Java API for Twitter such as twitter4j.
A technical note on AJAX URL's
Here is what Gina Jones, LOC tod me on there experiences with harvesting Twitter:
"Hi Sabine, am trying to clean out my mailbox today for overdue tasks and responses.
It depends on what you are going to do, archive tweets by owner? or hashtag?
In 2009, we crawled a twitter hashtag #sotomayor (and for misspellers (#sotomayer) ) for a Supreme Court Nomination Web archive collection. The problem was how twitter constructed the "more" pages. We couldn't narrow focus the crawler to just crawl the hashtags so we ended up crawling the hashtags hourly. It was a mess anyway, the things people say…. Base on my experience, I do not think that hashtag collection is interesting.
There has been a twitter structure change and Nicholas Taylor, who is doing most of the problemmatic QR sent the below to UK/TNA.
Nicholas, anything to add to this? (Nicholas, I edited a bit)
I made a test harvest: http://kb-test-adm-001.kb.dk:8080/History/Harveststatus-jobdetails.jsp?jobID=3281
According to the crawllog the twitter profiles seem to be harvested:
But the viewerproxy changes http til https in the url
Example: https://twitter.com/DRBreaking (and the viewerproxy collected this missing url: http://22.214.171.124:8086443