Page tree
Skip to end of metadata
Go to start of metadata

These are some thoughts developed after spending a couple of days exploring twitter's user and programming interfaces.

What is Twitter?

Twitter is a website where users can post messages (tweets) up to 140 characters long. Subscribers can follow tweets from each other, engage in conversations via tweets, retweet other people's tweets to their own followers, or send direct messages to a follower. Tweets can be labelled with hashtags - a kind of not-very-well-controlled keyword vocabulary.

Why is Twitter interesting?

Tweets can be trivial (do you want to know what a given celebrity is eating for breakfast?). However tweets can also:

  • be an important source of breaking news
  • a window into various kinds of activism
  • a link to important content on the wider web

From the point of view of web-harvesting, the last is very interesting. Tweets are tagged (though not with 100% accuracy) by language and geo-location. Twitter could therefore be an important gateway to relevant content for national web archives. For example, starting with a keen Danish tweeter, the politician Margrethe Vestager, one could:

  • harvest her tweets
  • harvest tweets from her followers
  • harvest blogs linked to by her and her followers

in each case filtering content by language and/or location to try to restrict the harvest to material of Danish national interest. In this way one could hope to find a great deal of relevant web content which would be missed in a simple national-domain-level crawl.

But what is Twitter Really?

Twitter is obviously not (just) a website. Twitter users are as likely to be reading it in a browser plugin or a smartphone app as they are on the standard website. These applications are built on Twitter's application programming interface (API) which is open and well-documented. Shouldn't we be using this API to find the twitter content that interests us?

What do we want from Twitter?

There are many method calls in the twitter API which could be useful to us, such as methods for finding all the followers of a given tweeter. The most used method is likely to be the general search call, which looks like This is the API equivalent of typing in a browser window. In general one can always transform a web search in twitter to an API search by changing the prefix of the URL to point to the API instead of the web interface.

What does the result of the above JSON search look like? Well, click on it ... or look at the formatted example.

The result is a structure which is essentially a paged list of tweets, including all sorts of relevant metadata such as language- and geo-codes. Each tweet has a unique id so the JSON result can easily be used to build a web link to the given tweet, e.g.!/vestager/status/153531518315282432 (\!/vestager/status/153531518315282432).

This is smart, but one can also immediately see a problem. We could just use the search API to find a lot of relevant tweets and harvest them all one-by-one, but then we would lose much of the context - for example the timeline view of an individual's tweets, or the flow of conversation. Instead we would end up with a vast archive of disconnected individual tweets which would be difficult to navigate or interpret.

So what do we want:

  • We want an organised record of the twitter behaviour of interesting users
  • We want a discoverable record of the interactions between twitter users
  • We want links in tweets to be clickable, at least where they are of relevance to our collection profile

(How) Can the Twitter API Help Us?

We are still at the proof-of-concept and prototype stage, but we have made some experiments with creating a Twitter link-extractor for heritrix which uses the API to discover new content and queue it for harvesting. The extractor takes any twitter web-search url (such as, transforms it into an API call, extracts all twitter usernames and inline url's (the so-called url's) from the API result, and the adds to the heritrix queue:

  • a web lookup for the timeline of each user's tweets
  • each url found

The harvest can be seeded with some relevant search terms - for example for prominent relevant users or hashtags. The extractor works as expected. It is still difficult to navigate within the harvested archive and we need more work to see if we can understand how to improve the linkage between the different harvested pages.

There are many possible future refinements to this harvesting strategy, such as

  • creating new searches based on the results of previous searches
  • using paging to get many more search results
  • using geo- and language- restriction 
  • using time-based restriction to avoid reharvesting of previously seen content

On a technical note, the prototype uses very ugly regular expression parsing to get user-id's and url's from the JSON output. Any production-ready development should use a proper Java API for Twitter such as twitter4j.

A technical note on AJAX URL's

Twitter web url's typically look like!/vestager (\!/vestager). The elements after the "#" character are not sent to the twitter server but are interpreted by the browser and used to create asynchronous queried in javascript. Google describe how these ajax url's can be transformed to enable crawling of AJAX sites which  otherwise are difficult or impossible to harvest. We need a much more detailed investigation as to whether this technique can usefully be employed to create more-complete and better-navigable archives of Twitter and other AJAXy websites such as Facebook.

  • No labels

1 Comment

  1. Here is what Gina Jones, LOC tod me on there experiences with harvesting Twitter:

    "Hi Sabine, am trying to clean out my mailbox today for overdue tasks and responses.

    It depends on what you are going to do, archive tweets by owner? or hashtag?

    In 2009, we crawled a twitter hashtag #sotomayor (and for misspellers (#sotomayer) ) for a Supreme Court Nomination Web archive collection.  The problem was how twitter constructed the "more" pages.  We couldn't narrow focus the crawler to just crawl the hashtags so we ended up crawling the hashtags hourly. It was a mess anyway, the things people say….   Base on my experience, I do not think that hashtag collection is interesting.  

    There has been a twitter structure change and Nicholas Taylor, who is doing most of the problemmatic QR sent the below to UK/TNA.

    Nicholas, anything to add to this? (Nicholas, I edited a bit)

    I looked  into this and found additional documentation at Internet Archive's website. They maintain that, as it stands, Heritrix should be able to crawl Twitter as long as these best practices are followed: Then, on the replay side, it's necessary to disable JavaScript when attempting to replay the captured Twitter pages in the archive. I can corroborate that this is the case with our captures of individual Twitter pages, though, unfortunately, I don't have any examples to point you to since none of our web archives that were captured so recently are yet available publicly."

    I made a test harvest:

    According to the crawllog the twitter profiles seem to be harvested:

    But the viewerproxy changes http til https in the url
    Example:  (and the viewerproxy collected this missing url: