Page tree
Skip to end of metadata
Go to start of metadata

Motivation

The most basic frustration of web-archiving is that virtually any modern computer running any recent browser can render virtually any web-page. But these same web-pages often defeat our purpose-built crawlers. So why not leverage the tremendous developer effort which goes into building browsers and use a browser to render the web page we want to harvest, including executing any and all scripts/flash etc. on the website which might be necessary to generate any links? At the same time, we would like to have a single crawler in overall charge of the crawl - of crawl budgeting, scope-management, and warc-generation. So the idea then becomes that we use a plugin to the heritrix crawler which enables it to use a conventional web-browser as a link-extractor. This is how both Internet Archive's Umbra and the British Library's PhantomJS systems work, although with slight but important differences. In this investigation we are focusing on umbra.

umbra_architecture

  • No labels