Page tree
Skip to end of metadata
Go to start of metadata

Installing Heritrix 3

Clone from git

Build using Java 6 as JAVA_Home:

 

mvn -DskipTests package

 

Then unpack the heritrix distro from dist/target/heritrix-3.3.0-SNAPSHOT-dist.tar.gz. Hertrix 3 is started with

 

heritrix/bin/heritrix -a admin:admin

 

Building Contrib

Change directory to heritrix3/contrib and run

 

mvn -DskipTests package

 

Copy target/heritrix-contrib-3.3.0-SNAPSHOT.jar into the lib directory of the heritrix distribution. Also copy amqp client library from your maven repository to the heritrix library with something like

 

cp ~/.m2/repository/com/rabbitmq/amqp-client/3.2.1/amqp-client-3.2.1.jar ~/heritrix-3/heritrix/lib/

Using Umbra

To enable umbra in a crawl job you need to do two things

  1. Create the publisher bean and add it to the fetchProcessors bean: 

     <bean id="umbraBean" class="org.archive.modules.AMQPPublishProcessor">
      <property name="clientId" value="requests"/>
     </bean> 
    .
    .
    .
    <ref bean="extractorSwf"/>
    <ref bean="umbraBean"/>
  2. Add the listener (receiver) bean at the top level of the crawler beans file: 

     <bean class="org.archive.crawler.frontier.AMQPUrlReceiver"/>

There is very little in umbra which is actually configurable - so far as i can see only the names of the queues. This might be useful if you are running multiple heritrix instances sharing the same broker.

Things to Think About

  • Don't forget to set the "clientId" property as shown above. It is probably a bug that this isn't set by default to be consistent with the default configuration of the receiver.
  • Heritrix sends every queued http and https link to umbra except for robots.txt and urls received from umbra.
  • Urls are received from umbra asynchronously and put directly into the frontier. That means they have no discovery path and just get an "I". This also means that they are not subject to heritrix's normal scoping rules (question).
  • Urls received from umbra are marked in the crawl log with the string "receivedFromAMQP" so you can identify them.
  • Because the communication is asynchronous there can still be urls left on the queue after the job has finished. Remember to drain the queue 

    drain-queue

    before running the next harvest.
     

And Then ...

And then we ran some harvest jobs using umbra, on various interesting sites including twitter, facebook, youtube, and bt.dk. Running harvests takes more time than developing code or learning APIs so this part of the investigation was unfortunately cut rather short. In our specific harvests we found very few cases where umbra identified urls which were not also found by the heritrix extractor. The only ones we were reasonably sure about were some JSON queries in twittter. But that shouldn't be taken as a criticism of umbra itself, but instead as an indication of our own inexperience with using it. One possible reason we didn't achieve much with umbra is that, because of time constraints, all our test harvests were run with quite small overall harvesting budgets. So perhaps we just filled up our budget with heritrix-extracted urls before umbra could really come into play.

More generally, as with any crawl-engineering related problem, effective use of umbra requires both extensive experience and knowledge-sharing with other users. We hope that the information presented here will encourage other web archives to experiment with umbra and other browser-based link-identification and extraction tools so that we can continue to build our knowledge-base in the web-archiving community.

  • No labels

2 Comments

  1. Noah Levitt followed up on this with the following on the archive-crawler mailing list:

    Hello Colin, very glad you were able to get umbra to work. It looks
    like you managed to figure out the tricky parts, and document them,
    before I had a chance to do it myself! Thank you for that. At some
    point I'd like to get some of your insights into the umbra readme.
    Anyone is welcome to accelerate that process by sending a pull
    request. 
    
    Also want to briefly discuss your results:
    https://sbforge.org/display/NAS/Integrating+Umbra+And+Heritrix+3
    "And then we ran some harvest jobs using umbra, on various interesting
    sites including twitter, facebook, youtube, and bt.dk. Running
    harvests takes more time than developing code or learning APIs so this
    part of the investigation was unfortunately cut rather short. In our
    specific harvests we found very few cases where umbra identified urls
    which were not also found by the heritrix extractor. The only ones we
    were reasonably sure about were some JSON queries in twittter."
    
    As you probably know, umbra runs a "behavior" on each web page that it
    loads. These behaviors can be customized for different urls, and they
    can do things like scroll through the page, click on different
    elements, etc. So far we have only developed a handful of customized
    behaviors. Of the 4 sites you mentioned, only facebook has a custom
    behavior in umbra at this point. That does mean that the facebook
    pages you harvested should have benefited from umbra. For urls without
    an assigned custom behavior, umbra falls back on a default behavior.
    The main thing the default behavior does is scroll to the bottom of
    the page, and keep scrolling if more content is loaded. The twitter
    json queries were probably a result of that scrolling.
    https://github.com/internetarchive/umbra/tree/master/umbra/behaviors.d
    Pull requests for additional behaviors are also very welcome.