Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2761

Get Umbra + Heritrix Installation up and Running Locally

    XMLWordPrintable

Details

    • New Feature
    • Resolution: Fixed
    • Minor
    • None
    • None
    • Netarkiv
    • Umbra Sprint 1

    Description

      This is the procedure that works for kaah which is performed using intellij

      Heritrix

      Installation

      Download the source code from https://github.com/netarchivesuite/heritrix3.git. After that select the BrowserBased git branch.

      After that go in to the Maven projects bar and select Reimport all Maven Projects and choose Rebuild project. For me there came a number of errors. I solved that by closing the project and deleting the .idea folder and open the project again. I chose Rebuild project again and now there were no more errors.

      Then right click engine/src/main/java/org/archive/crawler/ Heritrix.java and choose Run Heritrix.main()

      There will come an error that valid username/password should be given. That can be done by entering in VM options “-Dheritrix.development” and in Program arguments “-a admin:admin -l dist/src/main/conf/logging.properties”

      When the code now is run it will start

      Then enter https://localhost:8443/ in an internet browser to access Heritrix GUI 

       

      Further information

      https://webarchive.jira.com/wiki/spaces/Heritrix/pages/5735610/A+Quick+Guide+to+Running+Your+First+Crawl+Job

      and

      http://crawler.archive.org/articles/user_manual/usecases.html

      and

      https://people.emich.edu/csperlic/big_data/heritrix_quickstart/crawler_configuration_details.html

       

      Heritrix 3 + Umbra

      https://sbfo rge.org/display/NAS/Getting+Started+With+Umbra explains how to install it on ubuntu, which is

       

      sudo
      pip3 install git+https://}}github.com/internetarchive/umbra.git

      The way that a tar.gz file is unpacked by is: “tar xfz <filename>.tar.gz”

      and in my case I had to enter “cp ~/.m2/repository/com/rabbitmq/amqp-client/3.2.1/amqp-client-3.2.1.jar ~/Code/Repository/heritrix3/heritrix-3.3.0-SNAPSHOT/lib” instead of “cp
      ~/.m2/repository/com/rabbitmq/amqp-client/3.2.1/amqp-client-3.2.1.jar
      ~/heritrix-3/heritrix/lib/

      And after that

      sudo
      apt-get install rabbitmq-server
      sudo
      rabbitmq-plugins enable rabbitmq_management
      rabbitmq-plugins
      enable rabbitmq_shovel rabbitmq_shovel_management
      sudo
      service rabbitmq-server restart

      Problems that may occur

      In case a “permission denied” is encountered it can be solved by:

      sudo su

      and after the command has been executed then write

      exit

       

      After these steps have been done the process to start a harvest is to first start umbra and then Heretrix. In Heritrix the crawler-bean.cxml can now be changed to specify which domains that are to be harvested and then a harvest with umbra can be done by building and launching the harvest

       

      In case of a manual Umbra execution (instead of a Heritrix execution) the procedure is

      sudo X :1

      After that press <ctrl> <alt> <F7>

      export DISPLAY=:1;

      umbra -v &

      queue-url -v https://da-dk.facebook.com/larsloekke/

      or

      umbra -v &

      queue-url -v https://twitter.com/sortediamant?lang=da

       

      You can also get an idea of how it works by following the link

      Crawler-beans.cxml: https://gist.github.com/anjackson/ce84c8aa61e6fa439e79

      Evaluation

      According to Noah’s comment in https://sbforge.org/display/NAS/Integrating+Umbra+And+Heritrix+3 there has to be given explicit behavior scripts for each URL-type (for instance Facebook, Instagram, …). If a type isn’t in the list the behavior is scroll down to the end of the page, but not click on for instance images. So for each not listed type javascript code has to be developed that can fulfill the given desired behavior. The list doesn’t contain any danish URL-types.

       

      Attachments

        Activity

          People

            kaah Knud Åge Hansen
            csr Colin Rosenthal
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: