Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2620

Adding brozzler as a possible harvesting option

    XMLWordPrintable

Details

    Description

      Brozzler Installation Guide

      Install Python and Python-pip
      sudo yum install python34
      sudo yum install python34-pip
      
      Install RethinkDB
      sudo wget https://download.rethinkdb.com/centos/6/`uname -m`/rethinkdb.repo -O /etc/yum.repos.d/rethinkdb.repo
      sudo yum install rethinkdb
      
      Install Chromium
      sudo yum chromium-browser
      
      Install Broozler
      pip3 install brozzler[easy]
      pip3 install brozzler[dashboard]
      pip3 install warcprox
      pip3 install pywb
      pip3 install flas
      

      Brozzler Start Guide

      Start Rethinkdb
      rethinkdb --bind all &
      http://ip-address:8080
      

      Start Brozzler Automatik (Option 1)

      Start Brozzler
      brozzler-easy
      

      Start Brozzler Manual (Option 2)

      Start Warcprox
      warcprox -d <path/to/warc>
      
      Start a Brozzler Worker
      brozzler-worker
      

      Start Crawl

      Queue a Site to Crawl
      brozzler-new-site http://example.com/
      
      or a job
      brozzler-new-job job1.yml
      

      https://github.com/internetarchive/brozzler

      View Brozzler crawled data from warc files

      Install PyWB
      pip install pywb
      
      Add new collection
      wb-manager init my_coll
      wb-manager add my_coll <path/to/warc>
      

      Remember 'my_coll' is a collection name example and can be changed to what ever you prefer.

      Start Wayback
      wayback -p 7080
      

      Remember when starting wayback, that your are at the directory where your warc files are placed.

      Start a browser
      http://localhost:7080
      

      https://github.com/ikreymer/pywb

      Attachments

        Activity

          People

            Unassigned Unassigned
            svc Søren Vejrup Carlsen (Inactive)
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: