Details
-
New Feature
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
Description
Brozzler Installation Guide
Install Python and Python-pip
sudo yum install python34 sudo yum install python34-pip
Install RethinkDB
sudo wget https://download.rethinkdb.com/centos/6/`uname -m`/rethinkdb.repo -O /etc/yum.repos.d/rethinkdb.repo
sudo yum install rethinkdb
Install Chromium
sudo yum chromium-browser
Install Broozler
pip3 install brozzler[easy] pip3 install brozzler[dashboard] pip3 install warcprox pip3 install pywb pip3 install flas
Brozzler Start Guide
Start Rethinkdb
rethinkdb --bind all &
http://ip-address:8080
Start Brozzler Automatik (Option 1)
Start Brozzler
brozzler-easy
Start Brozzler Manual (Option 2)
Start Warcprox
warcprox -d <path/to/warc>
Start a Brozzler Worker
brozzler-worker
Start Crawl
Queue a Site to Crawl
brozzler-new-site http://example.com/
or a job
brozzler-new-job job1.yml
https://github.com/internetarchive/brozzler
View Brozzler crawled data from warc files
Install PyWB
pip install pywb
Add new collection
wb-manager init my_coll wb-manager add my_coll <path/to/warc>
Remember 'my_coll' is a collection name example and can be changed to what ever you prefer.
Start Wayback
wayback -p 7080
Remember when starting wayback, that your are at the directory where your warc files are placed.
Start a browser
http://localhost:7080
Attachments
1.
|
Define the kind of templates for brozzler to be stored in NAS | Triage | Unassigned | |
2.
|
Define how our domainConfiguration class will work with Brozzler | Triage | Unassigned | |
3.
|
Define how the HarvestJobManager will schedule Brozzler jobs | Triage | Unassigned | |
4.
|
Define a HarvestControllerServer for Brozzler | Triage | Unassigned | |
5.
|
Define how to deploy brozzler in Netarchivesuite | Triage | Unassigned |