which python and make sure that the
python there points at a python 3 and if it does not, make sure that it does.
Then, to install Umbra, do the following.
RabbitMQ should now be reachable at http://localhost:15672 (user: guest, pass: guest).
Make sure Google Chromium is installed. (If not, do a
sudo apt-get install chromium-browser)
Then run Umbra as follows:
If you want to see what Umbra does in the Chromium browser, just do a
If you want Umbra to do its stuff without seeing the browser, do
sudo X :1 (and then press <ctrl> <alt> <F7>)
and in a(nother) shell, do
export DISPLAY=:1; umbra -v&
cd heritrix-3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/jobs/ mkdir myTestJob cd myTestJob
Get a the following Heritrix 3 Crawl Job Configuration File (like this one:
put it in
and rename it to
At the top of the file, just above the first
bean tag (not
<bean class="org.archive.crawler.frontier.AMQPUrlReceiver"/> <bean id="umbraBean" class="org.archive.modules.AMQPPublishProcessor"> <property name="clientId" value="requests"/> </bean>
Also, in the same file, find
and within that bean, just under
add the line:
Then, still under the
myTestJob dir, create a text file called
seeds.txt with a single line saying:
Now, in a browser, go to https://localhost:8443/ and paste the path to the
myTestJob dir into the field under add existing job, and click the add button. "myTestJob" should appear at the bottom of the window, so click it.
On the page that appears, click the build button, and when it has built, click the launch button and reload the page until it says "Job is Finished: FINISHED".
Now the Umbra harvests will likely be running... but where do they actually dump the resulting files???
NOTE: Heritrix can be killed by doing a