Using IIPC Heritrix 3 into NetarchiveSuite: feedback on BnF tests and analysis: see presentation
Status of the production sites
Summer activities and plans
• We ran a mini event harvest on Trumps plan to by Greenland from the Danish Queen, especially Twitter activities and reactions from foreign Medias.
From our ongoing projects:
During the summer, we continued the preparation of our broad crawl. We ran an HTTP test on circa 5 million seed URL and identified out of this test 90 unwanted websites (hosting, ISP, parking, domain name registration websites) which will enable us to exclude 187 400 domains from our seed list.
Our bandwith was increased to 1.5 GB along with a general increase of BnF bandwith. We are running tests to find the best compromise with our infrastrucure (CPU, memory).
We are still working on the new version of BCweb and are now on the administrator pages.
We upgraded openwayback to the latest 2.4.2 that was released in May 2019.
We are currently running the first stage of our yearly domain crawl. We are in the last third of that stage. After exchange our hardware (old PCs with weaker CPU but more RAM), we are still experiencing the https://sbforge.org/jira/browse/NAS-2682 Problem, but not very often.
This year we also plan to do a second stage, which was not possible last year due to our limits of storage. To make this possible for every year we need to negotiate to get more storage. In preparation for our yearly budget discussion we were collecting information about the last domain crawl in Denmark (Number .dk is similiar to .at). Thanks to Tue, who was providing us these information, which can hopefully help us the get more storage in future.