Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


Using IIPC Heritrix 3 into NetarchiveSuite: feedback on BnF tests and analysis: see presentation

Status of the production sites



Summer activities and plans

• We ran a mini event harvest on Trumps plan to by Greenland from the Danish Queen, especially Twitter activities and reactions from foreign Medias.
• After our 2nd broad crawl for 2019, which finished in May we reworked the results and changed some configurations. We started our 3rd broad crawl on 1 September, 1st step with a limit of 50 MB. Step 2 will have a limit of 16 GB, simultaneously with step 1 we started a run of ”ultra big sites”, "OAI-extraction (research databases)", "municipalities and regions", "ministries and administrative bodies" and, YouTube videos
• One of our most important problems observed with the selective crawls is js-lazy-load: images are not displayed or worse, not even captured.

From our ongoing projects:
• We are looking forward to implement the new features for BCWeb, so we can go on with building up an external community to help us with the collection work using BCWeb
• We are going to rethink our collection strategy within the frame of the general collection strategy for the digital cultural heritage.
• We are investigating the solution with only one online copy of Netarchive
• Hopefully we soon will get allocated more IT resources, so we can go on with the implementation of browser based harvesting in our production system. Umbra still is not totally in place.
• There are still some issues to be solved before we can implement SolR wayback in our frontend – especially legal issues in connection with GDPR



During the summer, we continued the preparation of our broad crawl. We ran an HTTP test on circa 5 million seed URL and identified out of this test 90 unwanted websites (hosting, ISP, parking, domain name registration websites) which will enable us to exclude 187 400 domains from our seed list.

Our bandwith was increased to 1.5 GB along with a general increase of BnF bandwith. We are running tests to find the best compromise with our infrastrucure (CPU, memory).

We are still working on the new version of BCweb and are now on the administrator pages.

We upgraded openwayback to the latest 2.4.2 that was released in May 2019.



We are currently running the first stage of our yearly domain crawl. We are in the last third of that stage. After exchange our hardware (old PCs with weaker CPU but more RAM), we are still experiencing the Problem, but not very often.

This year we also plan to do a second stage, which was not possible last year due to our limits of storage. To make this possible for every year we need to negotiate to get more storage. In preparation for our yearly budget discussion we were collecting information about the last domain crawl in Denmark (Number .dk is similiar to .at). Thanks to Tue, who was providing us these information, which can hopefully help us the get more storage in future.