Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


  • Our third broad crawl ran from September 13 to September 25 – with a budget of 10 MB per domain, that is to say we ran our usual step 1. Due to unsolved problems with H3 we will not be able to run step 2 (budget normally 100 MB)
  • We are preparing the event crawl of the local and regional elections on November 21. As our selective crawls cover the news media part of the elections, we will exclude them from the event crawl. We talked about using the last broad crawl for 2017 as a “back up” for the event crawl by starting it just after the election day. But as we won’t be able to run an “in depth going” broad crawl before spring 2018, so this will be no option. Anyway, focus will be on Social Media (Twitter, Facebook, YouTube) NGO’s, companies, other stakeholders.
  • We hope to get hints and help from the to days Social Media workshop. The first day was very fruitfull. It focused on how to identify relevant profiles, content etc on Twitter and Facebook. The second day will be about capturing content (API’s etc.). After the second day (on Monday) I’ll provide any information that would be useful for you.
  • We implemented BCWeb in our production system, our intention is to use it for the election event crawl. But there are still some open questions, especially the transfer in connection to our new way to build configurations (which do not include hops) is a big issue to be solved.
  • We started testing BNF’s NAS preload tool for the activation/deactivation of domains and cleaning up of their seeds concerning the broad crawls.
  • Our Webdanica project (automatic finding Danish content from TLD’s other than .dk by capturing outlinks from domains archived in Netarchive) is almost ready for going into production. If you have any questions on this project, Stephen or Tue will be able to tell more about it on our next meeting on Tuesday.


  • Our 2017 broad crawl was launched on the 16th October. So far we have encountered no major problems, both H3 and the new infrastructure are functioning correctly. We are keeping a close watch on the volume of data collected to ensure that we stay within our storage budget: we are harvesting 1,500 URLs per domain.