Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


  • Broad Crawl going great
  • March+April we will focus on Browsertrix
  • Data dump of all text from Netarkivet to research project on making a new Danish language model in the works.
  • Small organisation change in Copenhagen. Section manager at Digital Cultural Heritage changes to Head of Department at Digitla Transformation-dept.
  • Anders visited Nettarkivet in Oslo to see their world premiere of researcher access to web archive data.
    • They used Pywb 2.6 but will use much better 2.7.x soon .
    • Had prototype free text search based on natural language extracted from HTML.
    • I showed the organisers SolrWayback - It will fullfill many of the wishes from researchers  that came up during the workshop and save them development time. They need to index 1.8 PB data though.
    • Nettarkivet uses browser-based crawler Veidemann for all their crawls, but I'm not sure of the scale (will check out). They have legal deposit law but don´t get a complte TLD list like KB do from DK Hostmaster.
    • Want to work more together. 
  • Twitter API!



First of all, this week, we are launching our first internal harvesting workshop of the year 2022. Until March, 31th, our team will experiment Browsertrix with different types of websites. In this framework we will also test the harvest of social networks.

Following the TikTok crawl launched in 2022 on the theme of the elections, we are going to launch our first current TikTok harvest this month.
198 TikTok accounts or tags have been selected until now.

On March 13, there will be an exchange day around the results and future prospects of the ResPaDon project, the aim of which is to "to set up a network about web archives". This day will be held at the BnF and will be broadcast live on Youtube.