Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


  • 3. Broadcrawl - step 2 - Done 192 of  592 Jobs
  • Presenting for
  • Workshop with The Danish Agency for Digitalisation regarding all KB´s language resources (data, delivery methods etc.)
  • Great knowledgesharing with Nina Heljeback from KB-Sweden 
  • Testing on site installation of Browsertrix Cloud
    • Seems fine so far (but does not work in Google Chrome at the moment)
  • Some focus on SoMe and freedom of speach/burning of Quran etc.
  • Data delivery projects still going on - researchers getting data via SFTP - LLM (40/15TB)
  • Still focusing on Paywall content. 
  •  Working on proposals for IIPC WAC 2024
    • Paywall-sites and more 
    • Maybe Browsertrix status
    • ?
    • ?
  • Update of default seeds
  • Scraping site maps to get more quality content
  • Ingesting ArchiveIt-files from 2020-2023..
  • Still working on ingesting files for IA 1996-1999 .dk-crawls



The digital legal deposit service welcomes a new colleague, Florence Simonet as digital collections manager. She will particularly work on the harvests.

The end of the summer is marked by two important saving sites, before closure, projects.
Skyblog, which was the largest French blogging platform in the 2000s, closed to the public on August 21st. The BnF harvest began last week and covers more than 12.6 million blogs for a total of 1261 jobs.
The harvest is expected to last about 2 months and the estimated size is about 40 TB.

Furthermore, the Orange personal pages hosting service will close on September 5th. This is a website creation service linked to the telephone operator Orange. Harvesting tests will begin soon and should cover around 450 000 sites for a about 12 TB of data.

Like every year, we are currently preparing our upcoming broad crawl which will be launched in October 2023.