Our 2021 broad crawl was launched on the 11th of October. The chosen settings are 2100 URLs per domain, with a limit of 3 days per job. The crawl is due to finish in the middle of November and the budget should be around 112-115 TB.
At the start of the broad crawl, we had very slow jobs because of several million discovered URLs.
Some of our seeds redirect to a location like "http://fr/" or "http://com/". Heritrix considered "fr" and "com" as domains and added all the .fr or .com sites to the queue (a fix is ongoing on Heritrix: https://github.com/kris-sigur/heritrix3/commit/69b023199d3ad176b83c7e6d7dbb793c7a8adf66).
The BnF DataLab was opened on the 18th of October. It is a research assistance and support service set up by the BnF in partnership with the TGIR Huma-Num. The DataLab is intended for researchers who want to work on digital collections of the BnF.
A presentation about web archives was carried out by the digital legal deposit team on this occasion.
Moreover, a research project relating to web archives has been selected, among nearly 20 responses to a previous call for proposals launched by the BnF DataLab. This project led by Valerie Schafer is called "Buzz F, a history of online virality". The purpose is to reconstruct fleeting phenomena of online virality from traces found in the archives.
A new access to our "Archives de l’internet" will be opened at the Champs Libres Library in Rennes on November, 18th. It is the 21st access (out of 26) which will be opened in public libraries.
Finally, we will also organize a Webinar about web regional harvests, on the 9th of November. Up to now, three regional crawls are launched each year (Alsace, Lorraine and Languedoc-Roussillon). The aim is to exchange about these harvests and to develop new crawls with the other provinces.