Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Status of the production sites

Netarkivet

Panel
  • Broad Crawl - finishing upload of step 2
/upload 
  • harvest 
  • Event harvest starting up - Kommunalvalg 2021 (
lo EWventhastest 
  • Twitter API-solution/pilot project - next steps close to be presented to management.
  • Looking into NZ Web archives use of https://www.brandwatch.com/  to get Twiter, Instagram and Facebook-content (including comments) via API (as .XLS or CSV as  I understood it)
  • Setting up a Crawlogs and Gephi-system to find bottlenecks in harvests.
  • Working on curator workflow/nomination pipeline with students from IT-University of Copenhagen
  • Youtube-harvesting. Contact found via google Denmark and mails sent. Awaiting their feedback.
  • Outsourcing harvest status
  • Making a presentation for the Polish Webarchive initiative  
  • IIPC to use SolrWayback for collections 

BnF

Panel

Our 2021 broad crawl was launched on the 11th of October. The chosen settings are 2100 URLs per domain, with a limit of 3 days per job. The crawl is due to finish in the middle of November and the budget should be around 112-115 TB.
At the start of the broad crawl, we had very slow jobs because of several million discovered URLs.
Some of our seeds redirect to a location like "http://fr/" or "http://com/". Heritrix considered "fr" and "com" as domains and added all the .fr or .com sites to the queue (a fix is ongoing on Heritrix: https://github.com/kris-sigur/heritrix3/commit/69b023199d3ad176b83c7e6d7dbb793c7a8adf66).

The BnF DataLab was opened on the 18th of October. It is a research assistance and support service set up by the BnF in partnership with the TGIR Huma-Num. The DataLab is intended for researchers who want to work on digital collections of the BnF.
A presentation about web archives was carried out by the digital legal deposit team on this occasion.
Moreover, a research project relating to web archives has been selected, among nearly 20 responses to a previous call for proposals launched by the BnF DataLab. This project led by Valerie Schafer is called "Buzz F, a history of online virality". The purpose is to reconstruct fleeting phenomena of online virality from traces found in the archives.

A new access to our "Archives de l’internet" will be opened at the Champs Libres Library in Rennes on November, 18th. It is the 21st access (out of 26) which will be opened in public libraries.

Finally, we will also organize a Webinar about web regional harvests, on the 9th of November. Up to now, three regional crawls are launched each year (Alsace, Lorraine and Languedoc-Roussillon). The aim is to exchange about these harvests and to develop new crawls with the other provinces.

...