Corona event harvest
Allan Christophersen <ALCH@kb.dk> has joined as project employee and is on Netarkviet 20% of his time
http://webadmin.oszk.hu/solrwayback/ (Hungarian Archive)
Our annual broad crawl has ended on 7th of November. It lasted 32 days, executed 1037 jobs, and crawled 2,455 billions of URLs for a size of 117,59 TB (compressed).
The French newspaper Liberation contacted our team to inform us that their blog platform (https://www.liberation.fr/blogs,26) would be closed in the course of December. The platform hosts more than 300 blogs. We launched an emergency crawl last week to crawl these blogs and preserve them.
We are working on the full text indexation (with Solr) of our covid-19 crawl performed between February and July of 2020 and covering the first wave of the pandemic. The size of this collection is about 15 TB (compressed). The new collection will be put in production during december and will be available to the readers through the GUI Archives de l'internet Labs.