Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Panel

Broad crawl
Still moving along slowly. We are investigating why


Corona event harvest
Set harvest to daily again due to second lock down 

Personnel

Allan Christophersen <ALCH@kb.dk> has joined as project employee and is on Netarkviet 20% of his time


SolrWayback;

https://github.com/netarchivesuite/solrwayback/releases/tag/4.0.5

https://github.com/netarchivesuite/solrwayback


http://webadmin.oszk.hu/solrwayback/ (Hungarian Archive)


BnF

Panel

Our annual broad crawl has ended on 7th of November. It lasted 32 days, executed 1037 jobs, and crawled 2,455 billions of URLs for a size of 117,59 TB (compressed).

The French newspaper Liberation contacted our team to inform us that their blog platform (https://www.liberation.fr/blogs,26) would be closed in the course of December. The platform hosts more than 300 blogs. We launched an emergency crawl last week to crawl these blogs and preserve them.

We are working on the full text indexation (with Solr) of our covid-19 crawl performed between February and July of 2020 and covering the first wave of the pandemic. The size of this collection is about 15 TB (compressed). The new collection will be put in production during december and will be available to the readers through the GUI Archives de l'internet Labs.

...