Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


Status of the production sites


  • Broad crawl. Step 2 of the 3rd broad crawl 2022 started 
  • Heritrix: Using RSS and extractormodules to get more content. Eg- issuu-PDF´s
  • Focus on Paywall and IP-validation. Some News-sites resonds quicly others not at all.
  • Preeparing for the Besocial 2022 conference 
  • SolrWayback "live"-QA still up and running and is great for QA (although some ressources are missing due to deduplication). 
  • IIPC Browserbased-crawling project - workshop today at 16-17 CET
  • Progress is made on the updated JWAT for validation of Warc-files 



First of all, we welcome Nola N'Diaye in our team as harvesting manager and assistant head of the digital legal deposit team. She succeeds Pascal Tanésie who is retiring in December.

Last month, nas-preload version 9.1 and NetarchiveSuite version 7.4.1 have been released. The new version of NAS includes several improvements and evolutions which will be usefull for monitoring the crawls: display of the compressed data size of the WARC files produced by each running job, distinction of the queues types on Progression and Queues page, bug fix on the possibility to use a regex with a backslash on Browse/Delete frontier...

We are also going to launch a test broad crawl this week. Our production crawl will be launched in October.

The crawl stemmed from the LIFRANUM project which concerns digital French-speaking literature websites, ended last week. 1089 seeds (websites, blogs hosted on several platforms such as,, etc...) have been harvested. We also crawled separately a few thousand contextual contents webpages with a dedicated job. The selection step was made with Hyphe, a web corpus curation tool based on a web crawler.

Finally the IIPC webinar "Web Archiving the War in Ukraine" took place last Wednesday. On this occasion our colleagues Vladimir Tybin and Anaïs Crinière-Boizet presented, with Kees Teszelszky, the "War in Ukraine" IIPC collaborative collection led by the BnF and the National Library of The Netherlands.





This month, we are working on the organization of a online workshop for countries are part of ABINIA (Association of Iberoamerican States for the Development of National Libraries of Iberoamerica) It will be in October or November. We want to show how the Spanish web archive works through its collections, infrastructure and operation. Many Lationamerican countries are beginning to consider the creation of their web archives and we want to help them in their first steps.

Lately, We have had some problems to harvest Twitter. Some days we have errors 429 in NAS report, we think it is for high number of account that we are collecting, currently about 3,000, most of ones weekly, we try to reduce the  number to avoid this problem.

National Library of Peru is interested to use NAS. Would it be possible to invite them to come to next NAS meeting?



Next meetings