Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


  • Event harvest -  shooting in Fields Shopping Mall started very early after the incident sunday evening.
    • Using NAS/Heritrix, Twitter API and 
  • Step 2 of the second broad crawl 2022 around half finished 
  • SolrWayback "live"-QA still up and running and is great for QA. 
  • IIPC Browserbased-crawling project
    • We have an update meeting tonight and have had input during the IIPC GA-sessions.
    • Lots of user input from the Netarkivet team (curators, engineers and more to the Google Doc).
    • Great possibilities
    • Playback is important - browsers playing a bigger role with more advanced crawling/playback. As Kris put it: advanced crawling needs advnaced playback
  • Working on updated JWAT for validation of Warc-files ongoing
  • Talks with the Norwegian web archive Nettarkivet - they use a browserbased crawler they made themselves called Veidemann: They are looking into SolrWayback for search/discovery (and maybe playback)



Last week, we launched our "Auction house" crawl, which concerns French auction houses websites. About 200 websites had been selected. Last year, we had been blacklisted by large auction sites. So we set up a specific harvest system for where many websites are hosted. We added filters on all the other jobs in progress before starting the harvest and we created a special queue management to group the URLs of all hosts which belong to a website into one particular queue. This makes it possible to avoid sending too many requests at the same time as well as to limit the harvest to 100 000 URLs per website.

The LIFRANUM crawl carried out in partnership with researchers from the Jean Moulin University Lyon 3 and the Lumière University Lyon 2 is about to be launched.
The project aims to identify and map the corpus of digital French-speaking literature (sites, blogs, social networks). About 1100 sites will be crawled for this harvest with a specific budget of 15 000 URLs. The harvest should last about 1 or 2 weeks.

Finally, we are continuing the preparations for our 2022 broad crawl.