Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


  • Step 2 of the first broad crawl will start  
    • Focus on cleaning up regular expressions
    • Limits of domains 
    • New harvester-server up and running soon
  • Outsourcing of harvesting - status: will be kept in house.
  • SolrWayback "live"-QA up and running 
  • RSS-Heritrix module tested and still needs some focus
  • IIPC Browserbased-crawling project is proceeding. We have an update meeting tonight and have had input during the IIPC GA-sessions.
  • A lot of data delivery for researchers at the moment.
  • Working on updated JWAT for validation of Warc-files (communication with Nicholas and estimating budget/finding a way to this now internally at KB)
    • Support sha256
    • Missing support for modern gzip os
    • support for []{} in urls 
    • http request headers
    • warc 1.1 support



Last week, we had a meeting to prepare the program of the 2022 broad crawl. 
In this context, an overhaul of nas-preload and developments concerning NAS are planned.
The registrars have been contacted and we've already got most of the lists. Two new TLDs from overseas departments and territories have been obtained. The launch of the broad crawl is scheduled for October.

The Official Publications harvest has been launched last week and will last at the end of June. This harvest includes websites of ministries, public establishments, independent administrative authorities and local authorities. Nearly 900 websites have been selected.

Finally, our next Videos harvest is in preparation. We are encountering some difficulties because we have changed the metadata extraction tool. The number of metadata extracted and therefore videos to download is indeed much greater than with the previous tool, which raises budget issues.