Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


  • After performing our last tests on Netarchivesuite 5.3 and Heritrix 3, we went into production and started our first crawls on March 20th!
  • The beginning of the year is also the time for writing our annual report. In 2016, we crawled 125.47 TB of data including the largest broad crawl in our collection (90.5 TB). This year we chose to study the top level domains (TLDs) in the broad crawl  to measure the impact of including new regional TLDs in the seed list. The use of the TLD varies from one region to another (commercial purposes, public purposes, personal websites...) and the number of active websites is not proportional to the geographical area. We also analysed Epub files, as we did last year, to see if there is any evolution: their number is quite similar but the number of domains where they are hosted is growing. Overall, we exceeded our predictions due to the increase of the average weight of the harvested files.


  • Since one week we are using NAS 5.3 in production. No problems during selective crawls. Our Domain crawl for 2017 will start soon.



Next meetings

  • May 9th
  • June 6th
  • July 4th
  • August 8th