Our biggest issue during the last weeks of 2019 was a harvest problem between Christmas and New Year, that is to say, there were no uploads (neither from the last broad crawl for 2019 nor from the selective crawls) to the Copenhagen bit archive from 24 December 2019 to 1 January 2020. 453 jobs failed with upload errors. About 2400 files are missing on the server, but the system did not “tell” us, that there was no server space left. The minSpaceleft setup implemented before Christmas apparently did not work.
We will work on uploading the missing 2400 files when the broad crawl is finished. We started this broad crawl on 8 November 2019.
An electrical cut in Copenhagen last night probably has produced some loss, as the broad crawl was not finished.
News from our test environment: the newest snapshot NAS 5.7 IIPC Heritrix bundle is able to run on NAS 5.5 without problems.
A happy New Year and best wishes to all for 2020 from the BnF web archiving team!
Our broad crawl, which started on October 12th, finished on December 23rd. It represents 2.2 billion URLs and 118,17 TB of compressed data. Despite technical problems related to our infrastructure (25% of the jobs were killed by their HarvestController because Heritrix needed too much time to initialise), it took less time than last year (11 weeks in 2019). Its size exceeds our initial budget of 110 TB due to an average weight per URL which is higher than our estimates (from 55 421 bytes to 57 044 bytes). We'll analyse the reports to understand this increase : it probably comes from an evolution of the websites.
We welcome a newcomer in our team : Alexandre Faye. He will be in charge of cooperation with researchers and international cooperation.