We started our second broad crawl for 2020 on 20 August, the first step with a byte limit of 50 MB finished on 2 September. On 21 August we started the separate crawl of ultra big sites, this crawl is still running.
We have to decide, whether we want to stop the event crawl on Corona in Denmark or not, there are different opinions on that issue.
Everything is prepared for the French trainee: we signed a contract and he will start on 28 September. He wants to work on visualization of data and Netarchive.
We started a collaboration with the IT-University in Copenhagen: students participating in a course on project work and communication for software developers will work together with us on several special challenges.
We try to solve various technical issues; we got aware of most of them on the base of emails from persons dealing with certain web sites. These issues are for example:
We are going to look at the new features in BCWeb on an installation in a test environment
After the upgrade of NAS and Heritrix in June, we have observed the evolution of the QA indicators by comparing similar jobs run before and after the upgrade. The findings are positive : for a same job type, we crawl more URLs with less 404 errors with the new version, and the improvement is particularly significant with the image files, with a growth of the number of crawled images between 19 % and 98 % depending on the different types of jobs. We are very happy with this quality improvement, however we have to manage with larger WARC files and to reassess our budget estimate. Our annual broad crawl will be launched in October and we have to carefully adjust the parameters in order to comply with budget forecast.
The new version of BC web (7.3.0), with new functionalities such as duplication of records and improvement of the advanced search and of the deduplication, has been successfully put in production at the end of July.