After the upgrade of NAS and Heritrix in June, we have observed the evolution of the QA indicators by comparing similar jobs run before and after the upgrade. The findings are positive : for a same job type, we crawl more URLs with less 404 errors with the new version, and the improvement is particularly significant with the image files, with a growth of the number of crawled images between 19 % and 98 % depending on the different types of jobs. We are very happy with this quality improvement, however we have to manage with larger WARC files and to reassess our budget estimate. Our annual broad crawl will be launched in October and we have to carefully adjust the parameters in order to comply with budget forecast.
The new version of BC web (7.3.0), with new functionalities such as duplication of records and improvement of the advanced search and of the deduplication, has been successfully put in production at the end of July.
This year the result of the broad crawl has been 1.930.000 web sites (around 50 terabytes of information). The number of domines has increased but the information published on internet has been less than the year before. Of all web sites that have been saved a 87 per cent have been fully and completely recollected.
On the other hand we continue with the collection about Coronavirus which increases each week. Actually it contains more than 4000 (four thousand) web sites
Do you treat certain types of web sites/domains as uninteresting to harvest, and limit their budget or reduce the harvest in other ways? If yes:
We would like to avoid the very large amount of web sites containing huge product catalogues, often with lots of images on each product. But are there ways to do find and avoid/limit them in some (semi-)automatic way?
(On the wish list – when you have identified such a site – would also be a way to harvest a specified proportion of it, e.g. 1 %, randomly selected among a representative selection of different types of pages … J )
A side-track to this is more complicated crawler traps which often show up on these (and other) sites, e.g. infinite loops of types which Heritrix can’t detect (a/b/c/a/b/c, pages referring to themselves with extra parameters etc.). Hints?