Do you treat certain types of web sites/domains as uninteresting to harvest, and limit their budget or reduce the harvest in other ways? If yes:
We would like to avoid the very large amount of web sites containing huge product catalogues, often with lots of images on each product. But are there ways to do find and avoid/limit them in some (semi-)automatic way?
(On the wish list – when you have identified such a site – would also be a way to harvest a specified proportion of it, e.g. 1 %, randomly selected among a representative selection of different types of pages … J )
A side-track to this is more complicated crawler traps which often show up on these (and other) sites, e.g. infinite loops of types which Heritrix can’t detect (a/b/c/a/b/c, pages referring to themselves with extra parameters etc.). Hints?
- October 6, 2020
- November 3, 2020
- December 8, 2020
- January 5, 2021