Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.








Do you treat certain types of web sites/domains as uninteresting to harvest, and limit their budget or reduce the harvest in other ways? If yes:

  • Which categories of web sites?
  • How do you identify the category and find which web sites to treat specially?
  • How do you reduce the harvest there – data limit, object count limit, reject rules?

We would like to avoid the very large amount of web sites containing huge product catalogues, often with lots of images on each product. But are there ways to do find and avoid/limit them in some (semi-)automatic way?

(On the wish list – when you have identified such a site – would also be a way to harvest a specified proportion of it, e.g. 1 %, randomly selected among a representative selection of different types of pages … J )

A side-track to this is more complicated crawler traps which often show up on these (and other) sites, e.g. infinite loops  of types which Heritrix can’t detect (a/b/c/a/b/c, pages referring to themselves with extra parameters etc.). Hints?

Next meetings

  • October 6, 2020
  • November 3, 2020
  • December 8, 2020
  • January 5, 2021