Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-1495

inline objects should be counted on the domain-queue-account

    XMLWordPrintable

Details

    • New Feature
    • Resolution: Duplicate
    • Major
    • None
    • 3.2
    • None

    Description

      Currently inline objects (images videos ....) are counted (in the QuotaEnforcer mechanism) as own domains with own limits.
      This causes a lot of material to be fetched from forign domains. We have seen crawls with hundres of Gbytes from "unknown" domains.
      It would be more logical if such inlines are treated as material from the originating seed-domain.
      The originating seed can be carried along in heritrix internals by enabling <source-tag-seeds> (setting it to true) in the frontier section of the order-template.
      Apart from that DomainNameQueueAssignmentPolicy should put all URIs (apart from DNS-requests) in the originating domain queue instead of the URIs own domain queue

      Attachments

        Activity

          People

            Unassigned Unassigned
            bja Bjarne Andersen
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: