Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-1672

Count inline material as belonging to the domain it's inlined in

    XMLWordPrintable

Details

    • SB/KB
    • Rough
    • Hide
      1. Install netarchivesuite
      2. Find website/domain, containing several inline-images. Found www.tilbudsugen.dk which contains inline images from the domains googlesyndication.com domain and adform.net.
      3. Make sure that only one selective harvester (i.e.HIGH priority) is running
      4. Make sure that "settings.harvester.harvesting.harvestReport.disregardSeedURLInfo" for the remaining selective harvest is not enabled (disabled by default).

      Make sure that in the frontier of the template used by the domain,
      "source-tag-seeds" is set to true,
      and that the queue-assignment-policy is changed from
      "dk.netarkivet.harvester.harvesting.DomainnameQueueAssignmentPolicy"
      to
      "dk.netarkivet.harvester.harvesting.SeedUriDomainnameQueueAssignmentPolicy".

      1. Harvest 100 objects from tilbudsugen.dk
      2. Then harvest 100 objects from tilbudsugen.dk (without counting the inlined objects as part of the harvest)
      1. Note that the difference is that the 4 inlined objects on the main page is in the first case added to the bytes harvested by tilbudsugen.dk (though in fact from a different domain) making the sum 448,238, whereas in the second harvest they are not making the sum: 418,164
      Show
      Install netarchivesuite Find website/domain, containing several inline-images. Found www.tilbudsugen.dk which contains inline images from the domains googlesyndication.com domain and adform.net. Make sure that only one selective harvester (i.e.HIGH priority) is running Make sure that "settings.harvester.harvesting.harvestReport.disregardSeedURLInfo" for the remaining selective harvest is not enabled (disabled by default). Make sure that in the frontier of the template used by the domain, "source-tag-seeds" is set to true, and that the queue-assignment-policy is changed from "dk.netarkivet.harvester.harvesting.DomainnameQueueAssignmentPolicy" to "dk.netarkivet.harvester.harvesting.SeedUriDomainnameQueueAssignmentPolicy". Harvest 100 objects from tilbudsugen.dk Then harvest 100 objects from tilbudsugen.dk (without counting the inlined objects as part of the harvest) Note that the difference is that the 4 inlined objects on the main page is in the first case added to the bytes harvested by tilbudsugen.dk (though in fact from a different domain) making the sum 448,238, whereas in the second harvest they are not making the sum: 418,164

    Description

      This feature was requested at the NetarchiveSuite workshop in September 2007:
      It is possible for a crawl to greatly exceed its limits since objects from other domains aren't counted towards domain size limits even when they're inline images. When another domains items are downloaded as images they should be counted towards the limit both in Heritrix and in the historical info. Essentially no downloaded material should remain unaccounted for.

      Attachments

        Activity

          People

            svc Søren Vejrup Carlsen (Inactive)
            lars lars [X] (Inactive)
            Mikis Seth Sørensen Mikis Seth Sørensen (Inactive)
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 21h
                21h
                Remaining:
                Time Spent - 10h Remaining Estimate - 11h
                11h
                Logged:
                Time Spent - 10h Remaining Estimate - 11h
                10h