Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2746

Add sitemap support in bundled Heritrix3

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • H3-extensions
    • None

    Description

      Links to sitemaps can be present in robots.txt files.
      e.g.

      Sitemap: http://kursuskatalog.au.dk/sitemap.xml
      

      and

      User-Agent: *
      Disallow: /da/error/
      Disallow: /en/error/
      Disallow: /image_client
      Disallow: /da/soeg/
      Sitemap: http://www.kb.dk/da/Sitemap.jsp
      Sitemap: http://www.kb.dk/en/Sitemap.jsp
      

      We should enable Heritrix3 to extract these sitemap links and put them into the h3 queues for harvesting. When harvested they should be parsed by a SiteMapExtractor and the found links put into the h3 queues

      Attachments

        Activity

          People

            Unassigned Unassigned
            svc Søren Vejrup Carlsen (Inactive)
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: