Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
Description
Links to sitemaps can be present in robots.txt files.
e.g.
Sitemap: http://kursuskatalog.au.dk/sitemap.xml
and
User-Agent: * Disallow: /da/error/ Disallow: /en/error/ Disallow: /image_client Disallow: /da/soeg/ Sitemap: http://www.kb.dk/da/Sitemap.jsp Sitemap: http://www.kb.dk/en/Sitemap.jsp
We should enable Heritrix3 to extract these sitemap links and put them into the h3 queues for harvesting. When harvested they should be parsed by a SiteMapExtractor and the found links put into the h3 queues
Attachments
1.
|
Add robotsTxtExtractor | Triage | Unassigned | |
2.
|
Add SitemapExtractor | Triage | Unassigned |