Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Component/s: H3-extensions
Labels:
None

Description

Links to sitemaps can be present in robots.txt files.
e.g.

Sitemap: http://kursuskatalog.au.dk/sitemap.xml

and

User-Agent: *
Disallow: /da/error/
Disallow: /en/error/
Disallow: /image_client
Disallow: /da/soeg/
Sitemap: http://www.kb.dk/da/Sitemap.jsp
Sitemap: http://www.kb.dk/en/Sitemap.jsp

We should enable Heritrix3 to extract these sitemap links and put them into the h3 queues for harvesting. When harvested they should be parsed by a SiteMapExtractor and the found links put into the h3 queues

Attachments

Sub-Tasks

1.	Add robotsTxtExtractor		Triage	Unassigned
2.	Add SitemapExtractor		Triage	Unassigned

Activity

People

Assignee:: Unassigned

Reporter:: Søren Vejrup Carlsen (Inactive)

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Apr/18 2:25 PM

Updated:: 26/Apr/18 2:25 PM