Details
-
Improvement
-
Resolution: Fixed
-
Critical
-
None
-
None
-
BNF
Description
Integrate into NetarchiveSuite the most recent H3 release:
H3 release 3.4-20200518: https://github.com/internetarchive/heritrix3/tree/3.4.0-20200518
This represents a major upgrade since 2016.
This release includes the following features:
https://github.com/internetarchive/heritrix3/wiki/Release%20Notes%20-%20Heritrix%203.4.0-20200518
And notably our contributions:
#1 Extend the FTP fetcher to harvest documents hosted on SFTP servers,
#2 Extend the HTML extractor to extract data- prefixed attributs (data-src , data-original, data-original-src and data-original-set) to harvest images available in many resolutions on responsive design websites,
#3 Fix the crawl status in the CrawlSummary report on H3 console (currently, it is systematically Finished - Abnormal exit from crawling) and refresh the content when viewing the report (currently it is only updated at the end of the crawl).