Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2865

Integrate the latest H3 IIPC community version: 3.4.0-20200518

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Critical
    • 6.0
    • None
    • Heritrix 3
    • None
    • BNF

    Description

      Integrate into NetarchiveSuite the most recent H3 release:
      H3 release 3.4-20200518: https://github.com/internetarchive/heritrix3/tree/3.4.0-20200518
      This represents a major upgrade since 2016.

      This release includes the following features:
      https://github.com/internetarchive/heritrix3/wiki/Release%20Notes%20-%20Heritrix%203.4.0-20200518

      And notably our contributions:
      #1 Extend the FTP fetcher to harvest documents hosted on SFTP servers,
      #2 Extend the HTML extractor to extract data- prefixed attributs (data-src , data-original, data-original-src and data-original-set) to harvest images available in many resolutions on responsive design websites,
      #3 Fix the crawl status in the CrawlSummary report on H3 console (currently, it is systematically Finished - Abnormal exit from crawling) and refresh the content when viewing the report (currently it is only updated at the end of the crawl).

      Attachments

        Activity

          People

            Unassigned Unassigned
            sara Sara Aubry
            Colin Rosenthal Colin Rosenthal
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: