Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2410

Include the Icelandic H3 extractorJs in our heritrix3-extensions

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • 5.0
    • None
    • Heritrix 3
    • None
    • SB/KB

    Description

      Hi Søren,

      This requires a variant of the JavaScript extractor and does not work with stock code.

      Attached is the extractor variant's source. The changes are mostly centered around the shouldIgnorePossibleRelativeLink method.

      Do note that this variant also uses Heritrix 1.14 equivalent heuristics for JS extraction and not the updated H3 ones as I created it while H3 JS extraction was spectacularly over-aggressive. A comparable change can no doubt be introduced in current ExtractorJS class with little difficulty.

      The configuration we are currently using wires the JS extractor thusly:

      <bean id="extractorJs" class="is.landsbokasafn.crawler.extractors.ExtractorJS">
      <property name="rejectRelativeMatchingRegexList">
      <list>
      <value>^text/javascript$</value>
      <value>^text/css$</value>
      <value>^a\.[^/]+$</value>
      <value>^div\.[^/]+$</value>
      <value>^[a-zA-Z-]+\.is$</value>
      <!-- E.g. 3.5.0. Very common in some JS libraries for strings of this nature but very unlikely to
      be a relative URL -->
      <value>^[0-9]\.([0-9]\.)[0-9]$</value>
      <value>^Microsoft\.XMLHTTP$</value>
      </list>
      </property>
      </bean>

      Best,
      Kristinn

      Attachments

        Activity

          People

            svc Søren Vejrup Carlsen (Inactive)
            svc Søren Vejrup Carlsen (Inactive)
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - Not Specified
                Not Specified
                Logged:
                Time Spent - 0.15h
                0.15h