Details
-
Improvement
-
Resolution: Fixed
-
Minor
-
None
-
None
-
SB/KB
Description
Hi Søren,
This requires a variant of the JavaScript extractor and does not work with stock code.
Attached is the extractor variant's source. The changes are mostly centered around the shouldIgnorePossibleRelativeLink method.
Do note that this variant also uses Heritrix 1.14 equivalent heuristics for JS extraction and not the updated H3 ones as I created it while H3 JS extraction was spectacularly over-aggressive. A comparable change can no doubt be introduced in current ExtractorJS class with little difficulty.
The configuration we are currently using wires the JS extractor thusly:
<bean id="extractorJs" class="is.landsbokasafn.crawler.extractors.ExtractorJS">
<property name="rejectRelativeMatchingRegexList">
<list>
<value>^text/javascript$</value>
<value>^text/css$</value>
<value>^a\.[^/]+$</value>
<value>^div\.[^/]+$</value>
<value>^[a-zA-Z-]+\.is$</value>
<!-- E.g. 3.5.0. Very common in some JS libraries for strings of this nature but very unlikely to
be a relative URL -->
<value>^[0-9]\.([0-9]\.)[0-9]$</value>
<value>^Microsoft\.XMLHTTP$</value>
</list>
</property>
</bean>
Best,
Kristinn