Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: 5.0
Affects Version/s: None
Component/s: Heritrix 3
Labels:
None

Organization:

SB/KB

Description

Hi Søren,

This requires a variant of the JavaScript extractor and does not work with stock code.

Attached is the extractor variant's source. The changes are mostly centered around the shouldIgnorePossibleRelativeLink method.

Do note that this variant also uses Heritrix 1.14 equivalent heuristics for JS extraction and not the updated H3 ones as I created it while H3 JS extraction was spectacularly over-aggressive. A comparable change can no doubt be introduced in current ExtractorJS class with little difficulty.

The configuration we are currently using wires the JS extractor thusly:

<bean id="extractorJs" class="is.landsbokasafn.crawler.extractors.ExtractorJS">
<property name="rejectRelativeMatchingRegexList">
<list>
<value>^text/javascript$</value>
<value>^text/css$</value>
<value>^a\.[^/]+$</value>
<value>^div\.[^/]+$</value>
<value>^[a-zA-Z-]+\.is$</value>

<value>^[0-9]\.([0-9]\.)[0-9]$</value>
<value>^Microsoft\.XMLHTTP$</value>
</list>
</property>
</bean>

Best,
Kristinn

Attachments

Activity

People

Assignee:: Søren Vejrup Carlsen (Inactive)

Reporter:: Søren Vejrup Carlsen (Inactive)

Watchers:: 1 Start watching this issue

Dates

Created:: 06/May/15 1:09 PM

Updated:: 07/Sep/15 12:06 PM

Resolved:: 02/Sep/15 1:39 PM

Time Tracking

Estimated:

Not Specified

Remaining:

Not Specified

Logged:

0.15h