[NAS-2420] Need to XML escape Global Crawlertraps Created: 09/Jun/15  Updated: 04/Oct/18

Status: Open
Project: NetarchiveSuite
Component/s: Heritrix 3
Affects Version/s: 5.0
Fix Version/s: 5.5.1

Type: Bug Priority: Minor
Reporter: Søren Vejrup Carlsen (Inactive) Assignee: Unassigned
Resolution: Unresolved  
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

The list (https://sbforge.org/download/attachments/9928737/crawlertrapsCollection.txt?version=1&modificationDate=1398345665399&api=v2)
contains a lot '&' which needs to be escaped

If not, H3 fails to start.

It seems that what worked for H1 will not work for H3



 Comments   
Comment by Colin Rosenthal [ 26/Oct/16 ]

I've uploaded an xml-escaped version of the old test-data to the TEST2 page, so that at least that part of the issue is dealt with.

Comment by Colin Rosenthal [ 28/Apr/16 ]

I think we can do better than we do now, without rewriting the whole logic. All crawler traps should be validated as regexps whenever a global crawler trap list is edited.

Comment by Colin Rosenthal [ 17/Sep/15 ]

I changed the title of this issue to make it clear that it is a NAS issue and not just an issue with bad test-data.

Comment by Søren Vejrup Carlsen (Inactive) [ 15/Jun/15 ]

Another suggestion came from N. Levitt (IA)

You could use an xml library to modify your cxml. That would take care of encoding things as necessary.

Noah
Comment by Søren Vejrup Carlsen (Inactive) [ 15/Jun/15 ]

However to follow Kristinn's suggestion, we need to modify the Job class, so that the crawlertraps are added to Job as a datastructure instead of just being inserted into the template
during job-generation.

Comment by Søren Vejrup Carlsen (Inactive) [ 15/Jun/15 ]

Kristinn had the following suggestion:

I got sufficiently sick of this XML encoding of regexes that I made a variant to the MatchesListRegexDecideRule that reads the regexes from a plain text file. Source: http://pastebin.com/SJKdhjhZ

Wire it in like so:

  <bean id="listRegexFilterOut" class="is.landsbokasafn.crawler.deciderules.MatchesListRegexDecideRule">
    <property name="decision" value="REJECT" />
    <!-- <property name="listLogicalOr" value="true" /> -->
    <property name="regexSource">
      <bean class="org.archive.spring.ConfigFile">
        <property name="path" value="listRegexFilterOut.txt" />
      </bean>
    </property>
  </bean>

Any non-empty line that doesn't start with a hash (#) is treated as a single regular expression.

It should also be much easier to programmatically update a plain text file.

- Kris
Comment by Søren Vejrup Carlsen (Inactive) [ 09/Jun/15 ]

We need to validate the expressions before inserting them into the Heritrix template

Have sent a query out on the Heritrix mailingliste for tips on how to do this

Generated at Sat Apr 27 03:37:18 CEST 2024 using Jira 9.4.15#940015-sha1:bdaa9cbecfb6791ea579749728cab771f0dfe90b.