[NAS-2420] Need to XML escape Global Crawlertraps Created: 09/Jun/15 Updated: 04/Oct/18 |
|
Status: | Open |
Project: | NetarchiveSuite |
Component/s: | Heritrix 3 |
Affects Version/s: | 5.0 |
Fix Version/s: | 5.5.1 |
Type: | Bug | Priority: | Minor |
Reporter: | Søren Vejrup Carlsen (Inactive) | Assignee: | Unassigned |
Resolution: | Unresolved | ||
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Description |
The list (https://sbforge.org/download/attachments/9928737/crawlertrapsCollection.txt?version=1&modificationDate=1398345665399&api=v2) If not, H3 fails to start. It seems that what worked for H1 will not work for H3 |
Comments |
Comment by Colin Rosenthal [ 26/Oct/16 ] |
I've uploaded an xml-escaped version of the old test-data to the TEST2 page, so that at least that part of the issue is dealt with. |
Comment by Colin Rosenthal [ 28/Apr/16 ] |
I think we can do better than we do now, without rewriting the whole logic. All crawler traps should be validated as regexps whenever a global crawler trap list is edited. |
Comment by Colin Rosenthal [ 17/Sep/15 ] |
I changed the title of this issue to make it clear that it is a NAS issue and not just an issue with bad test-data. |
Comment by Søren Vejrup Carlsen (Inactive) [ 15/Jun/15 ] |
Another suggestion came from N. Levitt (IA) You could use an xml library to modify your cxml. That would take care of encoding things as necessary. Noah |
Comment by Søren Vejrup Carlsen (Inactive) [ 15/Jun/15 ] |
However to follow Kristinn's suggestion, we need to modify the Job class, so that the crawlertraps are added to Job as a datastructure instead of just being inserted into the template |
Comment by Søren Vejrup Carlsen (Inactive) [ 15/Jun/15 ] |
Kristinn had the following suggestion: I got sufficiently sick of this XML encoding of regexes that I made a variant to the MatchesListRegexDecideRule that reads the regexes from a plain text file. Source: http://pastebin.com/SJKdhjhZ Wire it in like so: <bean id="listRegexFilterOut" class="is.landsbokasafn.crawler.deciderules.MatchesListRegexDecideRule"> <property name="decision" value="REJECT" /> <!-- <property name="listLogicalOr" value="true" /> --> <property name="regexSource"> <bean class="org.archive.spring.ConfigFile"> <property name="path" value="listRegexFilterOut.txt" /> </bean> </property> </bean> Any non-empty line that doesn't start with a hash (#) is treated as a single regular expression. It should also be much easier to programmatically update a plain text file. - Kris |
Comment by Søren Vejrup Carlsen (Inactive) [ 09/Jun/15 ] |
We need to validate the expressions before inserting them into the Heritrix template Have sent a query out on the Heritrix mailingliste for tips on how to do this |