[NAS-2690] The function "Browse only relevant crawl-log lines for this domain" is faulty Created: 08/Jan/18  Updated: 24/Apr/18  Resolved: 24/Apr/18

Status: Closed
Project: NetarchiveSuite
Component/s: GUI
Affects Version/s: 5.2.2, 5.3.1
Fix Version/s: 5.4

Type: Bug Priority: Minor
Reporter: Søren Vejrup Carlsen (Inactive) Assignee: Søren Vejrup Carlsen (Inactive)
Resolution: Fixed  
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

External reference:

https://sbprojects.statsbiblioteket.dk/jira/browse/NARK-1212

Sprint: NAS 5.4
Verification:

How to test this.
Construct a tool to fetch relevant crawlog-lines from a local metadata warcfile
(args: domain metadata warcfile)
Fetch a metadata warcfile with a large crawllog.

Test that the new domain specific regexp returns the correct lines


 Description   

The code in harvester/qa-gui/src/main/webapp/QA-searchcrawllog.jsp is faulty:

if (regexp != null && regexp.length() != 0 ) {
            crawlLogExtract = Reporting.getCrawlLoglinesMatchingRegexp(jobid, regexp);
} else { // use 'domain' as the regular expression
            regexp = ".*" + domain.replaceAll("\\.", "\\\\.") + ".*";
            crawlLogExtract = Reporting.getCrawlLoglinesMatchingRegexp(jobid, regexp);
        }

The regexp in the else logic is used for the "Browse only..." functionality



 Comments   
Comment by Søren Vejrup Carlsen (Inactive) [ 06/Mar/18 ]

Do you still want a review, or is it enough just to read the diff: https://sbforge.org/fisheye/changelog/NetarchiveSuite-Github?cs=33f143e1f05312303548b4f512202b071032a374

Comment by Colin Rosenthal [ 02/Feb/18 ]

Is there a review? If so, it doesn't seem to be linked to the issue.

Comment by Søren Vejrup Carlsen (Inactive) [ 16/Jan/18 ]

This code is reade for review

Comment by Søren Vejrup Carlsen (Inactive) [ 12/Jan/18 ]

A valid regular expression has now been found:

".*(https?:\\/\\/(www\\.)?|dns:|ftp:\\/\\/)([\\w_-]+\\.)?([\\w_-]+\\.)?([\\w_-]+\\.)?" + domain.replaceAll("\\.", "\\\\.") +  "($|\\/|\\w|\\s).*";
Generated at Fri Apr 19 21:39:59 CEST 2024 using Jira 9.4.15#940015-sha1:bdaa9cbecfb6791ea579749728cab771f0dfe90b.