[NAS-2726] The methods in dk.netarkivet.viewerproxy.webinterface.Reporting does not support metadata files using the BNF naming Created: 22/Mar/18  Updated: 11/Apr/18  Resolved: 11/Apr/18

Status: Resolved
Project: NetarchiveSuite
Component/s: GUI
Affects Version/s: None
Fix Version/s: 5.4

Type: Bug Priority: Blocker
Reporter: Søren Vejrup Carlsen (Inactive) Assignee: Søren Vejrup Carlsen (Inactive)
Resolution: Fixed  
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to NAS-2712 Indexserver includes too much in the ... Ready for release test
Spawned
was spawned by NAS-2676 The regexp used by Reporting.getMetad... Closed

 Description   

The methods in https://github.com/netarchivesuite/netarchivesuite/blob/master/harvester/harvester-core/src/main/java/dk/netarkivet/viewerproxy/webinterface/Reporting.java
all use the helper method

private static String getMetadataFilePatternForJobId(long jobid) {
    	// The old invalid metadataFilePattern
    	//return ".*"+jobid + ".*" + metadatafile_suffix;
    	return jobid + metadatafile_suffix;
}

This code currently assumes that we do the legacy style naming of the metadatafiles
and not the prefix stile :
https://github.com/netarchivesuite/netarchivesuite/blob/master/harvester/harvester-core/src/main/java/dk/netarkivet/harvester/harvesting/metadata/MetadataFileWriter.java

The difference:

if(isPrefix) {
                return collectionName + "-" + jobID + "-" + harvestID + "-metadata-" + versionNumber + ".warc" + possibleGzSuffix;
            } else {
                return jobID + "-metadata-" + versionNumber + ".warc" + possibleGzSuffix;
}

Currently, collectionName is read from setting HarvesterSettings.HERITRIX_PREFIX_COLLECTION_NAME



 Comments   
Comment by Søren Vejrup Carlsen (Inactive) [ 11/Apr/18 ]

And I have also tested it using BNF style naming

<settings>
<harvester><harvesting>
<heritrix>
<archiveNaming>
    <class&gt;dk.netarkivet.harvester.harvesting.CollectionPrefixNamingConvention</class&gt;
    <collectionName>BNF</collectionName> 
</archiveNaming>
</heritrix>

<metadata>
  <metadataFileNameFormat>prefix</metadataFileNameFormat>
  <filename>
    <versionnumber>1</versionnumber>
  </filename>
</metadata>

</harvesting></harvester></settings>

And it works perfectly

Comment by Søren Vejrup Carlsen (Inactive) [ 10/Apr/18 ]

I am currently testing your solution with the standard naming setup

Comment by Søren Vejrup Carlsen (Inactive) [ 10/Apr/18 ]

Yes, the same regexp is used for selecting the correct metadatafile to search in

Comment by Sara Aubry [ 10/Apr/18 ]

Will this regexp also fix the bugs on the features "Display crawl lines matching this regexp" and "Browse only crawl-log lines for this domain XXX" ?

Comment by Sara Aubry [ 10/Apr/18 ]

Bert says we should stick to his proposal as the metadatafile_suffix starts with a hyphen.
So if you use "-(.*)?", there will be two hyphens in a row which would not match your naming scheme.

Comment by Colin Rosenthal [ 10/Apr/18 ]

That looks good, but isn't there always a hyphen after the jobid? So

"(.*-)?" + jobid + "-(.*)?" + metadatafile_suffix
Comment by Sara Aubry [ 10/Apr/18 ]

Here is Bert's recommendation:


(.*-)?1073(-.*)?-metadata-[0-9]+\.(w)?arc(\.gz)?

hence

 "(.*-)?" + jobid + "(-.*)?" + metadatafile_suffix

should work for both KB and BnF naming schemes.

Comment by Sara Aubry [ 10/Apr/18 ]

Here is our naming scheme:

  • for metadata:
    BnF-25740-55-metadata-1.warc.gz
    BnF-25740-55 being prefix+jobid+harvestdefinitionid
    prefix is always the same, based on initial IIPC recommendations for ARC and WARC file naming.
  • for data:
    BnF-25740-55-20180409170133-00000-ciblee_2018_fogg126.bnf.fr.warc.gz
Comment by Colin Rosenthal [ 10/Apr/18 ]

It's surely not beyond our intellectual limits to devise a regex that satisfies both sets of requirements. We need an optional

{0,1}

"prefix-" before the jobid and a "-" after. QuestionS: can the prefix contain international characters? Does it contain only letters and numbers? (See https://stackoverflow.com/questions/14636540/java-regular-expression-with-international-letters for use of international letters in regexes.)

Comment by Søren Vejrup Carlsen (Inactive) [ 22/Mar/18 ]

If we return to the old metadatafile-pattern

 ".*"+jobid + ".*" + metadatafile_suffix 

,
the pattern will include metadatafiles for other jobs as well

Generated at Wed Apr 24 06:48:57 CEST 2024 using Jira 9.4.15#940015-sha1:bdaa9cbecfb6791ea579749728cab771f0dfe90b.