[NAS-2545] harvestInfo.performer in warcinfo records is not included in harvestInfo.xml Created: 27/Jul/16  Updated: 03/Nov/16  Resolved: 19/Oct/16

Status: Resolved
Project: NetarchiveSuite
Component/s: WARC
Affects Version/s: 5.1
Fix Version/s: 5.2

Type: New Feature Priority: Minor
Reporter: Sara Aubry Assignee: Søren Vejrup Carlsen (Inactive)
Resolution: Fixed  
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Organization:
BNF
Inspector: lam.mai lam.mai
Sprint: NAS 5.2

 Description   

In 5.1, the harvestInfo.performer included in warcinfo records is empty. And is not included in harvestInfo.xml.

#added by NetarchiveSuite Version: 5.1 (<a href="https://github.com/netarchivesuite/netarchivesuite/commit/cde61d78299cabccae6195908b81ef77c84a76b9">cde61d7829</a>)
harvestInfo.version: 0.5
harvestInfo.jobId: 20
harvestInfo.channel: CIBLEE
harvestInfo.harvestNum: 3
harvestInfo.origHarvestDefinitionID: 2
harvestInfo.maxBytesPerDomain: -1
harvestInfo.maxObjectsPerDomain: 50
harvestInfo.orderXMLName: default_NAS5_1_KLM
harvestInfo.origHarvestDefinitionName: test saa
harvestInfo.scheduleName: annuelle
harvestInfo.harvestFilenamePrefix: BnF-20-2
harvestInfo.jobSubmitDate: Mon Jul 25 13:08:46 CEST 2016
harvestInfo.performer:
harvestInfo.audience: champ public

<?xml version="1.0" encoding="UTF-8"?>
<harvestInfo>
<version>0.5</version>
<jobId>20</jobId>
<channel>CIBLEE</channel>
<harvestNum>3</harvestNum>
<origHarvestDefinitionID>2</origHarvestDefinitionID>
<maxBytesPerDomain>-1</maxBytesPerDomain>
<maxObjectsPerDomain>50</maxObjectsPerDomain>
<orderXMLName>default_NAS5_1_KLM</orderXMLName>
<origHarvestDefinitionName>test saa</origHarvestDefinitionName>
<origHarvestDefinitionComments>Collecte réalisée avec NAS 5.1 pour contrôler les WARC de données et de métadonnées.</origHarvestDefinitionComments>
<scheduleName>annuelle</scheduleName>
<harvestFilenamePrefix>BnF-20-2</harvestFilenamePrefix>
<jobSubmitDate>2016-07-25T11:08:46Z</jobSubmitDate>
<audience>champ public</audience>
</harvestInfo>



 Comments   
Comment by Sara Aubry [ 10/Oct/16 ]

Tested, if the settings.harvester.performer is not declared, it will not appear either in the harvestInfo.xml, nor in the warcinfo metadata of the data files.

Comment by Sara Aubry [ 28/Sep/16 ]

Great, we'll test it.

Comment by Søren Vejrup Carlsen (Inactive) [ 27/Sep/16 ]

NAS is now consistent. It does not longer add empty performer values into the warcInfo metadata

Comment by Søren Vejrup Carlsen (Inactive) [ 21/Sep/16 ]

So probably, you just need to override the empty settings-value

settings.harvester.performer

The thing, we probably should do is to avoid inserting a empty performer in warcInfo metadata

Comment by Sara Aubry [ 21/Sep/16 ]

Yes, we saw that by declaring a performer in the settings.
But to be consistent, if the performer is not declared in the settings, maybe the harvestInfo.performer should not be inserted with an empty value in the warcinfo records of the WARC data files. It is currently not inserted in the harvestInfo.xml.

Comment by Søren Vejrup Carlsen (Inactive) [ 21/Sep/16 ]

and appended to the warc-info-metadata with this code in H3HeritrixTemplate.insertWarcInfoMetadata

if (performer != null){
			sb.append(startMetadataEntry);
			sb.append(HARVESTINFO_PERFORMER + valuePart + performer  + endMetadataEntry);
		}
Comment by Søren Vejrup Carlsen (Inactive) [ 21/Sep/16 ]

The value of the performer is currently read from settings in the JobDispatcher.doOneCrawl method
and inserted into the template:

if (job.getContinuationOf() == null ) {
                ht.insertWarcInfoMetadata(job, origHarvestName, origHarvestSchedule,
                        Settings.get(HarvesterSettings.PERFORMER));
            } else {
                log.info("Job is a continuation of " + job.getContinuationOf() + " so no need to replace WarcInfoMetadata");
            }
Generated at Fri Apr 19 20:40:30 CEST 2024 using Jira 9.4.15#940015-sha1:bdaa9cbecfb6791ea579749728cab771f0dfe90b.