[NAS-2565] Fix orderXMLName and add operator/templateUpdateDate/templateDescription fields in harvestInfo.xml Created: 11/Oct/16  Updated: 07/Nov/23  Resolved: 03/Mar/17

Status: Resolved
Project: NetarchiveSuite
Component/s: WARC
Affects Version/s: None
Fix Version/s: 5.3

Type: Improvement Priority: Critical
Reporter: Sara Aubry Assignee: Unassigned
Resolution: Fixed  
Labels: None
Remaining Estimate: Not Specified
Time Spent: 1h 1m
Original Estimate: Not Specified

Organization:
BNF
Sprint: NAS 5.3

 Description   

We should replace orderXMLName (which was meant for H1) by templateName (which would be more generic and compatible with H3).

Also, to ease the extraction of configuration information for a preservation repository, we should also like to include three more fields.
<operator> => metadata.operator in crawler-bean.cxml
<templateUpdateDate> => metadata.date in crawler-bean.cxml
<templateDescription> => metadata.description in crawler-bean.cxml



 Comments   
Comment by Sara Aubry [ 23/Feb/17 ]

To test this feature:
1) Edit an harvest template and insert the following in the simple overrides section:

    <!-- SIMPLE OVERRIDES (START)
    Overrides from a text property list 
    -->
    <bean id="simpleOverrides" class="org.springframework.beans.factory.config.PropertyOverrideConfigurer">
        <property name="properties">
            <!-- Overrides the default values used by Heritrix -->
            <value>
                # This Properties map is specified in the Java 'property list' text format
                # http://java.sun.com/javase/6/docs/api/java/util/Properties.html#load%28java.io.Reader%29

                ###
                ### Metadata overrides (placed here for preservation purpose)
                ###

                metadata.jobName=domaine
                metadata.description=Parametres utilises pour la collecte ciblee permettant l'archivage de l'URL de depart ainsi que de toutes les pages internes au domaine / Parameters for the focused crawl used to harvest the seed URL and all pages within the same domain. Pas de respect du protocole robots.txt / Does not respect robots.txt
                metadata.operator=BnF - DLWeb
                metadata.organization=Bibliotheque nationale de France
                metadata.date=20170116110000

2) Run and complete a job using this template.

3) In the associated metadata file, check the harvestInfo.xml record has consistent fields:

WARC/1.0
WARC-Type: resource
WARC-Record-ID: <urn:uuid:d1a8a657-e7d2-48a2-ae77-a8053f079594>
WARC-Date: 2017-02-20T16:20:34Z
Content-Length: 1248
Content-Type: text/xml
WARC-Block-Digest: sha1:OOZDFR2P4J2SZDQM6E47VCSRIF5LWAAS
WARC-IP-Address: 172.20.22.181
WARC-Target-URI: metadata://netarchivesuite.bnf.fr/crawl/setup/harvestInfo.xml?heritrixVersion=3.3.0-LBS-2016-02&harvestid=53&jobid=22313
WARC-Warcinfo-ID: <urn:uuid:c3703684-c400-40df-937c-8da8f40d5007>

<?xml version="1.0" encoding="UTF-8"?>

<harvestInfo>
  <version>0.6</version>
  <jobId>22313</jobId>
  <channel>CIBLEE</channel>
  <harvestNum>9</harvestNum>
  <origHarvestDefinitionID>53</origHarvestDefinitionID>
  <maxBytesPerDomain>-1</maxBytesPerDomain>
  <maxObjectsPerDomain>150</maxObjectsPerDomain>
  <templateName>domaine</templateName>
  <templateLastUpdateDate>20170116110000</templateLastUpdateDate>
  <templateDescription>Parametres utilises pour la collecte ciblee permettant l'archivage de l'URL de depart ainsi que de toutes les pages internes au domaine / Parameters for the focused crawl used to harvest the seed URL and all pages within the same domain. Pas de respect du protocole robots.txt / Does not respect robots.txt</templateDescription>
  <origHarvestDefinitionName>SAA test rapide</origHarvestDefinitionName>
  <origHarvestDefinitionComments>Cet EC permet de tester le fonctionnement de NAS5 et H3 en environnement de MAB.</origHarvestDefinitionComments>
  <scheduleName>annuelle</scheduleName>
  <harvestFilenamePrefix>BnF-22313-53</harvestFilenamePrefix>
  <jobSubmitDate>2017-02-20T15:29:08Z</jobSubmitDate>
  <performer>Bibliotheque nationale de France</performer>
  <operator>BnF - DLWeb</operator>
</harvestInfo>

4) Check harvest template is stated in templateName (and not orderXMLname).

5) Check version using this 3 new fields is 0.6.

Comment by Colin Rosenthal [ 23/Feb/17 ]

Closed by mistake, reopened.

Comment by Sara Aubry [ 05/Dec/16 ]

Currently, harvestInfo.xml looks like this:
<harvestInfo>
<version>0.5</version>
<jobId>21636</jobId>
<channel>PRESSE</channel>
<harvestNum>1568</harvestNum>
<origHarvestDefinitionID>33</origHarvestDefinitionID>
<maxBytesPerDomain>-1</maxBytesPerDomain>
<maxObjectsPerDomain>-1</maxObjectsPerDomain>
<orderXMLName>lindependant</orderXMLName>
<origHarvestDefinitionName>BnF presse payante quotidienne illimite</origHarvestDefinitionName>
<origHarvestDefinitionComments>Collecte quotidienne des sites de presse payante, proposés par les correspondants du dépôt légal du Web, réalisée par la Bibliothèque nationale de France. Daily crawl of subscription based press websites, selected by curators for Web legal deposit, performed by the Bibliothèque nationale de France.</origHarvestDefinitionComments>
<scheduleName>quotidienne</scheduleName>
<harvestFilenamePrefix>BnF-21636-33</harvestFilenamePrefix>
<jobSubmitDate>2016-11-14T13:02:39Z</jobSubmitDate>
</harvestInfo>

As a recap:
<orderXMLName> => should become <templateName>
<operator> => new fied filled with metadata.operator in crawler-bean.cxml
<templateLastUpdateDate> => new fied filled with metadata.date in crawler-bean.cxml
<templateDescription> => new fied filled with metadata.description in crawler-bean.cxml
<version> => should probably go from 0.5 to 0.6

Generated at Thu Apr 18 02:37:20 CEST 2024 using Jira 9.4.15#940015-sha1:bdaa9cbecfb6791ea579749728cab771f0dfe90b.