[NAS-2565] Fix orderXMLName and add operator/templateUpdateDate/templateDescription fields in harvestInfo.xml Created: 11/Oct/16 Updated: 07/Nov/23 Resolved: 03/Mar/17 |
|
Status: | Resolved |
Project: | NetarchiveSuite |
Component/s: | WARC |
Affects Version/s: | None |
Fix Version/s: | 5.3 |
Type: | Improvement | Priority: | Critical |
Reporter: | Sara Aubry | Assignee: | Unassigned |
Resolution: | Fixed | ||
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | 1h 1m | ||
Original Estimate: | Not Specified |
Organization: |
BNF
|
Sprint: | NAS 5.3 |
Description |
We should replace orderXMLName (which was meant for H1) by templateName (which would be more generic and compatible with H3). Also, to ease the extraction of configuration information for a preservation repository, we should also like to include three more fields. |
Comments |
Comment by Sara Aubry [ 23/Feb/17 ] |
To test this feature: <!-- SIMPLE OVERRIDES (START) Overrides from a text property list --> <bean id="simpleOverrides" class="org.springframework.beans.factory.config.PropertyOverrideConfigurer"> <property name="properties"> <!-- Overrides the default values used by Heritrix --> <value> # This Properties map is specified in the Java 'property list' text format # http://java.sun.com/javase/6/docs/api/java/util/Properties.html#load%28java.io.Reader%29 ### ### Metadata overrides (placed here for preservation purpose) ### metadata.jobName=domaine metadata.description=Parametres utilises pour la collecte ciblee permettant l'archivage de l'URL de depart ainsi que de toutes les pages internes au domaine / Parameters for the focused crawl used to harvest the seed URL and all pages within the same domain. Pas de respect du protocole robots.txt / Does not respect robots.txt metadata.operator=BnF - DLWeb metadata.organization=Bibliotheque nationale de France metadata.date=20170116110000 2) Run and complete a job using this template. 3) In the associated metadata file, check the harvestInfo.xml record has consistent fields: WARC/1.0 WARC-Type: resource WARC-Record-ID: <urn:uuid:d1a8a657-e7d2-48a2-ae77-a8053f079594> WARC-Date: 2017-02-20T16:20:34Z Content-Length: 1248 Content-Type: text/xml WARC-Block-Digest: sha1:OOZDFR2P4J2SZDQM6E47VCSRIF5LWAAS WARC-IP-Address: 172.20.22.181 WARC-Target-URI: metadata://netarchivesuite.bnf.fr/crawl/setup/harvestInfo.xml?heritrixVersion=3.3.0-LBS-2016-02&harvestid=53&jobid=22313 WARC-Warcinfo-ID: <urn:uuid:c3703684-c400-40df-937c-8da8f40d5007> <?xml version="1.0" encoding="UTF-8"?> <harvestInfo> <version>0.6</version> <jobId>22313</jobId> <channel>CIBLEE</channel> <harvestNum>9</harvestNum> <origHarvestDefinitionID>53</origHarvestDefinitionID> <maxBytesPerDomain>-1</maxBytesPerDomain> <maxObjectsPerDomain>150</maxObjectsPerDomain> <templateName>domaine</templateName> <templateLastUpdateDate>20170116110000</templateLastUpdateDate> <templateDescription>Parametres utilises pour la collecte ciblee permettant l'archivage de l'URL de depart ainsi que de toutes les pages internes au domaine / Parameters for the focused crawl used to harvest the seed URL and all pages within the same domain. Pas de respect du protocole robots.txt / Does not respect robots.txt</templateDescription> <origHarvestDefinitionName>SAA test rapide</origHarvestDefinitionName> <origHarvestDefinitionComments>Cet EC permet de tester le fonctionnement de NAS5 et H3 en environnement de MAB.</origHarvestDefinitionComments> <scheduleName>annuelle</scheduleName> <harvestFilenamePrefix>BnF-22313-53</harvestFilenamePrefix> <jobSubmitDate>2017-02-20T15:29:08Z</jobSubmitDate> <performer>Bibliotheque nationale de France</performer> <operator>BnF - DLWeb</operator> </harvestInfo> 4) Check harvest template is stated in templateName (and not orderXMLname). 5) Check version using this 3 new fields is 0.6. |
Comment by Colin Rosenthal [ 23/Feb/17 ] |
Closed by mistake, reopened. |
Comment by Sara Aubry [ 05/Dec/16 ] |
Currently, harvestInfo.xml looks like this: As a recap: |