Description
harvestInfo.XXXX fields included in warcinfo records of data files are empty when the job has been resubmitted.
{{WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2017-05-24T05:09:17Z
WARC-Filename: BnF-23279-25-20170524050917-00160-ciblee_2017_gulliver134.bnf.fr.warc.gz
WARC-Record-ID: <urn:uuid:1d537950-9140-4f10-8a38-e4ffb200607a>
Content-Type: application/warc-fields
Content-Length: 942
software: Heritrix/3.3.0-LBS-2016-02 http://crawler.archive.org
ip: 172.20.22.68
hostname: gulliver134.bnf.fr
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
operator: BnF - DLWeb
publisher: Bibliotheque nationale de France
isPartOf: domaine
description: Parametres utilises pour la collecte ciblee permettant l'archivage de l'URL de depart ainsi que de toutes les pages internes au domaine / Parameters for the focused crawl used to harvest the seed URL and all pages within the same domain. Pas de respect du protocole robots.txt / Does not respect robots.txt
robots: ignore
http-header-user-agent: Mozilla/5.0 (compatible; bnf.fr_bot; +http://www.bnf.fr/fr/outils/a.dl_web_capture_robot.html)
http-header-from: robot@bnf.fr
#added by NetarchiveSuite Version: 5.3 (https://github.com/netarchivesuite/netarchivesuite/commit/8d1ce389cbd9d12dab176709ec1e0833e835e308)
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://ad.id/
WARC-Date: 2017-05-24T05:09:17Z
WARC-IP-Address: 203.119.112.50
WARC-Payload-Digest: sha1:Y2GG7DZ3JIXQ2I3F6KCNEAXVWG4VGXZA
WARC-Record-ID: <urn:uuid:68474f16-7b8d-446b-a83a-130d2ac9708b>
Content-Type: application/http; msgtype=response
Content-Length: 4264}}
Attachments
Issue Links
- related to
-
NAS-2678 NasWARCProcessor throws ugly NullPointerException if harvestInfo missing from template
- Closed