Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2649

harvestInfo.XXXX fields are not added in warcinfo records for resubmitted jobs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 5.4
    • 5.3, 5.3.1
    • WARC
    • None
    • BNF
    • NAS 5.4
    • Hide

      Test:
      Start a harvest.
      While harvesting, restart the system.
      When harvest shows up as failed, restart it.
      Wait for it to finish.
      Check that the data warc contains harvestInfo data.

      Show
      Test: Start a harvest. While harvesting, restart the system. When harvest shows up as failed, restart it. Wait for it to finish. Check that the data warc contains harvestInfo data.

    Description

      harvestInfo.XXXX fields included in warcinfo records of data files are empty when the job has been resubmitted.

      {{WARC/1.0
      WARC-Type: warcinfo
      WARC-Date: 2017-05-24T05:09:17Z
      WARC-Filename: BnF-23279-25-20170524050917-00160-ciblee_2017_gulliver134.bnf.fr.warc.gz
      WARC-Record-ID: <urn:uuid:1d537950-9140-4f10-8a38-e4ffb200607a>
      Content-Type: application/warc-fields
      Content-Length: 942

      software: Heritrix/3.3.0-LBS-2016-02 http://crawler.archive.org
      ip: 172.20.22.68
      hostname: gulliver134.bnf.fr
      format: WARC File Format 1.0
      conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
      operator: BnF - DLWeb
      publisher: Bibliotheque nationale de France
      isPartOf: domaine
      description: Parametres utilises pour la collecte ciblee permettant l'archivage de l'URL de depart ainsi que de toutes les pages internes au domaine / Parameters for the focused crawl used to harvest the seed URL and all pages within the same domain. Pas de respect du protocole robots.txt / Does not respect robots.txt
      robots: ignore
      http-header-user-agent: Mozilla/5.0 (compatible; bnf.fr_bot; +http://www.bnf.fr/fr/outils/a.dl_web_capture_robot.html)
      http-header-from: robot@bnf.fr

      #added by NetarchiveSuite Version: 5.3 (https://github.com/netarchivesuite/netarchivesuite/commit/8d1ce389cbd9d12dab176709ec1e0833e835e308)

      WARC/1.0
      WARC-Type: response
      WARC-Target-URI: http://ad.id/
      WARC-Date: 2017-05-24T05:09:17Z
      WARC-IP-Address: 203.119.112.50
      WARC-Payload-Digest: sha1:Y2GG7DZ3JIXQ2I3F6KCNEAXVWG4VGXZA
      WARC-Record-ID: <urn:uuid:68474f16-7b8d-446b-a83a-130d2ac9708b>
      Content-Type: application/http; msgtype=response
      Content-Length: 4264}}

      Attachments

        Issue Links

          Activity

            People

              csr Colin Rosenthal
              sara Sara Aubry
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: