Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2602

WARC-Refers-To-Date in WARC revisits records do not have the right original record date

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • 5.3
    • 5.2.1
    • WARC
    • None
    • BNF
    • NAS 5.3
    • Hide

      I took a page from flickr.com and used it as seed for an event harvest. The test worked perfectly.

      Show
      I took a page from flickr.com and used it as seed for an event harvest. The test worked perfectly.

    Description

      Trying to display revisit records in OpenWayback, we noticed that WARC-Refers-To-Date in WARC revisits records do not have the right original record date.

      Here is a sample original record and the associated line in the crawl.log:
      {{
      WARC/1.0
      WARC-Type: response
      WARC-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
      WARC-Date: 2017-01-16T16:14:21Z
      WARC-IP-Address: 216.58.198.206
      WARC-Payload-Digest: sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I
      WARC-Record-ID: <urn:uuid:f6ed4965-92a2-483e-977c-3794a96af663>
      Content-Type: application/http; msgtype=response
      Content-Length: 22951

      HTTP/1.0 200 OK
      Content-Type: image/jpeg
      Date: Mon, 16 Jan 2017 16:13:10 GMT
      Expires: Mon, 16 Jan 2017 18:13:10 GMT
      ETag: "1484105439"
      X-Content-Type-Options: nosniff
      Server: sffe
      Content-Length: 22660
      X-XSS-Protection: 1; mode=block
      Cache-Control: public, max-age=7200
      Age: 71

      2017-01-16T16:14:26.982Z 200 22660 http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg LXE http://rue89.nouvelobs.com/2017/01/15/rue89.com image/jpeg #161 20170116161421526+52 sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I http://rue89.nouvelobs.com content-size:22951
      }}

      WARC Date is 2017-01-16T16:14:21Z coming out of 9th field in the crawl.log: 20170116161421526+52

      Here is an associated revisit record and its line in the crawl.log:
      {{

      WARC/1.0
      WARC-Type: revisit
      WARC-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
      WARC-Date: 2017-01-17T14:02:30Z
      WARC-IP-Address: 216.58.198.206
      WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
      WARC-Truncated: length
      WARC-Payload-Digest: UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I
      WARC-Refers-To-Date: 2017-01-16T16:14:26Z
      WARC-Refers-To-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
      WARC-Record-ID: <urn:uuid:483dbdfa-123e-45f6-9f8f-5be02c3789f7>
      Content-Type: application/http; msgtype=response
      Content-Length: 292

      HTTP/1.0 200 OK
      Content-Type: image/jpeg
      Date: Tue, 17 Jan 2017 13:55:26 GMT
      Expires: Tue, 17 Jan 2017 15:55:26 GMT
      ETag: "1484105439"
      X-Content-Type-Options: nosniff
      Server: sffe
      Content-Length: 22660
      X-XSS-Protection: 1; mode=block
      Cache-Control: public, max-age=7200
      Age: 424

      2017-01-17T14:02:38.705Z 200 22660 http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg E http://rue89.nouvelobs.com/ image/jpeg #037 20170117140230363+47 sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I http://rue89.nouvelobs.com duplicate:"BnF-22218-28-20170116161005144-00002-ciblee_2016_gulliver228.bnf.fr.warc.gz,254013214,20170116161426982",content-size:22952,3t
      }}

      WARC-Refers-To-Date is 2017-01-16T16:14:26Z, corresponding to 20170116161426982 in the duplicate annotation in the crawl.log. => this date is wrong, it corresponds to the 1st column which is the writing in the crawl.log

      There is not much difference between the two:
      2017-01-16T16:14:21Z
      2017-01-16T16:14:26Z
      but it prevents OpenWayback from finding the original payload.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              sara Sara Aubry
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: