Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2602

WARC-Refers-To-Date in WARC revisits records do not have the right original record date

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 5.2.1
    • Fix Version/s: 5.3
    • Component/s: WARC
    • Labels:
      None
    • Organization:
      BNF
    • Sprint:
      NAS 5.3
    • Verification:
      Hide

      I took a page from flickr.com and used it as seed for an event harvest. The test worked perfectly.

      Show
      I took a page from flickr.com and used it as seed for an event harvest. The test worked perfectly.

      Description

      Trying to display revisit records in OpenWayback, we noticed that WARC-Refers-To-Date in WARC revisits records do not have the right original record date.

      Here is a sample original record and the associated line in the crawl.log:
      {{
      WARC/1.0
      WARC-Type: response
      WARC-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
      WARC-Date: 2017-01-16T16:14:21Z
      WARC-IP-Address: 216.58.198.206
      WARC-Payload-Digest: sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I
      WARC-Record-ID: <urn:uuid:f6ed4965-92a2-483e-977c-3794a96af663>
      Content-Type: application/http; msgtype=response
      Content-Length: 22951

      HTTP/1.0 200 OK
      Content-Type: image/jpeg
      Date: Mon, 16 Jan 2017 16:13:10 GMT
      Expires: Mon, 16 Jan 2017 18:13:10 GMT
      ETag: "1484105439"
      X-Content-Type-Options: nosniff
      Server: sffe
      Content-Length: 22660
      X-XSS-Protection: 1; mode=block
      Cache-Control: public, max-age=7200
      Age: 71

      2017-01-16T16:14:26.982Z 200 22660 http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg LXE http://rue89.nouvelobs.com/2017/01/15/rue89.com image/jpeg #161 20170116161421526+52 sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I http://rue89.nouvelobs.com content-size:22951
      }}

      WARC Date is 2017-01-16T16:14:21Z coming out of 9th field in the crawl.log: 20170116161421526+52

      Here is an associated revisit record and its line in the crawl.log:
      {{

      WARC/1.0
      WARC-Type: revisit
      WARC-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
      WARC-Date: 2017-01-17T14:02:30Z
      WARC-IP-Address: 216.58.198.206
      WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
      WARC-Truncated: length
      WARC-Payload-Digest: UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I
      WARC-Refers-To-Date: 2017-01-16T16:14:26Z
      WARC-Refers-To-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
      WARC-Record-ID: <urn:uuid:483dbdfa-123e-45f6-9f8f-5be02c3789f7>
      Content-Type: application/http; msgtype=response
      Content-Length: 292

      HTTP/1.0 200 OK
      Content-Type: image/jpeg
      Date: Tue, 17 Jan 2017 13:55:26 GMT
      Expires: Tue, 17 Jan 2017 15:55:26 GMT
      ETag: "1484105439"
      X-Content-Type-Options: nosniff
      Server: sffe
      Content-Length: 22660
      X-XSS-Protection: 1; mode=block
      Cache-Control: public, max-age=7200
      Age: 424

      2017-01-17T14:02:38.705Z 200 22660 http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg E http://rue89.nouvelobs.com/ image/jpeg #037 20170117140230363+47 sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I http://rue89.nouvelobs.com duplicate:"BnF-22218-28-20170116161005144-00002-ciblee_2016_gulliver228.bnf.fr.warc.gz,254013214,20170116161426982",content-size:22952,3t
      }}

      WARC-Refers-To-Date is 2017-01-16T16:14:26Z, corresponding to 20170116161426982 in the duplicate annotation in the crawl.log. => this date is wrong, it corresponds to the 1st column which is the writing in the crawl.log

      There is not much difference between the two:
      2017-01-16T16:14:21Z
      2017-01-16T16:14:26Z
      but it prevents OpenWayback from finding the original payload.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                lam.mai .
                Reporter:
                sara Sara Aubry
                Inspector:
                .
              • Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: