Description
Trying to display revisit records in OpenWayback, we noticed that WARC-Refers-To-Date in WARC revisits records do not have the right original record date.
Here is a sample original record and the associated line in the crawl.log:
{{
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
WARC-Date: 2017-01-16T16:14:21Z
WARC-IP-Address: 216.58.198.206
WARC-Payload-Digest: sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I
WARC-Record-ID: <urn:uuid:f6ed4965-92a2-483e-977c-3794a96af663>
Content-Type: application/http; msgtype=response
Content-Length: 22951
HTTP/1.0 200 OK
Content-Type: image/jpeg
Date: Mon, 16 Jan 2017 16:13:10 GMT
Expires: Mon, 16 Jan 2017 18:13:10 GMT
ETag: "1484105439"
X-Content-Type-Options: nosniff
Server: sffe
Content-Length: 22660
X-XSS-Protection: 1; mode=block
Cache-Control: public, max-age=7200
Age: 71
2017-01-16T16:14:26.982Z 200 22660 http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg LXE http://rue89.nouvelobs.com/2017/01/15/rue89.com image/jpeg #161 20170116161421526+52 sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I http://rue89.nouvelobs.com content-size:22951
}}
WARC Date is 2017-01-16T16:14:21Z coming out of 9th field in the crawl.log: 20170116161421526+52
Here is an associated revisit record and its line in the crawl.log:
{{
WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
WARC-Date: 2017-01-17T14:02:30Z
WARC-IP-Address: 216.58.198.206
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
WARC-Truncated: length
WARC-Payload-Digest: UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I
WARC-Refers-To-Date: 2017-01-16T16:14:26Z
WARC-Refers-To-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
WARC-Record-ID: <urn:uuid:483dbdfa-123e-45f6-9f8f-5be02c3789f7>
Content-Type: application/http; msgtype=response
Content-Length: 292
HTTP/1.0 200 OK
Content-Type: image/jpeg
Date: Tue, 17 Jan 2017 13:55:26 GMT
Expires: Tue, 17 Jan 2017 15:55:26 GMT
ETag: "1484105439"
X-Content-Type-Options: nosniff
Server: sffe
Content-Length: 22660
X-XSS-Protection: 1; mode=block
Cache-Control: public, max-age=7200
Age: 424
2017-01-17T14:02:38.705Z 200 22660 http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg E http://rue89.nouvelobs.com/ image/jpeg #037 20170117140230363+47 sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I http://rue89.nouvelobs.com duplicate:"BnF-22218-28-20170116161005144-00002-ciblee_2016_gulliver228.bnf.fr.warc.gz,254013214,20170116161426982",content-size:22952,3t
}}
WARC-Refers-To-Date is 2017-01-16T16:14:26Z, corresponding to 20170116161426982 in the duplicate annotation in the crawl.log. => this date is wrong, it corresponds to the 1st column which is the writing in the crawl.log
There is not much difference between the two:
2017-01-16T16:14:21Z
2017-01-16T16:14:26Z
but it prevents OpenWayback from finding the original payload.
Attachments
Issue Links
- mentioned in
-
Page Loading...