[NAS-2602] WARC-Refers-To-Date in WARC revisits records do not have the right original record date Created: 02/Feb/17  Updated: 07/Nov/23  Resolved: 03/Mar/17

Status: Resolved
Project: NetarchiveSuite
Component/s: WARC
Affects Version/s: 5.2.1
Fix Version/s: 5.3

Type: Bug Priority: Blocker
Reporter: Sara Aubry Assignee: Unassigned
Resolution: Fixed  
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Organization:
BNF
Sprint: NAS 5.3
Verification:

I took a page from flickr.com and used it as seed for an event harvest. The test worked perfectly.


 Description   

Trying to display revisit records in OpenWayback, we noticed that WARC-Refers-To-Date in WARC revisits records do not have the right original record date.

Here is a sample original record and the associated line in the crawl.log:
{{
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
WARC-Date: 2017-01-16T16:14:21Z
WARC-IP-Address: 216.58.198.206
WARC-Payload-Digest: sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I
WARC-Record-ID: <urn:uuid:f6ed4965-92a2-483e-977c-3794a96af663>
Content-Type: application/http; msgtype=response
Content-Length: 22951

HTTP/1.0 200 OK
Content-Type: image/jpeg
Date: Mon, 16 Jan 2017 16:13:10 GMT
Expires: Mon, 16 Jan 2017 18:13:10 GMT
ETag: "1484105439"
X-Content-Type-Options: nosniff
Server: sffe
Content-Length: 22660
X-XSS-Protection: 1; mode=block
Cache-Control: public, max-age=7200
Age: 71

2017-01-16T16:14:26.982Z 200 22660 http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg LXE http://rue89.nouvelobs.com/2017/01/15/rue89.com image/jpeg #161 20170116161421526+52 sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I http://rue89.nouvelobs.com content-size:22951
}}

WARC Date is 2017-01-16T16:14:21Z coming out of 9th field in the crawl.log: 20170116161421526+52

Here is an associated revisit record and its line in the crawl.log:
{{

WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
WARC-Date: 2017-01-17T14:02:30Z
WARC-IP-Address: 216.58.198.206
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
WARC-Truncated: length
WARC-Payload-Digest: UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I
WARC-Refers-To-Date: 2017-01-16T16:14:26Z
WARC-Refers-To-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
WARC-Record-ID: <urn:uuid:483dbdfa-123e-45f6-9f8f-5be02c3789f7>
Content-Type: application/http; msgtype=response
Content-Length: 292

HTTP/1.0 200 OK
Content-Type: image/jpeg
Date: Tue, 17 Jan 2017 13:55:26 GMT
Expires: Tue, 17 Jan 2017 15:55:26 GMT
ETag: "1484105439"
X-Content-Type-Options: nosniff
Server: sffe
Content-Length: 22660
X-XSS-Protection: 1; mode=block
Cache-Control: public, max-age=7200
Age: 424

2017-01-17T14:02:38.705Z 200 22660 http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg E http://rue89.nouvelobs.com/ image/jpeg #037 20170117140230363+47 sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I http://rue89.nouvelobs.com duplicate:"BnF-22218-28-20170116161005144-00002-ciblee_2016_gulliver228.bnf.fr.warc.gz,254013214,20170116161426982",content-size:22952,3t
}}

WARC-Refers-To-Date is 2017-01-16T16:14:26Z, corresponding to 20170116161426982 in the duplicate annotation in the crawl.log. => this date is wrong, it corresponds to the 1st column which is the writing in the crawl.log

There is not much difference between the two:
2017-01-16T16:14:21Z
2017-01-16T16:14:26Z
but it prevents OpenWayback from finding the original payload.



 Comments   
Comment by Sara Aubry [ 23/Feb/17 ]

To test this fix:
1) Run and complete a job that contains at least a big image or a big PDF (this image/PDF should be recorded as a WARC response record).
2) Run a second job on the same Harvest (the image/PDF should be recorded as a WARC revisit record).
3) Check the WARC-Refers-To-Date of the revisit record matches the WARC-Date of the original record.
4) Compare the crawl.log of the two jobs, the same date should be inserted:

14 first digits should be the same as the current dates have the follong format: AAAAMMJJHHmmss

Generated at Fri Mar 29 08:10:38 CET 2024 using Jira 9.4.15#940015-sha1:bdaa9cbecfb6791ea579749728cab771f0dfe90b.