[NAS-2602] WARC-Refers-To-Date in WARC revisits records do not have the right original record date Created: 02/Feb/17 Updated: 07/Nov/23 Resolved: 03/Mar/17 |
|
Status: | Resolved |
Project: | NetarchiveSuite |
Component/s: | WARC |
Affects Version/s: | 5.2.1 |
Fix Version/s: | 5.3 |
Type: | Bug | Priority: | Blocker |
Reporter: | Sara Aubry | Assignee: | Unassigned |
Resolution: | Fixed | ||
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Organization: |
BNF
|
Sprint: | NAS 5.3 |
Verification: | I took a page from flickr.com and used it as seed for an event harvest. The test worked perfectly. |
Description |
Trying to display revisit records in OpenWayback, we noticed that WARC-Refers-To-Date in WARC revisits records do not have the right original record date. Here is a sample original record and the associated line in the crawl.log: HTTP/1.0 200 OK 2017-01-16T16:14:26.982Z 200 22660 http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg LXE http://rue89.nouvelobs.com/2017/01/15/rue89.com image/jpeg #161 20170116161421526+52 sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I http://rue89.nouvelobs.com content-size:22951 WARC Date is 2017-01-16T16:14:21Z coming out of 9th field in the crawl.log: 20170116161421526+52 Here is an associated revisit record and its line in the crawl.log: WARC/1.0 HTTP/1.0 200 OK 2017-01-17T14:02:38.705Z 200 22660 http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg E http://rue89.nouvelobs.com/ image/jpeg #037 20170117140230363+47 sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I http://rue89.nouvelobs.com duplicate:"BnF-22218-28-20170116161005144-00002-ciblee_2016_gulliver228.bnf.fr.warc.gz,254013214,20170116161426982",content-size:22952,3t WARC-Refers-To-Date is 2017-01-16T16:14:26Z, corresponding to 20170116161426982 in the duplicate annotation in the crawl.log. => this date is wrong, it corresponds to the 1st column which is the writing in the crawl.log There is not much difference between the two: |
Comments |
Comment by Sara Aubry [ 23/Feb/17 ] |
To test this fix:
14 first digits should be the same as the current dates have the follong format: AAAAMMJJHHmmss |