[NAS-2290] Generate revisit records Created: 29/Apr/14  Updated: 27/Oct/16  Resolved: 21/Sep/16

Status: Closed
Project: NetarchiveSuite
Component/s: H3-extensions, WARC
Affects Version/s: 5.0-Milestone1
Fix Version/s: 5.2

Type: New Feature Priority: Major
Reporter: Sara Aubry Assignee: Søren Vejrup Carlsen (Inactive)
Resolution: Fixed  
Labels: None
Remaining Estimate: Not Specified
Time Spent: 55m
Original Estimate: Not Specified

Attachments: File 23-14-20160719160131809-00000-dia-prod-udv-01.kb.dk.warc.gz    
Issue Links:
Duplicate
is duplicated by NAS-2521 Support WARC Revisit records Closed
Spawned
spawned NAS-2547 Create WARC 1.1 records instead of WA... Open
Organization:
BNF
Sprint: NAS 5.2
Verification:

Verify by making standard netarkivet.dk selective harvest using the hourly schedule. Deduplication must be enabled both in the template, and in the NAS instance.
Activate and make it run twice. Verify, that revisit-records are present in the second harvest


 Description   

Currently, neither one of the two WARC Writers (the default WARCArchiver from Heritrix org.archive.crawler.writer.WARCWriterProcessor or the one from NAS
dk.netarkivet.harvester.harvesting.WARCWriterProcessor)
is producing WARC revisit records for duplicates although they are identified in the crawl.log

According to Søren it is not the WarcWriter which needs to be changed to
enable generation of revisit records, but the deduplication module. The
estimated effort to do this is 1-2 weeks.



 Comments   
Comment by Søren Vejrup Carlsen (Inactive) [ 29/Sep/16 ]

Some tests to fix before I commit the code

Failed tests: 
  DedupCrawlLogIndexCacheTester.testCombine:153->verifySearchResult:173 Should have correct origin for url http://www.kb.dk/bevarbogen/images/menu_03.gif expected:<...1.kb.dk.arc,92248220[]> but was:<...1.kb.dk.arc,92248220[,20050506114818114]>
  CDXOriginCrawlLogIteratorTester.testbug680:238 Wrong origin
  CDXOriginCrawlLogIteratorTester.testOriginCrawlLogIterator:83 Must have right origin from CDXReader for http://base.kb.dk/pls/fag_web/fag_www_front.intro expected:<...1.kb.dk.arc,95054220[]> but was:<...1.kb.dk.arc,95054220[,20050506114817950]>
Comment by Søren Vejrup Carlsen (Inactive) [ 29/Sep/16 ]

Now there is a property for this on the DeDuplicator bean, if you want to turn this off:

<property name="revisitInWarcs" value="true"/>

By default making revisitRecords is enabled

Comment by Søren Vejrup Carlsen (Inactive) [ 15/Sep/16 ]

No, not currently. It is the Deduplicator, that triggers H3 to make revisit records. And whenever the deduplication is enabled, revisit records is produced
whenever the deduplicator finds a match in the Deduplication index.
Of course, if you want to be able to turn off revisit-records, we could make a setting for the deduplicator.

Comment by Sara Aubry [ 15/Sep/16 ]

Quick question: will the production of revisit records be configurable in NAS settings?

Comment by Sara Aubry [ 15/Sep/16 ]

The use of named fields which are not officially declared in the WARC 1.0 standard is not forbidden (confirmed by Clement).
So there is no problem in using the two new fields (WARC-Refers-To-Target-URI and WARC-Refers-To-Date) and stick to the 1.0 version.
BL and LBS have been producing revisit records with these 2 fields for a while.

Comment by Søren Vejrup Carlsen (Inactive) [ 03/Aug/16 ]

Now Heritrix3 should produce the correct WARC 1.0 revisit-records.
Making Heritrix3 write 1.1 WARC-records is a different matter.
Please address this in the NAS-2547 issue

Comment by Søren Vejrup Carlsen (Inactive) [ 26/Jul/16 ]

The new deduplicator writes revisit records like this conforming to the WARC 1.0 standard:

WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://www.familien-carlsen.dk/pania-de-croce.jpg
WARC-Date: 2016-07-19T16:01:36Z
WARC-IP-Address: 46.30.212.223
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
WARC-Truncated: length
WARC-Refers-To-Target-URI: http://www.familien-carlsen.dk/pania-de-croce.jpg
WARC-Refers-To-Date: 20160718160141564
WARC-Payload-Digest: OR2GYA2BOWYYTONJ6U5TF5YHOOQYIJBT
WARC-Record-ID: <urn:uuid:716e15b5-43ed-45dc-adb2-755762f216e1>
Content-Type: application/http; msgtype=response
Content-Length: 295

HTTP/1.1 200 OK
Server: Apache
Last-Modified: Wed, 15 Sep 2004 09:55:21 GMT
ETag: "38e41f81-b2f38-3e41ded90b440"
Content-Type: image/jpeg
Content-Length: 732984
Accept-Ranges: bytes
Date: Tue, 19 Jul 2016 16:01:37 GMT
X-Varnish: 930321265
Age: 0
Via: 1.1 varnish
Connection: close

It seems only the dateformat of the

WARC-Refers-To-Date: 2016-09-19T17:20:24Z

is wrong aside from being marked as WARC/1.0

Comment by Søren Vejrup Carlsen (Inactive) [ 15/Jul/16 ]

The WARCWriterProcessor bundled with H3 is hardwired to write WARC/1.0 warc records.
So If we want to write WARC/1.1 records, we must create our own WarcWriterProcessor or extend it.

The WARCWriterProcessor used by NAS 5.X (the class dk.netarkivet.harvester.harvesting.NasWARCProcessor) already extends
the org.archive.modules.writer.WARCWriterProcessor, and it maybe possibly to fix by extending it even more

Comment by Sara Aubry [ 12/Jul/16 ]

I double-checked in the WARC 1.1 revision: we now have these new fields WARC-Refers-To-Target-URI and WARC-Refers-To-Target-Date that can be use in stead of or as a complement of WARC-Refers-To in revisit records.

Citation:
Using a WARC-Refers-To header to identify a specific prior record from which the matching content can be retrieved is recommended, to minimize the risk of misinterpreting the 'revisit' record. The following two optional fields can also be used to associate the revisit record with the original record. Their use is recommended:

  • WARC-Refers-To-Target-URI. Its value should be equal to the WARC-Target-URI in the WARC record that the current record is considered a duplicate of.
  • WARC-Refers-To-Date. Its value should be equal to the WARC-Date in the WARC record that the current record is considered a duplicate of.

Example of a revisit record:
WARC/1.1
WARC-Type: revisit
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2017-06-23T12:43:35Z
WARC-Profile: http://netpreserve.org/warc/1.1/server-not-modified
WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-57494593bbbb>
WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>
WARC-Refers-To-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Refers-To-Date: 2016-09-19T17:20:24Z
Content-Type: message/http
Content-Length: 226

HTTP/1.x 304 Not Modified
Date: Tue, 06 Mar 2017 00:43:35 GMT
Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4
Connection: Keep-Alive
Keep-Alive: timeout=15, max=100
ETag: "3e45-67e-2ed02ec0"

If we want to create these records, we need to change the record version from WARC/1.0 to WARC/1.1

Comment by Sara Aubry [ 28/May/15 ]

During the WARC workshop at the Stanford GA, everyone agreed that the recommendations on revisit records should be changed and clarified. In 2013, the harvesting working group made this proposal:
https://github.com/iipc/openwayback/wiki/How-OpenWayback-handles-revisit-records-in-WARC-files
which introduces these fields:
WARC-Refers-To-Target-URI: This value should be equal to the WARC-Target-URI in the WARC record that the current record is considered a duplicate of.
WARC-Refers-To-Date: This value should be equal to the WARC-Date in the WARC record that the current record is considered a duplicate of.
And the use of WARC-Profile set to ‘identical-payload-digest’.

A sample file produce by BL:
WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://bl.uk/subjects/news-media/
WARC-Date: 2014-11-29T09:30:53Z
WARC-Payload-Digest: sha1:IUTFLOMMNZVZEJ6EIHSQLOFFFG3PBA5S
WARC-IP-Address: 194.66.233.215
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
WARC-Truncated: length
WARC-Refers-To-Target-URI: http://bl.uk/subjects/news-media/
WARC-Refers-To-Date: 2014-11-29T09:18:39Z
WARC-Record-ID: <urn:uuid:09c6d242-3165-42ac-89ba-c7a2189dff87>
Content-Type: application/http; msgtype=response
Content-Length: 385

HTTP/1.1 200 OK
Date: Sat, 29 Nov 2014 09:30:58 GMT
Server: Microsoft-IIS/8.5
Cache-Control: no-cache, no-store
Pragma: no-cache
Content-Type: text/html; charset=utf-8
Expires: -1
X-UA-Compatible: IE=edge,chrome=1
Content-Length: 75331
Set-Cookie: SC_ANALYTICS_SESSION_COOKIE=9489BB2196AA43EB8295F6D98063C3DD|1|bhjlnvnx3eh3e3wxjuahbpdo; path=/; HttpOnly
Connection: close

So we should not need WARC IDs anymore but we would need the original record date.

Comment by Søren Vejrup Carlsen (Inactive) [ 15/May/15 ]

I believe that a revisit record requires a warc-id for the original record.
Currently, we only have filename,offset to represent the original record in the lucene index.

Generated at Tue Apr 23 10:54:11 CEST 2024 using Jira 9.4.15#940015-sha1:bdaa9cbecfb6791ea579749728cab771f0dfe90b.