[NAS-2290] Generate revisit records Created: 29/Apr/14 Updated: 27/Oct/16 Resolved: 21/Sep/16 |
|
Status: | Closed |
Project: | NetarchiveSuite |
Component/s: | H3-extensions, WARC |
Affects Version/s: | 5.0-Milestone1 |
Fix Version/s: | 5.2 |
Type: | New Feature | Priority: | Major |
Reporter: | Sara Aubry | Assignee: | Søren Vejrup Carlsen (Inactive) |
Resolution: | Fixed | ||
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | 55m | ||
Original Estimate: | Not Specified |
Attachments: | 23-14-20160719160131809-00000-dia-prod-udv-01.kb.dk.warc.gz | ||||||||||||||||
Issue Links: |
|
||||||||||||||||
Organization: |
BNF
|
||||||||||||||||
Sprint: | NAS 5.2 | ||||||||||||||||
Verification: | Verify by making standard netarkivet.dk selective harvest using the hourly schedule. Deduplication must be enabled both in the template, and in the NAS instance. |
Description |
Currently, neither one of the two WARC Writers (the default WARCArchiver from Heritrix org.archive.crawler.writer.WARCWriterProcessor or the one from NAS According to Søren it is not the WarcWriter which needs to be changed to |
Comments |
Comment by Søren Vejrup Carlsen (Inactive) [ 29/Sep/16 ] |
Some tests to fix before I commit the code Failed tests: DedupCrawlLogIndexCacheTester.testCombine:153->verifySearchResult:173 Should have correct origin for url http://www.kb.dk/bevarbogen/images/menu_03.gif expected:<...1.kb.dk.arc,92248220[]> but was:<...1.kb.dk.arc,92248220[,20050506114818114]> CDXOriginCrawlLogIteratorTester.testbug680:238 Wrong origin CDXOriginCrawlLogIteratorTester.testOriginCrawlLogIterator:83 Must have right origin from CDXReader for http://base.kb.dk/pls/fag_web/fag_www_front.intro expected:<...1.kb.dk.arc,95054220[]> but was:<...1.kb.dk.arc,95054220[,20050506114817950]> |
Comment by Søren Vejrup Carlsen (Inactive) [ 29/Sep/16 ] |
Now there is a property for this on the DeDuplicator bean, if you want to turn this off: <property name="revisitInWarcs" value="true"/> By default making revisitRecords is enabled |
Comment by Søren Vejrup Carlsen (Inactive) [ 15/Sep/16 ] |
No, not currently. It is the Deduplicator, that triggers H3 to make revisit records. And whenever the deduplication is enabled, revisit records is produced |
Comment by Sara Aubry [ 15/Sep/16 ] |
Quick question: will the production of revisit records be configurable in NAS settings? |
Comment by Sara Aubry [ 15/Sep/16 ] |
The use of named fields which are not officially declared in the WARC 1.0 standard is not forbidden (confirmed by Clement). |
Comment by Søren Vejrup Carlsen (Inactive) [ 03/Aug/16 ] |
Now Heritrix3 should produce the correct WARC 1.0 revisit-records. |
Comment by Søren Vejrup Carlsen (Inactive) [ 26/Jul/16 ] |
The new deduplicator writes revisit records like this conforming to the WARC 1.0 standard: WARC/1.0 WARC-Type: revisit WARC-Target-URI: http://www.familien-carlsen.dk/pania-de-croce.jpg WARC-Date: 2016-07-19T16:01:36Z WARC-IP-Address: 46.30.212.223 WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest WARC-Truncated: length WARC-Refers-To-Target-URI: http://www.familien-carlsen.dk/pania-de-croce.jpg WARC-Refers-To-Date: 20160718160141564 WARC-Payload-Digest: OR2GYA2BOWYYTONJ6U5TF5YHOOQYIJBT WARC-Record-ID: <urn:uuid:716e15b5-43ed-45dc-adb2-755762f216e1> Content-Type: application/http; msgtype=response Content-Length: 295 HTTP/1.1 200 OK Server: Apache Last-Modified: Wed, 15 Sep 2004 09:55:21 GMT ETag: "38e41f81-b2f38-3e41ded90b440" Content-Type: image/jpeg Content-Length: 732984 Accept-Ranges: bytes Date: Tue, 19 Jul 2016 16:01:37 GMT X-Varnish: 930321265 Age: 0 Via: 1.1 varnish Connection: close It seems only the dateformat of the WARC-Refers-To-Date: 2016-09-19T17:20:24Z is wrong aside from being marked as WARC/1.0 |
Comment by Søren Vejrup Carlsen (Inactive) [ 15/Jul/16 ] |
The WARCWriterProcessor bundled with H3 is hardwired to write WARC/1.0 warc records. The WARCWriterProcessor used by NAS 5.X (the class dk.netarkivet.harvester.harvesting.NasWARCProcessor) already extends |
Comment by Sara Aubry [ 12/Jul/16 ] |
I double-checked in the WARC 1.1 revision: we now have these new fields WARC-Refers-To-Target-URI and WARC-Refers-To-Target-Date that can be use in stead of or as a complement of WARC-Refers-To in revisit records. Citation:
Example of a revisit record: HTTP/1.x 304 Not Modified If we want to create these records, we need to change the record version from WARC/1.0 to WARC/1.1 |
Comment by Sara Aubry [ 28/May/15 ] |
During the WARC workshop at the Stanford GA, everyone agreed that the recommendations on revisit records should be changed and clarified. In 2013, the harvesting working group made this proposal: A sample file produce by BL: HTTP/1.1 200 OK So we should not need WARC IDs anymore but we would need the original record date. |
Comment by Søren Vejrup Carlsen (Inactive) [ 15/May/15 ] |
I believe that a revisit record requires a warc-id for the original record. |