Details
-
New Feature
-
Resolution: Fixed
-
Major
-
5.0-Milestone1
-
None
-
BNF
-
NAS 5.2
Description
Currently, neither one of the two WARC Writers (the default WARCArchiver from Heritrix org.archive.crawler.writer.WARCWriterProcessor or the one from NAS
dk.netarkivet.harvester.harvesting.WARCWriterProcessor)
is producing WARC revisit records for duplicates although they are identified in the crawl.log
According to Søren it is not the WarcWriter which needs to be changed to
enable generation of revisit records, but the deduplication module. The
estimated effort to do this is 1-2 weeks.