Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.0-Milestone1
    • Fix Version/s: 5.2
    • Component/s: H3-extensions, WARC
    • Labels:
      None
    • Organization:
      BNF
    • Sprint:
      NAS 5.2
    • Verification:
      Hide

      Verify by making standard netarkivet.dk selective harvest using the hourly schedule. Deduplication must be enabled both in the template, and in the NAS instance.
      Activate and make it run twice. Verify, that revisit-records are present in the second harvest

      Show
      Verify by making standard netarkivet.dk selective harvest using the hourly schedule. Deduplication must be enabled both in the template, and in the NAS instance. Activate and make it run twice. Verify, that revisit-records are present in the second harvest

      Description

      Currently, neither one of the two WARC Writers (the default WARCArchiver from Heritrix org.archive.crawler.writer.WARCWriterProcessor or the one from NAS
      dk.netarkivet.harvester.harvesting.WARCWriterProcessor)
      is producing WARC revisit records for duplicates although they are identified in the crawl.log

      According to Søren it is not the WarcWriter which needs to be changed to
      enable generation of revisit records, but the deduplication module. The
      estimated effort to do this is 1-2 weeks.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                svc Søren Vejrup Carlsen (Inactive)
                Reporter:
                sara Sara Aubry
              • Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - Not Specified
                  Not Specified
                  Logged:
                  Time Spent - 55m
                  55m