Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2722

The current DeduplicationCDXExtractionBatchJob makes invalid CDX'es if a deduplicationmigration record exists in the metadata-file

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • 5.4
    • Wayback
    • None

    Description

      The waybackindexer uses the dk.netarkivet.wayback.indexer.ArchiveFile.index method to index arc and warc-files.

      In the case of metadata-files, it currently uses the DeduplicationCDXExtractionBatchJob batchjob to generate deduplicationCDX'es from the duplicate entries in the crawllog.

      This will not work for metadata-files with a deduplicationmigration record.

      Instead we should fetch the deduplicationmigration and the crawllog from the metadatafile, and
      then do the replacement, as we do in the RawMetadataCache.migrateDuplicates method

      Attachments

        Activity

          People

            svc Søren Vejrup Carlsen (Inactive)
            svc Søren Vejrup Carlsen (Inactive)
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: