Page tree
Skip to end of metadata
Go to start of metadata

Wayback Usage at ONB

Wayback Indexing Process

Build Pathindex

  • Loops over all files in all mounted data segments (/mnt/wa001 to /mnt/waNNN) and generates a csv line for each file (Filename\tAbsoluteFilename) and creates a csv file for each segment
  • Merging and sorting of all segment files to one pathindex file

Generate CDX

  • Loops over all pathindex files of each segment and calls for data ARCs dk.netarkivet.wayback.batch.ExtractWaybackCDXBatchJob and for metadata ARCs dk.netarkivet.wayback.batch.ExtractDeduplicateCDXBatchJob and generates a CDX-File for each ARC when such a file doesn’t exist

Merge CDX

  • Merging of all single cdx files to one large cdx file per segment

Sort CDX

  • Sorting cdx files of all segments to one large sorted cdx file via the linux sort command