Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2547

Create WARC 1.1 records instead of WARC 1.0 records

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • 5.5.1
    • None
    • Heritrix 3
    • BNF

    Description

      Currently heritrix3 writes WARC records conformant with the WARC 1.0 standard.
      Browsing through the WARCWriterProcessor that we're currently extending in NetarchiveSuite (org.archive.modules.writer.WARCWriterProcessor) it seems that it uses a WARCWriter and WARCConstants classes from archive-commons package, that hardwires its WARC to WARC version 1.0

      Imports in the WARCWriterProcessor from the archive-commons package:

      import org.archive.io.warc.WARCRecordInfo;
      import org.archive.io.warc.WARCWriter;
      import org.archive.io.warc.WARCWriterPool;
      import org.archive.io.warc.WARCWriterPoolSettings;
      
      import static org.archive.format.warc.WARCConstants
      

      The easiest way to implement 1.1 would to copy the above classes inside the folder:
      https://github.com/netarchivesuite/netarchivesuite/tree/master/harvester/heritrix3/heritrix3-extensions/src/main/java

      And then adapt the WARCConstants to be 1.1 compliant instead

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              svc Søren Vejrup Carlsen (Inactive)
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: