Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-1958

Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates.

    XMLWordPrintable

Details

    • Rough

    Description

      Currently our harvest templates include the following piece of xml that configures Heritrix to write ARC files:

            <newObject name="Archiver" class="org.archive.crawler.writer.ARCWriterProcessor">
              <boolean name="enabled">true</boolean>
              <newObject name="Archiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
                <map name="rules">
                </map>
              </newObject>
              <boolean name="compress">false</boolean>
              <string name="prefix">netarkivet</string>
              <string name="suffix">${HOSTNAME}</string>
              <long name="max-size-bytes">100000000</long>
              <stringList name="path">
                <string>arcs</string>
              </stringList>
              <integer name="pool-max-active">5</integer>
              <integer name="pool-max-wait">300000</integer>
              <long name="total-bytes-to-write">0</long>
              <boolean name="skip-identical-digests">false</boolean>
            </newObject>
          
      

      By replacing this piece of xml with the following, you tell Heritrix to write WARC-files:

            <newObject name="WARCArchiver" class="org.archive.crawler.writer.WARCWriterProcessor">
              <boolean name="enabled">true</boolean>
              <newObject name="WARCArchiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
                <map name="rules">
                </map>
              </newObject>
              <boolean name="compress">false</boolean>
              <string name="prefix">netarkivet</string>
              <string name="suffix">${HOSTNAME}</string>
              <long name="max-size-bytes">100000000</long>
              <stringList name="path">
                <string>warcs</string>
              </stringList>
              <integer name="pool-max-active">5</integer>
              <integer name="pool-max-wait">300000</integer>
              <long name="total-bytes-to-write">0</long>
              <boolean name="skip-identical-digests">false</boolean>
              <boolean name="write-requests">true</boolean>
              <boolean name="write-metadata">true</boolean>
              <boolean name="write-revisit-for-identical-digests">true</boolean>
              <boolean name="write-revisit-for-not-modified">true</boolean>
            </newObject>
          
      

      Attachments

        Issue Links

          Activity

            People

              svc Søren Vejrup Carlsen (Inactive)
              mss Mikis Seth Sørensen (Inactive)
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 14h
                  14h
                  Remaining:
                  Remaining Estimate - 14h
                  14h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified