Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-1958

Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates.

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: I53, 4.0
    • Labels:
      None
    • Accuracy of estimate:
      Rough

      Description

      Currently our harvest templates include the following piece of xml that configures Heritrix to write ARC files:

            <newObject name="Archiver" class="org.archive.crawler.writer.ARCWriterProcessor">
              <boolean name="enabled">true</boolean>
              <newObject name="Archiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
                <map name="rules">
                </map>
              </newObject>
              <boolean name="compress">false</boolean>
              <string name="prefix">netarkivet</string>
              <string name="suffix">${HOSTNAME}</string>
              <long name="max-size-bytes">100000000</long>
              <stringList name="path">
                <string>arcs</string>
              </stringList>
              <integer name="pool-max-active">5</integer>
              <integer name="pool-max-wait">300000</integer>
              <long name="total-bytes-to-write">0</long>
              <boolean name="skip-identical-digests">false</boolean>
            </newObject>
          
      

      By replacing this piece of xml with the following, you tell Heritrix to write WARC-files:

            <newObject name="WARCArchiver" class="org.archive.crawler.writer.WARCWriterProcessor">
              <boolean name="enabled">true</boolean>
              <newObject name="WARCArchiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
                <map name="rules">
                </map>
              </newObject>
              <boolean name="compress">false</boolean>
              <string name="prefix">netarkivet</string>
              <string name="suffix">${HOSTNAME}</string>
              <long name="max-size-bytes">100000000</long>
              <stringList name="path">
                <string>warcs</string>
              </stringList>
              <integer name="pool-max-active">5</integer>
              <integer name="pool-max-wait">300000</integer>
              <long name="total-bytes-to-write">0</long>
              <boolean name="skip-identical-digests">false</boolean>
              <boolean name="write-requests">true</boolean>
              <boolean name="write-metadata">true</boolean>
              <boolean name="write-revisit-for-identical-digests">true</boolean>
              <boolean name="write-revisit-for-not-modified">true</boolean>
            </newObject>
          
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                svc Søren Vejrup Carlsen
                Reporter:
                mss Mikis Seth Sørensen (Inactive)
              • Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 14h
                  14h
                  Remaining:
                  Remaining Estimate - 14h
                  14h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified