[NAS-1958] Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates. Created: 29/Sep/11  Updated: 16/Feb/16  Resolved: 21/Jan/13

Status: Resolved
Project: NetarchiveSuite
Component/s: Harvester Controller Server
Affects Version/s: None
Fix Version/s: I53, 4.0

Type: Task Priority: Major
Reporter: Mikis Seth Sørensen (Inactive) Assignee: Søren Vejrup Carlsen (Inactive)
Resolution: Fixed  
Labels: None
Remaining Estimate: 14h
Time Spent: Not Specified
Original Estimate: 14h

Issue Links:
Spawned
spawned NAS-2131 Add WARCWriterProcessor attributes to... Resolved
was spawned by NAS-1720 Enable WARC file writing and handling... Resolved
Accuracy of estimate: Rough

 Description   

Currently our harvest templates include the following piece of xml that configures Heritrix to write ARC files:

      <newObject name="Archiver" class="org.archive.crawler.writer.ARCWriterProcessor">
        <boolean name="enabled">true</boolean>
        <newObject name="Archiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
          <map name="rules">
          </map>
        </newObject>
        <boolean name="compress">false</boolean>
        <string name="prefix">netarkivet</string>
        <string name="suffix">${HOSTNAME}</string>
        <long name="max-size-bytes">100000000</long>
        <stringList name="path">
          <string>arcs</string>
        </stringList>
        <integer name="pool-max-active">5</integer>
        <integer name="pool-max-wait">300000</integer>
        <long name="total-bytes-to-write">0</long>
        <boolean name="skip-identical-digests">false</boolean>
      </newObject>
    

By replacing this piece of xml with the following, you tell Heritrix to write WARC-files:

      <newObject name="WARCArchiver" class="org.archive.crawler.writer.WARCWriterProcessor">
        <boolean name="enabled">true</boolean>
        <newObject name="WARCArchiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
          <map name="rules">
          </map>
        </newObject>
        <boolean name="compress">false</boolean>
        <string name="prefix">netarkivet</string>
        <string name="suffix">${HOSTNAME}</string>
        <long name="max-size-bytes">100000000</long>
        <stringList name="path">
          <string>warcs</string>
        </stringList>
        <integer name="pool-max-active">5</integer>
        <integer name="pool-max-wait">300000</integer>
        <long name="total-bytes-to-write">0</long>
        <boolean name="skip-identical-digests">false</boolean>
        <boolean name="write-requests">true</boolean>
        <boolean name="write-metadata">true</boolean>
        <boolean name="write-revisit-for-identical-digests">true</boolean>
        <boolean name="write-revisit-for-not-modified">true</boolean>
      </newObject>
    


 Comments   
Comment by Søren Vejrup Carlsen (Inactive) [ 29/Sep/11 ]

The following attributes should be associated with settings in NetarchiveSuite, and should be updated in the harvest template for the harvestJob (the Job class)

<boolean name="skip-identical-digests">false</boolean>
<boolean name="write-requests">true</boolean>
<boolean name="write-metadata">true</boolean>
<boolean name="write-revisit-for-identical-digests">true</boolean>
<boolean name="write-revisit-for-not-modified">true</boolean>
Generated at Thu Apr 25 05:45:07 CEST 2024 using Jira 9.4.15#940015-sha1:bdaa9cbecfb6791ea579749728cab771f0dfe90b.