[NAS-1958] Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates. Created: 29/Sep/11 Updated: 16/Feb/16 Resolved: 21/Jan/13 |
|
Status: | Resolved |
Project: | NetarchiveSuite |
Component/s: | Harvester Controller Server |
Affects Version/s: | None |
Fix Version/s: | I53, 4.0 |
Type: | Task | Priority: | Major |
Reporter: | Mikis Seth Sørensen (Inactive) | Assignee: | Søren Vejrup Carlsen (Inactive) |
Resolution: | Fixed | ||
Labels: | None | ||
Remaining Estimate: | 14h | ||
Time Spent: | Not Specified | ||
Original Estimate: | 14h |
Issue Links: |
|
||||||||||||
Accuracy of estimate: | Rough |
Description |
Currently our harvest templates include the following piece of xml that configures Heritrix to write ARC files: <newObject name="Archiver" class="org.archive.crawler.writer.ARCWriterProcessor"> <boolean name="enabled">true</boolean> <newObject name="Archiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <boolean name="compress">false</boolean> <string name="prefix">netarkivet</string> <string name="suffix">${HOSTNAME}</string> <long name="max-size-bytes">100000000</long> <stringList name="path"> <string>arcs</string> </stringList> <integer name="pool-max-active">5</integer> <integer name="pool-max-wait">300000</integer> <long name="total-bytes-to-write">0</long> <boolean name="skip-identical-digests">false</boolean> </newObject> By replacing this piece of xml with the following, you tell Heritrix to write WARC-files: <newObject name="WARCArchiver" class="org.archive.crawler.writer.WARCWriterProcessor"> <boolean name="enabled">true</boolean> <newObject name="WARCArchiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <boolean name="compress">false</boolean> <string name="prefix">netarkivet</string> <string name="suffix">${HOSTNAME}</string> <long name="max-size-bytes">100000000</long> <stringList name="path"> <string>warcs</string> </stringList> <integer name="pool-max-active">5</integer> <integer name="pool-max-wait">300000</integer> <long name="total-bytes-to-write">0</long> <boolean name="skip-identical-digests">false</boolean> <boolean name="write-requests">true</boolean> <boolean name="write-metadata">true</boolean> <boolean name="write-revisit-for-identical-digests">true</boolean> <boolean name="write-revisit-for-not-modified">true</boolean> </newObject> |
Comments |
Comment by Søren Vejrup Carlsen (Inactive) [ 29/Sep/11 ] |
The following attributes should be associated with settings in NetarchiveSuite, and should be updated in the harvest template for the harvestJob (the Job class) <boolean name="skip-identical-digests">false</boolean> <boolean name="write-requests">true</boolean> <boolean name="write-metadata">true</boolean> <boolean name="write-revisit-for-identical-digests">true</boolean> <boolean name="write-revisit-for-not-modified">true</boolean> |