Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2131

Add WARCWriterProcessor attributes to HarvesterSettings

    XMLWordPrintable

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 4.2
    • None
    • Hide

      Install standard netarchivesuite (in warc mode)
      Make standard netarkivet.dk harvest
      On the jobdetails page, Click on the "Show harvest template for job 1"

      <boolean name="skip-identical-digests">false</boolean>
      <boolean name="write-requests">false</boolean>
      <boolean name="write-metadata">false</boolean>
      <boolean name="write-revisit-for-identical-digests">false</boolean>
      <boolean name="write-revisit-for-not-modified">false</boolean>

      All the booleans above should be false!
      If you want further confirmation, check the order.xml inserted in the
      metadata-file. It should have the same values.

      Stop the HarvestJobManager, and insert the following overrides in the settings file for the HarvestJobManager
      {quote
      <heritrix>
      ..
      <warc>
      <skipIdenticalDigests>true</skipIdenticalDigests>
      <writeRequests>true</writeRequests>
      <writeMetadata>true</writeMetadata> <writeRevisitForIdenticalDigests>true</writeRevisitForIdenticalDigests> <writeRevisitForNotModified>true</writeRevisitForNotModified>
      </warc>
      .. </heritrix>

      Restart the harvestJobManager, and then make another netarkivet.dk harvest
      Wait for this harvest to finish.
      On the jobdetails page, Click on the "Show harvest template for job 2"
      All the values previously false should now be true.

      If you want further confirmation, check the order.xml inserted in the
      metadata-file. It should have the same values.

      Show
      Install standard netarchivesuite (in warc mode) Make standard netarkivet.dk harvest On the jobdetails page, Click on the "Show harvest template for job 1" <boolean name="skip-identical-digests">false</boolean> <boolean name="write-requests">false</boolean> <boolean name="write-metadata">false</boolean> <boolean name="write-revisit-for-identical-digests">false</boolean> <boolean name="write-revisit-for-not-modified">false</boolean> All the booleans above should be false! If you want further confirmation, check the order.xml inserted in the metadata-file. It should have the same values. Stop the HarvestJobManager, and insert the following overrides in the settings file for the HarvestJobManager {quote <heritrix> .. <warc> <skipIdenticalDigests>true</skipIdenticalDigests> <writeRequests>true</writeRequests> <writeMetadata>true</writeMetadata> <writeRevisitForIdenticalDigests>true</writeRevisitForIdenticalDigests> <writeRevisitForNotModified>true</writeRevisitForNotModified> </warc> .. </heritrix> Restart the harvestJobManager, and then make another netarkivet.dk harvest Wait for this harvest to finish. On the jobdetails page, Click on the "Show harvest template for job 2" All the values previously false should now be true. If you want further confirmation, check the order.xml inserted in the metadata-file. It should have the same values.

    Description

      The following attributes in the WARCWriterProcessor should be associated with settings in NetarchiveSuite, and should be updated in the harvest template for the harvestJob (the Job class)

      <boolean name="skip-identical-digests">false</boolean>
      <boolean name="write-requests">true</boolean>
      <boolean name="write-metadata">true</boolean>
      <boolean name="write-revisit-for-identical-digests">true</boolean>
      <boolean name="write-revisit-for-not-modified">true</boolean>

      together with some or all of

      <boolean name="compress">false</boolean>
      <string name="prefix">netarkivet</string>
      <string name="suffix">HOSTNAME</string>
      <long name="max-size-bytes">100000000</long>
      <integer name="pool-max-active">5</integer>
      <integer name="pool-max-wait">300000</integer>
      <long name="total-bytes-to-write">0</long>

      The prefix, suffix should probably be based on information coming from some configured naming-convention.

      Attachments

        Issue Links

          Activity

            People

              svc Søren Vejrup Carlsen (Inactive)
              svc Søren Vejrup Carlsen (Inactive)
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: