SBPROJECTS will be offline Wednesday between 7:30 and 8:30

SBForge with all its applications will be down for security updates during a time interval of about 10-20 minutes in the interval mentioned above.

Child pages
  • Heritrix3 Configurations

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


Table of Contents

For configuration related to NetarchiveSuite, please refer to section on Detailed Configurations#Configure Heritrix process.

For more specific Heritrix configurations, please refer to Appendix B1 - Managing Heritrix1 Harvest Templates (order.xml)B2: Managing Heritrix 3 Crawl-order Templates and Migrating H1 templates to H3 to use with NetarchiveSuite 5.0

The crawling in NetarchiveSuite uses by default Deduplication.

How to configure which Heritrix report has to be uploaded in the metadata ARC/WARC file

Three settings properties control which heritrix reports are added to the metadata ARC or WARC file:

  • settings.harvester.harvesting.metadata.heritrixFilePattern is a java pattern that allows you select which files in the crawl dir (not recursively) to include in the metadata ARC.
  • settings.harvester.harvesting.metadata.reportFilePattern is also a java pattern that controls which subset of the files selected by heritrixFilePattern are to be considered as report files All the other files will be considered as setup files.
  • settings.harvester.harvesting.metadata.logFilePattern is a third java pattern that controls which files in the logs subdirectory of the crawldir are to be added as log files to the metadata ARC.