Child pages
  • Heritrix3 Configurations

Note that the this documentation is for the coming release and is still work-in-progress.
For documentation on the released versions, please view the previous versions of the NetarchiveSuite documentation and select the relevant version.

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Contents

For configuration related to NetarchiveSuite, please refer to section on Detailed Configurations#Configure Heritrix process.

For more specific Heritrix configurations, please refer to Appendix B - Managing Heritrix1 Harvest Templates (order.xml) and Appendix C - Migrate the Heritrix templates to NetarchiveSuite 3.6.0+ of this document.

The crawling in NetarchiveSuite uses by default Deduplication. This feature and how to disable it is described in Configuration Manual, Section 8.1.2.

How to configure which Heritrix report has to be uploaded in the metadata ARC file

Three settings properties control which heritrix reports are added to the metadata ARC file:

  • settingsharvesterharvestingmetadataheritrixFilePattern is a java pattern that allows you select which files in the crawl dir (not recursively) to include in the metadata ARC.
  • settingsharvesterharvestingmetadatareportFilePattern is also a java pattern that controls which subset of the files selected by heritrixFilePattern are to be considered as report files All the other files will be considered as setup files.
  • settingsharvesterharvestingmetadatalogFilePattern is a third java pattern that controls which files in the logs subdirectory of the crawldir are to be added as log files to the metadata ARC.

 

  • No labels