Note that this documentation is for the coming release
NetarchiveSuite 7.4
and is still work-in-progress.

For documentation on the released versions, please view the previous versions of the NetarchiveSuite documentation and select the relevant version.

A Heritrix3 harvest is defined by a Crawler-Bean (.cxml) file. This is a bean-definition file from the Spring framework. You can use Heritrix3's own documentation to create Crawler-Bean files which can then be uploaded to NetarchiveSuite via the GUI. NetarchiveSuite overwrites certain placeholder values in every Crawler-Bean definition before scheduling the harvest. The following placeholders are defined - some are required in every Crawler-Bean file, others are optional. When an optional placeholder is missing from the Crawler-Bean definition, then any attempt to redefine its value via the GUI will be ignored. There is no validation of Crawler-Bean files in this version of NetarchiveSuite, so a missing required placeholder will first manifest itself as a harvest job which fails to start. Some form for validation will be introduced in a later version of NetarchiveSuite.

Required Placeholders

In PropertyOverrideConfigurer

See discussion below


In PropertyOverrideConfigurer

See discussion below

In PropertyOverrideConfigurer

See discussion below  
in the regexList in MatchesListRegexDecideRuleSubstituted with global crawler traps defined in NAS
At the first xml nesting level, inside the <beans> element 
Inside the DispositionChain bean. 


Optional Placeholders

In PropertyOverrideConfigurer

if absent, e.g. if maxTimeSeconds is hardcoded in the crawler-beans file, then NAS will never override this value.
<property name="indexLocation" value="%{DEDUPLICATION_INDEX_LOCATION_PLACEHOLDER}"/> 
Inside the bean with class is.hi.bok.deduplicator.DeDuplicatorIf absent, there will be no deduplication


<property name="robotsPolicyName" value="%{HONOR_ROBOTS_DOT_TXT}"/> 

In PropertyOverrideConfigurer


In metadata bean

If absent, the robotsPolicy will be "ignore" (the default in H3) or hardwired to either obey or ignore
 extractorHtml.extractJavascript=%{EXTRACT_JAVASCRIPT}In PropertyOverrideConfigurerIf absent, the H3 template will use default value(?) or be hardwired to either true or false

scope.rules[2].maxHops=%{MAX_HOPS} (assuming TooManyHopsDecideRule is the 3rd bean defined in the "scope" bean)


<property name="maxHops" value="%{MAX_HOPS}" />

 In PropertyOverrideConfigurer


in bean for class

If absent, the H3 template will use default value (20) or be hardwired to something else
<property name="enabled" value="%{DEDUPLICATION_ENABLED_PLACEHOLDER}" />
in the bean of class is.hi.bok.deduplicator.DeDuplicator

It is replaced when jobs are generated by the value of the setting harvester.harvesting.deduplication.enabled for the HarvestJobManager application.

Note that this property is only valid for the version of DeDuplicator included with NetarchiveSuite.

Quote Enforcement

All three Quota/Budget -related placeholders are required, but their interpretation depends on the NAS setting  harvester.scheduler.jobGen.objectLimitIsSetByQuotaEnforcer.

Behaviour is as follows:


queueTotalBudget is set to infinity

groupMaxFetchSuccesses is set to the maxObjectsPerDomain value from NAS


queueTotalBudget is set to the maxObjectsPerDomain value from NAS

groupMaxFetchSuccesses is set to infinity

In all cases, groupMaxAllKb is set to the value determined from the maxBytesPerDomain setting from the NAS GUI (default value is -1 which is equivalent to no limit).

Umbra Integration

To enable browser-based harvesting with Internet Archive's Umbra system, the following placeholders need to be added. If a template containing these placeholders is sent to a non-umbra-enabled harvester they will be silently removed. In other words, the same template file can be used for both umbra and non-umbra harvesting.

inside the <value> element in the <properties> element in the "simpleOverrides" bean.
at the top level in the crawler-beans
at the top level in the crawler-beans
at the end of the list of processors in the "fetchProcessors" bean


  • No labels