Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


Check that the default_orderxml template is a Heritrix3 template( go to the Edit Harvest Templates tab, and retrieve the default_orderxml: If the file-header contains "HERITRIX 3 CRAWL JOB CONFIGURATION FILE ", it is ok.Update the default cxml file to support deduplication by adding the bean

Check that the DispositionChain includes a deduplicator: 

Code Block
<bean<ref idbean="DeDuplicator" class="is.hi.bok.deduplicator.DeDuplicator">
        <!-- DEDUPLICATION_INDEX_LOCATION_PLACEHOLDER is replaced by path on harvest-server -->
        <property name="indexLocation" value="%{DEDUPLICATION_INDEX_LOCATION_PLACEHOLDER}" />
        <property name="matchingMethod" value="URL" />
        <property name="tryEquivalent" value="TRUE" />
        <property name="changeContentSize" value="false" />
        <property name="mimeFilter" value="^text/.*" />
        <property name="filterMode" value="BLACKLIST" />
        <property name="origin" value="" />
        <property name="originHandling" value="INDEX" />
        <property name="statsPerHost" value="true" />

to the processors list in the DispositionChain.

2. Running selective harvest