Class DeDuplicator

  • All Implemented Interfaces:
    org.archive.checkpointing.Checkpointable, org.archive.spring.HasKeyedProperties, org.springframework.beans.factory.Aware, org.springframework.beans.factory.BeanNameAware, org.springframework.beans.factory.InitializingBean, org.springframework.context.Lifecycle

    public class DeDuplicator
    extends org.archive.modules.Processor
    implements org.springframework.beans.factory.InitializingBean
    Heritrix compatible processor.

    Will determine if CrawlURIs are duplicates.

    Duplicate detection can only be performed after the fetch processors have run. Modified by SVC to use Lucene 4.X

    Author:
    Kristinn Sigurðsson, Søren Vejrup Carlsen other option: DIGEST Other option: WHITELIST Other options: NONE, TIMESTAMP_AND_ETAG Other options: NONE,PROCESSOR // /** // (FROM deduplicator-commons/src/main/java/is/landsbokasafn/deduplicator/IndexFields.java) // * These enums correspond to the names of fields in the Lucene index //