Class DeDuplicator

  • All Implemented Interfaces:
    org.archive.checkpointing.Checkpointable, org.archive.spring.HasKeyedProperties, org.springframework.beans.factory.Aware, org.springframework.beans.factory.BeanNameAware, org.springframework.beans.factory.InitializingBean, org.springframework.context.Lifecycle

    public class DeDuplicator
    extends org.archive.modules.Processor
    implements org.springframework.beans.factory.InitializingBean
    Heritrix compatible processor.

    Will determine if CrawlURIs are duplicates.

    Duplicate detection can only be performed after the fetch processors have run. Modified by SVC to use Lucene 4.X

    Author:
    Kristinn Sigurðsson, Søren Vejrup Carlsen other option: DIGEST Other option: WHITELIST Other options: NONE, TIMESTAMP_AND_ETAG Other options: NONE,PROCESSOR // /** // (FROM deduplicator-commons/src/main/java/is/landsbokasafn/deduplicator/IndexFields.java) // * These enums correspond to the names of fields in the Lucene index //
    • Constructor Detail

      • DeDuplicator

        public DeDuplicator()
    • Method Detail

      • getEnabled

        public boolean getEnabled()
        Overrides:
        getEnabled in class org.archive.modules.Processor
      • setEnabled

        public void setEnabled​(boolean enabled)
        Overrides:
        setEnabled in class org.archive.modules.Processor
      • getIndexLocation

        public String getIndexLocation()
      • setIndexLocation

        public void setIndexLocation​(String indexLocation)
        SETTER used by Spring
      • getJumpTo

        public String getJumpTo()
      • setJumpTo

        public void setJumpTo​(String jumpTo)
        SPRING SETTER. TODO Are we using this property?? The netarkivet are not
      • getOrigin

        public String getOrigin()
      • setOrigin

        public void setOrigin​(String origin)
        SPRING SETTER
      • getTryEquivalent

        public Boolean getTryEquivalent()
      • setTryEquivalent

        public void setTryEquivalent​(Boolean tryEquivalent)
        SPRING SETTER
      • getMimeFilter

        public String getMimeFilter()
      • setMimeFilter

        public void setMimeFilter​(String mimeFilter)
      • getBlacklist

        public Boolean getBlacklist()
      • getAnalyzeTimestamp

        public boolean getAnalyzeTimestamp()
      • getChangeContentSize

        public Boolean getChangeContentSize()
      • setChangeContentSize

        public void setChangeContentSize​(Boolean changeContentSize)
        SPRING SETTER
      • getStatsPerHost

        public Boolean getStatsPerHost()
      • setStatsPerHost

        public void setStatsPerHost​(Boolean statsPerHost)
      • setRevisitInWarcs

        public void setRevisitInWarcs​(Boolean revisitOn)
      • getRevisitInWarcs

        public Boolean getRevisitInWarcs()
      • getServerCache

        public org.archive.modules.net.ServerCache getServerCache()
      • setServerCache

        @Autowired
        public void setServerCache​(org.archive.modules.net.ServerCache serverCache)
      • afterPropertiesSet

        public void afterPropertiesSet()
                                throws Exception
        Specified by:
        afterPropertiesSet in interface org.springframework.beans.factory.InitializingBean
        Throws:
        Exception
      • shouldProcess

        protected boolean shouldProcess​(org.archive.modules.CrawlURI curi)
        Specified by:
        shouldProcess in class org.archive.modules.Processor
      • innerProcess

        protected void innerProcess​(org.archive.modules.CrawlURI puri)
        Specified by:
        innerProcess in class org.archive.modules.Processor
      • innerProcessResult

        protected org.archive.modules.ProcessResult innerProcessResult​(org.archive.modules.CrawlURI curi)
                                                                throws InterruptedException
        Overrides:
        innerProcessResult in class org.archive.modules.Processor
        Throws:
        InterruptedException
      • lookupByURL

        protected org.apache.lucene.document.Document lookupByURL​(org.archive.modules.CrawlURI curi,
                                                                  is.hi.bok.deduplicator.Statistics currHostStats)
        Process a CrawlURI looking up in the index by URL
        Parameters:
        curi - The CrawlURI to process
        currHostStats - A statistics object for the current host. If per host statistics tracking is enabled this must be non null and the method will increment appropriate counters on it.
        Returns:
        The result of the lookup (a Lucene document). If a duplicate is not found null is returned.
      • lookupByDigest

        protected org.apache.lucene.document.Document lookupByDigest​(org.archive.modules.CrawlURI curi,
                                                                     is.hi.bok.deduplicator.Statistics currHostStats)
        Process a CrawlURI looking up in the index by content digest
        Parameters:
        curi - The CrawlURI to process
        currHostStats - A statistics object for the current host. If per host statistics tracking is enabled this must be non null and the method will increment appropriate counters on it.
        Returns:
        The result of the lookup (a Lucene document). If a duplicate is not found null is returned.
      • report

        public String report()
        Overrides:
        report in class org.archive.modules.Processor
      • getPercentage

        protected static String getPercentage​(double portion,
                                              double total)
      • doAnalysis

        protected void doAnalysis​(org.archive.modules.CrawlURI curi,
                                  is.hi.bok.deduplicator.Statistics currHostStats,
                                  boolean isDuplicate)
      • doTimestampAnalysis

        protected void doTimestampAnalysis​(org.archive.modules.CrawlURI curi,
                                           org.apache.lucene.document.Document urlHit,
                                           is.hi.bok.deduplicator.Statistics currHostStats,
                                           boolean isDuplicate)
      • queryField

        protected org.apache.lucene.search.Query queryField​(String fieldName,
                                                            String value)
        Run a simple Lucene query for a single term in a single field.
        Parameters:
        fieldName - name of the field to look in.
        value - The value to query for
        Returns:
        A Query for the given value in the given field.