Package is.hi.bok.deduplicator
Class DeDuplicator
- java.lang.Object
-
- org.archive.modules.Processor
-
- is.hi.bok.deduplicator.DeDuplicator
-
- All Implemented Interfaces:
org.archive.checkpointing.Checkpointable
,org.archive.spring.HasKeyedProperties
,org.springframework.beans.factory.Aware
,org.springframework.beans.factory.BeanNameAware
,org.springframework.beans.factory.InitializingBean
,org.springframework.context.Lifecycle
public class DeDuplicator extends org.archive.modules.Processor implements org.springframework.beans.factory.InitializingBean
Heritrix compatible processor.Will determine if CrawlURIs are duplicates.
Duplicate detection can only be performed after the fetch processors have run. Modified by SVC to use Lucene 4.X
- Author:
- Kristinn Sigurðsson, Søren Vejrup Carlsen
other option: DIGEST Other option: WHITELIST Other options: NONE, TIMESTAMP_AND_ETAG Other options: NONE,PROCESSOR // /** // (FROM deduplicator-commons/src/main/java/is/landsbokasafn/deduplicator/IndexFields.java) // * These enums correspond to the names of fields in the Lucene index //
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
DeDuplicator.AnalysisMode
static class
DeDuplicator.FilterMode
static class
DeDuplicator.MatchingMethod
static class
DeDuplicator.OriginHandling
-
Field Summary
Fields Modifier and Type Field Description static String
ATTR_ANALYZE_MODE
static String
ATTR_CHANGE_CONTENT_SIZE
static String
ATTR_EQUIVALENT
static String
ATTR_FILTER_MODE
static String
ATTR_JUMP_TO
static String
ATTR_MIME_FILTER
static String
ATTR_ORIGIN
static String
ATTR_ORIGIN_HANDLING
static String
ATTR_REVISIT_IN_WARCS
static String
ATTR_STATS_PER_HOST
static String
DEFAULT_MIME_FILTER
static DeDuplicator.OriginHandling
DEFAULT_ORIGIN_HANDLING
protected org.apache.lucene.index.IndexReader
indexReader
protected org.apache.lucene.search.IndexSearcher
indexSearcher
protected boolean
lookupByURL
protected HashMap<String,is.hi.bok.deduplicator.Statistics>
perHostStats
protected org.archive.modules.net.ServerCache
serverCache
protected is.hi.bok.deduplicator.Statistics
stats
protected boolean
statsPerHost
protected boolean
useOrigin
protected boolean
useOriginFromIndex
-
Constructor Summary
Constructors Constructor Description DeDuplicator()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
afterPropertiesSet()
protected void
doAnalysis(org.archive.modules.CrawlURI curi, is.hi.bok.deduplicator.Statistics currHostStats, boolean isDuplicate)
protected void
doTimestampAnalysis(org.archive.modules.CrawlURI curi, org.apache.lucene.document.Document urlHit, is.hi.bok.deduplicator.Statistics currHostStats, boolean isDuplicate)
DeDuplicator.AnalysisMode
getAnalysisMode()
boolean
getAnalyzeTimestamp()
Boolean
getBlacklist()
Boolean
getChangeContentSize()
boolean
getEnabled()
DeDuplicator.FilterMode
getFilterMode()
String
getIndexLocation()
String
getJumpTo()
DeDuplicator.MatchingMethod
getMatchingMethod()
String
getMimeFilter()
String
getOrigin()
DeDuplicator.OriginHandling
getOriginHandling()
protected static String
getPercentage(double portion, double total)
Boolean
getRevisitInWarcs()
org.archive.modules.net.ServerCache
getServerCache()
Boolean
getStatsPerHost()
Boolean
getTryEquivalent()
protected void
innerProcess(org.archive.modules.CrawlURI puri)
protected org.archive.modules.ProcessResult
innerProcessResult(org.archive.modules.CrawlURI curi)
protected org.apache.lucene.document.Document
lookupByDigest(org.archive.modules.CrawlURI curi, is.hi.bok.deduplicator.Statistics currHostStats)
Process a CrawlURI looking up in the index by content digestprotected org.apache.lucene.document.Document
lookupByURL(org.archive.modules.CrawlURI curi, is.hi.bok.deduplicator.Statistics currHostStats)
Process a CrawlURI looking up in the index by URLprotected org.apache.lucene.search.Query
queryField(String fieldName, String value)
Run a simple Lucene query for a single term in a single field.String
report()
void
setAnalysisMode(DeDuplicator.AnalysisMode analyzeMode)
void
setChangeContentSize(Boolean changeContentSize)
SPRING SETTERvoid
setEnabled(boolean enabled)
void
setfilterMode(DeDuplicator.FilterMode filterMode)
SPRING SETTER methodvoid
setIndexLocation(String indexLocation)
SETTER used by Springvoid
setJumpTo(String jumpTo)
SPRING SETTER.void
setMatchingMethod(DeDuplicator.MatchingMethod method)
SETTER used by Springvoid
setMimeFilter(String mimeFilter)
void
setOrigin(String origin)
SPRING SETTERvoid
setOriginHandling(DeDuplicator.OriginHandling originHandling)
void
setRevisitInWarcs(Boolean revisitOn)
void
setServerCache(org.archive.modules.net.ServerCache serverCache)
void
setStatsPerHost(Boolean statsPerHost)
void
setTryEquivalent(Boolean tryEquivalent)
SPRING SETTERprotected boolean
shouldProcess(org.archive.modules.CrawlURI curi)
-
Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, fromCheckpointJson, getBeanName, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop, toCheckpointJson
-
-
-
-
Field Detail
-
ATTR_JUMP_TO
public static final String ATTR_JUMP_TO
- See Also:
- Constant Field Values
-
ATTR_ORIGIN
public static final String ATTR_ORIGIN
- See Also:
- Constant Field Values
-
ATTR_EQUIVALENT
public static final String ATTR_EQUIVALENT
- See Also:
- Constant Field Values
-
ATTR_MIME_FILTER
public static final String ATTR_MIME_FILTER
- See Also:
- Constant Field Values
-
DEFAULT_MIME_FILTER
public static final String DEFAULT_MIME_FILTER
- See Also:
- Constant Field Values
-
ATTR_FILTER_MODE
public static final String ATTR_FILTER_MODE
- See Also:
- Constant Field Values
-
ATTR_ANALYZE_MODE
public static final String ATTR_ANALYZE_MODE
- See Also:
- Constant Field Values
-
ATTR_CHANGE_CONTENT_SIZE
public static final String ATTR_CHANGE_CONTENT_SIZE
- See Also:
- Constant Field Values
-
ATTR_STATS_PER_HOST
public static final String ATTR_STATS_PER_HOST
- See Also:
- Constant Field Values
-
ATTR_ORIGIN_HANDLING
public static final String ATTR_ORIGIN_HANDLING
- See Also:
- Constant Field Values
-
DEFAULT_ORIGIN_HANDLING
public static final DeDuplicator.OriginHandling DEFAULT_ORIGIN_HANDLING
-
ATTR_REVISIT_IN_WARCS
public static final String ATTR_REVISIT_IN_WARCS
- See Also:
- Constant Field Values
-
serverCache
protected org.archive.modules.net.ServerCache serverCache
-
indexSearcher
protected org.apache.lucene.search.IndexSearcher indexSearcher
-
indexReader
protected org.apache.lucene.index.IndexReader indexReader
-
lookupByURL
protected boolean lookupByURL
-
statsPerHost
protected boolean statsPerHost
-
useOrigin
protected boolean useOrigin
-
useOriginFromIndex
protected boolean useOriginFromIndex
-
stats
protected is.hi.bok.deduplicator.Statistics stats
-
-
Method Detail
-
getEnabled
public boolean getEnabled()
- Overrides:
getEnabled
in classorg.archive.modules.Processor
-
setEnabled
public void setEnabled(boolean enabled)
- Overrides:
setEnabled
in classorg.archive.modules.Processor
-
getIndexLocation
public String getIndexLocation()
-
setIndexLocation
public void setIndexLocation(String indexLocation)
SETTER used by Spring
-
getMatchingMethod
public DeDuplicator.MatchingMethod getMatchingMethod()
-
setMatchingMethod
public void setMatchingMethod(DeDuplicator.MatchingMethod method)
SETTER used by Spring
-
getJumpTo
public String getJumpTo()
-
setJumpTo
public void setJumpTo(String jumpTo)
SPRING SETTER. TODO Are we using this property?? The netarkivet are not
-
getOrigin
public String getOrigin()
-
setOrigin
public void setOrigin(String origin)
SPRING SETTER
-
getTryEquivalent
public Boolean getTryEquivalent()
-
setTryEquivalent
public void setTryEquivalent(Boolean tryEquivalent)
SPRING SETTER
-
getMimeFilter
public String getMimeFilter()
-
setMimeFilter
public void setMimeFilter(String mimeFilter)
-
getFilterMode
public DeDuplicator.FilterMode getFilterMode()
-
getBlacklist
public Boolean getBlacklist()
-
setfilterMode
public void setfilterMode(DeDuplicator.FilterMode filterMode)
SPRING SETTER method
-
getAnalyzeTimestamp
public boolean getAnalyzeTimestamp()
-
setAnalysisMode
public void setAnalysisMode(DeDuplicator.AnalysisMode analyzeMode)
-
getAnalysisMode
public DeDuplicator.AnalysisMode getAnalysisMode()
-
getChangeContentSize
public Boolean getChangeContentSize()
-
setChangeContentSize
public void setChangeContentSize(Boolean changeContentSize)
SPRING SETTER
-
getStatsPerHost
public Boolean getStatsPerHost()
-
setStatsPerHost
public void setStatsPerHost(Boolean statsPerHost)
-
getOriginHandling
public DeDuplicator.OriginHandling getOriginHandling()
-
setOriginHandling
public void setOriginHandling(DeDuplicator.OriginHandling originHandling)
-
setRevisitInWarcs
public void setRevisitInWarcs(Boolean revisitOn)
-
getRevisitInWarcs
public Boolean getRevisitInWarcs()
-
getServerCache
public org.archive.modules.net.ServerCache getServerCache()
-
setServerCache
@Autowired public void setServerCache(org.archive.modules.net.ServerCache serverCache)
-
afterPropertiesSet
public void afterPropertiesSet() throws Exception
- Specified by:
afterPropertiesSet
in interfaceorg.springframework.beans.factory.InitializingBean
- Throws:
Exception
-
shouldProcess
protected boolean shouldProcess(org.archive.modules.CrawlURI curi)
- Specified by:
shouldProcess
in classorg.archive.modules.Processor
-
innerProcess
protected void innerProcess(org.archive.modules.CrawlURI puri)
- Specified by:
innerProcess
in classorg.archive.modules.Processor
-
innerProcessResult
protected org.archive.modules.ProcessResult innerProcessResult(org.archive.modules.CrawlURI curi) throws InterruptedException
- Overrides:
innerProcessResult
in classorg.archive.modules.Processor
- Throws:
InterruptedException
-
lookupByURL
protected org.apache.lucene.document.Document lookupByURL(org.archive.modules.CrawlURI curi, is.hi.bok.deduplicator.Statistics currHostStats)
Process a CrawlURI looking up in the index by URL- Parameters:
curi
- The CrawlURI to processcurrHostStats
- A statistics object for the current host. If per host statistics tracking is enabled this must be non null and the method will increment appropriate counters on it.- Returns:
- The result of the lookup (a Lucene document). If a duplicate is not found null is returned.
-
lookupByDigest
protected org.apache.lucene.document.Document lookupByDigest(org.archive.modules.CrawlURI curi, is.hi.bok.deduplicator.Statistics currHostStats)
Process a CrawlURI looking up in the index by content digest- Parameters:
curi
- The CrawlURI to processcurrHostStats
- A statistics object for the current host. If per host statistics tracking is enabled this must be non null and the method will increment appropriate counters on it.- Returns:
- The result of the lookup (a Lucene document). If a duplicate is not found null is returned.
-
report
public String report()
- Overrides:
report
in classorg.archive.modules.Processor
-
getPercentage
protected static String getPercentage(double portion, double total)
-
doAnalysis
protected void doAnalysis(org.archive.modules.CrawlURI curi, is.hi.bok.deduplicator.Statistics currHostStats, boolean isDuplicate)
-
doTimestampAnalysis
protected void doTimestampAnalysis(org.archive.modules.CrawlURI curi, org.apache.lucene.document.Document urlHit, is.hi.bok.deduplicator.Statistics currHostStats, boolean isDuplicate)
-
queryField
protected org.apache.lucene.search.Query queryField(String fieldName, String value)
Run a simple Lucene query for a single term in a single field.- Parameters:
fieldName
- name of the field to look in.value
- The value to query for- Returns:
- A Query for the given value in the given field.
-
-