Package is.hi.bok.deduplicator
Class DigestIndexer
- java.lang.Object
-
- is.hi.bok.deduplicator.DigestIndexer
-
public class DigestIndexer extends Object
A class for building a de-duplication index.The indexing can be done via the command line options (Run with --help parameter to print usage information) or natively embedded in other applications.
This class also defines string constants for the lucene field names.
- Author:
- Kristinn Sigurðsson, Søren Vejrup Carlsen
-
-
Field Summary
Fields Modifier and Type Field Description static String
FIELD_DIGEST
The content digest as String.static String
FIELD_ETAG
The document's etag.static String
FIELD_ORIGIN
A field containing meta-data on where the original version of a document is stored.static String
FIELD_TIMESTAMP
The URLs timestamp (time of fetch).static String
FIELD_URL
The URL.static String
FIELD_URL_NORMALIZED
A stripped (normalized) version of the URL.static String
MODE_BOTH
Both URL and hash are indexed.static String
MODE_HASH
Index HASH enabling lookups by hash (content digest).static String
MODE_URL
Index URL enabling lookups by URL.
-
Constructor Summary
Constructors Constructor Description DigestIndexer(String indexLocation, String indexingMode, boolean includeNormalizedURL, boolean includeTimestamp, boolean includeEtag, boolean addToExistingIndex)
Each instance of this class wraps one Lucene index for writing deduplication information to it.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
close()
Close the index.org.apache.lucene.index.IndexWriter
getIndex()
static void
main(String[] args)
static String
stripURL(String url)
An aggressive URL normalizer.long
writeToIndex(CrawlDataIterator dataIt, String mimefilter, boolean blacklist, String defaultOrigin, boolean verbose)
Writes the contents of aCrawlDataIterator
to this index.long
writeToIndex(CrawlDataIterator dataIt, String mimefilter, boolean blacklist, String defaultOrigin, boolean verbose, boolean skipDuplicates)
Writes the contents of aCrawlDataIterator
to this index.
-
-
-
Field Detail
-
FIELD_URL
public static final String FIELD_URL
The URL. *- See Also:
- Constant Field Values
-
FIELD_DIGEST
public static final String FIELD_DIGEST
The content digest as String. *- See Also:
- Constant Field Values
-
FIELD_TIMESTAMP
public static final String FIELD_TIMESTAMP
The URLs timestamp (time of fetch). The exact nature of this time may vary slightly depending on the source (i.e. crawl.log and ARCs contain slightly different times but both indicate roughly when the document was obtained. The time is encoded as a String with the Java date format yyyyMMddHHmmssSSS- See Also:
- Constant Field Values
-
FIELD_ETAG
public static final String FIELD_ETAG
The document's etag. *- See Also:
- Constant Field Values
-
FIELD_URL_NORMALIZED
public static final String FIELD_URL_NORMALIZED
A stripped (normalized) version of the URL. *- See Also:
- Constant Field Values
-
FIELD_ORIGIN
public static final String FIELD_ORIGIN
A field containing meta-data on where the original version of a document is stored.- See Also:
- Constant Field Values
-
MODE_URL
public static final String MODE_URL
Index URL enabling lookups by URL. If normalized URLs are included in the index they will also be indexed and searchable. *- See Also:
- Constant Field Values
-
MODE_HASH
public static final String MODE_HASH
Index HASH enabling lookups by hash (content digest). *- See Also:
- Constant Field Values
-
MODE_BOTH
public static final String MODE_BOTH
Both URL and hash are indexed. *- See Also:
- Constant Field Values
-
-
Constructor Detail
-
DigestIndexer
public DigestIndexer(String indexLocation, String indexingMode, boolean includeNormalizedURL, boolean includeTimestamp, boolean includeEtag, boolean addToExistingIndex) throws IOException
Each instance of this class wraps one Lucene index for writing deduplication information to it.- Parameters:
indexLocation
- The location of the index (path).indexingMode
- IndexMODE_URL
,MODE_HASH
orMODE_BOTH
.includeNormalizedURL
- Should a normalized version of the URL be added to the index. SeestripURL(String)
.includeTimestamp
- Should a timestamp be included in the index.includeEtag
- Should an Etag be included in the index.addToExistingIndex
- Are we opening up an existing index. Setting this to false will cause any index atindexLocation
to be overwritten.- Throws:
IOException
- If an error occurs opening the index.
-
-
Method Detail
-
getIndex
public org.apache.lucene.index.IndexWriter getIndex()
- Returns:
- the IndexWriter
-
writeToIndex
public long writeToIndex(CrawlDataIterator dataIt, String mimefilter, boolean blacklist, String defaultOrigin, boolean verbose) throws IOException
Writes the contents of aCrawlDataIterator
to this index.This method may be invoked multiple times with different CrawlDataIterators until
close()
has been called.- Parameters:
dataIt
- The CrawlDataIterator that provides the data to index.mimefilter
- A regular expression that is used as a filter on the mimetypes to include in the index.blacklist
- If true then themimefilter
is used as a blacklist for mimetypes. If false then themimefilter
is treated as a whitelist.defaultOrigin
- If an item is missing an origin, this default value will be assigned to it. Can be null if no default origin value should be assigned.verbose
- If true then progress information will be sent to System.out.- Returns:
- The number of items added to the index.
- Throws:
IOException
- If an error occurs writing the index.
-
writeToIndex
public long writeToIndex(CrawlDataIterator dataIt, String mimefilter, boolean blacklist, String defaultOrigin, boolean verbose, boolean skipDuplicates) throws IOException
Writes the contents of aCrawlDataIterator
to this index.This method may be invoked multiple times with different CrawlDataIterators until
close()
has been called.- Parameters:
dataIt
- The CrawlDataIterator that provides the data to index.mimefilter
- A regular expression that is used as a filter on the mimetypes to include in the index.blacklist
- If true then themimefilter
is used as a blacklist for mimetypes. If false then themimefilter
is treated as a whitelist.defaultOrigin
- If an item is missing an origin, this default value will be assigned to it. Can be null if no default origin value should be assigned.verbose
- If true then progress information will be sent to System.out.skipDuplicates
- Do not add URLs that are marked as duplicates to the index- Returns:
- The number of items added to the index.
- Throws:
IOException
- If an error occurs writing the index.
-
close
public void close() throws IOException
Close the index.- Throws:
IOException
- If an error occurs while closing the index.
-
stripURL
public static String stripURL(String url)
An aggressive URL normalizer. This methods removes any www[0-9]. segments from an URL, along with any trailing slashes and all parameters.Example:
http://www.bok.hi.is/?lang=ice
would becomehttp://bok.hi.is
- Parameters:
url
- The url to strip- Returns:
- A normalized URL.
-
-