Class DigestIndexer

  • public class DigestIndexer
    extends java.lang.Object
    A class for building a de-duplication index.

    The indexing can be done via the command line options (Run with --help parameter to print usage information) or natively embedded in other applications.

    This class also defines string constants for the lucene field names.

    Kristinn Sigurðsson, Søren Vejrup Carlsen
    • Field Summary

      Modifier and Type Field Description
      static java.lang.String FIELD_DIGEST
      The content digest as String.
      static java.lang.String FIELD_ETAG
      The document's etag.
      static java.lang.String FIELD_ORIGIN
      A field containing meta-data on where the original version of a document is stored.
      static java.lang.String FIELD_TIMESTAMP
      The URLs timestamp (time of fetch).
      static java.lang.String FIELD_URL
      The URL.
      static java.lang.String FIELD_URL_NORMALIZED
      A stripped (normalized) version of the URL.
      static java.lang.String MODE_BOTH
      Both URL and hash are indexed.
      static java.lang.String MODE_HASH
      Index HASH enabling lookups by hash (content digest).
      static java.lang.String MODE_URL
      Index URL enabling lookups by URL.
    • Constructor Summary

      Constructor Description
      DigestIndexer​(java.lang.String indexLocation, java.lang.String indexingMode, boolean includeNormalizedURL, boolean includeTimestamp, boolean includeEtag, boolean addToExistingIndex)
      Each instance of this class wraps one Lucene index for writing deduplication information to it.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void close()
      Close the index.
      org.apache.lucene.index.IndexWriter getIndex()  
      static void main​(java.lang.String[] args)  
      static java.lang.String stripURL​(java.lang.String url)
      An aggressive URL normalizer.
      long writeToIndex​(CrawlDataIterator dataIt, java.lang.String mimefilter, boolean blacklist, java.lang.String defaultOrigin, boolean verbose)
      Writes the contents of a CrawlDataIterator to this index.
      long writeToIndex​(CrawlDataIterator dataIt, java.lang.String mimefilter, boolean blacklist, java.lang.String defaultOrigin, boolean verbose, boolean skipDuplicates)
      Writes the contents of a CrawlDataIterator to this index.
      • Methods inherited from class java.lang.Object

    • Field Detail


        public static final java.lang.String FIELD_TIMESTAMP
        The URLs timestamp (time of fetch). The exact nature of this time may vary slightly depending on the source (i.e. crawl.log and ARCs contain slightly different times but both indicate roughly when the document was obtained. The time is encoded as a String with the Java date format yyyyMMddHHmmssSSS
        public static final java.lang.String FIELD_ORIGIN
        A field containing meta-data on where the original version of a document is stored.
      • MODE_URL

        public static final java.lang.String MODE_URL
        Index URL enabling lookups by URL. If normalized URLs are included in the index they will also be indexed and searchable. *
    • Constructor Detail

      • DigestIndexer

        public DigestIndexer​(java.lang.String indexLocation,
                             java.lang.String indexingMode,
                             boolean includeNormalizedURL,
                             boolean includeTimestamp,
                             boolean includeEtag,
                             boolean addToExistingIndex)
        Each instance of this class wraps one Lucene index for writing deduplication information to it.
        indexLocation - The location of the index (path).
        indexingMode - Index MODE_URL, MODE_HASH or MODE_BOTH.
        includeNormalizedURL - Should a normalized version of the URL be added to the index. See stripURL(String).
        includeTimestamp - Should a timestamp be included in the index.
        includeEtag - Should an Etag be included in the index.
        addToExistingIndex - Are we opening up an existing index. Setting this to false will cause any index at indexLocation to be overwritten.
        Parameters:
    • Method Detail

      • getIndex

        public org.apache.lucene.index.IndexWriter getIndex()
        the IndexWriter
      • writeToIndex

        public long writeToIndex​(CrawlDataIterator dataIt,
                                 java.lang.String mimefilter,
                                 boolean blacklist,
                                 java.lang.String defaultOrigin,
                                 boolean verbose)
        Writes the contents of a CrawlDataIterator to this index.

        This method may be invoked multiple times with different CrawlDataIterators until close() has been called.

        dataIt - The CrawlDataIterator that provides the data to index.
        mimefilter - A regular expression that is used as a filter on the mimetypes to include in the index.
        blacklist - If true then the mimefilter is used as a blacklist for mimetypes. If false then the mimefilter is treated as a whitelist.
        defaultOrigin - If an item is missing an origin, this default value will be assigned to it. Can be null if no default origin value should be assigned.
        verbose - If true then progress information will be sent to System.out.
        The number of items added to the index.
        Returns:
      • writeToIndex

        public long writeToIndex​(CrawlDataIterator dataIt,
                                 java.lang.String mimefilter,
                                 boolean blacklist,
                                 java.lang.String defaultOrigin,
                                 boolean verbose,
                                 boolean skipDuplicates)
        Writes the contents of a CrawlDataIterator to this index.

        This method may be invoked multiple times with different CrawlDataIterators until close() has been called.

        dataIt - The CrawlDataIterator that provides the data to index.
        mimefilter - A regular expression that is used as a filter on the mimetypes to include in the index.
        blacklist - If true then the mimefilter is used as a blacklist for mimetypes. If false then the mimefilter is treated as a whitelist.
        defaultOrigin - If an item is missing an origin, this default value will be assigned to it. Can be null if no default origin value should be assigned.
        verbose - If true then progress information will be sent to System.out.
        skipDuplicates - Do not add URLs that are marked as duplicates to the index
        The number of items added to the index.
        Returns:
      • close

        public void close()
        Close the index.
        Throws:
      • stripURL

        public static java.lang.String stripURL​(java.lang.String url)
        An aggressive URL normalizer. This methods removes any www[0-9]. segments from an URL, along with any trailing slashes and all parameters.

        Example: would become

        url - The url to strip
        A normalized URL.
      • main

        public static void main​(java.lang.String[] args)
                         throws java.lang.Exception