dk.netarkivet.archive.indexserver
Class CrawlLogIndexCache

java.lang.Object
  extended by dk.netarkivet.archive.indexserver.FileBasedCache<java.util.Set<T>>
      extended by dk.netarkivet.archive.indexserver.MultiFileBasedCache<T>
          extended by dk.netarkivet.archive.indexserver.CombiningMultiFileBasedCache<java.lang.Long>
              extended by dk.netarkivet.archive.indexserver.CrawlLogIndexCache
All Implemented Interfaces:
JobIndexCache
Direct Known Subclasses:
DedupCrawlLogIndexCache, FullCrawlLogIndexCache

public abstract class CrawlLogIndexCache
extends CombiningMultiFileBasedCache<java.lang.Long>
implements JobIndexCache

A cache that serves Lucene indices of crawl logs for given job IDs. Uses the DigestIndexer in the deduplicator software: http://deduplicator.sourceforge.net/apidocs/is/hi/bok/deduplicator/DigestIndexer.html Upon combination of underlying files, each file in the Lucene index is gzipped and the compressed versions are stored in the directory given by getCacheFile(). The subclass has to determine in its constructor call which mime types are included.


Field Summary
 
Fields inherited from class dk.netarkivet.archive.indexserver.CombiningMultiFileBasedCache
rawcache
 
Fields inherited from class dk.netarkivet.archive.indexserver.FileBasedCache
cacheDir
 
Constructor Summary
CrawlLogIndexCache(java.lang.String name, boolean blacklist, java.lang.String mimeFilter)
          Constructor for the CrawlLogIndexCache class.
 
Method Summary
protected  void combine(java.util.Map<java.lang.Long,java.io.File> rawfiles)
          Combine a number of crawl.log files into one Lucene index.
protected  java.util.Map<java.lang.Long,java.io.File> prepareCombine(java.util.Set<java.lang.Long> ids)
          Prepare data for combining.
 
Methods inherited from class dk.netarkivet.archive.indexserver.CombiningMultiFileBasedCache
cacheData
 
Methods inherited from class dk.netarkivet.archive.indexserver.MultiFileBasedCache
getCacheFile
 
Methods inherited from class dk.netarkivet.archive.indexserver.FileBasedCache
cache, get, getCacheDir, getIndex
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface dk.netarkivet.common.distribute.indexserver.JobIndexCache
getIndex
 

Constructor Detail

CrawlLogIndexCache

public CrawlLogIndexCache(java.lang.String name,
                          boolean blacklist,
                          java.lang.String mimeFilter)
Constructor for the CrawlLogIndexCache class.

Parameters:
name - The name of the CrawlLogIndexCache
blacklist - Shall the mimefilter be considered a blacklist or a whitelist?
mimeFilter - A regular expression for the mimetypes to exclude/include
Method Detail

prepareCombine

protected java.util.Map<java.lang.Long,java.io.File> prepareCombine(java.util.Set<java.lang.Long> ids)
Prepare data for combining. This class overrides prepareCombine to make sure that CDX data is available.

Overrides:
prepareCombine in class CombiningMultiFileBasedCache<java.lang.Long>
Parameters:
ids - Set of IDs that will be combined.
Returns:
Map of ID->File of data to combine for the IDs where we could find data.

combine

protected void combine(java.util.Map<java.lang.Long,java.io.File> rawfiles)
Combine a number of crawl.log files into one Lucene index. This index is placed as gzip files under the directory returned by getCacheFile().

Specified by:
combine in class CombiningMultiFileBasedCache<java.lang.Long>
Parameters:
rawfiles - The map from job ID into crawl.log contents. No null values are allowed in this map.