dk.netarkivet.archive.indexserver
Class CrawlLogIndexCache
java.lang.Object
dk.netarkivet.archive.indexserver.FileBasedCache<java.util.Set<T>>
dk.netarkivet.archive.indexserver.MultiFileBasedCache<T>
dk.netarkivet.archive.indexserver.CombiningMultiFileBasedCache<java.lang.Long>
dk.netarkivet.archive.indexserver.CrawlLogIndexCache
- All Implemented Interfaces:
- JobIndexCache
- Direct Known Subclasses:
- DedupCrawlLogIndexCache, FullCrawlLogIndexCache
public abstract class CrawlLogIndexCache
- extends CombiningMultiFileBasedCache<java.lang.Long>
- implements JobIndexCache
A cache that serves Lucene indices of crawl logs for given job IDs.
Uses the DigestIndexer in the deduplicator software:
http://deduplicator.sourceforge.net/apidocs/is/hi/bok/deduplicator/DigestIndexer.html
Upon combination of underlying files, each file in the Lucene index is
gzipped and the compressed versions are stored in the directory given by
getCacheFile().
The subclass has to determine in its constructor call which mime types are
included.
Constructor Summary |
CrawlLogIndexCache(java.lang.String name,
boolean blacklist,
java.lang.String mimeFilter)
Constructor for the CrawlLogIndexCache class. |
Method Summary |
protected void |
combine(java.util.Map<java.lang.Long,java.io.File> rawfiles)
Combine a number of crawl.log files into one Lucene index. |
protected java.util.Map<java.lang.Long,java.io.File> |
prepareCombine(java.util.Set<java.lang.Long> ids)
Prepare data for combining. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
CrawlLogIndexCache
public CrawlLogIndexCache(java.lang.String name,
boolean blacklist,
java.lang.String mimeFilter)
- Constructor for the CrawlLogIndexCache class.
- Parameters:
name
- The name of the CrawlLogIndexCacheblacklist
- Shall the mimefilter be considered a blacklist
or a whitelist?mimeFilter
- A regular expression for the mimetypes to
exclude/include
prepareCombine
protected java.util.Map<java.lang.Long,java.io.File> prepareCombine(java.util.Set<java.lang.Long> ids)
- Prepare data for combining. This class overrides prepareCombine to
make sure that CDX data is available.
- Overrides:
prepareCombine
in class CombiningMultiFileBasedCache<java.lang.Long>
- Parameters:
ids
- Set of IDs that will be combined.
- Returns:
- Map of ID->File of data to combine for the IDs where we could
find data.
combine
protected void combine(java.util.Map<java.lang.Long,java.io.File> rawfiles)
- Combine a number of crawl.log files into one Lucene index. This index
is placed as gzip files under the directory returned by getCacheFile().
- Specified by:
combine
in class CombiningMultiFileBasedCache<java.lang.Long>
- Parameters:
rawfiles
- The map from job ID into crawl.log contents. No
null values are allowed in this map.