|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectdk.netarkivet.harvester.indexserver.FileBasedCache<java.util.Set<T>>
dk.netarkivet.harvester.indexserver.MultiFileBasedCache<T>
dk.netarkivet.harvester.indexserver.CombiningMultiFileBasedCache<java.lang.Long>
dk.netarkivet.harvester.indexserver.CrawlLogIndexCache
public abstract class CrawlLogIndexCache
A cache that serves Lucene indices of crawl logs for given job IDs. Uses the DigestIndexer in the deduplicator software: http://deduplicator.sourceforge.net/apidocs/is/hi/bok/deduplicator/DigestIndexer.html Upon combination of underlying files, each file in the Lucene index is gzipped and the compressed versions are stored in the directory given by getCacheFile(). The subclass has to determine in its constructor call which mime types are included.
Field Summary |
---|
Fields inherited from class dk.netarkivet.harvester.indexserver.CombiningMultiFileBasedCache |
---|
rawcache |
Fields inherited from class dk.netarkivet.harvester.indexserver.FileBasedCache |
---|
cacheDir |
Constructor Summary | |
---|---|
CrawlLogIndexCache(java.lang.String name,
boolean blacklist,
java.lang.String mimeFilter)
Constructor for the CrawlLogIndexCache class. |
Method Summary | |
---|---|
protected void |
combine(java.util.Map<java.lang.Long,java.io.File> rawfiles)
Combine a number of crawl.log files into one Lucene index. |
protected static is.hi.bok.deduplicator.DigestIndexer |
createStandardIndexer(java.lang.String indexLocation)
Create standard deduplication indexer. |
protected static java.io.File |
getSortedCDX(java.io.File cdxFile)
Get a sorted, temporary CDX file corresponding to the given CDXfile. |
protected static java.io.File |
getSortedCrawlLog(java.io.File file)
Get a sorted, temporary crawl.log file from an unsorted one. |
protected static void |
indexFile(java.lang.Long id,
java.io.File crawllogfile,
java.io.File cdxfile,
is.hi.bok.deduplicator.DigestIndexer indexer,
DigestOptions options)
Ingest a single crawl.log file using the corresponding CDX file to find offsets. |
protected java.util.Map<java.lang.Long,java.io.File> |
prepareCombine(java.util.Set<java.lang.Long> ids)
Prepare data for combining. |
Methods inherited from class dk.netarkivet.harvester.indexserver.CombiningMultiFileBasedCache |
---|
cacheData |
Methods inherited from class dk.netarkivet.harvester.indexserver.MultiFileBasedCache |
---|
getCacheFile |
Methods inherited from class dk.netarkivet.harvester.indexserver.FileBasedCache |
---|
cache, get, getCacheDir, getIndex |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface dk.netarkivet.common.distribute.indexserver.JobIndexCache |
---|
getIndex, requestIndex |
Constructor Detail |
---|
public CrawlLogIndexCache(java.lang.String name, boolean blacklist, java.lang.String mimeFilter)
name
- The name of the CrawlLogIndexCacheblacklist
- Shall the mimefilter be considered a blacklist
or a whitelist?mimeFilter
- A regular expression for the mimetypes to
exclude/includeMethod Detail |
---|
protected java.util.Map<java.lang.Long,java.io.File> prepareCombine(java.util.Set<java.lang.Long> ids)
prepareCombine
in class CombiningMultiFileBasedCache<java.lang.Long>
ids
- Set of IDs that will be combined.
protected void combine(java.util.Map<java.lang.Long,java.io.File> rawfiles)
combine
in class CombiningMultiFileBasedCache<java.lang.Long>
rawfiles
- The map from job ID into crawl.log contents. No
null values are allowed in this map.protected static void indexFile(java.lang.Long id, java.io.File crawllogfile, java.io.File cdxfile, is.hi.bok.deduplicator.DigestIndexer indexer, DigestOptions options)
id
- ID of a job to ingest.crawllogfile
- The file containing the crawl.log data for the jobcdxfile
- The file containing the cdx data for the joboptions
- The digesting options used.indexer
- The indexer to add to.protected static java.io.File getSortedCDX(java.io.File cdxFile)
cdxFile
- A cdxfile
protected static java.io.File getSortedCrawlLog(java.io.File file)
file
- The file containing an unsorted crawl.log file.
protected static is.hi.bok.deduplicator.DigestIndexer createStandardIndexer(java.lang.String indexLocation) throws java.io.IOException
indexLocation
- The full path to the indexing directory
java.io.IOException
- If unable to open the index.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |