public abstract class CrawlLogIndexCache extends CombiningMultiFileBasedCache<Long> implements JobIndexCache
rawcache
cacheDir
Constructor and Description |
---|
CrawlLogIndexCache(String name,
boolean blacklist,
String mimeFilter)
Constructor for the CrawlLogIndexCache class.
|
Modifier and Type | Method and Description |
---|---|
protected void |
combine(Map<Long,File> rawfiles)
Combine a number of crawl.log files into one Lucene index.
|
protected static DigestIndexer |
createStandardIndexer(String indexLocation)
Create standard deduplication indexer.
|
protected static File |
getSortedCDX(File cdxFile)
Get a sorted, temporary CDX file corresponding to the given CDXfile.
|
protected static File |
getSortedCrawlLog(File file)
Get a sorted, temporary crawl.log file from an unsorted one.
|
protected static void |
indexFile(Long id,
File crawllogfile,
File cdxfile,
DigestIndexer indexer,
DigestOptions options)
Ingest a single crawl.log file using the corresponding CDX file to find offsets.
|
protected Map<Long,File> |
prepareCombine(Set<Long> ids)
Prepare data for combining.
|
cacheData
getCacheFile
cache, get, getCacheDir, getIndex
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getIndex, requestIndex
public CrawlLogIndexCache(String name, boolean blacklist, String mimeFilter)
name
- The name of the CrawlLogIndexCacheblacklist
- Shall the mimefilter be considered a blacklist or a whitelist?mimeFilter
- A regular expression for the mimetypes to exclude/includeprotected Map<Long,File> prepareCombine(Set<Long> ids)
prepareCombine
in class CombiningMultiFileBasedCache<Long>
ids
- Set of IDs that will be combined.protected void combine(Map<Long,File> rawfiles)
combine
in class CombiningMultiFileBasedCache<Long>
rawfiles
- The map from job ID into crawl.log contents. No null values are allowed in this map.protected static void indexFile(Long id, File crawllogfile, File cdxfile, DigestIndexer indexer, DigestOptions options)
id
- ID of a job to ingest.crawllogfile
- The file containing the crawl.log data for the jobcdxfile
- The file containing the cdx data for the joboptions
- The digesting options used.indexer
- The indexer to add to.protected static File getSortedCDX(File cdxFile)
cdxFile
- A cdxfileprotected static File getSortedCrawlLog(File file)
file
- The file containing an unsorted crawl.log file.protected static DigestIndexer createStandardIndexer(String indexLocation) throws IOException
indexLocation
- The full path to the indexing directoryIOException
- If unable to open the index.Copyright © 2005–2016 The Royal Danish Library, the Danish State and University Library, the National Library of France and the Austrian National Library.. All rights reserved.