java.lang.Object
- dk.netarkivet.harvester.indexserver.FileBasedCache<Set<T>>
- - dk.netarkivet.harvester.indexserver.MultiFileBasedCache<T>
  - - dk.netarkivet.harvester.indexserver.CombiningMultiFileBasedCache<Long>
    - - dk.netarkivet.harvester.indexserver.CrawlLogIndexCache

All Implemented Interfaces:

JobIndexCache

Direct Known Subclasses:

DedupCrawlLogIndexCache, FullCrawlLogIndexCache
```
public abstract class CrawlLogIndexCache
extends CombiningMultiFileBasedCache<Long>
implements JobIndexCache
```
A cache that serves Lucene indices of crawl logs for given job IDs. Uses the DigestIndexer in the deduplicator software: http://deduplicator.sourceforge.net/apidocs/is/hi/bok/deduplicator/DigestIndexer.html Upon combination of underlying files, each file in the Lucene index is gzipped and the compressed versions are stored in the directory given by getCacheFile(). The subclass has to determine in its constructor call which mime types are included.

Field Summary
- Fields inherited from class dk.netarkivet.harvester.indexserver.CombiningMultiFileBasedCache
  rawcache
- Fields inherited from class dk.netarkivet.harvester.indexserver.FileBasedCache
  cacheDir

Constructor Summary

Constructors
Constructor	Description
`CrawlLogIndexCache(String name, boolean blacklist, String mimeFilter)`	Constructor for the CrawlLogIndexCache class.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`protected void`	`combine(Map<Long,File> rawfiles)`	Combine a number of crawl.log files into one Lucene index.
`protected static DigestIndexer`	`createStandardIndexer(String indexLocation)`	Create standard deduplication indexer.
`protected static File`	`getSortedCDX(File cdxFile)`	Get a sorted, temporary CDX file corresponding to the given CDXfile.
`protected static File`	`getSortedCrawlLog(File file)`	Get a sorted, temporary crawl.log file from an unsorted one.
`protected static void`	`indexFile(Long id, File crawllogfile, File cdxfile, DigestIndexer indexer, DigestOptions options)`	Ingest a single crawl.log file using the corresponding CDX file to find offsets.
`protected Map<Long,File>`	`prepareCombine(Set<Long> ids)`	Prepare data for combining.

Methods inherited from class dk.netarkivet.harvester.indexserver.CombiningMultiFileBasedCache
cacheData

Methods inherited from class dk.netarkivet.harvester.indexserver.MultiFileBasedCache
getCacheFile

Methods inherited from class dk.netarkivet.harvester.indexserver.FileBasedCache
cache, get, getCacheDir, getIndex

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface dk.netarkivet.common.distribute.indexserver.JobIndexCache
getIndex, requestIndex

- Constructor Detail
  - CrawlLogIndexCache
```
public CrawlLogIndexCache(String name,
                          boolean blacklist,
                          String mimeFilter)
```
    Constructor for the CrawlLogIndexCache class.
    
    Parameters:
    
    name - The name of the CrawlLogIndexCache
    
    blacklist - Shall the mimefilter be considered a blacklist or a whitelist?
    
    mimeFilter - A regular expression for the mimetypes to exclude/include
- Method Detail
  - prepareCombine
```
protected Map<Long,File> prepareCombine(Set<Long> ids)
```
    Prepare data for combining. This class overrides prepareCombine to make sure that CDX data is available.
    
    Overrides:
    
    prepareCombine in class CombiningMultiFileBasedCache<Long>
    
    Parameters:
    
    ids - Set of IDs that will be combined.
    
    Returns:
    
    Map of ID->File of data to combine for the IDs where we could find data.
  - combine
```
protected void combine(Map<Long,File> rawfiles)
```
    Combine a number of crawl.log files into one Lucene index. This index is placed as gzip files under the directory returned by getCacheFile().
    
    Specified by:
    
    combine in class CombiningMultiFileBasedCache<Long>
    
    Parameters:
    
    rawfiles - The map from job ID into crawl.log contents. No null values are allowed in this map.
  - indexFile
```
protected static void indexFile(Long id,
                                File crawllogfile,
                                File cdxfile,
                                DigestIndexer indexer,
                                DigestOptions options)
```
    Ingest a single crawl.log file using the corresponding CDX file to find offsets.
    
    Parameters:
    
    id - ID of a job to ingest.
    
    crawllogfile - The file containing the crawl.log data for the job
    
    cdxfile - The file containing the cdx data for the job
    
    options - The digesting options used.
    
    indexer - The indexer to add to.
  - getSortedCDX
```
protected static File getSortedCDX(File cdxFile)
```
    Get a sorted, temporary CDX file corresponding to the given CDXfile.
    
    Parameters:
    
    cdxFile - A cdxfile
    
    Returns:
    
    A temporary file with CDX info for that just sorted according to the standard CDX sorting rules. This file will be removed at the exit of the JVM, but should be attempted removed when it is no longer used.
  - getSortedCrawlLog
```
protected static File getSortedCrawlLog(File file)
```
    Get a sorted, temporary crawl.log file from an unsorted one.
    
    Parameters:
    
    file - The file containing an unsorted crawl.log file.
    
    Returns:
    
    A temporary file containing the entries sorted according to URL. The file will be removed upon exit of the JVM, but should be attempted removed when it is no longer used.
  - createStandardIndexer
```
protected static DigestIndexer createStandardIndexer(String indexLocation)
                                              throws IOException
```
    Create standard deduplication indexer.
    
    Parameters:
    
    indexLocation - The full path to the indexing directory
    
    Returns:
    
    the created deduplication indexer.
    
    Throws:
    
    IOException - If unable to open the index.

Class CrawlLogIndexCache

Field Summary

Fields inherited from class dk.netarkivet.harvester.indexserver.CombiningMultiFileBasedCache

Fields inherited from class dk.netarkivet.harvester.indexserver.FileBasedCache

Constructor Summary

Method Summary

Methods inherited from class dk.netarkivet.harvester.indexserver.CombiningMultiFileBasedCache

Methods inherited from class dk.netarkivet.harvester.indexserver.MultiFileBasedCache

Methods inherited from class dk.netarkivet.harvester.indexserver.FileBasedCache

Methods inherited from class java.lang.Object

Methods inherited from interface dk.netarkivet.common.distribute.indexserver.JobIndexCache

Constructor Detail

CrawlLogIndexCache

Method Detail

prepareCombine

combine

indexFile

getSortedCDX

getSortedCrawlLog

createStandardIndexer