Page tree
Skip to end of metadata
Go to start of metadata


dk.netarkivet.common.distribute.indexserver.JobIndexCache (interface):

/**
* An interface to a cache of data for jobs.
*/
public interface JobIndexCache {
/**
* Get an index for the given list of job IDs. The resulting file contains a suitably sorted list. This method
* should always be safe for asynchronous calling. This method may use a cached version of the file.
*
* @param jobIDs Set of job IDs to generate index for.
* @return An index, consisting of a file and the set this is an index for. This file must not be modified or
* deleted, since it is part of the cache of data.
*/
Index<Set<Long>> getIndex(Set<Long> jobIDs);


/**
* Request an index from the indexserver. Prepare the index but don't give it to me.
*
* @param jobSet Set of job IDs to generate index for.
* @param harvestId Harvestdefinition associated with this set of jobs
*/
void requestIndex(Set<Long> jobSet, Long harvestId);
}


All relevant implementations of the JobIndexCache are:


The types of jobs you can request is defined by the enum class dk.netarkivet.common.distribute.indexserver.RequestType:

public enum RequestType {
	CDX, DEDUP_CRAWL_LOG, FULL_CRAWL_LOG
}


The naming of the cachefiles is done by the MultiFileBasedCache#getCacheFile() method:

/**
* Get the filename for the file containing the combined data for a set of IDs.
*
* @param ids A set of IDs to generate a filename for
* @return A filename that uniquely identifies this set of IDs within the cache. It is considered acceptable to have
* collisions at a likelihood the order of 1/2^128 (i.e. use MD5 to abbreviate long lists).
*/
public File getCacheFile(Set<T> ids) {
	String fileName = FileUtils.generateFileNameFromSet(ids, "-cache");
	return new File(getCacheDir(), fileName);
}
  • No labels

1 Comment

  1. I have started a https://github.com/netarchivesuite/netarchivesuite/tree/NextGenIndexing branch with some minor recommended changes to the code, primarily additional logging

    Some further comments:

    • DedupCrawlLogIndexCache uses its own CDXindexCache  object. Could be a synchronization issue
    • All these .working files should be removed after the cache file or cache directory has been created
    • The class hierarchy seems quite complicated:
      • We may not need all the following classes: CombiningMultiFileBasedCache, MultiFileBasedCache, RawDataCache, RawMetadataCache,..
    • Remove interface RawDataCache. There is only one implementation, i.e. RawMetadataCache
    • Look at the FIXMEs in the FileBasedCache#cache() method.