Class RawMetadataCache

  • All Implemented Interfaces:
    RawDataCache
    Direct Known Subclasses:
    CDXDataCache, CrawlLogDataCache

    public class RawMetadataCache
    extends FileBasedCache<Long>
    implements RawDataCache
    This is an implementation of the RawDataCache specialized for data out of metadata files. It uses regular expressions for matching URL and mime-type of ARC entries for the kind of metadata we want.
    • Field Detail

      • MATCH_ALL_PATTERN

        public static final Pattern MATCH_ALL_PATTERN
        A regular expression object that matches everything.
    • Constructor Detail

      • RawMetadataCache

        public RawMetadataCache​(String prefix,
                                Pattern urlMatcher,
                                Pattern mimeMatcher)
        Create a new RawMetadataCache. For a given job ID, this will fetch and cache selected content from metadata files (<ID>-metadata-[0-9]+.arc). Any entry in a metadata file that matches both patterns will be returned. The returned data does not directly indicate which file they were from, though parts intrinsic to the particular format might.
        Parameters:
        prefix - A prefix that will be used to distinguish this cache's files from other caches'. It will be used for creating a directory, so it must not contain characters not legal in directory names.
        urlMatcher - A pattern for matching URLs of the desired entries. If null, a .* pattern will be used.
        mimeMatcher - A pattern for matching mime-types of the desired entries. If null, a .* pattern will be used.