Class CDXIndexer

  • All Implemented Interfaces:
    Indexer

    public class CDXIndexer
    extends Object
    implements Indexer
    Class for creating CDX indexes from archive files.
    • Field Detail

      • warcAdapter

        protected final org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter warcAdapter
        The warc record searcher.
      • arcAdapter

        protected final org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter arcAdapter
      • cdxLineCreator

        protected final org.archive.wayback.resourceindex.cdx.SearchResultToCDXLineAdapter cdxLineCreator
        The CDX line creator, which creates the cdx lines from the warc records.
      • urlCanonicalizer

        protected final org.archive.wayback.UrlCanonicalizer urlCanonicalizer
    • Constructor Detail

      • CDXIndexer

        public CDXIndexer()
        Constructor.
    • Method Detail

      • index

        public List<String> index​(InputStream archiveInputStream,
                                  String archiveName)
                           throws IOException
        Index the given archive file.
        Parameters:
        archiveInputStream - An inputstream to the given file.
        archiveName - The name of the given file.
        Returns:
        The extracted CDX lines from the file.
        Throws:
        IOException
      • indexFile

        public List<String> indexFile​(File archiveFile)
                               throws IOException
        Create the CDX indexes from an archive file.
        Specified by:
        indexFile in interface Indexer
        Parameters:
        archiveFile - The archive file.
        Returns:
        The CDX lines for the records in the archive file.
        Throws:
        IOException - If it fails to read the archive file.
      • extractCDXLines

        protected List<String> extractCDXLines​(org.archive.io.ArchiveReader reader)
        Method for extracting the cdx lines from an ArchiveReader.
        Parameters:
        reader - The ArchiveReader which is actively reading an archive file (e.g WARC).
        Returns:
        The list of CDX index lines for the records of the archive in the reader.