Class CDXIndexer

  • All Implemented Interfaces:
    Indexer

    public class CDXIndexer
    extends java.lang.Object
    implements Indexer
    Class for creating CDX indexes from archive files.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter arcAdapter  
      protected org.archive.wayback.resourceindex.cdx.SearchResultToCDXLineAdapter cdxLineCreator
      The CDX line creator, which creates the cdx lines from the warc records.
      protected org.archive.wayback.UrlCanonicalizer urlCanonicalizer  
      protected org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter warcAdapter
      The warc record searcher.
    • Constructor Summary

      Constructors 
      Constructor Description
      CDXIndexer()
      Constructor.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected java.util.List<java.lang.String> extractCDXLines​(org.archive.io.ArchiveReader reader)
      Method for extracting the cdx lines from an ArchiveReader.
      java.util.List<java.lang.String> index​(java.io.InputStream archiveInputStream, java.lang.String archiveName)
      Index the given archive file.
      java.util.List<java.lang.String> indexFile​(java.io.File archiveFile)
      Create the CDX indexes from an archive file.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • warcAdapter

        protected final org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter warcAdapter
        The warc record searcher.
      • arcAdapter

        protected final org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter arcAdapter
      • cdxLineCreator

        protected final org.archive.wayback.resourceindex.cdx.SearchResultToCDXLineAdapter cdxLineCreator
        The CDX line creator, which creates the cdx lines from the warc records.
      • urlCanonicalizer

        protected final org.archive.wayback.UrlCanonicalizer urlCanonicalizer
    • Constructor Detail

    • Method Detail

      • index

        public java.util.List<java.lang.String> index​(java.io.InputStream archiveInputStream,
                                                      java.lang.String archiveName)
                                               throws java.io.IOException
        Index the given archive file.
        Parameters:
        archiveInputStream - An inputstream to the given file.
        archiveName - The name of the given file.
        Returns:
        The extracted CDX lines from the file.
        Throws:
        java.io.IOException
      • indexFile

        public java.util.List<java.lang.String> indexFile​(java.io.File archiveFile)
                                                   throws java.io.IOException
        Create the CDX indexes from an archive file.
        Specified by:
        indexFile in interface Indexer
        Parameters:
        archiveFile - The archive file.
        Returns:
        The CDX lines for the records in the archive file.
        Throws:
        java.io.IOException - If it fails to read the archive file.
      • extractCDXLines

        protected java.util.List<java.lang.String> extractCDXLines​(org.archive.io.ArchiveReader reader)
        Method for extracting the cdx lines from an ArchiveReader.
        Parameters:
        reader - The ArchiveReader which is actively reading an archive file (e.g WARC).
        Returns:
        The list of CDX index lines for the records of the archive in the reader.