Package dk.netarkivet.wayback.hadoop
Class CDXIndexer
- java.lang.Object
-
- dk.netarkivet.wayback.hadoop.CDXIndexer
-
- All Implemented Interfaces:
Indexer
public class CDXIndexer extends java.lang.Object implements Indexer
Class for creating CDX indexes from archive files.
-
-
Field Summary
Fields Modifier and Type Field Description protected org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter
arcAdapter
protected org.archive.wayback.resourceindex.cdx.SearchResultToCDXLineAdapter
cdxLineCreator
The CDX line creator, which creates the cdx lines from the warc records.protected org.archive.wayback.UrlCanonicalizer
urlCanonicalizer
protected org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter
warcAdapter
The warc record searcher.
-
Constructor Summary
Constructors Constructor Description CDXIndexer()
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected java.util.List<java.lang.String>
extractCDXLines(org.archive.io.ArchiveReader reader)
Method for extracting the cdx lines from an ArchiveReader.java.util.List<java.lang.String>
index(java.io.InputStream archiveInputStream, java.lang.String archiveName)
Index the given archive file.java.util.List<java.lang.String>
indexFile(java.io.File archiveFile)
Create the CDX indexes from an archive file.
-
-
-
Field Detail
-
warcAdapter
protected final org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter warcAdapter
The warc record searcher.
-
arcAdapter
protected final org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter arcAdapter
-
cdxLineCreator
protected final org.archive.wayback.resourceindex.cdx.SearchResultToCDXLineAdapter cdxLineCreator
The CDX line creator, which creates the cdx lines from the warc records.
-
urlCanonicalizer
protected final org.archive.wayback.UrlCanonicalizer urlCanonicalizer
-
-
Constructor Detail
-
CDXIndexer
public CDXIndexer()
Constructor.
-
-
Method Detail
-
index
public java.util.List<java.lang.String> index(java.io.InputStream archiveInputStream, java.lang.String archiveName) throws java.io.IOException
Index the given archive file.- Parameters:
archiveInputStream
- An inputstream to the given file.archiveName
- The name of the given file.- Returns:
- The extracted CDX lines from the file.
- Throws:
java.io.IOException
-
indexFile
public java.util.List<java.lang.String> indexFile(java.io.File archiveFile) throws java.io.IOException
Create the CDX indexes from an archive file.
-
extractCDXLines
protected java.util.List<java.lang.String> extractCDXLines(org.archive.io.ArchiveReader reader)
Method for extracting the cdx lines from an ArchiveReader.- Parameters:
reader
- The ArchiveReader which is actively reading an archive file (e.g WARC).- Returns:
- The list of CDX index lines for the records of the archive in the reader.
-
-