Package dk.netarkivet.wayback.hadoop
Class CDXIndexer
- java.lang.Object
-
- dk.netarkivet.wayback.hadoop.CDXIndexer
-
-
Field Summary
Fields Modifier and Type Field Description protected org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter
arcAdapter
protected org.archive.wayback.resourceindex.cdx.SearchResultToCDXLineAdapter
cdxLineCreator
The CDX line creator, which creates the cdx lines from the warc records.protected org.archive.wayback.UrlCanonicalizer
urlCanonicalizer
protected org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter
warcAdapter
The warc record searcher.
-
Constructor Summary
Constructors Constructor Description CDXIndexer()
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected List<String>
extractCDXLines(org.archive.io.ArchiveReader reader)
Method for extracting the cdx lines from an ArchiveReader.List<String>
index(InputStream archiveInputStream, String archiveName)
Index the given archive file.List<String>
indexFile(File archiveFile)
Create the CDX indexes from an archive file.
-
-
-
Field Detail
-
warcAdapter
protected final org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter warcAdapter
The warc record searcher.
-
arcAdapter
protected final org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter arcAdapter
-
cdxLineCreator
protected final org.archive.wayback.resourceindex.cdx.SearchResultToCDXLineAdapter cdxLineCreator
The CDX line creator, which creates the cdx lines from the warc records.
-
urlCanonicalizer
protected final org.archive.wayback.UrlCanonicalizer urlCanonicalizer
-
-
Method Detail
-
index
public List<String> index(InputStream archiveInputStream, String archiveName) throws IOException
Index the given archive file.- Parameters:
archiveInputStream
- An inputstream to the given file.archiveName
- The name of the given file.- Returns:
- The extracted CDX lines from the file.
- Throws:
IOException
-
indexFile
public List<String> indexFile(File archiveFile) throws IOException
Create the CDX indexes from an archive file.- Specified by:
indexFile
in interfaceIndexer
- Parameters:
archiveFile
- The archive file.- Returns:
- The CDX lines for the records in the archive file.
- Throws:
IOException
- If it fails to read the archive file.
-
extractCDXLines
protected List<String> extractCDXLines(org.archive.io.ArchiveReader reader)
Method for extracting the cdx lines from an ArchiveReader.- Parameters:
reader
- The ArchiveReader which is actively reading an archive file (e.g WARC).- Returns:
- The list of CDX index lines for the records of the archive in the reader.
-
-