dk.netarkivet.wayback.batch
Interface DeduplicateToCDXAdapterInterface

All Known Implementing Classes:
DeduplicateToCDXAdapter

public interface DeduplicateToCDXAdapterInterface

Interface describing a class which can be used to convert duplicate records in a crwal log to wayback-compatible cdx records


Method Summary
 java.lang.String adaptLine(java.lang.String line)
          Takes a deduplicate line from a crawl log and converts it to a line in a cdx file suitable for searching in wayback.
 void adaptStream(java.io.InputStream is, java.io.OutputStream os)
          Scans an input stream from a crawl log and converts all lines containing deduplicate information to cdx records which it outputs to an output stream.
 

Method Detail

adaptLine

java.lang.String adaptLine(java.lang.String line)
Takes a deduplicate line from a crawl log and converts it to a line in a cdx file suitable for searching in wayback. The target url in the line is canonicalized by this method. Thetype of canonicalization is determined by the default canonicalizer from the wayback settings.xml file.If the input String is not a crawl-log duplicate line, null is returned.

Parameters:
line - a line from a crawl log
Returns:
a line for a cdx file or null if the input is not a duplicate line

adaptStream

void adaptStream(java.io.InputStream is,
                 java.io.OutputStream os)
Scans an input stream from a crawl log and converts all lines containing deduplicate information to cdx records which it outputs to an output stream.

Parameters:
is - the input stream
os - the output stream