Package dk.netarkivet.wayback.batch
Interface DeduplicateToCDXAdapterInterface
-
- All Known Implementing Classes:
DeduplicateToCDXAdapter
public interface DeduplicateToCDXAdapterInterface
Interface describing a class which can be used to convert duplicate records in a crawl log to wayback-compatible cdx records.
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description String
adaptLine(String line)
Takes a deduplicate line from a crawl log and converts it to a line in a cdx file suitable for searching in wayback.void
adaptStream(InputStream is, OutputStream os)
Scans an input stream from a crawl log and converts all lines containing deduplicate information to cdx records which it outputs to an output stream.
-
-
-
Method Detail
-
adaptLine
String adaptLine(String line)
Takes a deduplicate line from a crawl log and converts it to a line in a cdx file suitable for searching in wayback. The target url in the line is canonicalized by this method. The type of canonicalization is determined by the default canonicalizer from the wayback settings.xml file. If the input String is not a crawl-log duplicate line, null is returned.- Parameters:
line
- a line from a crawl log- Returns:
- a line for a cdx file or null if the input is not a duplicate line
-
adaptStream
void adaptStream(InputStream is, OutputStream os)
Scans an input stream from a crawl log and converts all lines containing deduplicate information to cdx records which it outputs to an output stream.- Parameters:
is
- the input streamos
- the output stream
-
-