dk.netarkivet.wayback.batch
Class DeduplicateToCDXAdapter

java.lang.Object
  extended by dk.netarkivet.wayback.batch.DeduplicateToCDXAdapter
All Implemented Interfaces:
DeduplicateToCDXAdapterInterface

public class DeduplicateToCDXAdapter
extends java.lang.Object
implements DeduplicateToCDXAdapterInterface

Class containing methods for turning duplicate entries in a crawl log into lines in a CDX index file.


Field Summary
(package private)  org.archive.wayback.UrlCanonicalizer canonicalizer
          canonicalizer used to canonicalize urls.
 
Constructor Summary
DeduplicateToCDXAdapter()
          Default constructor.
 
Method Summary
 java.lang.String adaptLine(java.lang.String line)
          If the input line is a crawl log entry representing a duplicate then a CDX entry is written to the output.
 void adaptStream(java.io.InputStream is, java.io.OutputStream os)
          Reads an input stream representing a crawl log line by line and converts any lines representing duplicate entries to wayback-compliant cdx lines.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

canonicalizer

org.archive.wayback.UrlCanonicalizer canonicalizer
canonicalizer used to canonicalize urls.

Constructor Detail

DeduplicateToCDXAdapter

public DeduplicateToCDXAdapter()
Default constructor. Initializes the canonicalizer.

Method Detail

adaptLine

public java.lang.String adaptLine(java.lang.String line)
If the input line is a crawl log entry representing a duplicate then a CDX entry is written to the output. Otherwise returns null. In the event of an error returns null.

Specified by:
adaptLine in interface DeduplicateToCDXAdapterInterface
Parameters:
line - the crawl-log line to be analysed
Returns:
a CDX line (without newline) or null

adaptStream

public void adaptStream(java.io.InputStream is,
                        java.io.OutputStream os)
Reads an input stream representing a crawl log line by line and converts any lines representing duplicate entries to wayback-compliant cdx lines.

Specified by:
adaptStream in interface DeduplicateToCDXAdapterInterface
Parameters:
is - The input stream from which data is read.
os - The output stream to which the cdx lines are written.