Interface DeduplicateToCDXAdapterInterface

  • All Known Implementing Classes:
    DeduplicateToCDXAdapter

    public interface DeduplicateToCDXAdapterInterface
    Interface describing a class which can be used to convert duplicate records in a crawl log to wayback-compatible cdx records.
    • Method Summary

      All Methods Instance Methods Abstract Methods 
      Modifier and Type Method Description
      String adaptLine​(String line)
      Takes a deduplicate line from a crawl log and converts it to a line in a cdx file suitable for searching in wayback.
      void adaptStream​(InputStream is, OutputStream os)
      Scans an input stream from a crawl log and converts all lines containing deduplicate information to cdx records which it outputs to an output stream.
    • Method Detail

      • adaptLine

        String adaptLine​(String line)
        Takes a deduplicate line from a crawl log and converts it to a line in a cdx file suitable for searching in wayback. The target url in the line is canonicalized by this method. The type of canonicalization is determined by the default canonicalizer from the wayback settings.xml file. If the input String is not a crawl-log duplicate line, null is returned.
        Parameters:
        line - a line from a crawl log
        Returns:
        a line for a cdx file or null if the input is not a duplicate line
      • adaptStream

        void adaptStream​(InputStream is,
                         OutputStream os)
        Scans an input stream from a crawl log and converts all lines containing deduplicate information to cdx records which it outputs to an output stream.
        Parameters:
        is - the input stream
        os - the output stream