Class DeduplicateToCDXApplication


  • public class DeduplicateToCDXApplication
    extends java.lang.Object
    A simple command line application to generate cdx files from local crawl-log files.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void generateCDX​(java.lang.String[] localCrawlLogs)
      Takes an array of file names (relative or full paths) of crawl.log files from which duplicate records are to be extracted.
      static void main​(java.lang.String[] args)
      An application to generate unsorted cdx files from duplicate records present in a crawl.log file.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • generateCDX

        public void generateCDX​(java.lang.String[] localCrawlLogs)
                         throws java.io.IOException
        Takes an array of file names (relative or full paths) of crawl.log files from which duplicate records are to be extracted. Writes the concatenated cdx files of all duplicate records in these files to standard out. An exception will be thrown if any of the files cannot be read for any reason or if the argument is null
        Parameters:
        localCrawlLogs - a list of file names
        Throws:
        java.io.FileNotFoundException - if one of the files cannot be found
        java.io.IOException
      • main

        public static void main​(java.lang.String[] args)
                         throws java.io.IOException
        An application to generate unsorted cdx files from duplicate records present in a crawl.log file. The only parameters are a list of file-paths. Output is written to standard out.
        Parameters:
        args - the file names (relative or absolute paths)
        Throws:
        java.io.FileNotFoundException - if one or more of the files does not exist
        java.io.IOException