Class HadoopJobUtils


  • public class HadoopJobUtils
    extends java.lang.Object
    Utilities for Hadoop jobs.
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static java.util.List<java.lang.String> collectOutputLines​(org.apache.hadoop.fs.FileSystem fileSystem, org.apache.hadoop.fs.Path outputFolder)
      Collects lines from a jobs output files at a specified path.
      static void doKerberosLogin()
      Login to Kerberos from the settings specified in CommonSettings.
      static java.util.List<CDXRecord> getCDXRecordListFromCDXLines​(java.util.List<java.lang.String> cdxLines)
      Converts a list of CDX line strings to a list of CDXRecords
      static org.apache.hadoop.conf.Configuration getConf()
      Initialize a hadoop configuration.
      static org.apache.hadoop.security.UserGroupInformation getUserGroupInformation()
      Obtain a logged in UserGroupInformation for running hadoop jobs from the kerberos parameters defined in CommonSettings.
      static void writeHadoopInputFileLinesToInputFile​(java.util.List<java.nio.file.Path> files, java.nio.file.Path inputFilePath)
      Given a list of file paths prepend 'file://' to every entry and write them as newline separated lines to the given input file path.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • getUserGroupInformation

        public static org.apache.hadoop.security.UserGroupInformation getUserGroupInformation()
                                                                                       throws sun.security.krb5.KrbException,
                                                                                              java.io.IOException
        Obtain a logged in UserGroupInformation for running hadoop jobs from the kerberos parameters defined in CommonSettings.
        Returns:
        The UserGroupInformation instance
        Throws:
        sun.security.krb5.KrbException - if the kerberos configuration is invalid
        java.io.IOException - if the kerberos login fails
      • doKerberosLogin

        public static void doKerberosLogin()
                                    throws sun.security.krb5.KrbException,
                                           java.io.IOException
        Login to Kerberos from the settings specified in CommonSettings.
        Throws:
        sun.security.krb5.KrbException - if the kerberos configuration is invalid
        java.io.IOException - if the kerberos login fails
      • getConf

        public static org.apache.hadoop.conf.Configuration getConf()
        Initialize a hadoop configuration. The basic configuration must be in a directory on the classpath. This class additionally sets the path to the uber jar specified in CommonSettings#HADOOP_MAPRED_UBER_JAR
        Returns:
        A new configuration to use for a job.
      • writeHadoopInputFileLinesToInputFile

        public static void writeHadoopInputFileLinesToInputFile​(java.util.List<java.nio.file.Path> files,
                                                                java.nio.file.Path inputFilePath)
                                                         throws java.io.IOException
        Given a list of file paths prepend 'file://' to every entry and write them as newline separated lines to the given input file path.
        Parameters:
        files - A list of input file paths to operate on
        inputFilePath - The path of the file to write the lines to
        Throws:
        java.io.IOException - If the input file path cannot be written to
      • collectOutputLines

        public static java.util.List<java.lang.String> collectOutputLines​(org.apache.hadoop.fs.FileSystem fileSystem,
                                                                          org.apache.hadoop.fs.Path outputFolder)
                                                                   throws java.io.IOException
        Collects lines from a jobs output files at a specified path. Also deletes the folder once the output has been collected.
        Parameters:
        fileSystem - The filesystem that the result is collected from.
        outputFolder - The output folder to find the job result files in.
        Returns:
        A list of lines collected from all the output files.
        Throws:
        java.io.IOException - If the output folder or its contents cannot be read.
      • getCDXRecordListFromCDXLines

        public static java.util.List<CDXRecordgetCDXRecordListFromCDXLines​(java.util.List<java.lang.String> cdxLines)
        Converts a list of CDX line strings to a list of CDXRecords
        Parameters:
        cdxLines - The list to convert
        Returns:
        A list of CDXRecords representing the old list