Class HadoopJobUtils


  • public class HadoopJobUtils
    extends Object
    Utilities for Hadoop jobs.
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static List<String> collectOutputLines​(org.apache.hadoop.fs.FileSystem fileSystem, org.apache.hadoop.fs.Path outputFolder)
      Collects lines from a jobs output files at a specified path.
      static void configureCaching​(org.apache.hadoop.conf.Configuration configuration)  
      static void doKerberosLogin()
      Login to Kerberos from the settings specified in CommonSettings.
      static org.apache.hadoop.conf.Configuration enableMapOnlyUberTask​(org.apache.hadoop.conf.Configuration configuration, Integer appMasterMemory, Integer appMasterCores)
      Call the setMapCoresPerTask() and setMapMemory BEFORE callling this, as it uses their values
      static org.apache.hadoop.conf.Configuration enableUberTask​(org.apache.hadoop.conf.Configuration configuration, Integer appMasterMemory, Integer appMasterCores)
      Call the set*CoresPerTask() and set*Memory BEFORE callling this, as it uses their values
      static List<CDXRecord> getCDXRecordListFromCDXLines​(List<String> cdxLines)
      TODO now here's some code that would look better with streams Converts a list of CDX line strings to a list of CDXRecords
      static org.apache.hadoop.conf.Configuration getConf()
      Initialize a hadoop configuration.
      static org.apache.hadoop.security.UserGroupInformation getUserGroupInformation()
      Obtain a logged in UserGroupInformation for running hadoop jobs from the kerberos parameters defined in CommonSettings.
      static org.apache.hadoop.conf.Configuration setAppMasterCores​(org.apache.hadoop.conf.Configuration configuration, int cores)  
      static org.apache.hadoop.conf.Configuration setAppMasterMemory​(org.apache.hadoop.conf.Configuration configuration, int memory)  
      static void setBatchQueue​(org.apache.hadoop.conf.Configuration conf)  
      static void setInteractiveQueue​(org.apache.hadoop.conf.Configuration conf)  
      static org.apache.hadoop.conf.Configuration setMapCoresPerTask​(org.apache.hadoop.conf.Configuration configuration, int cores)  
      static org.apache.hadoop.conf.Configuration setMapMemory​(org.apache.hadoop.conf.Configuration configuration, int memory)  
      static org.apache.hadoop.conf.Configuration setReduceCoresPerTask​(org.apache.hadoop.conf.Configuration configuration, int cores)  
      static org.apache.hadoop.conf.Configuration setReducerMemory​(org.apache.hadoop.conf.Configuration configuration, int memory)  
      static void writeHadoopInputFileLinesToInputFile​(List<Path> files, Path inputFilePath)
      Given a list of file paths prepend 'file://' to every entry and write them as newline separated lines to the given input file path.
    • Method Detail

      • getUserGroupInformation

        public static org.apache.hadoop.security.UserGroupInformation getUserGroupInformation()
                                                                                       throws sun.security.krb5.KrbException,
                                                                                              IOException
        Obtain a logged in UserGroupInformation for running hadoop jobs from the kerberos parameters defined in CommonSettings.
        Returns:
        The UserGroupInformation instance
        Throws:
        sun.security.krb5.KrbException - if the kerberos configuration is invalid
        IOException - if the kerberos login fails
      • doKerberosLogin

        public static void doKerberosLogin()
                                    throws sun.security.krb5.KrbException,
                                           IOException
        Login to Kerberos from the settings specified in CommonSettings.
        Throws:
        sun.security.krb5.KrbException - if the kerberos configuration is invalid
        IOException - if the kerberos login fails
      • getConf

        public static org.apache.hadoop.conf.Configuration getConf()
        Initialize a hadoop configuration. The basic configuration must be in a directory on the classpath. This class additionally sets the path to the uber jar specified in CommonSettings#HADOOP_MAPRED_UBER_JAR
        Returns:
        A new configuration to use for a job.
      • enableUberTask

        public static org.apache.hadoop.conf.Configuration enableUberTask​(org.apache.hadoop.conf.Configuration configuration,
                                                                          Integer appMasterMemory,
                                                                          Integer appMasterCores)
        Call the set*CoresPerTask() and set*Memory BEFORE callling this, as it uses their values
        Parameters:
        configuration -
        Returns:
      • enableMapOnlyUberTask

        public static org.apache.hadoop.conf.Configuration enableMapOnlyUberTask​(org.apache.hadoop.conf.Configuration configuration,
                                                                                 Integer appMasterMemory,
                                                                                 Integer appMasterCores)
        Call the setMapCoresPerTask() and setMapMemory BEFORE callling this, as it uses their values
        Parameters:
        configuration -
        Returns:
      • setMapMemory

        public static org.apache.hadoop.conf.Configuration setMapMemory​(org.apache.hadoop.conf.Configuration configuration,
                                                                        int memory)
      • setReducerMemory

        public static org.apache.hadoop.conf.Configuration setReducerMemory​(org.apache.hadoop.conf.Configuration configuration,
                                                                            int memory)
      • setAppMasterMemory

        public static org.apache.hadoop.conf.Configuration setAppMasterMemory​(org.apache.hadoop.conf.Configuration configuration,
                                                                              int memory)
      • setMapCoresPerTask

        public static org.apache.hadoop.conf.Configuration setMapCoresPerTask​(org.apache.hadoop.conf.Configuration configuration,
                                                                              int cores)
      • setReduceCoresPerTask

        public static org.apache.hadoop.conf.Configuration setReduceCoresPerTask​(org.apache.hadoop.conf.Configuration configuration,
                                                                                 int cores)
      • setAppMasterCores

        public static org.apache.hadoop.conf.Configuration setAppMasterCores​(org.apache.hadoop.conf.Configuration configuration,
                                                                             int cores)
      • writeHadoopInputFileLinesToInputFile

        public static void writeHadoopInputFileLinesToInputFile​(List<Path> files,
                                                                Path inputFilePath)
                                                         throws IOException
        Given a list of file paths prepend 'file://' to every entry and write them as newline separated lines to the given input file path.
        Parameters:
        files - A list of input file paths to operate on
        inputFilePath - The path of the file to write the lines to
        Throws:
        IOException - If the input file path cannot be written to
      • collectOutputLines

        public static List<String> collectOutputLines​(org.apache.hadoop.fs.FileSystem fileSystem,
                                                      org.apache.hadoop.fs.Path outputFolder)
                                               throws IOException
        Collects lines from a jobs output files at a specified path. Also deletes the folder once the output has been collected.
        Parameters:
        fileSystem - The filesystem that the result is collected from.
        outputFolder - The output folder to find the job result files in.
        Returns:
        A list of lines collected from all the output files.
        Throws:
        IOException - If the output folder or its contents cannot be read.
      • getCDXRecordListFromCDXLines

        public static List<CDXRecord> getCDXRecordListFromCDXLines​(List<String> cdxLines)
        TODO now here's some code that would look better with streams Converts a list of CDX line strings to a list of CDXRecords
        Parameters:
        cdxLines - The list to convert
        Returns:
        A list of CDXRecords representing the old list
      • configureCaching

        public static void configureCaching​(org.apache.hadoop.conf.Configuration configuration)
      • setBatchQueue

        public static void setBatchQueue​(org.apache.hadoop.conf.Configuration conf)
      • setInteractiveQueue

        public static void setInteractiveQueue​(org.apache.hadoop.conf.Configuration conf)