Class HadoopJobUtils
- java.lang.Object
-
- dk.netarkivet.common.utils.hadoop.HadoopJobUtils
-
public class HadoopJobUtils extends Object
Utilities for Hadoop jobs.
-
-
Field Summary
Fields Modifier and Type Field Description static String
DEFAULT_FILESYSTEM
static String
MAPREDUCE_FRAMEWORK
static String
YARN_RESOURCEMANAGER_ADDRESS
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static List<String>
collectOutputLines(org.apache.hadoop.fs.FileSystem fileSystem, org.apache.hadoop.fs.Path outputFolder)
Collects lines from a jobs output files at a specified path.static void
configureCaching(org.apache.hadoop.conf.Configuration configuration)
static void
doKerberosLogin()
Login to Kerberos from the settings specified in CommonSettings.static org.apache.hadoop.conf.Configuration
enableMapOnlyUberTask(org.apache.hadoop.conf.Configuration configuration, Integer appMasterMemory, Integer appMasterCores)
Call the setMapCoresPerTask() and setMapMemory BEFORE callling this, as it uses their valuesstatic org.apache.hadoop.conf.Configuration
enableUberTask(org.apache.hadoop.conf.Configuration configuration, Integer appMasterMemory, Integer appMasterCores)
Call the set*CoresPerTask() and set*Memory BEFORE callling this, as it uses their valuesstatic List<CDXRecord>
getCDXRecordListFromCDXLines(List<String> cdxLines)
TODO now here's some code that would look better with streams Converts a list of CDX line strings to a list of CDXRecordsstatic org.apache.hadoop.conf.Configuration
getConf()
Initialize a hadoop configuration.static org.apache.hadoop.security.UserGroupInformation
getUserGroupInformation()
Obtain a logged in UserGroupInformation for running hadoop jobs from the kerberos parameters defined in CommonSettings.static org.apache.hadoop.conf.Configuration
setAppMasterCores(org.apache.hadoop.conf.Configuration configuration, int cores)
static org.apache.hadoop.conf.Configuration
setAppMasterMemory(org.apache.hadoop.conf.Configuration configuration, int memory)
static void
setBatchQueue(org.apache.hadoop.conf.Configuration conf)
static void
setInteractiveQueue(org.apache.hadoop.conf.Configuration conf)
static org.apache.hadoop.conf.Configuration
setMapCoresPerTask(org.apache.hadoop.conf.Configuration configuration, int cores)
static org.apache.hadoop.conf.Configuration
setMapMemory(org.apache.hadoop.conf.Configuration configuration, int memory)
static org.apache.hadoop.conf.Configuration
setReduceCoresPerTask(org.apache.hadoop.conf.Configuration configuration, int cores)
static org.apache.hadoop.conf.Configuration
setReducerMemory(org.apache.hadoop.conf.Configuration configuration, int memory)
static void
writeHadoopInputFileLinesToInputFile(List<Path> files, Path inputFilePath)
Given a list of file paths prepend 'file://' to every entry and write them as newline separated lines to the given input file path.
-
-
-
Field Detail
-
DEFAULT_FILESYSTEM
public static final String DEFAULT_FILESYSTEM
- See Also:
- Constant Field Values
-
MAPREDUCE_FRAMEWORK
public static final String MAPREDUCE_FRAMEWORK
- See Also:
- Constant Field Values
-
YARN_RESOURCEMANAGER_ADDRESS
public static final String YARN_RESOURCEMANAGER_ADDRESS
- See Also:
- Constant Field Values
-
-
Method Detail
-
getUserGroupInformation
public static org.apache.hadoop.security.UserGroupInformation getUserGroupInformation() throws sun.security.krb5.KrbException, IOException
Obtain a logged in UserGroupInformation for running hadoop jobs from the kerberos parameters defined in CommonSettings.- Returns:
- The UserGroupInformation instance
- Throws:
sun.security.krb5.KrbException
- if the kerberos configuration is invalidIOException
- if the kerberos login fails
-
doKerberosLogin
public static void doKerberosLogin() throws sun.security.krb5.KrbException, IOException
Login to Kerberos from the settings specified in CommonSettings.- Throws:
sun.security.krb5.KrbException
- if the kerberos configuration is invalidIOException
- if the kerberos login fails
-
getConf
public static org.apache.hadoop.conf.Configuration getConf()
Initialize a hadoop configuration. The basic configuration must be in a directory on the classpath. This class additionally sets the path to the uber jar specified in CommonSettings#HADOOP_MAPRED_UBER_JAR- Returns:
- A new configuration to use for a job.
-
enableUberTask
public static org.apache.hadoop.conf.Configuration enableUberTask(org.apache.hadoop.conf.Configuration configuration, Integer appMasterMemory, Integer appMasterCores)
Call the set*CoresPerTask() and set*Memory BEFORE callling this, as it uses their values- Parameters:
configuration
-- Returns:
-
enableMapOnlyUberTask
public static org.apache.hadoop.conf.Configuration enableMapOnlyUberTask(org.apache.hadoop.conf.Configuration configuration, Integer appMasterMemory, Integer appMasterCores)
Call the setMapCoresPerTask() and setMapMemory BEFORE callling this, as it uses their values- Parameters:
configuration
-- Returns:
-
setMapMemory
public static org.apache.hadoop.conf.Configuration setMapMemory(org.apache.hadoop.conf.Configuration configuration, int memory)
-
setReducerMemory
public static org.apache.hadoop.conf.Configuration setReducerMemory(org.apache.hadoop.conf.Configuration configuration, int memory)
-
setAppMasterMemory
public static org.apache.hadoop.conf.Configuration setAppMasterMemory(org.apache.hadoop.conf.Configuration configuration, int memory)
-
setMapCoresPerTask
public static org.apache.hadoop.conf.Configuration setMapCoresPerTask(org.apache.hadoop.conf.Configuration configuration, int cores)
-
setReduceCoresPerTask
public static org.apache.hadoop.conf.Configuration setReduceCoresPerTask(org.apache.hadoop.conf.Configuration configuration, int cores)
-
setAppMasterCores
public static org.apache.hadoop.conf.Configuration setAppMasterCores(org.apache.hadoop.conf.Configuration configuration, int cores)
-
writeHadoopInputFileLinesToInputFile
public static void writeHadoopInputFileLinesToInputFile(List<Path> files, Path inputFilePath) throws IOException
Given a list of file paths prepend 'file://' to every entry and write them as newline separated lines to the given input file path.- Parameters:
files
- A list of input file paths to operate oninputFilePath
- The path of the file to write the lines to- Throws:
IOException
- If the input file path cannot be written to
-
collectOutputLines
public static List<String> collectOutputLines(org.apache.hadoop.fs.FileSystem fileSystem, org.apache.hadoop.fs.Path outputFolder) throws IOException
Collects lines from a jobs output files at a specified path. Also deletes the folder once the output has been collected.- Parameters:
fileSystem
- The filesystem that the result is collected from.outputFolder
- The output folder to find the job result files in.- Returns:
- A list of lines collected from all the output files.
- Throws:
IOException
- If the output folder or its contents cannot be read.
-
getCDXRecordListFromCDXLines
public static List<CDXRecord> getCDXRecordListFromCDXLines(List<String> cdxLines)
TODO now here's some code that would look better with streams Converts a list of CDX line strings to a list of CDXRecords- Parameters:
cdxLines
- The list to convert- Returns:
- A list of CDXRecords representing the old list
-
configureCaching
public static void configureCaching(org.apache.hadoop.conf.Configuration configuration)
-
setBatchQueue
public static void setBatchQueue(org.apache.hadoop.conf.Configuration conf)
-
setInteractiveQueue
public static void setInteractiveQueue(org.apache.hadoop.conf.Configuration conf)
-
-