java.lang.Object
- dk.netarkivet.common.utils.hadoop.HadoopJobUtils

public class HadoopJobUtils
extends Object

Utilities for Hadoop jobs.

Field Summary

Fields
Modifier and Type Field Description

static String DEFAULT_FILESYSTEM

static String MAPREDUCE_FRAMEWORK

static String YARN_RESOURCEMANAGER_ADDRESS

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method	Description
`static List<String>`	`collectOutputLines(org.apache.hadoop.fs.FileSystem fileSystem, org.apache.hadoop.fs.Path outputFolder)`	Collects lines from a jobs output files at a specified path.
`static void`	`configureCaching(org.apache.hadoop.conf.Configuration configuration)`
`static void`	`doKerberosLogin()`	Login to Kerberos from the settings specified in CommonSettings.
`static org.apache.hadoop.conf.Configuration`	`enableMapOnlyUberTask(org.apache.hadoop.conf.Configuration configuration, Integer appMasterMemory, Integer appMasterCores)`	Call the setMapCoresPerTask() and setMapMemory BEFORE callling this, as it uses their values
`static org.apache.hadoop.conf.Configuration`	`enableUberTask(org.apache.hadoop.conf.Configuration configuration, Integer appMasterMemory, Integer appMasterCores)`	Call the setCoresPerTask() and setMemory BEFORE callling this, as it uses their values
`static List<CDXRecord>`	`getCDXRecordListFromCDXLines(List<String> cdxLines)`	TODO now here's some code that would look better with streams Converts a list of CDX line strings to a list of CDXRecords
`static org.apache.hadoop.conf.Configuration`	`getConf()`	Initialize a hadoop configuration.
`static org.apache.hadoop.security.UserGroupInformation`	`getUserGroupInformation()`	Obtain a logged in UserGroupInformation for running hadoop jobs from the kerberos parameters defined in CommonSettings.
`static org.apache.hadoop.conf.Configuration`	`setAppMasterCores(org.apache.hadoop.conf.Configuration configuration, int cores)`
`static org.apache.hadoop.conf.Configuration`	`setAppMasterMemory(org.apache.hadoop.conf.Configuration configuration, int memory)`
`static void`	`setBatchQueue(org.apache.hadoop.conf.Configuration conf)`
`static void`	`setInteractiveQueue(org.apache.hadoop.conf.Configuration conf)`
`static org.apache.hadoop.conf.Configuration`	`setMapCoresPerTask(org.apache.hadoop.conf.Configuration configuration, int cores)`
`static org.apache.hadoop.conf.Configuration`	`setMapMemory(org.apache.hadoop.conf.Configuration configuration, int memory)`
`static org.apache.hadoop.conf.Configuration`	`setReduceCoresPerTask(org.apache.hadoop.conf.Configuration configuration, int cores)`
`static org.apache.hadoop.conf.Configuration`	`setReducerMemory(org.apache.hadoop.conf.Configuration configuration, int memory)`
`static void`	`writeHadoopInputFileLinesToInputFile(List<Path> files, Path inputFilePath)`	Given a list of file paths prepend 'file://' to every entry and write them as newline separated lines to the given input file path.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail
- DEFAULT_FILESYSTEM
```
public static final String DEFAULT_FILESYSTEM
```
  See Also:
  
  Constant Field Values
- MAPREDUCE_FRAMEWORK
```
public static final String MAPREDUCE_FRAMEWORK
```
  See Also:
  
  Constant Field Values
- YARN_RESOURCEMANAGER_ADDRESS
```
public static final String YARN_RESOURCEMANAGER_ADDRESS
```
  See Also:
  
  Constant Field Values

Method Detail

getUserGroupInformation

public static org.apache.hadoop.security.UserGroupInformation getUserGroupInformation()
                                                                               throws sun.security.krb5.KrbException,
                                                                                      IOException

Obtain a logged in UserGroupInformation for running hadoop jobs from the kerberos parameters defined in CommonSettings.

Returns:: The UserGroupInformation instance
Throws:: sun.security.krb5.KrbException - if the kerberos configuration is invalid; IOException - if the kerberos login fails

doKerberosLogin
```
public static void doKerberosLogin()
                            throws sun.security.krb5.KrbException,
                                   IOException
```
Login to Kerberos from the settings specified in CommonSettings.

Throws:

sun.security.krb5.KrbException - if the kerberos configuration is invalid

IOException - if the kerberos login fails

getConf
```
public static org.apache.hadoop.conf.Configuration getConf()
```
Initialize a hadoop configuration. The basic configuration must be in a directory on the classpath. This class additionally sets the path to the uber jar specified in CommonSettings#HADOOP_MAPRED_UBER_JAR

Returns:

A new configuration to use for a job.

enableUberTask

public static org.apache.hadoop.conf.Configuration enableUberTask(org.apache.hadoop.conf.Configuration configuration,
                                                                  Integer appMasterMemory,
                                                                  Integer appMasterCores)

Call the set*CoresPerTask() and set*Memory BEFORE callling this, as it uses their values

Parameters:: configuration -
Returns:

enableMapOnlyUberTask

public static org.apache.hadoop.conf.Configuration enableMapOnlyUberTask(org.apache.hadoop.conf.Configuration configuration,
                                                                         Integer appMasterMemory,
                                                                         Integer appMasterCores)

Call the setMapCoresPerTask() and setMapMemory BEFORE callling this, as it uses their values

Parameters:: configuration -
Returns:

setMapMemory

public static org.apache.hadoop.conf.Configuration setMapMemory(org.apache.hadoop.conf.Configuration configuration,
                                                                int memory)

setReducerMemory

public static org.apache.hadoop.conf.Configuration setReducerMemory(org.apache.hadoop.conf.Configuration configuration,
                                                                    int memory)

setAppMasterMemory

public static org.apache.hadoop.conf.Configuration setAppMasterMemory(org.apache.hadoop.conf.Configuration configuration,
                                                                      int memory)

setMapCoresPerTask

public static org.apache.hadoop.conf.Configuration setMapCoresPerTask(org.apache.hadoop.conf.Configuration configuration,
                                                                      int cores)

setReduceCoresPerTask

public static org.apache.hadoop.conf.Configuration setReduceCoresPerTask(org.apache.hadoop.conf.Configuration configuration,
                                                                         int cores)

setAppMasterCores

public static org.apache.hadoop.conf.Configuration setAppMasterCores(org.apache.hadoop.conf.Configuration configuration,
                                                                     int cores)

writeHadoopInputFileLinesToInputFile
```
public static void writeHadoopInputFileLinesToInputFile(List<Path> files,
                                                        Path inputFilePath)
                                                 throws IOException
```
Given a list of file paths prepend 'file://' to every entry and write them as newline separated lines to the given input file path.

Parameters:

files - A list of input file paths to operate on

inputFilePath - The path of the file to write the lines to

Throws:

IOException - If the input file path cannot be written to

collectOutputLines
```
public static List<String> collectOutputLines(org.apache.hadoop.fs.FileSystem fileSystem,
                                              org.apache.hadoop.fs.Path outputFolder)
                                       throws IOException
```
Collects lines from a jobs output files at a specified path. Also deletes the folder once the output has been collected.

Parameters:

fileSystem - The filesystem that the result is collected from.

outputFolder - The output folder to find the job result files in.

Returns:

A list of lines collected from all the output files.

Throws:

IOException - If the output folder or its contents cannot be read.

getCDXRecordListFromCDXLines
```
public static List<CDXRecord> getCDXRecordListFromCDXLines(List<String> cdxLines)
```
TODO now here's some code that would look better with streams Converts a list of CDX line strings to a list of CDXRecords

Parameters:

cdxLines - The list to convert

Returns:

A list of CDXRecords representing the old list

configureCaching

public static void configureCaching(org.apache.hadoop.conf.Configuration configuration)

setBatchQueue

public static void setBatchQueue(org.apache.hadoop.conf.Configuration conf)

setInteractiveQueue

public static void setInteractiveQueue(org.apache.hadoop.conf.Configuration conf)

Modifier and Type	Field	Description
`static String`	`DEFAULT_FILESYSTEM`
`static String`	`MAPREDUCE_FRAMEWORK`
`static String`	`YARN_RESOURCEMANAGER_ADDRESS`

Class HadoopJobUtils

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

DEFAULT_FILESYSTEM

MAPREDUCE_FRAMEWORK

YARN_RESOURCEMANAGER_ADDRESS