Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
To better support configuring the Stream environment whilst running .Net Streaming jobs I have made a change to the “Generics based Framework for .Net Hadoop MapReduce Job Submission” code.
I have fixed a few bugs around setting job configuration options which were being controlled by the submission code. However, more importantly, I have added support for two additional command lines submission options:
- jobconf – to support setting standard job configuration values
- cmdenv – to support setting environment variables which are accessible to the Map/Reduce code
The full set of options for the command line submission is now:
-help (Required=false) : Display Help Text
-input (Required=true) : Input Directory or Files
-output (Required=true) : Output Directory
-mapper (Required=true) : Mapper Class
-reducer (Required=true) : Reducer Class
-combiner (Required=false) : Combiner Class (Optional)
-format (Required=false) : Input Format |Text(Default)|Binary|Xml|
-numberReducers (Required=false) : Number of Reduce Tasks (Optional)
-numberKeys (Required=false) : Number of MapReduce Keys (Optional)
-outputFormat (Required=false) : Reducer Output Format |Text|Binary| (Optional)
-file (Required=true) : Processing Files (Must include Map and Reduce Class files)
-nodename (Required=false) : XML Processing Nodename (Optional)
-cmdenv (Required=false) : List of Environment Variables for Job (Optional)
-jobconf (Required=false) : List of Job Configuration Parameters (Optional)
-debug (Required=false) : Turns on Debugging Options
Command Options
The job configuration option supports providing a list of standard job options. As an example, to set the name of a job and compress the Map output, which could improve performance, one would add:
-jobconf mapred.compress.map.output=true
-jobconf mapred.job.name=MobileDataDebug
For a complete list of options one would need to review the Hadoop documentation.
The command environment option supports setting environment variables accessible to the Streaming process. However, it will replace non-alphanumeric characters with an underscore “_” character. As an example if one wanted to pass in a filter into the Streaming process one would add:
-cmdenv DATA_FILTER=Windows
This would then be accessible in the Streaming process.
Context Object
To support providing better feedback into the Hadoop running environment I have added a new static class into the code; named Context. The Context object contains the original FormatKeys() and GetKeys() operations, along with the following additions:
public class Context
{
public Context();
/// Formats with a TAB the provided keys for a Map/Reduce output file
public static string FormatKeys(string[] keys);
/// Return the split keys as an array of values
public static string[] GetKeys(string key);
/// Gets an environment variable for runtime configurations value
public static string GetVariable(string name);
/// Increments the specified counter for the given category
public static void IncrementCounter(string category, string instance);
/// Updates the specified counter for the given category
public static void UpdateCounter(string category, string instance, int value);
/// Emits the specified status information
public static void UpdateStatus(string message);
}
The code contained in this Context object, although simple, will hopefully provide some abstraction over the idiosyncrasies of using the Streaming interface.