Udostępnij za pośrednictwem


Framework for .Net Hadoop MapReduce Job Submission configuration update

To better support configuring the Stream environment whilst running .Net Streaming jobs I have made a change to the “Generics based Framework for .Net Hadoop MapReduce Job Submission” code.

I have fixed a few bugs around setting job configuration options which were being controlled by the submission code. However, more importantly, I have added support for two additional command lines submission options:

  • jobconf – to support setting standard job configuration values
  • cmdenv – to support setting environment variables which are accessible to the Map/Reduce code

The full set of options for the command line submission is now:

-help (Required=false) : Display Help Text
-input (Required=true) : Input Directory or Files
-output (Required=true) : Output Directory
-mapper (Required=true) : Mapper Class
-reducer (Required=true) : Reducer Class
-combiner (Required=false) : Combiner Class (Optional)
-format (Required=false) : Input Format |Text(Default)|Binary|Xml|
-numberReducers (Required=false) : Number of Reduce Tasks (Optional)
-numberKeys (Required=false) : Number of MapReduce Keys (Optional)
-outputFormat (Required=false) : Reducer Output Format |Text|Binary| (Optional)
-file (Required=true) : Processing Files (Must include Map and Reduce Class files)
-nodename (Required=false) : XML Processing Nodename (Optional)
-cmdenv (Required=false) : List of Environment Variables for Job (Optional)
-jobconf (Required=false) : List of Job Configuration Parameters (Optional)

-debug (Required=false) : Turns on Debugging Options

Command Options

The job configuration option supports providing a list of standard job options. As an example, to set the name of a job and compress the Map output, which could improve performance, one would add:

-jobconf mapred.compress.map.output=true
-jobconf mapred.job.name=MobileDataDebug

For a complete list of options one would need to review the Hadoop documentation.

The command environment option supports setting environment variables accessible to the Streaming process. However, it will replace non-alphanumeric characters with an underscore “_” character. As an example if one wanted to pass in a filter into the Streaming process one would add:

-cmdenv DATA_FILTER=Windows

This would then be accessible in the Streaming process.

Context Object

To support providing better feedback into the Hadoop running environment I have added a new static class into the code; named Context. The Context object contains the original FormatKeys() and GetKeys() operations, along with the following additions:

public class Context
{
    public Context();

    /// Formats with a TAB the provided keys for a Map/Reduce output file
    public static string FormatKeys(string[] keys);

    /// Return the split keys as an array of values
    public static string[] GetKeys(string key);

    /// Gets an environment variable for runtime configurations value
    public static string GetVariable(string name);

    /// Increments the specified counter for the given category
    public static void IncrementCounter(string category, string instance);

    /// Updates the specified counter for the given category
    public static void UpdateCounter(string category, string instance, int value);

    /// Emits the specified status information
    public static void UpdateStatus(string message);
}

The code contained in this Context object, although simple, will hopefully provide some abstraction over the idiosyncrasies of using the Streaming interface.