Framework for .Net Hadoop MapReduce Job Submission Binary Output
To end the week I decided to make a minor change to the “Generics based Framework for .Net Hadoop MapReduce Job Submission”.
I have been doing some work on creating a co-occurrence matrix for item recommendations. I was going to map the process to a MapReduce job(s), then came across the issue of how I would output the vector data from the reducer. In the current framework the reducer outputs the key/value data in a string format. This works fine for simple data but for a vector this quickly becomes problematic.
To resolve this I have enabled a parameter called “outputFormat”. The default output will be the usual string format; optionally specified with the parameter value “Text”. Additionally a parameter value of “Binary” is supported:
MSDN.Hadoop.Submission.Console.exe
-input "mobile/data" -output "mobile/querytimes"
-mapper "MSDN.Hadoop.MapReduceFSharp.MobilePhoneQueryMapper, MSDN.Hadoop.MapReduceFSharp"
-reducer "MSDN.Hadoop.MapReduceFSharp.MobilePhoneQueryReducer, MSDN.Hadoop.MapReduceFSharp"
-outputFormat Binary
-file "C:\Projects\Release\MSDN.Hadoop.MapReduceFSharp.dll"
When the output format is specified as binary the reducer value is output as a binary serialized version of the data, represented as a Base64 string. Reading the reduced output one can then easily serialize this object back into a .Net type:
- let Deserialize (value:string) =
- let bytes = Convert.FromBase64String(value);
- use stream = new MemoryStream(bytes)
- let formatter = new BinaryFormatter()
- formatter.Deserialize(stream)
Hopefully one will find this a lot simpler than performing string manipulations.