Apache Hadoop on Windows Azure Part 7 – Writing your very own WordCount Hadoop Job in Java and deploying to Windows Azure Cluster

In this article, I will help you writing your own WordCount Hadoop Job and then deploy it to Windows Azure Cluster for further processing.

 

Let’s create Java code file as “AvkashWordCount.java” as below:

 

package org.myorg;

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.util.*;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 

 

public class AvkashWordCount {

             public static class Map extends Mapper

                                                                  <LongWritable, Text, Text, IntWritable> {

                           private final static IntWritable one = new IntWritable(1);

                           private Text word = new Text();

                          

                           public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

                                        String line = value.toString();

                                        StringTokenizer tokenizer = new StringTokenizer(line);

                                        while (tokenizer.hasMoreTokens()) {

                                                     word.set(tokenizer.nextToken());

                                                     context.write(word, one);

                                        }

                           }

             }

             public static class Reduce extends Reducer

                                                                  <Text, IntWritable, Text, IntWritable> {

                           public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException, InterruptedException {

                                        int sum = 0;

                                        while (values.hasNext()) {

                                                     sum += values.next().get();

                                        }

                                        context.write(key, new IntWritable(sum));

                           }

             }

             public static void main(String[] args) throws Exception {

                           Configuration conf = new Configuration();

                           Job job = new Job(conf);

                           job.setJarByClass(AvkashWordCount.class);

                           job.setJobName("avkashwordcountjob");

                           job.setOutputKeyClass(Text.class);

                           job.setOutputValueClass(IntWritable.class);

                           job.setMapperClass(AvkashWordCount.Map.class);

                           job.setCombinerClass(AvkashWordCount.Reduce.class);

                           job.setReducerClass(AvkashWordCount.Reduce.class);

                           FileInputFormat.addInputPath(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

                           job.waitForCompletion(true);

                           }

}

 

Let’s Compile the Java code first. You must have Hadoop 0.20 or above installed in your machined to use this code:

 

C:\Azure\Java>C:\Apps\java\openjdk7\bin\javac -classpath c:\Apps\dist\hadoop-core-0.20.203.1-SNAPSHOT.jar -d . AvkashWordCount.java

 

Now let’s crate the JAR file

C:\Azure\Java>C:\Apps\java\openjdk7\bin\jar -cvf AvkashWordCount.jar org

 added manifest

adding: org/(in = 0) (out= 0)(stored 0%)

adding: org/myorg/(in = 0) (out= 0)(stored 0%)

adding: org/myorg/AvkashWordCount$Map.class(in = 1893) (out= 792)(deflated 58%)

adding: org/myorg/AvkashWordCount$Reduce.class(in = 1378) (out= 596)(deflated 56%)

adding: org/myorg/AvkashWordCount.class(in = 1399) (out= 754)(deflated 46%)

 

Once Jar is created please deploy it to your Windows Azure Hadoop Cluster as below:

 

In the page below please follow all the steps as described below:

  • Step 1: Click Browse to select your "AvkashWordCount.Jar" file here
  • Step 2: Enter the Job name as defined in the source code
  • Step 3: Add the parameter as below
  • Step 4: Add folder name where files will be read to word count
  • Step 5: Add output folder name where the results will be stored
  • Step 6: Start the Job

 

 

 

Note: Be sure to have some data in your input folder. (Avkash I am using /user/avkash/inputfolder which has a text file with lots of word to be used as Word Count input file)

Once the job is stared, you will see the results as below:

 

avkashwordcountjob

•••

Job Info

Status: Completed Sucessfully Type: jar Start time: 12/31/2011 4:06:51 PM End time: 12/31/2011 4:07:53 PM Exit code: 0

Command

call hadoop.cmd jar AvkashWordCount.jar org.myorg.AvkashWordCount /user/avkash/inputfolder /user/avkash/outputfolder

Output (stdout)

 

Errors (stderr)

11/12/31 16:06:53 INFO input.FileInputFormat: Total input paths to process : 1 11/12/31 16:06:54 INFO mapred.JobClient: Running job: job_201112310614_0001 11/12/31 16:06:55 INFO mapred.JobClient: map 0% reduce 0% 11/12/31 16:07:20 INFO mapred.JobClient: map 100% reduce 0% 11/12/31 16:07:42 INFO mapred.JobClient: map 100% reduce 100% 11/12/31 16:07:53 INFO mapred.JobClient: Job complete: job_201112310614_0001 11/12/31 16:07:53 INFO mapred.JobClient: Counters: 25 11/12/31 16:07:53 INFO mapred.JobClient: Job Counters 11/12/31 16:07:53 INFO mapred.JobClient: Launched reduce tasks=1 11/12/31 16:07:53 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=29029 11/12/31 16:07:53 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 11/12/31 16:07:53 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 11/12/31 16:07:53 INFO mapred.JobClient: Launched map tasks=1 11/12/31 16:07:53 INFO mapred.JobClient: Data-local map tasks=1 11/12/31 16:07:53 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=18764 11/12/31 16:07:53 INFO mapred.JobClient: File Output Format Counters 11/12/31 16:07:53 INFO mapred.JobClient: Bytes Written=123 11/12/31 16:07:53 INFO mapred.JobClient: FileSystemCounters 11/12/31 16:07:53 INFO mapred.JobClient: FILE_BYTES_READ=709 11/12/31 16:07:53 INFO mapred.JobClient: HDFS_BYTES_READ=234 11/12/31 16:07:53 INFO mapred.JobClient: FILE_BYTES_WRITTEN=43709 11/12/31 16:07:53 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=123 11/12/31 16:07:53 INFO mapred.JobClient: File Input Format Counters 11/12/31 16:07:53 INFO mapred.JobClient: Bytes Read=108 11/12/31 16:07:53 INFO mapred.JobClient: Map-Reduce Framework 11/12/31 16:07:53 INFO mapred.JobClient: Reduce input groups=7 11/12/31 16:07:53 INFO mapred.JobClient: Map output materialized bytes=189 11/12/31 16:07:53 INFO mapred.JobClient: Combine output records=15 11/12/31 16:07:53 INFO mapred.JobClient: Map input records=15 11/12/31 16:07:53 INFO mapred.JobClient: Reduce shuffle bytes=0 11/12/31 16:07:53 INFO mapred.JobClient: Reduce output records=15 11/12/31 16:07:53 INFO mapred.JobClient: Spilled Records=30 11/12/31 16:07:53 INFO mapred.JobClient: Map output bytes=153 11/12/31 16:07:53 INFO mapred.JobClient: Combine input records=15 11/12/31 16:07:53 INFO mapred.JobClient: Map output records=15 11/12/31 16:07:53 INFO mapred.JobClient: SPLIT_RAW_BYTES=126 11/12/31 16:07:53 INFO mapred.JobClient: Reduce input records=15

 

 

Finally you can open output folder /user/avkash/outputfolder and read the Word Count results.

Keywords: Windows Azure, Hadoop, Apache, BigData, Cloud, MapReduce

Comments

  • Anonymous
    May 14, 2012
    Hi Avkash, You have compiled the java files using hadoop mahout core version 0.20 as shown below C:AzureJava>C:Appsjavaopenjdk7binjavac -classpath c:Appsdisthadoop-core-0.20.203.1-SNAPSHOT.jar -d . AvkashWordCount.java We have the jars from hadoop mahout 0.4 version, which we want to test in Azure hadoop, so are these jars compatible with the platform same as the above. Thanks