Hadoop on Azure: word count in Java and JavaScript | Hadoop sur Azure: comptons les mots en Java et JavaScript

Artykuł
2011-12-19

A preview version of Hadoop on Windows Azure is available. The details of that availability is at

Une pré-version d’Hadoop sur Azure est disponible. Les détails de cette disponibilité sont à

availability-of-community-technology-preview-ctp-of-hadoop-based-service-on-windows-azure.aspx

A good introduction to what Hadoop and Map Reduce are is available at

Une bonne introduction à ce que sont Hadoop et Map/Reduce est à

https://developer.yahoo.com/hadoop/tutorial/module4.html

As a developer using Hadoop, you write a mapper function, a reducer function, and Hadoop does the rest: - distribute code to the nodes where data resides - execute code on the nodes - provide reducers with all the same keys generated by the mappers	En tant que développeur utilisant Hadoop, on écrit des fonctions de mapper et de reducer, et Hadoop fait le reste: - distribuer le code aux noeuds où la donnée se trouve - exécuter le code sur tous les noeuds - fournir aux reducers les ensembles de mêmes clefs générées par les mappers
One of the often used examples is the WordCount example.	Un des exemples les plus utilisés est le comptage de mots (WordCount).
In this WordCount example, - the mapper function emits each word found as a key, and 1 as the value. - the reducer function adds the values for the same key - Thus, you get each word and the number of occurrences as a result of the map/reduce. This sample can be found in different places, including:	Dans cet exemple WordCount, - la fonction mapper émet chaque mot trouvé en tant que clef, et 1 en tant que valeur. - la fonction reducer ajoute les valeurs pour la même clef - Ainsi, on obtient comme résultat du map/reduce chaque mot et le nombre d’occurrences pour ce mot. Cet exemple peut se trouver à différents endroits dont

https://wiki.apache.org/hadoop/WordCount

https://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html

Lets’s try this on an Hadoop On Azure cluster, after having changed the code to get only words with letters a to z, and having 4 letters or more	Essayons cela sur un cluster Hadoop sur Azure, après avoir modifié le code pour avoir uniquement les mots avec les lettres a à z, et ayant au moins 4 lettres
Here is the code we have	Voici le code

 package com.benjguin.hadoopSamples;

import java.io.IOException; 
import java.util.*; 
  
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapred.*; 

public class WordCount {
  public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 
        private final static IntWritable one = new IntWritable(1); 
        private Text word = new Text(); 
      
        public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 
          String line = value.toString(); 
          StringTokenizer tokenizer = new StringTokenizer(line);
          while (tokenizer.hasMoreTokens()) { 
            String[] wordsToCount = Utils.wordsToCount(tokenizer.nextToken());
            for (int i=0; i<wordsToCount.length; i++) {
                if (Utils.countThisWord(wordsToCount[i])) {
                    word.set(wordsToCount[i]);
                    output.collect(word, one);
                }
            }
          } 
        } 
      } 

  public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 
    public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 
      int sum = 0; 
      while (values.hasNext()) { 
        sum += values.next().get(); 
      } 
      output.collect(key, new IntWritable(sum)); 
    } 
  } 
          
  public static void main(String[] args) throws Exception { 
    JobConf conf = new JobConf(WordCount.class); 
    conf.setJobName("wordcount"); 
  
    conf.setOutputKeyClass(Text.class); 
    conf.setOutputValueClass(IntWritable.class); 
  
    conf.setMapperClass(Map.class); 
    conf.setCombinerClass(Reduce.class); 
    conf.setReducerClass(Reduce.class); 
  
    conf.setInputFormat(TextInputFormat.class); 
    conf.setOutputFormat(TextOutputFormat.class); 
  
    FileInputFormat.setInputPaths(conf, new Path(args[0])); 
    FileOutputFormat.setOutputPath(conf, new Path(args[1])); 
  
    JobClient.runJob(conf); 
  } 
}

 package com.benjguin.hadoopSamples;

public class Utils {
    public static String[] wordsToCount(String word) {
        return word.toLowerCase().split("[^a-zA-Z]");
    }
    
    public static boolean countThisWord(String word) {
        return word.length() > 3;
    }
}

The first step is to compile the code and generate a JAR file. This can be done with Eclipse for instance:

La première étape est de compiler le code et de générer un fichier JAR. Cela peut être fait depuis Eclipse par exemple:

We also need to have some data. For that, it is possible to download a few books from the Gutenberg project.

On a également besoin de données. On peut par exemple télécharger quelques livres du projet Gutenberg.

Then, an Hadoop on Azure cluster is requested as explained there:

Ensuite, on demande la création d’un cluster Hadoop sur Azure, comme expliqué à:

https://social.technet.microsoft.com/wiki/contents/articles/6225.aspx

Let’s upload the files to HDFS (Hadoop’s distributed file system) by using the interactive JavaScript Console:

Ensuite, on charge les données en HDFS (système de fichier distribué d’Hadoop) en utilisant la console interactive JavaScript:

NB: for large volumes of data, FTPS would be a better option. Please refer to How To FTP Data To Hadoop on Windows Azure.	NB: pour de grands volumes de données, FTPS est préférable. cf How To FTP Data To Hadoop on Windows Azure.
Let’s create a folder and upload the 3 books into that HDFS folder	On crée un répertoire HDFS et on y charge les 3 livres.

Then it is possible to create the job

Puis il est possible de créer un job

 11/12/19 17:51:27 INFO mapred.FileInputFormat: Total input paths to process : 3
11/12/19 17:51:27 INFO mapred.JobClient: Running job: job_201112190923_0004
11/12/19 17:51:28 INFO mapred.JobClient: map 0% reduce 0%
11/12/19 17:51:53 INFO mapred.JobClient: map 25% reduce 0%
11/12/19 17:51:54 INFO mapred.JobClient: map 75% reduce 0%
11/12/19 17:51:55 INFO mapred.JobClient: map 100% reduce 0%
11/12/19 17:52:14 INFO mapred.JobClient: map 100% reduce 100%
11/12/19 17:52:25 INFO mapred.JobClient: Job complete: job_201112190923_0004
11/12/19 17:52:25 INFO mapred.JobClient: Counters: 26
11/12/19 17:52:25 INFO mapred.JobClient: Job Counters 
11/12/19 17:52:25 INFO mapred.JobClient: Launched reduce tasks=1
11/12/19 17:52:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=57703
11/12/19 17:52:25 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/12/19 17:52:25 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/12/19 17:52:25 INFO mapred.JobClient: Launched map tasks=4
11/12/19 17:52:25 INFO mapred.JobClient: Data-local map tasks=4
11/12/19 17:52:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=18672
11/12/19 17:52:25 INFO mapred.JobClient: File Input Format Counters 
11/12/19 17:52:25 INFO mapred.JobClient: Bytes Read=1554158
11/12/19 17:52:25 INFO mapred.JobClient: File Output Format Counters 
11/12/19 17:52:25 INFO mapred.JobClient: Bytes Written=186556
11/12/19 17:52:25 INFO mapred.JobClient: FileSystemCounters
11/12/19 17:52:25 INFO mapred.JobClient: FILE_BYTES_READ=427145
11/12/19 17:52:25 INFO mapred.JobClient: HDFS_BYTES_READ=1554642
11/12/19 17:52:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=964132
11/12/19 17:52:25 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=186556
11/12/19 17:52:25 INFO mapred.JobClient: Map-Reduce Framework
11/12/19 17:52:25 INFO mapred.JobClient: Map output materialized bytes=426253
11/12/19 17:52:25 INFO mapred.JobClient: Map input records=19114
11/12/19 17:52:25 INFO mapred.JobClient: Reduce shuffle bytes=426253
11/12/19 17:52:25 INFO mapred.JobClient: Spilled Records=60442
11/12/19 17:52:25 INFO mapred.JobClient: Map output bytes=1482365
11/12/19 17:52:25 INFO mapred.JobClient: Map input bytes=1535450
11/12/19 17:52:25 INFO mapred.JobClient: Combine input records=135431
11/12/19 17:52:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=484
11/12/19 17:52:25 INFO mapred.JobClient: Reduce input records=30221
11/12/19 17:52:25 INFO mapred.JobClient: Reduce input groups=17618
11/12/19 17:52:25 INFO mapred.JobClient: Combine output records=30221
11/12/19 17:52:25 INFO mapred.JobClient: Reduce output records=17618
11/12/19 17:52:25 INFO mapred.JobClient: Map output records=135431

go back to the interactive JavaScript console.

On retourne dans la console interactive JavaScript