Uwaga
Dostęp do tej strony wymaga autoryzacji. Może spróbować zalogować się lub zmienić katalogi.
Dostęp do tej strony wymaga autoryzacji. Możesz spróbować zmienić katalogi.
A preview version of Hadoop on Windows Azure is available. The details of that availability is at | Une pré-version d’Hadoop sur Azure est disponible. Les détails de cette disponibilité sont à |
availability-of-community-technology-preview-ctp-of-hadoop-based-service-on-windows-azure.aspx
A good introduction to what Hadoop and Map Reduce are is available at | Une bonne introduction à ce que sont Hadoop et Map/Reduce est à |
https://developer.yahoo.com/hadoop/tutorial/module4.html
As a developer using Hadoop, you write a mapper function, a reducer function, and Hadoop does the rest: - distribute code to the nodes where data resides - execute code on the nodes - provide reducers with all the same keys generated by the mappers | En tant que développeur utilisant Hadoop, on écrit des fonctions de mapper et de reducer, et Hadoop fait le reste: - distribuer le code aux noeuds où la donnée se trouve - exécuter le code sur tous les noeuds - fournir aux reducers les ensembles de mêmes clefs générées par les mappers |
One of the often used examples is the WordCount example. | Un des exemples les plus utilisés est le comptage de mots (WordCount). |
In this WordCount example, - the mapper function emits each word found as a key, and 1 as the value. - the reducer function adds the values for the same key - Thus, you get each word and the number of occurrences as a result of the map/reduce. This sample can be found in different places, including: | Dans cet exemple WordCount, - la fonction mapper émet chaque mot trouvé en tant que clef, et 1 en tant que valeur. - la fonction reducer ajoute les valeurs pour la même clef - Ainsi, on obtient comme résultat du map/reduce chaque mot et le nombre d’occurrences pour ce mot. Cet exemple peut se trouver à différents endroits dont |
https://wiki.apache.org/hadoop/WordCount
https://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html
Lets’s try this on an Hadoop On Azure cluster, after having changed the code to get only words with letters a to z, and having 4 letters or more | Essayons cela sur un cluster Hadoop sur Azure, après avoir modifié le code pour avoir uniquement les mots avec les lettres a à z, et ayant au moins 4 lettres |
Here is the code we have | Voici le code |
package com.benjguin.hadoopSamples;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
String[] wordsToCount = Utils.wordsToCount(tokenizer.nextToken());
for (int i=0; i<wordsToCount.length; i++) {
if (Utils.countThisWord(wordsToCount[i])) {
word.set(wordsToCount[i]);
output.collect(word, one);
}
}
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
et
package com.benjguin.hadoopSamples;
public class Utils {
public static String[] wordsToCount(String word) {
return word.toLowerCase().split("[^a-zA-Z]");
}
public static boolean countThisWord(String word) {
return word.length() > 3;
}
}
The first step is to compile the code and generate a JAR file. This can be done with Eclipse for instance: | La première étape est de compiler le code et de générer un fichier JAR. Cela peut être fait depuis Eclipse par exemple: |
We also need to have some data. For that, it is possible to download a few books from the Gutenberg project. | On a également besoin de données. On peut par exemple télécharger quelques livres du projet Gutenberg. |
Then, an Hadoop on Azure cluster is requested as explained there: | Ensuite, on demande la création d’un cluster Hadoop sur Azure, comme expliqué à: |
https://social.technet.microsoft.com/wiki/contents/articles/6225.aspx
Let’s upload the files to HDFS (Hadoop’s distributed file system) by using the interactive JavaScript Console: | Ensuite, on charge les données en HDFS (système de fichier distribué d’Hadoop) en utilisant la console interactive JavaScript: |
NB: for large volumes of data, FTPS would be a better option. Please refer to How To FTP Data To Hadoop on Windows Azure. | NB: pour de grands volumes de données, FTPS est préférable. cf How To FTP Data To Hadoop on Windows Azure. |
Let’s create a folder and upload the 3 books into that HDFS folder | On crée un répertoire HDFS et on y charge les 3 livres. |
Then it is possible to create the job | Puis il est possible de créer un job |
11/12/19 17:51:27 INFO mapred.FileInputFormat: Total input paths to process : 3
11/12/19 17:51:27 INFO mapred.JobClient: Running job: job_201112190923_0004
11/12/19 17:51:28 INFO mapred.JobClient: map 0% reduce 0%
11/12/19 17:51:53 INFO mapred.JobClient: map 25% reduce 0%
11/12/19 17:51:54 INFO mapred.JobClient: map 75% reduce 0%
11/12/19 17:51:55 INFO mapred.JobClient: map 100% reduce 0%
11/12/19 17:52:14 INFO mapred.JobClient: map 100% reduce 100%
11/12/19 17:52:25 INFO mapred.JobClient: Job complete: job_201112190923_0004
11/12/19 17:52:25 INFO mapred.JobClient: Counters: 26
11/12/19 17:52:25 INFO mapred.JobClient: Job Counters
11/12/19 17:52:25 INFO mapred.JobClient: Launched reduce tasks=1
11/12/19 17:52:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=57703
11/12/19 17:52:25 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/12/19 17:52:25 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/12/19 17:52:25 INFO mapred.JobClient: Launched map tasks=4
11/12/19 17:52:25 INFO mapred.JobClient: Data-local map tasks=4
11/12/19 17:52:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=18672
11/12/19 17:52:25 INFO mapred.JobClient: File Input Format Counters
11/12/19 17:52:25 INFO mapred.JobClient: Bytes Read=1554158
11/12/19 17:52:25 INFO mapred.JobClient: File Output Format Counters
11/12/19 17:52:25 INFO mapred.JobClient: Bytes Written=186556
11/12/19 17:52:25 INFO mapred.JobClient: FileSystemCounters
11/12/19 17:52:25 INFO mapred.JobClient: FILE_BYTES_READ=427145
11/12/19 17:52:25 INFO mapred.JobClient: HDFS_BYTES_READ=1554642
11/12/19 17:52:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=964132
11/12/19 17:52:25 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=186556
11/12/19 17:52:25 INFO mapred.JobClient: Map-Reduce Framework
11/12/19 17:52:25 INFO mapred.JobClient: Map output materialized bytes=426253
11/12/19 17:52:25 INFO mapred.JobClient: Map input records=19114
11/12/19 17:52:25 INFO mapred.JobClient: Reduce shuffle bytes=426253
11/12/19 17:52:25 INFO mapred.JobClient: Spilled Records=60442
11/12/19 17:52:25 INFO mapred.JobClient: Map output bytes=1482365
11/12/19 17:52:25 INFO mapred.JobClient: Map input bytes=1535450
11/12/19 17:52:25 INFO mapred.JobClient: Combine input records=135431
11/12/19 17:52:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=484
11/12/19 17:52:25 INFO mapred.JobClient: Reduce input records=30221
11/12/19 17:52:25 INFO mapred.JobClient: Reduce input groups=17618
11/12/19 17:52:25 INFO mapred.JobClient: Combine output records=30221
11/12/19 17:52:25 INFO mapred.JobClient: Reduce output records=17618
11/12/19 17:52:25 INFO mapred.JobClient: Map output records=135431
go back to the interactive JavaScript console. | On retourne dans la console interactive JavaScript |
This generates another Map/Reduce job that will sort the result. | Cela crée un autre job Map/Reduce qui va trier le résultat |
(…)
Then, it is possible to get the data and show it in a chart | Puis, il est possible de récupérer la donnée et de la montrer sous forme de graphique |
It is also possible to have a more complete console by using Remote Desktop (RDP). | Il est également possible d’avoir une console plus complète en se connectant au bureau à distance. |