Apache Hadoop on Windows Azure Part 10 - Running a JavaScript Map/Reduce Job from Interactive JavaScript Console
Microsoft distribution of Apache Hadoop on Windows Azure, let you run JavaScript Map/Reduce jobs directly from web based Interactive JavaScript Console. To start with lets write a JavaScript code for Map/Reduce wordcount jobs as below:
FileName #Wordcount.js:
var map = function (key, value, context) {
var words = value.split(/[^a-zA-Z]/);
for (var i = 0; i < words.length; i++) {
if (words[i] !== "") {
context.write(words[i].toLowerCase(), 1);
};var reduce = function (key, values, context) {
var sum = 0;
while (values.hasNext()) {
sum += parseInt(values.next());
context.write(key, sum);
After that you can upload this wordcount.js file to HDFS and verify it as below:
js> fs.put()
js> #ls
Found 2 items
drwxr-xr-x - avkash supergroup 0 2012-01-02 20:25 /user/avkash/.oink
-rw-r--r-- 3 avkash supergroup 418 2012-01-02 20:17 /user/avkash/wordcount.js
Now you can create a folder name “wordsfolder” and upload a few txt files. We will use this folder as input folder to run the word count map/reduce job.
js> #ls
Found 3 items
drwxr-xr-x - avkash supergroup 0 2012-01-02 20:25 /user/avkash/.oink
-rw-r--r-- 3 avkash supergroup 418 2012-01-02 20:17 /user/avkash/wordcount.js
drwxr-xr-x - avkash supergroup 0 2012-01-02 20:22 /user/avkash/wordsfolder
js> #ls wordsfolder
Found 3 items
-rw-r--r-- 3 avkash supergroup 1395667 2012-01-02 20:22 /user/avkash/wordsfolder/davinci.txt
-rw-r--r-- 3 avkash supergroup 674762 2012-01-02 20:22 /user/avkash/wordsfolder/outlineofscience.txt
-rw-r--r-- 3 avkash supergroup 1573044 2012-01-02 20:22 /user/avkash/wordsfolder/ulysses.txt
Now we can run the JavaScript Map/Reduce job to count the top 15 words in descending order in the folder name “top15words” as below:
js> from("wordsfolder").mapReduce("wordcount.js", "word, count:long").orderBy("count DESC").take(15).to("top15words")
View Log
If you click the “View Log” link above in a new tab, you can see the activity about Map/Reduce job which I have added at the end of this blog:
Finally when the job is completed, the following folder “top15words” will be created as below:
js> #ls
Found 4 items
drwxr-xr-x - avkash supergroup 0 2012-01-02 20:26 /user/avkash/.oink
drwxr-xr-x - avkash supergroup 0 2012-01-02 20:31 /user/avkash/top15words
-rw-r--r-- 3 avkash supergroup 418 2012-01-02 20:17 /user/avkash/wordcount.js
drwxr-xr-x - avkash supergroup 0 2012-01-02 20:22 /user/avkash/wordsfolder
Now we can read the data from the “top15words” folder:
js> file = fs.read("top15words")
the 47430
of 25263
and 18664
a 14213
in 13125
to 12634
is 7876
that 7057
it 7005
on 5081
he 5037
with 4931
his 4314
as 4289
by 4119
Let’s parse the data also:
js> data = parse(file.data,"word, count:long")
0: {
word: "the"
count: 47430
1: {
word: "of"
count: 25263
2: {
word: "and"
count: 18664
3: {
word: "a"
count: 14213
4: {
word: "in"
count: 13125
5: {
word: "to"
count: 12634
6: {
word: "is"
count: 7876
7: {
word: "that"
count: 7057
8: {
word: "it"
count: 7005
9: {
word: "on"
count: 5081
10: {
word: "he"
count: 5037
11: {
word: "with"
count: 4931
12: {
word: "his"
count: 4314
13: {
word: "as"
count: 4289
14: {
word: "by"
count: 4119
Finally lets create a line graph from the results:
Here is the Map/Reduce Job results:
2012-01-02 20:26:52,304 [main] INFO org.apache.pig.Main - Logging error messages to: c:\apps\dist\bin\pig_1325536012304.log 2012-01-02 20:26:52,570 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs:// 2012-01-02 20:26:53,038 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 2012-01-02 20:26:53,304 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: ORDER_BY,LIMIT,NATIVE 2012-01-02 20:26:53,304 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used. 2012-01-02 20:26:53,507 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: q2: Store(hdfs:// - scope-12 Operator Key: scope-12) 2012-01-02 20:26:53,523 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2012-01-02 20:26:53,742 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 5 2012-01-02 20:26:53,742 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 5 2012-01-02 20:26:53,945 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2012-01-02 20:26:53,992 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2012-01-02 20:26:55,179 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2012-01-02 20:26:55,210 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2012-01-02 20:26:55,710 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2012-01-02 20:26:55,835 [Thread-4] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 3 2012-01-02 20:26:55,835 [Thread-4] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 3 2012-01-02 20:26:55,882 [Thread-4] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2012-01-02 20:26:57,226 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201021955_0002 2012-01-02 20:26:57,226 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: 2012-01-02 20:27:28,772 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 10% complete 2012-01-02 20:27:40,771 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 20% complete 2012-01-02 20:27:42,646 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2012-01-02 20:27:43,209 [main] INFO org.apache.hadoop.mapred.JobClient - Running job: job_201201021955_0003 2012-01-02 20:27:44,224 [main] INFO org.apache.hadoop.mapred.JobClient - map 0% reduce 0% 2012-01-02 20:28:12,223 [main] INFO org.apache.hadoop.mapred.JobClient - map 100% reduce 0% 2012-01-02 20:28:36,223 [main] INFO org.apache.hadoop.mapred.JobClient - map 100% reduce 100% 2012-01-02 20:28:47,222 [main] INFO org.apache.hadoop.mapred.JobClient - Job complete: job_201201021955_0003 2012-01-02 20:28:47,222 [main] INFO org.apache.hadoop.mapred.JobClient - Counters: 25 2012-01-02 20:28:47,222 [main] INFO org.apache.hadoop.mapred.JobClient - Job Counters 2012-01-02 20:28:47,222 [main] INFO org.apache.hadoop.mapred.JobClient - Launched reduce tasks=1 2012-01-02 20:28:47,222 [main] INFO org.apache.hadoop.mapred.JobClient - SLOTS_MILLIS_MAPS=32061 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Total time spent by all reduces waiting after reserving slots (ms)=0 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Total time spent by all maps waiting after reserving slots (ms)=0 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Launched map tasks=1 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Data-local map tasks=1 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - SLOTS_MILLIS_REDUCES=21531 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - File Output Format Counters 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Bytes Written=424066 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - FileSystemCounters 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - FILE_BYTES_READ=11850310 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - HDFS_BYTES_READ=3597791 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - FILE_BYTES_WRITTEN=17819374 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - HDFS_BYTES_WRITTEN=424066 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - File Input Format Counters 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Bytes Read=3597657 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Map-Reduce Framework 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Reduce input groups=39491 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Map output materialized bytes=5924329 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Combine output records=0 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Map input records=77934 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Reduce shuffle bytes=0 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Reduce output records=39491 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Spilled Records=1890066 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Map output bytes=4664279 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Combine input records=0 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Map output records=630022 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - SPLIT_RAW_BYTES=134 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Reduce input records=630022 2012-01-02 20:28:47,238 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 40% complete 2012-01-02 20:28:47,238 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2012-01-02 20:28:47,238 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2012-01-02 20:28:48,629 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2012-01-02 20:28:48,644 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2012-01-02 20:28:49,035 [Thread-24] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2012-01-02 20:28:49,035 [Thread-24] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2012-01-02 20:28:49,035 [Thread-24] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2012-01-02 20:28:50,050 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201021955_0004 2012-01-02 20:28:50,050 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: 2012-01-02 20:29:17,550 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2012-01-02 20:29:20,049 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2012-01-02 20:29:25,049 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2012-01-02 20:29:29,549 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2012-01-02 20:29:29,549 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2012-01-02 20:29:30,768 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2012-01-02 20:29:30,830 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2012-01-02 20:29:31,205 [Thread-34] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2012-01-02 20:29:31,205 [Thread-34] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2012-01-02 20:29:31,205 [Thread-34] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2012-01-02 20:29:31,330 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 60% complete 2012-01-02 20:29:32,252 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201021955_0005 2012-01-02 20:29:32,252 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: 2012-01-02 20:30:11,251 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete 2012-01-02 20:30:12,251 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete 2012-01-02 20:30:17,251 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete 2012-01-02 20:30:22,251 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete 2012-01-02 20:30:27,250 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete 2012-01-02 20:30:32,750 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete 2012-01-02 20:30:37,250 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete 2012-01-02 20:30:42,250 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete 2012-01-02 20:30:46,765 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2012-01-02 20:30:46,765 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2012-01-02 20:30:47,937 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2012-01-02 20:30:47,984 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2012-01-02 20:30:48,406 [Thread-45] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2012-01-02 20:30:48,406 [Thread-45] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2012-01-02 20:30:48,406 [Thread-45] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2012-01-02 20:30:48,484 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete 2012-01-02 20:30:49,390 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201021955_0006 2012-01-02 20:30:49,390 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: 2012-01-02 20:31:17,889 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete 2012-01-02 20:31:19,389 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete 2012-01-02 20:31:24,389 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete 2012-01-02 20:31:34,389 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete 2012-01-02 20:31:48,982 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2012-01-02 20:31:48,998 [main] INFO org.apache.pig.tools.pigstats.PigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.8.1-SNAPSHOT avkash 2012-01-02 20:26:53 2012-01-02 20:31:48 ORDER_BY,LIMIT,NATIVE
Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs job_201201021955_0002 1 0 15 15 15 0 0 0 q0 MAP_ONLY job_201201021955_0004 1 0 12 12 12 0 0 0 q1 MAP_ONLY job_201201021955_0005 1 1 11 11 11 21 21 21 q2 SAMPLER job_201201021955_0006 1 1 12 12 12 18 18 18 q2 ORDER_BY,COMBINER hdfs://, job_D:/Users/avkash/AppData/Local/Temp/MRjs1699097122276446870.jar__0001 0 0 0 0 0 0 0 0 NATIVE
Input(s): Successfully read 77934 records (3644014 bytes) from: "hdfs://" Successfully read 39491 records (424454 bytes) from: "hdfs://"
Output(s): Successfully stored 15 records (132 bytes) in: "hdfs://"
Counters: Total records written : 15 Total bytes written : 132 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0
Job DAG: job_201201021955_0002 -> job_D:/Users/avkash/AppData/Local/Temp/MRjs1699097122276446870.jar__0001, job_D:/Users/avkash/AppData/Local/Temp/MRjs1699097122276446870.jar__0001 -> job_201201021955_0004, job_201201021955_0004 -> job_201201021955_0005, job_201201021955_0005 -> job_201201021955_0006, job_201201021955_0006
2012-01-02 20:31:49,092 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! |
Keywords: Windows Azure, Hadoop, Apache, BigData, Cloud, MapReduce
- Anonymous
January 03, 2012
Can you compare Azure Hadoop with Azure HPC please? Great blogs! thank you.