Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Microsoft distribution of Apache Hadoop on Windows Azure, let you run JavaScript Map/Reduce jobs directly from web based Interactive JavaScript Console. To start with lets write a JavaScript code for Map/Reduce wordcount jobs as below:
FileName #Wordcount.js:
var map = function (key, value, context) {
var words = value.split(/[^a-zA-Z]/);
for (var i = 0; i < words.length; i++) {
if (words[i] !== "") {
context.write(words[i].toLowerCase(), 1);
}
}
};var reduce = function (key, values, context) {
var sum = 0;
while (values.hasNext()) {
sum += parseInt(values.next());
}
context.write(key, sum);
};
After that you can upload this wordcount.js file to HDFS and verify it as below:
js> fs.put()
js> #ls
Found 2 items
drwxr-xr-x - avkash supergroup 0 2012-01-02 20:25 /user/avkash/.oink
-rw-r--r-- 3 avkash supergroup 418 2012-01-02 20:17 /user/avkash/wordcount.js
Now you can create a folder name “wordsfolder” and upload a few txt files. We will use this folder as input folder to run the word count map/reduce job.
js> #ls
Found 3 items
drwxr-xr-x - avkash supergroup 0 2012-01-02 20:25 /user/avkash/.oink
-rw-r--r-- 3 avkash supergroup 418 2012-01-02 20:17 /user/avkash/wordcount.js
drwxr-xr-x - avkash supergroup 0 2012-01-02 20:22 /user/avkash/wordsfolder
js> #ls wordsfolder
Found 3 items
-rw-r--r-- 3 avkash supergroup 1395667 2012-01-02 20:22 /user/avkash/wordsfolder/davinci.txt
-rw-r--r-- 3 avkash supergroup 674762 2012-01-02 20:22 /user/avkash/wordsfolder/outlineofscience.txt
-rw-r--r-- 3 avkash supergroup 1573044 2012-01-02 20:22 /user/avkash/wordsfolder/ulysses.txt
Now we can run the JavaScript Map/Reduce job to count the top 15 words in descending order in the folder name “top15words” as below:
js> from("wordsfolder").mapReduce("wordcount.js", "word, count:long").orderBy("count DESC").take(15).to("top15words")
View Log
If you click the “View Log” link above in a new tab, you can see the activity about Map/Reduce job which I have added at the end of this blog:
Finally when the job is completed, the following folder “top15words” will be created as below:
js> #ls
Found 4 items
drwxr-xr-x - avkash supergroup 0 2012-01-02 20:26 /user/avkash/.oink
drwxr-xr-x - avkash supergroup 0 2012-01-02 20:31 /user/avkash/top15words
-rw-r--r-- 3 avkash supergroup 418 2012-01-02 20:17 /user/avkash/wordcount.js
drwxr-xr-x - avkash supergroup 0 2012-01-02 20:22 /user/avkash/wordsfolder
Now we can read the data from the “top15words” folder:
js> file = fs.read("top15words")
the 47430
of 25263
and 18664
a 14213
in 13125
to 12634
is 7876
that 7057
it 7005
on 5081
he 5037
with 4931
his 4314
as 4289
by 4119
Let’s parse the data also:
js> data = parse(file.data,"word, count:long")
[
0: {
word: "the"
count: 47430
}
1: {
word: "of"
count: 25263
}
2: {
word: "and"
count: 18664
}
3: {
word: "a"
count: 14213
}
4: {
word: "in"
count: 13125
}
5: {
word: "to"
count: 12634
}
6: {
word: "is"
count: 7876
}
7: {
word: "that"
count: 7057
}
8: {
word: "it"
count: 7005
}
9: {
word: "on"
count: 5081
}
10: {
word: "he"
count: 5037
}
11: {
word: "with"
count: 4931
}
12: {
word: "his"
count: 4314
}
13: {
word: "as"
count: 4289
}
14: {
word: "by"
count: 4119
}
]
Finally lets create a line graph from the results:
Here is the Map/Reduce Job results:
2012-01-02 20:26:52,304 [main] INFO org.apache.pig.Main - Logging error messages to: c:\apps\dist\bin\pig_1325536012304.log 2012-01-02 20:26:52,570 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://10.26.104.45:9000 2012-01-02 20:26:53,038 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 10.26.104.45:9010 2012-01-02 20:26:53,304 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: ORDER_BY,LIMIT,NATIVE 2012-01-02 20:26:53,304 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used. 2012-01-02 20:26:53,507 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: q2: Store(hdfs://10.26.104.45:9000/user/avkash/top15words:org.apache.pig.builtin.PigStorage) - scope-12 Operator Key: scope-12) 2012-01-02 20:26:53,523 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2012-01-02 20:26:53,742 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 5 2012-01-02 20:26:53,742 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 5 2012-01-02 20:26:53,945 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2012-01-02 20:26:53,992 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2012-01-02 20:26:55,179 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2012-01-02 20:26:55,210 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2012-01-02 20:26:55,710 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2012-01-02 20:26:55,835 [Thread-4] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 3 2012-01-02 20:26:55,835 [Thread-4] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 3 2012-01-02 20:26:55,882 [Thread-4] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2012-01-02 20:26:57,226 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201021955_0002 2012-01-02 20:26:57,226 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.26.104.45:50030/jobdetails.jsp?jobid=job_201201021955_0002 2012-01-02 20:27:28,772 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 10% complete 2012-01-02 20:27:40,771 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 20% complete 2012-01-02 20:27:42,646 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2012-01-02 20:27:43,209 [main] INFO org.apache.hadoop.mapred.JobClient - Running job: job_201201021955_0003 2012-01-02 20:27:44,224 [main] INFO org.apache.hadoop.mapred.JobClient - map 0% reduce 0% 2012-01-02 20:28:12,223 [main] INFO org.apache.hadoop.mapred.JobClient - map 100% reduce 0% 2012-01-02 20:28:36,223 [main] INFO org.apache.hadoop.mapred.JobClient - map 100% reduce 100% 2012-01-02 20:28:47,222 [main] INFO org.apache.hadoop.mapred.JobClient - Job complete: job_201201021955_0003 2012-01-02 20:28:47,222 [main] INFO org.apache.hadoop.mapred.JobClient - Counters: 25 2012-01-02 20:28:47,222 [main] INFO org.apache.hadoop.mapred.JobClient - Job Counters 2012-01-02 20:28:47,222 [main] INFO org.apache.hadoop.mapred.JobClient - Launched reduce tasks=1 2012-01-02 20:28:47,222 [main] INFO org.apache.hadoop.mapred.JobClient - SLOTS_MILLIS_MAPS=32061 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Total time spent by all reduces waiting after reserving slots (ms)=0 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Total time spent by all maps waiting after reserving slots (ms)=0 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Launched map tasks=1 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Data-local map tasks=1 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - SLOTS_MILLIS_REDUCES=21531 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - File Output Format Counters 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Bytes Written=424066 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - FileSystemCounters 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - FILE_BYTES_READ=11850310 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - HDFS_BYTES_READ=3597791 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - FILE_BYTES_WRITTEN=17819374 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - HDFS_BYTES_WRITTEN=424066 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - File Input Format Counters 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Bytes Read=3597657 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Map-Reduce Framework 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Reduce input groups=39491 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Map output materialized bytes=5924329 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Combine output records=0 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Map input records=77934 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Reduce shuffle bytes=0 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Reduce output records=39491 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Spilled Records=1890066 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Map output bytes=4664279 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Combine input records=0 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Map output records=630022 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - SPLIT_RAW_BYTES=134 2012-01-02 20:28:47,238 [main] INFO org.apache.hadoop.mapred.JobClient - Reduce input records=630022 2012-01-02 20:28:47,238 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 40% complete 2012-01-02 20:28:47,238 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2012-01-02 20:28:47,238 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2012-01-02 20:28:48,629 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2012-01-02 20:28:48,644 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2012-01-02 20:28:49,035 [Thread-24] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2012-01-02 20:28:49,035 [Thread-24] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2012-01-02 20:28:49,035 [Thread-24] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2012-01-02 20:28:50,050 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201021955_0004 2012-01-02 20:28:50,050 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.26.104.45:50030/jobdetails.jsp?jobid=job_201201021955_0004 2012-01-02 20:29:17,550 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2012-01-02 20:29:20,049 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2012-01-02 20:29:25,049 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2012-01-02 20:29:29,549 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2012-01-02 20:29:29,549 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2012-01-02 20:29:30,768 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2012-01-02 20:29:30,830 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2012-01-02 20:29:31,205 [Thread-34] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2012-01-02 20:29:31,205 [Thread-34] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2012-01-02 20:29:31,205 [Thread-34] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2012-01-02 20:29:31,330 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 60% complete 2012-01-02 20:29:32,252 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201021955_0005 2012-01-02 20:29:32,252 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.26.104.45:50030/jobdetails.jsp?jobid=job_201201021955_0005 2012-01-02 20:30:11,251 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete 2012-01-02 20:30:12,251 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete 2012-01-02 20:30:17,251 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete 2012-01-02 20:30:22,251 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete 2012-01-02 20:30:27,250 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete 2012-01-02 20:30:32,750 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete 2012-01-02 20:30:37,250 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete 2012-01-02 20:30:42,250 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete 2012-01-02 20:30:46,765 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2012-01-02 20:30:46,765 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2012-01-02 20:30:47,937 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2012-01-02 20:30:47,984 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2012-01-02 20:30:48,406 [Thread-45] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2012-01-02 20:30:48,406 [Thread-45] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2012-01-02 20:30:48,406 [Thread-45] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2012-01-02 20:30:48,484 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete 2012-01-02 20:30:49,390 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201021955_0006 2012-01-02 20:30:49,390 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.26.104.45:50030/jobdetails.jsp?jobid=job_201201021955_0006 2012-01-02 20:31:17,889 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete 2012-01-02 20:31:19,389 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete 2012-01-02 20:31:24,389 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete 2012-01-02 20:31:34,389 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete 2012-01-02 20:31:48,982 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2012-01-02 20:31:48,998 [main] INFO org.apache.pig.tools.pigstats.PigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.203.1-SNAPSHOT 0.8.1-SNAPSHOT avkash 2012-01-02 20:26:53 2012-01-02 20:31:48 ORDER_BY,LIMIT,NATIVE
Success!
Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs job_201201021955_0002 1 0 15 15 15 0 0 0 q0 MAP_ONLY job_201201021955_0004 1 0 12 12 12 0 0 0 q1 MAP_ONLY job_201201021955_0005 1 1 11 11 11 21 21 21 q2 SAMPLER job_201201021955_0006 1 1 12 12 12 18 18 18 q2 ORDER_BY,COMBINER hdfs://10.26.104.45:9000/user/avkash/top15words, job_D:/Users/avkash/AppData/Local/Temp/MRjs1699097122276446870.jar__0001 0 0 0 0 0 0 0 0 NATIVE
Input(s): Successfully read 77934 records (3644014 bytes) from: "hdfs://10.26.104.45:9000/user/avkash/wordsfolder" Successfully read 39491 records (424454 bytes) from: "hdfs://10.26.104.45:9000/user/avkash/.oink/output2/mr/out"
Output(s): Successfully stored 15 records (132 bytes) in: "hdfs://10.26.104.45:9000/user/avkash/top15words"
Counters: Total records written : 15 Total bytes written : 132 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0
Job DAG: job_201201021955_0002 -> job_D:/Users/avkash/AppData/Local/Temp/MRjs1699097122276446870.jar__0001, job_D:/Users/avkash/AppData/Local/Temp/MRjs1699097122276446870.jar__0001 -> job_201201021955_0004, job_201201021955_0004 -> job_201201021955_0005, job_201201021955_0005 -> job_201201021955_0006, job_201201021955_0006
2012-01-02 20:31:49,092 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! |
Keywords: Windows Azure, Hadoop, Apache, BigData, Cloud, MapReduce
Comments
- Anonymous
January 03, 2012
Can you compare Azure Hadoop with Azure HPC please? Great blogs! thank you.