Understanding Map/Reduce job in Apache Hadoop on Windows Azure (A Reverse Approach)
When you run Map/Reduce job in Hadoop cluster on Windows Azure you will get an aggregated progress and log directly on portal, so you can see what is happening with your job. This log is different the what see when you do check individual job status in the datanode instead this log gives you cumulative details about how the job was started and how it was completed.
2012-01-20 22:27:37,646 [main] INFO org.apache.pig.Main - Logging error messages to: c:\apps\dist\bin\pig_1327098457646.log
2012-01-20 22:27:38,036 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://10.114.178.123:9000
2012-01-20 22:27:38,443 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 10.114.178.123:9010
2012-01-20 22:27:38,661 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: ORDER_BY,LIMIT,NATIVE
2012-01-20 22:27:38,661 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used.
2012-01-20 22:27:40,286 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: q2: Store(asv://hadoop/outjan20:org.apache.pig.builtin.PigStorage) - scope-12 Operator Key: scope-12)
2012-01-20 22:27:40,302 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2012-01-20 22:27:40,380 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 5
2012-01-20 22:27:40,380 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 5
2012-01-20 22:27:40,489 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2012-01-20 22:27:40,505 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
At this point verification related with your jobs are completed, and now the job can be sent to Job queue. If you have had any issue related with your data source or where you are going to store the results, you could not have come here.. reaching here means the verification is done map/reduce job is pushed in queue:
2012-01-20 22:27:41,724 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2012-01-20 22:27:41,755 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2012-01-20 22:27:42,255 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
Depend on your input(s), at this point the data from the data source is ready to process:
2012-01-20 22:27:43,771 [Thread-4] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2012-01-20 22:27:43,771 [Thread-4] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2012-01-20 22:27:43,802 [Thread-4] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
At this point the input data source was processed and can be used later and Hadoop Job ID is associated with your job. If you look for this job ID inside you will see get more info about the tasks associated with this Job:
https://<datanode_ipaddress>:50030/jobdetailshistory.jsp?*
2012-01-20 22:27:44,771 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201202036_0012
2012-01-20 22:27:44,771 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.114.178.123:50030/jobdetails.jsp?jobid=job\_201201202036\_0012
2012-01-20 22:28:18,286 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 10% complete
2012-01-20 22:28:30,286 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 20% complete
2012-01-20 22:28:35,130 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
At this point A new Job is being processed:
2012-01-20 22:28:35,473 [main] INFO org.apache.hadoop.mapred.JobClient - Running job: job_201201202036_0013
2012-01-20 22:28:36,473 [main] INFO org.apache.hadoop.mapred.JobClient - map 0% reduce 0%
2012-01-20 22:29:04,489 [main] INFO org.apache.hadoop.mapred.JobClient - map 100% reduce 0%
2012-01-20 22:29:25,489 [main] INFO org.apache.hadoop.mapred.JobClient - map 100% reduce 100%
2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Job complete: job_201201202036_0013
After the Job is completed, here are the Job details which are same as what you would find at cluster:
Visit: https://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/JobCounter.html to learn more about Job counter.
Counter |
Map |
Reduce |
Total |
|
Job Counters |
Launched reduce tasks |
0 |
0 |
1 |
SLOTS_MILLIS_MAPS |
0 |
0 |
29,234 |
|
Total time spent by all reduces waiting after reserving slots (ms) |
0 |
0 |
0 |
|
Total time spent by all maps waiting after reserving slots (ms) |
0 |
0 |
0 |
|
Launched map tasks |
0 |
0 |
1 |
|
Data-local map tasks |
0 |
0 |
1 |
|
SLOTS_MILLIS_REDUCES |
0 |
0 |
18,719 |
|
File Output Format Counters |
Bytes Written |
0 |
56 |
56 |
FileSystemCounters |
FILE_BYTES_READ |
542 |
986 |
1,528 |
HDFS_BYTES_READ |
238 |
0 |
238 |
|
FILE_BYTES_WRITTEN |
23,715 |
23,944 |
47,659 |
|
HDFS_BYTES_WRITTEN |
0 |
56 |
56 |
|
File Input Format Counters |
Bytes Read |
102 |
0 |
102 |
Map-Reduce Framework |
Reduce input groups |
0 |
8 |
8 |
Map output materialized bytes |
184 |
0 |
184 |
|
Combine output records |
0 |
0 |
0 |
|
Map input records |
7 |
0 |
7 |
|
Reduce shuffle bytes |
0 |
0 |
0 |
|
Reduce output records |
0 |
8 |
8 |
|
Spilled Records |
19 |
19 |
38 |
|
Map output bytes |
140 |
0 |
140 |
|
Combine input records |
0 |
0 |
0 |
|
Map output records |
19 |
0 |
19 |
|
SPLIT_RAW_BYTES |
136 |
0 |
136 |
|
Reduce input records |
0 |
19 |
19 |
2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Counters: 25
2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Job Counters
2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Launched reduce tasks=1
2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - SLOTS_MILLIS_MAPS=29234
2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Total time spent by all reduces waiting after reserving slots (ms)=0
2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Total time spent by all maps waiting after reserving slots (ms)=0
2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Launched map tasks=1
2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Data-local map tasks=1
2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - SLOTS_MILLIS_REDUCES=18719
2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - File Output Format Counters
2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Bytes Written=56
2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - FileSystemCounters
2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - FILE_BYTES_READ=1528
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - HDFS_BYTES_READ=238
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - FILE_BYTES_WRITTEN=47659
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - HDFS_BYTES_WRITTEN=56
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - File Input Format Counters
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Bytes Read=102
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Map-Reduce Framework
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Reduce input groups=8
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Map output materialized bytes=184
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Combine output records=0
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Map input records=7
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Reduce shuffle bytes=0
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Reduce output records=8
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Spilled Records=38
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Map output bytes=140
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Combine input records=0
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Map output records=19
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - SPLIT_RAW_BYTES=136
2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Reduce input records=19
2012-01-20 22:29:36,520 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 40% complete
2012-01-20 22:29:36,520 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2012-01-20 22:29:36,520 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-01-20 22:29:37,770 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2012-01-20 22:29:37,785 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2012-01-20 22:29:38,051 [Thread-24] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2012-01-20 22:29:38,051 [Thread-24] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2012-01-20 22:29:38,051 [Thread-24] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
Introduction of another Job:
2012-01-20 22:29:38,864 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201202036_0014
2012-01-20 22:29:38,864 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.114.178.123:50030/jobdetails.jsp?jobid=job\_201201202036\_0014
2012-01-20 22:30:03,379 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2012-01-20 22:30:15,457 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 60% complete
2012-01-20 22:30:18,473 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2012-01-20 22:30:18,473 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-01-20 22:30:19,613 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2012-01-20 22:30:19,645 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2012-01-20 22:30:20,004 [Thread-34] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2012-01-20 22:30:20,004 [Thread-34] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2012-01-20 22:30:20,020 [Thread-34] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
Introduction of another Job:
2012-01-20 22:30:20,848 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201202036_0015
2012-01-20 22:30:20,848 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.114.178.123:50030/jobdetails.jsp?jobid=job\_201201202036\_0015
2012-01-20 22:30:51,863 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete
2012-01-20 22:31:12,863 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete
2012-01-20 22:31:15,863 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete
2012-01-20 22:31:25,378 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2012-01-20 22:31:25,378 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-01-20 22:31:26,503 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2012-01-20 22:31:26,597 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2012-01-20 22:31:27,097 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete
2012-01-20 22:31:27,769 [Thread-45] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2012-01-20 22:31:27,769 [Thread-45] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2012-01-20 22:31:27,769 [Thread-45] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
Introduction of another Job:
2012-01-20 22:31:29,144 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201202036_0016
2012-01-20 22:31:29,144 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.114.178.123:50030/jobdetails.jsp?jobid=job\_201201202036\_0016
2012-01-20 22:31:57,643 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete
2012-01-20 22:31:59,143 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete
2012-01-20 22:32:04,143 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete
2012-01-20 22:32:09,143 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete
2012-01-20 22:32:14,143 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete
2012-01-20 22:32:19,159 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete
2012-01-20 22:32:24,159 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete
2012-01-20 22:32:43,877 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2012-01-20 22:32:43,893 [main] INFO org.apache.pig.tools.pigstats.PigStats - Script Statistics:
List of Hadoop components used along with their version along with features:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
0.20.203.1-SNAPSHOT 0.8.1-SNAPSHOT avkash 2012-01-20 22:27:40 2012-01-20 22:32:43 ORDER_BY,LIMIT,NATIVE
This line really makes all the trouble go away:
Success!
This is very important description about your jobs. This table shows the type of Map/Reduce job, each job is and how much time it tool. You will actually learn a lot just by looking here:
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201201202036_0012 1 0 15 15 15 0 0 0 q0 MAP_ONLY
job_201201202036_0014 1 0 12 12 12 0 0 0 q1 MAP_ONLY
job_201201202036_0015 1 1 12 12 12 20 20 20 q2 SAMPLER
job_201201202036_0016 1 1 11 11 11 29 29 29 q2 ORDER_BY,COMBINER asv://hadoop/outjan20,
job_D:/Users/avkash/AppData/Local/Temp/MRjs7020713045504718236.jar__0001 0 0 0 0 0 0 0 0 NATIVE
Note: If you are adventures enough to dig, please remote into cluster and unzip above MRjs7020713045504718236.jar JAR file and you will learn how to Java Script job is packaged into a JAR files before submission.
This line prints data source location when I submitted the job:
Input(s):
[I am still digging why this line here?]
Successfully read 8 records (446 bytes) from: "hdfs://10.114.178.123:9000/user/avkash/.oink/output3/mr/out"
If I look at the content at this location, I see there is only 1 text blob with 7 lines in it. That’s why the log shows total records read here are 7:
Successfully read 7 records (355 bytes) from: "asv://hadoop/input"
This line shows what was my output location which was used when I submitted the job:
Output(s):
Successfully stored 8 records in: "asv://hadoop/outjan20"
This section shows the job results. Based on your job if there was any results, you will see the details about the result as belo:
Counters:
Total records written : 8
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
The dependency between Hadoop jobs can be expressed as a directed acyclic graph (DAG), where each node represents a Hadoop job. DAG is useful for modeling relationship between entities that have partial ordering. An application is either a single job in the classic MapReduce jobs or a DAG of such jobs. This section shows the DAG graph between all of the Jobs in your application:
Job DAG:
job_201201202036_0012 -> job_D:/Users/avkash/AppData/Local/Temp/MRjs7020713045504718236.jar__0001,
job_D:/Users/avkash/AppData/Local/Temp/MRjs7020713045504718236.jar__0001 -> job_201201202036_0014,
job_201201202036_0014 -> job_201201202036_0015,
job_201201202036_0015 -> job_201201202036_0016,
job_201201202036_0016
2012-01-20 22:32:43,955 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
All of these jobs could be analyzed individually on https://localhost:50030/jobhistoryhome.jsp
Job submit time |
Job Id |
Name |
User |
Fri Jan 20 22:31:28 GMT 2012 |
PigLatin:DefaultJobName |
avkash |
|
Fri Jan 20 22:30:20 GMT 2012 |
PigLatin:DefaultJobName |
avkash |
|
Fri Jan 20 22:29:38 GMT 2012 |
PigLatin:DefaultJobName |
avkash |
|
Fri Jan 20 22:28:35 GMT 2012 |
MRjs |
avkash |
|
Fri Jan 20 22:27:44 GMT 2012 |
PigLatin:DefaultJobName |
avkash |
In most cases the Job name will be the Map/Reduce job class name and in above case MRjs shows that it is a JavaScript Map/Reduce Job.
Note: I will keep updating this article as I think new users can learn a lot about the processing just by looking at these results.
Resources:
- Apache Hadoop on Windows Azure Technet WiKi
- Keywords: Apache Hadoop, Windows Azure, BigData, Cloud, MapReduce