Understanding Map/Reduce job in Apache Hadoop on Windows Azure (A Reverse Approach)

When you run Map/Reduce job in Hadoop cluster on Windows Azure you will get an aggregated progress and log directly on portal, so you can see what is happening with your job. This log is different the what see when you do check individual job status in the datanode instead this log gives you cumulative details about how the job was started and how it was completed.

 

2012-01-20 22:27:37,646 [main] INFO  org.apache.pig.Main - Logging error messages to: c:\apps\dist\bin\pig_1327098457646.log

2012-01-20 22:27:38,036 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://10.114.178.123:9000

2012-01-20 22:27:38,443 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 10.114.178.123:9010

2012-01-20 22:27:38,661 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: ORDER_BY,LIMIT,NATIVE

2012-01-20 22:27:38,661 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used.

2012-01-20 22:27:40,286 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: q2: Store(asv://hadoop/outjan20:org.apache.pig.builtin.PigStorage) - scope-12 Operator Key: scope-12)

2012-01-20 22:27:40,302 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false

2012-01-20 22:27:40,380 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 5

2012-01-20 22:27:40,380 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 5

2012-01-20 22:27:40,489 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job

2012-01-20 22:27:40,505 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

 

At this point verification related with your jobs are completed, and now the job can be sent to Job queue. If you have had any issue related with your data source or where you are going to store the results, you could not have come here.. reaching here means the verification is done map/reduce job is pushed in queue:

 

2012-01-20 22:27:41,724 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job

2012-01-20 22:27:41,755 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.

2012-01-20 22:27:42,255 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete

 

Depend on your input(s), at this point the data from the data source is ready to process:

 

2012-01-20 22:27:43,771 [Thread-4] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1

2012-01-20 22:27:43,771 [Thread-4] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

2012-01-20 22:27:43,802 [Thread-4] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1

 

At this point the input data source was processed and can be used later and Hadoop Job ID is associated with your job. If you look for this job ID inside you will see get more info about the tasks associated with this Job:

https://<datanode_ipaddress>:50030/jobdetailshistory.jsp?*

 

2012-01-20 22:27:44,771 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201202036_0012

2012-01-20 22:27:44,771 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.114.178.123:50030/jobdetails.jsp?jobid=job\_201201202036\_0012

2012-01-20 22:28:18,286 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 10% complete

2012-01-20 22:28:30,286 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 20% complete

2012-01-20 22:28:35,130 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1

 

At this point A new Job is being processed:

 

2012-01-20 22:28:35,473 [main] INFO  org.apache.hadoop.mapred.JobClient - Running job: job_201201202036_0013

2012-01-20 22:28:36,473 [main] INFO  org.apache.hadoop.mapred.JobClient -  map 0% reduce 0%

2012-01-20 22:29:04,489 [main] INFO  org.apache.hadoop.mapred.JobClient -  map 100% reduce 0%

2012-01-20 22:29:25,489 [main] INFO  org.apache.hadoop.mapred.JobClient -  map 100% reduce 100%

2012-01-20 22:29:36,489 [main] INFO  org.apache.hadoop.mapred.JobClient - Job complete: job_201201202036_0013

 

After the Job is completed, here are the Job details which are same as what you would find at cluster:

https://localhost:50030/jobdetailshistory.jsp?logFile=file:/c:/Apps/dist/logs/history/done/version-1/10.114.178.123_1327091770147_/2012/01/20/000000/job_201201202036_0013_1327098515411_avkash_MRjs

 Visit: https://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/JobCounter.html to learn more about Job counter.

Counter

Map

Reduce

Total

Job Counters

Launched reduce tasks

0

0

1

SLOTS_MILLIS_MAPS

0

0

29,234

Total time spent by all reduces waiting after reserving slots (ms)

0

0

0

Total time spent by all maps waiting after reserving slots (ms)

0

0

0

Launched map tasks

0

0

1

Data-local map tasks

0

0

1

SLOTS_MILLIS_REDUCES

0

0

18,719

File Output Format Counters

Bytes Written

0

56

56

FileSystemCounters

FILE_BYTES_READ

542

986

1,528

HDFS_BYTES_READ

238

0

238

FILE_BYTES_WRITTEN

23,715

23,944

47,659

HDFS_BYTES_WRITTEN

0

56

56

File Input Format Counters

Bytes Read

102

0

102

Map-Reduce Framework

Reduce input groups

0

8

8

Map output materialized bytes

184

0

184

Combine output records

0

0

0

Map input records

7

0

7

Reduce shuffle bytes

0

0

0

Reduce output records

0

8

8

Spilled Records

19

19

38

Map output bytes

140

0

140

Combine input records

0

0

0

Map output records

19

0

19

SPLIT_RAW_BYTES

136

0

136

Reduce input records

0

19

19

 

 

2012-01-20 22:29:36,489 [main] INFO  org.apache.hadoop.mapred.JobClient - Counters: 25

2012-01-20 22:29:36,489 [main] INFO  org.apache.hadoop.mapred.JobClient -   Job Counters

2012-01-20 22:29:36,489 [main] INFO  org.apache.hadoop.mapred.JobClient -     Launched reduce tasks=1

2012-01-20 22:29:36,489 [main] INFO  org.apache.hadoop.mapred.JobClient -     SLOTS_MILLIS_MAPS=29234

2012-01-20 22:29:36,489 [main] INFO  org.apache.hadoop.mapred.JobClient -     Total time spent by all reduces waiting after reserving slots (ms)=0

2012-01-20 22:29:36,489 [main] INFO  org.apache.hadoop.mapred.JobClient -     Total time spent by all maps waiting after reserving slots (ms)=0

2012-01-20 22:29:36,489 [main] INFO  org.apache.hadoop.mapred.JobClient -     Launched map tasks=1

2012-01-20 22:29:36,489 [main] INFO  org.apache.hadoop.mapred.JobClient -     Data-local map tasks=1

2012-01-20 22:29:36,489 [main] INFO  org.apache.hadoop.mapred.JobClient -     SLOTS_MILLIS_REDUCES=18719

2012-01-20 22:29:36,489 [main] INFO  org.apache.hadoop.mapred.JobClient -   File Output Format Counters

2012-01-20 22:29:36,489 [main] INFO  org.apache.hadoop.mapred.JobClient -     Bytes Written=56

2012-01-20 22:29:36,489 [main] INFO  org.apache.hadoop.mapred.JobClient -   FileSystemCounters

2012-01-20 22:29:36,489 [main] INFO  org.apache.hadoop.mapred.JobClient -     FILE_BYTES_READ=1528

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -     HDFS_BYTES_READ=238

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -     FILE_BYTES_WRITTEN=47659

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -     HDFS_BYTES_WRITTEN=56

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -   File Input Format Counters

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -     Bytes Read=102

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -   Map-Reduce Framework

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce input groups=8

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map output materialized bytes=184

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -     Combine output records=0

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map input records=7

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce shuffle bytes=0

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce output records=8

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -     Spilled Records=38

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map output bytes=140

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -     Combine input records=0

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map output records=19

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -     SPLIT_RAW_BYTES=136

2012-01-20 22:29:36,504 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce input records=19

2012-01-20 22:29:36,520 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 40% complete

2012-01-20 22:29:36,520 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job

2012-01-20 22:29:36,520 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

2012-01-20 22:29:37,770 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job

2012-01-20 22:29:37,785 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.

2012-01-20 22:29:38,051 [Thread-24] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1

2012-01-20 22:29:38,051 [Thread-24] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

2012-01-20 22:29:38,051 [Thread-24] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1

 

Introduction of another Job:

 

2012-01-20 22:29:38,864 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201202036_0014

2012-01-20 22:29:38,864 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.114.178.123:50030/jobdetails.jsp?jobid=job\_201201202036\_0014

2012-01-20 22:30:03,379 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete

2012-01-20 22:30:15,457 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 60% complete

2012-01-20 22:30:18,473 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job

2012-01-20 22:30:18,473 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

2012-01-20 22:30:19,613 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job

2012-01-20 22:30:19,645 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.

2012-01-20 22:30:20,004 [Thread-34] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1

2012-01-20 22:30:20,004 [Thread-34] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

2012-01-20 22:30:20,020 [Thread-34] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1

 

Introduction of another Job:

 

2012-01-20 22:30:20,848 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201202036_0015

2012-01-20 22:30:20,848 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.114.178.123:50030/jobdetails.jsp?jobid=job\_201201202036\_0015

2012-01-20 22:30:51,863 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete

2012-01-20 22:31:12,863 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete

2012-01-20 22:31:15,863 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete

2012-01-20 22:31:25,378 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job

2012-01-20 22:31:25,378 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

2012-01-20 22:31:26,503 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job

2012-01-20 22:31:26,597 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.

2012-01-20 22:31:27,097 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete

2012-01-20 22:31:27,769 [Thread-45] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1

2012-01-20 22:31:27,769 [Thread-45] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

2012-01-20 22:31:27,769 [Thread-45] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1

 

Introduction of another Job:

 

2012-01-20 22:31:29,144 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201202036_0016

2012-01-20 22:31:29,144 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.114.178.123:50030/jobdetails.jsp?jobid=job\_201201202036\_0016

2012-01-20 22:31:57,643 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-20 22:31:59,143 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-20 22:32:04,143 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-20 22:32:09,143 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-20 22:32:14,143 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-20 22:32:19,159 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-20 22:32:24,159 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-20 22:32:43,877 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete

2012-01-20 22:32:43,893 [main] INFO  org.apache.pig.tools.pigstats.PigStats - Script Statistics:

 

List of Hadoop components used along with their version along with features:

 

HadoopVersion  PigVersion     UserId StartedAt      FinishedAt     Features

0.20.203.1-SNAPSHOT   0.8.1-SNAPSHOT avkash  2012-01-20 22:27:40   2012-01-20 22:32:43       ORDER_BY,LIMIT,NATIVE

 

This line really makes all the trouble go away:

 

Success!

 

This is very important description about your jobs. This table shows the type of Map/Reduce job, each job is and how much time it tool. You will actually learn a lot just by looking here:

 

Job Stats (time in seconds):

JobId  Maps    Reduces MaxMapTime     MinMapTIme     AvgMapTime     MaxReduceTime       MinReduceTime  AvgReduceTime  Alias   Feature Outputs

job_201201202036_0012 1       0      15      15     15     0       0      0       q0      MAP_ONLY      

job_201201202036_0014 1       0      12      12     12     0       0      0       q1      MAP_ONLY      

job_201201202036_0015 1       1      12      12     12     20      20     20       q2      SAMPLER

job_201201202036_0016 1       1      11      11     11     29      29     29       q2      ORDER_BY,COMBINER asv://hadoop/outjan20,

job_D:/Users/avkash/AppData/Local/Temp/MRjs7020713045504718236.jar__0001   0       0       0      0       0      0       0      0              NATIVE

 

Note: If you are adventures enough to dig, please remote into cluster and unzip above MRjs7020713045504718236.jar JAR file and you will learn how to Java Script job is packaged into a JAR files before submission.

 

This line prints data source location when I submitted the job:

 

Input(s):

 

[I am still digging why this line here?]

Successfully read 8 records (446 bytes) from: "hdfs://10.114.178.123:9000/user/avkash/.oink/output3/mr/out"

 

If I look at the content at this location, I see there is only 1 text blob with 7 lines in it. That’s why the log shows total records read here are 7:

 

Successfully read 7 records (355 bytes) from: "asv://hadoop/input"

 

This line shows what was my output location which was used when I submitted the job:

 

Output(s):

Successfully stored 8 records in: "asv://hadoop/outjan20"

 

This section shows the job results. Based on your job if there was any results, you will see the details about the result as belo:

 

Counters:

Total records written : 8

Total bytes written : 0

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0

 

The dependency between Hadoop jobs can be expressed as a directed acyclic graph (DAG), where each node represents a Hadoop job. DAG is useful for modeling relationship between entities that have partial ordering. An application is either a single job in the classic MapReduce jobs or a DAG of such jobs. This section shows the DAG graph between all of the Jobs in your application:

 

Job DAG:

job_201201202036_0012 ->       job_D:/Users/avkash/AppData/Local/Temp/MRjs7020713045504718236.jar__0001,

job_D:/Users/avkash/AppData/Local/Temp/MRjs7020713045504718236.jar__0001   ->       job_201201202036_0014,

job_201201202036_0014 ->      job_201201202036_0015,

job_201201202036_0015 ->      job_201201202036_0016,

job_201201202036_0016

 

 

2012-01-20 22:32:43,955 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

  

All of these jobs could be analyzed individually on https://localhost:50030/jobhistoryhome.jsp

 

Job submit time

Job Id

Name

User

Fri Jan 20 22:31:28 GMT 2012

job_201201202036_0016

PigLatin:DefaultJobName

avkash

Fri Jan 20 22:30:20 GMT 2012

job_201201202036_0015

PigLatin:DefaultJobName

avkash

Fri Jan 20 22:29:38 GMT 2012

job_201201202036_0014

PigLatin:DefaultJobName

avkash

Fri Jan 20 22:28:35 GMT 2012

job_201201202036_0013

MRjs

avkash

Fri Jan 20 22:27:44 GMT 2012

job_201201202036_0012

PigLatin:DefaultJobName

avkash

In most cases the Job name will be the Map/Reduce job class name and in above case MRjs shows that it is a JavaScript Map/Reduce Job.

Note: I will keep updating this article as I think new users can learn a lot about the processing just by looking at these results. 

Resources: