Understanding Map/Reduce job in Apache Hadoop on Windows Azure (A Reverse Approach)

项目
01/20/2012

When you run Map/Reduce job in Hadoop cluster on Windows Azure you will get an aggregated progress and log directly on portal, so you can see what is happening with your job. This log is different the what see when you do check individual job status in the datanode instead this log gives you cumulative details about how the job was started and how it was completed.

2012-01-20 22:27:37,646 [main] INFO org.apache.pig.Main - Logging error messages to: c:\apps\dist\bin\pig_1327098457646.log

2012-01-20 22:27:38,036 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://10.114.178.123:9000

2012-01-20 22:27:38,443 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 10.114.178.123:9010

2012-01-20 22:27:38,661 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: ORDER_BY,LIMIT,NATIVE

2012-01-20 22:27:38,661 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used.

2012-01-20 22:27:40,286 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: q2: Store(asv://hadoop/outjan20:org.apache.pig.builtin.PigStorage) - scope-12 Operator Key: scope-12)

2012-01-20 22:27:40,302 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false

2012-01-20 22:27:40,380 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 5

2012-01-20 22:27:40,380 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 5

2012-01-20 22:27:40,489 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job

2012-01-20 22:27:40,505 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

At this point verification related with your jobs are completed, and now the job can be sent to Job queue. If you have had any issue related with your data source or where you are going to store the results, you could not have come here.. reaching here means the verification is done map/reduce job is pushed in queue:

2012-01-20 22:27:41,724 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job

2012-01-20 22:27:41,755 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.

2012-01-20 22:27:42,255 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete

Depend on your input(s), at this point the data from the data source is ready to process:

2012-01-20 22:27:43,771 [Thread-4] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1

2012-01-20 22:27:43,771 [Thread-4] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

2012-01-20 22:27:43,802 [Thread-4] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1

At this point the input data source was processed and can be used later and Hadoop Job ID is associated with your job. If you look for this job ID inside you will see get more info about the tasks associated with this Job:

https://<datanode_ipaddress>:50030/jobdetailshistory.jsp?*

2012-01-20 22:27:44,771 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201202036_0012

2012-01-20 22:27:44,771 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.114.178.123:50030/jobdetails.jsp?jobid=job\_201201202036\_0012

2012-01-20 22:28:18,286 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 10% complete

2012-01-20 22:28:30,286 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 20% complete

2012-01-20 22:28:35,130 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1

At this point A new Job is being processed:

2012-01-20 22:28:35,473 [main] INFO org.apache.hadoop.mapred.JobClient - Running job: job_201201202036_0013

2012-01-20 22:28:36,473 [main] INFO org.apache.hadoop.mapred.JobClient - map 0% reduce 0%

2012-01-20 22:29:04,489 [main] INFO org.apache.hadoop.mapred.JobClient - map 100% reduce 0%

2012-01-20 22:29:25,489 [main] INFO org.apache.hadoop.mapred.JobClient - map 100% reduce 100%

2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Job complete: job_201201202036_0013

After the Job is completed, here are the Job details which are same as what you would find at cluster:

https://localhost:50030/jobdetailshistory.jsp?logFile=file:/c:/Apps/dist/logs/history/done/version-1/10.114.178.123_1327091770147_/2012/01/20/000000/job_201201202036_0013_1327098515411_avkash_MRjs

Visit: https://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/JobCounter.html to learn more about Job counter.

	Counter	Map	Reduce	Total
Job Counters	Launched reduce tasks	0	0	1
	SLOTS_MILLIS_MAPS	0	0	29,234
	Total time spent by all reduces waiting after reserving slots (ms)	0	0	0
	Total time spent by all maps waiting after reserving slots (ms)	0	0	0
	Launched map tasks	0	0	1
	Data-local map tasks	0	0	1
	SLOTS_MILLIS_REDUCES	0	0	18,719
File Output Format Counters	Bytes Written	0	56	56
FileSystemCounters	FILE_BYTES_READ	542	986	1,528
	HDFS_BYTES_READ	238	0	238
	FILE_BYTES_WRITTEN	23,715	23,944	47,659
	HDFS_BYTES_WRITTEN	0	56	56
File Input Format Counters	Bytes Read	102	0	102
Map-Reduce Framework	Reduce input groups	0	8	8
	Map output materialized bytes	184	0	184
	Combine output records	0	0	0
	Map input records	7	0	7
	Reduce shuffle bytes	0	0	0
	Reduce output records	0	8	8
	Spilled Records	19	19	38
	Map output bytes	140	0	140
	Combine input records	0	0	0
	Map output records	19	0	19
	SPLIT_RAW_BYTES	136	0	136
	Reduce input records	0	19	19

2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Counters: 25

2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Job Counters

2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Launched reduce tasks=1

2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - SLOTS_MILLIS_MAPS=29234

2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Total time spent by all reduces waiting after reserving slots (ms)=0

2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Total time spent by all maps waiting after reserving slots (ms)=0

2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Launched map tasks=1

2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Data-local map tasks=1

2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - SLOTS_MILLIS_REDUCES=18719

2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - File Output Format Counters

2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - Bytes Written=56

2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - FileSystemCounters

2012-01-20 22:29:36,489 [main] INFO org.apache.hadoop.mapred.JobClient - FILE_BYTES_READ=1528

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - HDFS_BYTES_READ=238

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - FILE_BYTES_WRITTEN=47659

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - HDFS_BYTES_WRITTEN=56

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - File Input Format Counters

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Bytes Read=102

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Map-Reduce Framework

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Reduce input groups=8

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Map output materialized bytes=184

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Combine output records=0

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Map input records=7

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Reduce shuffle bytes=0

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Reduce output records=8

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Spilled Records=38

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Map output bytes=140

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Combine input records=0

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Map output records=19

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - SPLIT_RAW_BYTES=136

2012-01-20 22:29:36,504 [main] INFO org.apache.hadoop.mapred.JobClient - Reduce input records=19

2012-01-20 22:29:36,520 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 40% complete

2012-01-20 22:29:36,520 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job

2012-01-20 22:29:36,520 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

2012-01-20 22:29:37,770 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job

2012-01-20 22:29:37,785 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.

2012-01-20 22:29:38,051 [Thread-24] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1

2012-01-20 22:29:38,051 [Thread-24] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

2012-01-20 22:29:38,051 [Thread-24] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1

Introduction of another Job:

2012-01-20 22:29:38,864 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201202036_0014

2012-01-20 22:29:38,864 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.114.178.123:50030/jobdetails.jsp?jobid=job\_201201202036\_0014

2012-01-20 22:30:03,379 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete

2012-01-20 22:30:15,457 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 60% complete

2012-01-20 22:30:18,473 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job

2012-01-20 22:30:18,473 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

2012-01-20 22:30:19,613 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job

2012-01-20 22:30:19,645 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.

2012-01-20 22:30:20,004 [Thread-34] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1

2012-01-20 22:30:20,004 [Thread-34] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

2012-01-20 22:30:20,020 [Thread-34] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1

Introduction of another Job:

2012-01-20 22:30:20,848 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201202036_0015

2012-01-20 22:30:20,848 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.114.178.123:50030/jobdetails.jsp?jobid=job\_201201202036\_0015

2012-01-20 22:30:51,863 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete

2012-01-20 22:31:12,863 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete

2012-01-20 22:31:15,863 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete

2012-01-20 22:31:25,378 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job

2012-01-20 22:31:25,378 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

2012-01-20 22:31:26,503 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job

2012-01-20 22:31:26,597 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.

2012-01-20 22:31:27,097 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete

2012-01-20 22:31:27,769 [Thread-45] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1

2012-01-20 22:31:27,769 [Thread-45] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

2012-01-20 22:31:27,769 [Thread-45] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1

Introduction of another Job:

2012-01-20 22:31:29,144 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201202036_0016

2012-01-20 22:31:29,144 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.114.178.123:50030/jobdetails.jsp?jobid=job\_201201202036\_0016

2012-01-20 22:31:57,643 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-20 22:31:59,143 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-20 22:32:04,143 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-20 22:32:09,143 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-20 22:32:14,143 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-20 22:32:19,159 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-20 22:32:24,159 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-20 22:32:43,877 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete

2012-01-20 22:32:43,893 [main] INFO org.apache.pig.tools.pigstats.PigStats - Script Statistics:

List of Hadoop components used along with their version along with features:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features

0.20.203.1-SNAPSHOT 0.8.1-SNAPSHOT avkash 2012-01-20 22:27:40 2012-01-20 22:32:43 ORDER_BY,LIMIT,NATIVE

This line really makes all the trouble go away:

Success!

This is very important description about your jobs. This table shows the type of Map/Reduce job, each job is and how much time it tool. You will actually learn a lot just by looking here:

Job Stats (time in seconds):

JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs

job_201201202036_0012 1 0 15 15 15 0 0 0 q0 MAP_ONLY

job_201201202036_0014 1 0 12 12 12 0 0 0 q1 MAP_ONLY

job_201201202036_0015 1 1 12 12 12 20 20 20 q2 SAMPLER

job_201201202036_0016 1 1 11 11 11 29 29 29 q2 ORDER_BY,COMBINER asv://hadoop/outjan20,

job_D:/Users/avkash/AppData/Local/Temp/MRjs7020713045504718236.jar__0001 0 0 0 0 0 0 0 0 NATIVE

Note: If you are adventures enough to dig, please remote into cluster and unzip above MRjs7020713045504718236.jar JAR file and you will learn how to Java Script job is packaged into a JAR files before submission.

This line prints data source location when I submitted the job:

Input(s):

[I am still digging why this line here?]

Successfully read 8 records (446 bytes) from: "hdfs://10.114.178.123:9000/user/avkash/.oink/output3/mr/out"

If I look at the content at this location, I see there is only 1 text blob with 7 lines in it. That’s why the log shows total records read here are 7:

Successfully read 7 records (355 bytes) from: "asv://hadoop/input"

This line shows what was my output location which was used when I submitted the job:

Output(s):

Successfully stored 8 records in: "asv://hadoop/outjan20"

This section shows the job results. Based on your job if there was any results, you will see the details about the result as belo:

Counters:

Total records written : 8

Total bytes written : 0

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0

The dependency between Hadoop jobs can be expressed as a directed acyclic graph (DAG), where each node represents a Hadoop job. DAG is useful for modeling relationship between entities that have partial ordering. An application is either a single job in the classic MapReduce jobs or a DAG of such jobs. This section shows the DAG graph between all of the Jobs in your application:

Job DAG:

job_201201202036_0012 -> job_D:/Users/avkash/AppData/Local/Temp/MRjs7020713045504718236.jar__0001,

job_D:/Users/avkash/AppData/Local/Temp/MRjs7020713045504718236.jar__0001 -> job_201201202036_0014,

job_201201202036_0014 -> job_201201202036_0015,

job_201201202036_0015 -> job_201201202036_0016,

job_201201202036_0016

2012-01-20 22:32:43,955 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

All of these jobs could be analyzed individually on https://localhost:50030/jobhistoryhome.jsp

Job submit time	Job Id	Name	User
Fri Jan 20 22:31:28 GMT 2012	job_201201202036_0016	PigLatin:DefaultJobName	avkash
Fri Jan 20 22:30:20 GMT 2012	job_201201202036_0015	PigLatin:DefaultJobName	avkash
Fri Jan 20 22:29:38 GMT 2012	job_201201202036_0014	PigLatin:DefaultJobName	avkash
Fri Jan 20 22:28:35 GMT 2012	job_201201202036_0013	MRjs	avkash
Fri Jan 20 22:27:44 GMT 2012	job_201201202036_0012	PigLatin:DefaultJobName	avkash

In most cases the Job name will be the Map/Reduce job class name and in above case MRjs shows that it is a JavaScript Map/Reduce Job.

Note: I will keep updating this article as I think new users can learn a lot about the processing just by looking at these results.

Resources:

Apache Hadoop on Windows Azure Technet WiKi
Keywords: Apache Hadoop, Windows Azure, BigData, Cloud, MapReduce

通过

Understanding Map/Reduce job in Apache Hadoop on Windows Azure (A Reverse Approach)

其他资源