Internals of Hadoop Pig Operators as MapReduce Job
I was recently asked to show that Pig scripts are actually MapReduce jobs so to explain it in very simple way I have created the following example:
- Read a text file using Pig Script
- Dump the content of the file
As you can see below that when “dump” command was used a MapReduce job was initiated:
c:\apps\dist>pig
2012-02-09 05:19:12,777 [main] INFO org.apache.pig.Main - Logging error messages to: c:\apps\dist\pig_1328764752777.log
2012-02-09 05:19:13,198 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://10.114.226.34:9000
2012-02-09 05:19:13,652 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 10.114.226.34:9010
grunt> raw = load 'avkashwordfile.txt';
grunt> dump raw;
2012-02-09 05:19:46,542 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2012-02-09 05:19:46,542 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used.
2012-02-09 05:19:46,761 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: raw: Store(hdfs://10.114.226.34:9000/tmp/temp-1709215369/tmp275450578:org.apache.pig.impl.io.InterStorage) - scope-1 Operator Key: scope-1)
2012-02-09 05:19:46,776 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2012-02-09 05:19:46,823 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2012-02-09 05:19:46,823 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2012-02-09 05:19:46,995 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2012-02-09 05:19:47,026 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-02-09 05:19:48,308 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2012-02-09 05:19:48,339 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting forsubmission.
2012-02-09 05:19:48,839 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2012-02-09 05:19:48,870 [Thread-6] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2012-02-09 05:19:48,870 [Thread-6] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2012-02-09 05:19:48,886 [Thread-6] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2012-02-09 05:19:51,183 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201202082253_0006
2012-02-09 05:19:51,183 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.1
14.226.34:50030/jobdetails.jsp?jobid=job_201202082253_0006
2012-02-09 05:20:15,198 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2012-02-09 05:20:16,198 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2012-02-09 05:20:21,198 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2012-02-09 05:20:30,932 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2012-02-09 05:20:30,932 [main] INFO org.apache.pig.tools.pigstats.PigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
0.20.203.1-SNAPSHOT 0.8.1-SNAPSHOT avkash 2012-02-09 05:19:46 2012-02-09 05:20:30 UNKNOWN
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201202082253_0006 1 0 12 12 12 0 0 0 raw MAP_ONLY hdfs://10.114.226.34:9000/tmp/temp-170
9215369/tmp275450578,
Input(s):
Successfully read 15 records (482 bytes) from: "hdfs://10.114.226.34:9000/user/avkash/avkashwordfile.txt"
Output(s):
Successfully stored 15 records (183 bytes) in: "hdfs://10.114.226.34:9000/tmp/temp-1709215369/tmp275450578"
Counters:
Total records written : 15
Total bytes written : 183
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201202082253_0006
2012-02-09 05:20:30,948 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2012-02-09 05:20:30,979 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2012-02-09 05:20:30,979 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(avkash)
(amit)
(akhil)
(avkash)
(hello)
(world)
(hello)
(state)
(avkash)
(akhil)
(world)
(state)
(world)
(state)
(hello)
grunt>