Share via


Apache Hadoop on Windows Azure Part 5 - Running 10GB Sort Hadoop Job with Teragen, TeraSort and TeraValidate Options

This example consists of the 3 map/reduce applications that Owen O'Malley and Arun Murthy used win the annual general purpose (daytona) terabyte sort benchmark @ sortbenchmark.org. This sample is part of prebuilt package in your Hadoop on Azure portal so Just like any other prebuilt sample you can deploy it to cluster as below:

 

 

There are three steps to this example:
1. TeraGen is a map/reduce program to generate the data.
2. TeraSort samples the input data and uses map/reduce to sort the data into a total order.
3. TeraValidate is a map/reduce program that validates the output is sorted.

The example deployment is pre-loaded with the first Teragen job.

1. Teragen (sample loaded default)
> hadoop jar hadoop-examples-0.20.203.1-SNAPSHOT.jar teragen "-Dmapred.map.tasks=50" 100000000 /example/data/10GB-sort-input

 

Once sample is deployed to the cluster, you can verify the parameters first and then start the Job:

 

 

Once the Job is started it, first creates the input data in 50 different files on HDFS...

 

....which you can verify in HDFS management as below:

Finally when the Job is completed the results are displayed as below:

 

10GB Terasort Example

•••••

Job Info

Status: Completed Sucessfully Type: jar Start time: 12/30/2011 5:54:16 PM End time: 12/30/2011 6:04:59 PM Exit code: 0

Command

call hadoop.cmd jar hadoop-examples-0.20.203.1-SNAPSHOT.jar teragen "-Dmapred.map.tasks=50" 100000000 /example/data/10GB-sort-input

Output (stdout)

Generating 100000000 using 50 maps with step of 2000000

Errors (stderr)

11/12/30 17:54:20 INFO mapred.JobClient: map 0% reduce 0% 11/12/30 17:54:49 INFO mapred.JobClient: map 2% reduce 0% 11/12/30 17:54:52 INFO mapred.JobClient: map 4% reduce 0% 11/12/30 17:54:55 INFO mapred.JobClient: map 5% reduce 0% 11/12/30 17:55:01 INFO mapred.JobClient: map 6% reduce 0% 11/12/30 17:55:22 INFO mapred.JobClient: map 7% reduce 0% 11/12/30 17:55:28 INFO mapred.JobClient: map 8% reduce 0% 11/12/30 17:55:43 INFO mapred.JobClient: map 9% reduce 0% 11/12/30 17:55:46 INFO mapred.JobClient: map 12% reduce 0% 11/12/30 17:55:49 INFO mapred.JobClient: map 14% reduce 0% 11/12/30 17:56:10 INFO mapred.JobClient: map 15% reduce 0% 11/12/30 17:56:13 INFO mapred.JobClient: map 16% reduce 0% 11/12/30 17:56:28 INFO mapred.JobClient: map 18% reduce 0% 11/12/30 17:56:31 INFO mapred.JobClient: map 19% reduce 0% 11/12/30 17:56:34 INFO mapred.JobClient: map 20% reduce 0% 11/12/30 17:56:43 INFO mapred.JobClient: map 21% reduce 0% 11/12/30 17:56:49 INFO mapred.JobClient: map 22% reduce 0% 11/12/30 17:56:52 INFO mapred.JobClient: map 23% reduce 0% 11/12/30 17:56:58 INFO mapred.JobClient: map 24% reduce 0% 11/12/30 17:57:01 INFO mapred.JobClient: map 25% reduce 0% 11/12/30 17:57:04 INFO mapred.JobClient: map 26% reduce 0% 11/12/30 17:57:10 INFO mapred.JobClient: map 28% reduce 0% 11/12/30 17:57:19 INFO mapred.JobClient: map 29% reduce 0% 11/12/30 17:57:22 INFO mapred.JobClient: map 30% reduce 0% 11/12/30 17:57:28 INFO mapred.JobClient: map 31% reduce 0% 11/12/30 17:57:31 INFO mapred.JobClient: map 32% reduce 0% 11/12/30 17:58:04 INFO mapred.JobClient: map 33% reduce 0% 11/12/30 17:58:07 INFO mapred.JobClient: map 35% reduce 0% 11/12/30 17:58:10 INFO mapred.JobClient: map 36% reduce 0% 11/12/30 17:58:13 INFO mapred.JobClient: map 37% reduce 0% 11/12/30 17:58:19 INFO mapred.JobClient: map 38% reduce 0% 11/12/30 17:58:25 INFO mapred.JobClient: map 39% reduce 0% 11/12/30 17:58:34 INFO mapred.JobClient: map 40% reduce 0% 11/12/30 17:58:37 INFO mapred.JobClient: map 42% reduce 0% 11/12/30 17:58:44 INFO mapred.JobClient: map 43% reduce 0% 11/12/30 17:58:47 INFO mapred.JobClient: map 44% reduce 0% 11/12/30 17:58:52 INFO mapred.JobClient: map 45% reduce 0% 11/12/30 17:58:59 INFO mapred.JobClient: map 46% reduce 0% 11/12/30 17:59:23 INFO mapred.JobClient: map 48% reduce 0% 11/12/30 17:59:26 INFO mapred.JobClient: map 49% reduce 0% 11/12/30 17:59:32 INFO mapred.JobClient: map 50% reduce 0% 11/12/30 17:59:40 INFO mapred.JobClient: map 51% reduce 0% 11/12/30 17:59:44 INFO mapred.JobClient: map 52% reduce 0% 11/12/30 17:59:46 INFO mapred.JobClient: map 53% reduce 0% 11/12/30 17:59:47 INFO mapred.JobClient: map 54% reduce 0% 11/12/30 17:59:58 INFO mapred.JobClient: map 55% reduce 0% 11/12/30 18:00:11 INFO mapred.JobClient: map 56% reduce 0% 11/12/30 18:00:14 INFO mapred.JobClient: map 58% reduce 0% 11/12/30 18:00:16 INFO mapred.JobClient: map 59% reduce 0% 11/12/30 18:00:20 INFO mapred.JobClient: map 60% reduce 0% 11/12/30 18:00:23 INFO mapred.JobClient: map 61% reduce 0% 11/12/30 18:00:31 INFO mapred.JobClient: map 62% reduce 0% 11/12/30 18:00:50 INFO mapred.JobClient: map 63% reduce 0% 11/12/30 18:00:53 INFO mapred.JobClient: map 65% reduce 0% 11/12/30 18:00:59 INFO mapred.JobClient: map 66% reduce 0% 11/12/30 18:01:10 INFO mapred.JobClient: map 67% reduce 0% 11/12/30 18:01:13 INFO mapred.JobClient: map 68% reduce 0% 11/12/30 18:01:14 INFO mapred.JobClient: map 69% reduce 0% 11/12/30 18:01:17 INFO mapred.JobClient: map 70% reduce 0% 11/12/30 18:01:20 INFO mapred.JobClient: map 71% reduce 0% 11/12/30 18:01:23 INFO mapred.JobClient: map 72% reduce 0% 11/12/30 18:01:37 INFO mapred.JobClient: map 73% reduce 0% 11/12/30 18:01:38 INFO mapred.JobClient: map 74% reduce 0% 11/12/30 18:01:50 INFO mapred.JobClient: map 75% reduce 0% 11/12/30 18:02:07 INFO mapred.JobClient: map 76% reduce 0% 11/12/30 18:02:11 INFO mapred.JobClient: map 77% reduce 0% 11/12/30 18:02:14 INFO mapred.JobClient: map 78% reduce 0% 11/12/30 18:02:17 INFO mapred.JobClient: map 79% reduce 0% 11/12/30 18:02:20 INFO mapred.JobClient: map 80% reduce 0% 11/12/30 18:02:32 INFO mapred.JobClient: map 81% reduce 0% 11/12/30 18:02:44 INFO mapred.JobClient: map 82% reduce 0% 11/12/30 18:02:53 INFO mapred.JobClient: map 83% reduce 0% 11/12/30 18:02:59 INFO mapred.JobClient: map 84% reduce 0% 11/12/30 18:03:05 INFO mapred.JobClient: map 85% reduce 0% 11/12/30 18:03:08 INFO mapred.JobClient: map 87% reduce 0% 11/12/30 18:03:14 INFO mapred.JobClient: map 88% reduce 0% 11/12/30 18:03:20 INFO mapred.JobClient: map 89% reduce 0% 11/12/30 18:03:38 INFO mapred.JobClient: map 90% reduce 0% 11/12/30 18:03:41 INFO mapred.JobClient: map 92% reduce 0% 11/12/30 18:03:47 INFO mapred.JobClient: map 93% reduce 0% 11/12/30 18:03:50 INFO mapred.JobClient: map 94% reduce 0% 11/12/30 18:03:56 INFO mapred.JobClient: map 95% reduce 0% 11/12/30 18:04:05 INFO mapred.JobClient: map 96% reduce 0% 11/12/30 18:04:11 INFO mapred.JobClient: map 97% reduce 0% 11/12/30 18:04:14 INFO mapred.JobClient: map 98% reduce 0% 11/12/30 18:04:23 INFO mapred.JobClient: map 99% reduce 0% 11/12/30 18:04:47 INFO mapred.JobClient: map 100% reduce 0% 11/12/30 18:04:58 INFO mapred.JobClient: Job complete: job_201112290558_0005 11/12/30 18:04:58 INFO mapred.JobClient: Counters: 16 11/12/30 18:04:58 INFO mapred.JobClient: Job Counters 11/12/30 18:04:58 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4761149 11/12/30 18:04:58 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 11/12/30 18:04:58 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 11/12/30 18:04:58 INFO mapred.JobClient: Launched map tasks=54 11/12/30 18:04:58 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 11/12/30 18:04:58 INFO mapred.JobClient: File Input Format Counters 11/12/30 18:04:58 INFO mapred.JobClient: Bytes Read=0 11/12/30 18:04:58 INFO mapred.JobClient: File Output Format Counters 11/12/30 18:04:58 INFO mapred.JobClient: Bytes Written=10000000000 11/12/30 18:04:58 INFO mapred.JobClient: FileSystemCounters 11/12/30 18:04:58 INFO mapred.JobClient: FILE_BYTES_READ=113880 11/12/30 18:04:58 INFO mapred.JobClient: HDFS_BYTES_READ=4288 11/12/30 18:04:58 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1180870 11/12/30 18:04:58 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=10000000000 11/12/30 18:04:58 INFO mapred.JobClient: Map-Reduce Framework 11/12/30 18:04:58 INFO mapred.JobClient: Map input records=100000000 11/12/30 18:04:58 INFO mapred.JobClient: Spilled Records=0 11/12/30 18:04:58 INFO mapred.JobClient: Map input bytes=100000000 11/12/30 18:04:58 INFO mapred.JobClient: Map output records=100000000 11/12/30 18:04:58 INFO mapred.JobClient: SPLIT_RAW_BYTES=4288

 

 Keywords: Windows Azure, Hadoop, Apache, BigData, Cloud, MapReduce