Share via


Azure Haddoop: 10GB GraySort - Terasort (video) (Part 2)

View

Introduction

Hadoop-based Services for Windows Azure includes several samples you can use for learning and testing. One sample is the 10GB GraySort which is a scaled-down version of the Hadoop Terasort benchmark. There are three jobs to run and in this video, Developer Brad Sarsfield walks you through Terasort.

See Also

Transcript

Hi, my name is Brad Sarsfield and I’m a Developer on the Hadoop Services for Windows and Windows Azure team.

In video one in this series, I generated 10GB of data using the teragen example. In this video I’ll run terasort to sort that data. Terasort has 2 phases: the map phase and then the reduce phase. The map tasks partition the data into 25 ranges and the reduce tasks sort the data and write it to HDFS.

Just like with teragen, I run terasort from the hadoop-examples JAR.

  1. To begin I select 10GB GraySort from the Samples page and then click to Deploy it to my cluster.

  2. On the Create Job page, the fields are pre-populated for me, but I need to make a few changes.

    1. First, I name the job Terasort Example .

    2. The first parameter I change to terasort – this is the name of the program that will be run from the hadoop-examples JAR.

    3. The 2 nd parameter specifies the number of map tasks and reduce tasks to be executed. Here I will introduce a new parameter called mapred.reduce.tasks and set it to 25. I start out with a 2-to-1 ratio of map to reduce tasks and from there I can do further optimization. This ratio matches the configuration of my nodes -- each of my nodes has 2 map slots and 1 reduce slot. If I had a bigger machine with 8 cores, I might configure it to use 6 map tasks and 3 reduce tasks.

    4. The 3 rd parameter identifies the input and output files. The terasort sample takes input from the previously-created 10GB-sort-input file and will write the output to a file named 10G-sort-output.

      The command line content I added to the parameters is translated into the Final Command below .

  3. Execute the Job.

    Each map takes a file, and, based on the contents of that file, creates 25 buckets or data range partitions. It doesn’t sort the data yet – just partitions the data. It samples and figures out how to partition it into the appropriate data ranges. Then that map task internally sorts the data it is responsible for – only the data in that one file.

    After the map tasks have completed, the reduce tasks start up. The first reduce task is put in charge of one of those buckets. It reaches out to each node and asks for any data that fits into its bucket. For example, if my bucket is the numbers 60 through 88, I collect those numbers from each map task. That is my piece of the job. Once all of the data from the cluster is read, the Reduce task brings that all in and sorts it and writes its part of the final output. In parallel, all of the other 24 Reduce tasks are doing the same thing.

    The map task results are stored in temporary files – they are not saved in HDFS and are not persisted past the life of the job.

    It’s reducing now, going through the reduces. I could configure my cluster to do this work faster. For instance, if I had more nodes, it would run faster. But for this video I have only 4 worker nodes dividing up the work.

    When I submit the job and tell it to use 50 maps and 25 Reduces , jobscheduler decides on placement and execution of tasks based on the available slots. Once a task finishes, the next task in the queue begins.

    Each of my medium VM nodes is set to accept 2 map tasks and 1 reduce task each.  So in this case, 4 workernodes each with 2 slots available for map tasks means I am running 8 tasks at a time.

    The jobscheduler tries to affinitize the task to the placement of the data. It places the task close to the location of the data, if not on the same machine.

  4. Once terasort completes, I see all the statistics that were written out. I scroll down and see that all maps and reduces completed. 

    Now it’s time to validate the sort. Validating the data output is covered in the next video in this series.

    Thank you for watching, I hope you found it helpful.