How To: Upload Data and Use the WordCount Sample with Hadoop Services for Windows Azure (video)
Hadoop-based Services for Windows Azure includes several samples you can use for learning and testing. In this video, Developer Brad Sarsfield demonstrates two different ways to upload data to Hadoop-Based Services for Windows Azure. After he uploads the data, he uses the WordCount sample (included) to run a MapReduce program on the uploaded data.
See Also
Transcript
Hi, my name is Brad Sarsfield, I’m a Developer on the Hadoop Services for Windows and Windows Azure team.
Today I’m going to show you two different ways to upload data into a Hadoop cluster on Windows Azure. Once the data is uploaded to my cluster, I’ll use one of the samples – which are included with Hadoop Services on Windows Azure – to run a word count MapReduce job against the new data in my cluster.
To upload the data, I have many options – I can use the Interactive Javascript console, secure FTPS, Azure Blob store, Amazon S3, or import data from Azure Data Market. Let’s start with the JavaScript Interactive Console which I can access from the Hadoop Services on Azure web portal.
Upload data using the JavaScript Console
- The first thing I need to do is create a directory for my data on HDFS inside my cluster. I name the directory example/data.
- To select a file from my local harddrive, I use fs.put.
The name of the local file I am uploading is DaVinci.txt. And the HDFS destination is my newly-created example/data folder. - Click Upload. That’s it!
- To make sure the file uploaded properly, run the #ls command to see a directory listing.
Upload data using secure FTPS
Another way to upload data into HDFS on Windows Azure is via secure ftp. The ftp server runs on the headnode inside Windows Azure. We chose secure ftp because regular ftp puts your credentials over the wire in cleartext. Another security requirement is that the FTP password must be MD5hashed.
- By default, the FTPS port is closed. To open the port, select the Open Ports tile.
- Toggle the FTPS port.
In the background this opens up the port to my hadoop cluster’s headnode and allows me to upload files via FTPS.
The FTPS client that we recommend is called Curl, and we make it easy for you to use Curl AND to MD5hash your password by including a sample PowerShell script. The sample script is included here with the WordCount sample. - To use the script, I download it to my local box and fill in the appropriate cluster name, my username and password. Now, when I open up PowerShell and call the script, it uploads the specified davinci text file to the examples/data directory in HDFS.
And there you have the 2nd way to upload data to your Hadoop cluster on Windows Azure. Now it’s time to deploy the wordcount job.
Deploy the WordCount job
Hadoop on Azure comes with samples I can use for learning and testing. Today I’m going to use the WordCount Java mapreduce sample to count the occurrences of each word in my davinci text file. I am going run this example on my cluster using the Hadoop Examples JAR.
On the WordCount Sample page, select Deploy to your cluster.
The Job Template is prepopulated with the appropriate job name, it attaches the jar file and sets the parameters that are going to be passed into the job. It even displays the command that will run on the headnode.
So for my job, the parameters are:- wordcount to indicate we are running the wordcount example for the hadoop-exampes JAR
- Input file is davinci.txt and output file DaVinciAllTopWords
Based on my parameters, the Final Command that will be executed is constructed below.
I click on Execute Job. My davinci text file is sent to the headnode where each mapper evaluates a single line and the reducer sums the counts for each word.
Each map process reads a line from the file and then parses all of the words.
The output of the map is a key-value pair for each word and the number one.
The reducers then sum up the counts for each of the words from all of the map outputs and in turn output each word and its total occurrences for the final output.The Job Page displays status. The Standard Errors section contains messages from Hadoop, things like status, statistics, and informational messages. The Output section contains messages generated by the wordcount Java code.
My job completed successfully. I see that a new file called DaVinciALLTopWords has been created.
I used 2 different methods to uploaded data to Hadoop Services on Windows Azure and then ran the wordcount job on that data.
Thank you for viewing this tutorial. I hope you found it helpful.