Azure Hadoop Services: Introduction to Interactive JavaScript Console (video)
Introduction
The Microsoft Azure deployment of Hadoop Services for Windows lets you set up a private Hadoop cluster on Azure. One of the included administration/deployment tools is an Interactive Console for JavaScript and Hive. This video introduces the Interactive JavaScript console. Tester David Zhang demonstrates running several JavaScript commands against your Hadoop cluster.
See Also
- More Videos about Hadoop Services on Windows and Windows Azure
- Apache Hadoop Services on Windows - wiki Homepage
- Microsoft's Big Data channel on YouTube
Transcript (edited for readability)
Introduction to the Hadoop Services on Azure Interactive JavaScript Console (video)
Hi, my name is David Zhang and I'm a Tester on the Microsoft Hadoop Services for Windows team. In this video I'll introduce you to the Interactive JavaScript console on Windows Hadoop.
I'm going to start by going to the Hadooponazure.com website and signing myself in. I provisioned a cluster earlier so it's going to bring me back that cluster.
I click on the Interactive Console tile.
This is an interactive console that's sitting on top of my Hadoop cluster. It's powered by JavaScript and running inside my browser. It evaluates simple JavaScript expressions. It also lets me run file system commands, HDFS commands, as if I was on a client console somewhere. For example, if I type hash and a file system command, it is actually now sending the ls command to the HDFS.
We have a bunch of commands in the browser and in this video I’m going to do a quick walkthrough of those features. I’ll use the WordCount sample to find the top 10 most-common words in the Gutenberg samples that come installed with Hadoop Services on Windows Azure.
Upload the Sample Files to the HDFS
I’ll start by writing a MapReduce program as a JavaScript script.
I open NotePad and I have a pre-prepared script that I paste in here. You can see that this is a very simple MapReduce script.
I have a Mapper and a Reducer defined as functions. What the WordCount sample does is look at the words in the corpus and emit ‘ones’ for each distinct word in the corpus. And the Reducer is summing up all those ones for each of the words to give you a word count. It's the standard wordcount example but now it's written in JavaScript which keeps the program a lot cleaner.
Save this as a file named wordcount.js.
Upload this file to the HDFS.
To do this, there is a command on the Interactive JavaScript console called put so we do fs.put().
This brings up a dialog where we select the file we previously saved.
Press Upload.
You can see the file is now uploaded. If we run ls again, we see that the file is actually on the HDFS.
Next let's also upload the Gutenberg examples onto the HDFS.
Make a directory on the HDFS called gutenberg.
And unfortunately, because the browser only allows you to upload one file at a time, I need to do this three times.
The good thing about this is that like with a normal console, I can press up to get to the previous command. So there's less typing involved.
Upload each of the 3 files into the gutenberg directory I just created.
So now the 3 files are uploaded. Let’s review what I’ve done so far.
- The WordCount.js file is loaded and I've created the gutenberg directory.
- The gutenberg directory has the 3 files are loaded.
- And I check the WordCount JavaScript file that I loaded – I can see it's what I typed (pasted) into NotePad.
Write and Run the MapReduce Query
The next thing I'll do is use the Gutenberg files as input and run the JavaScript MapReduce program on it. But I’m going to do this in such a way that I order the word count I get descending by the top 10.
I do this by writing a query (see complete query below) that:
a) takes input from the gutenbery directory,
b) runs MapReduce using the WordCount JavaScript,
c) using 2 columns; the first column is called word and the 2nd column is called count of type long,
d) orders by count descending,
e) takes the top 10,
f) and stores the results in a folder called gbtop10.
js>From(“gutenburg”).mapReduce(“WordCount.js”, “word, count:long”).orderBy(“count DESC”).take(10).to(“gbtop10”)
Run the query and, while it is processing, select the ViewLog link and press SHIFT. This opens a new browser window that displays the job status. In this window we can see that a lot is going on.
- I pulled in the logs from Pig, as the Pig script is executing.
- The query I typed in the console is translated into a set of Pig queries. And Pig translates those Pig queries into a set of MapReduce jobs, and runs each of those jobs on the Hadoop cluster.
- As it's doing that, it's generating some output to the Standard Output which is what I am seeing here in the browser.
- When the job completes, the Pig job displays a Success message.
Verify the Results of the MapReduce Job
- I first verify that the output directory I specified, gbtop10, is in the HDFS. I do this by typing ls. It’s there.
- Then I type #ls gbtop10 to see the contents of the gbtop10 directory. I see the very familiar part -r-00000 file, which is a very standard naming scheme that MapReduce uses.
- I open this file by typing #cat gbtop10/part-r-00000. I see the top 10 most common words in the Gutenberg text have been found, along with the count.
As expected, the is the most-common word in the corpus, with 47,430 occurrences.
Convert the data into a JavaScript array
So now that I have these results in the HDFS I can read these results back out into my JavaScript console. fs.read is a function we provide that allows me to do that.
To read from the file named gbtop10, I type:
Js> file = fs.read(“gbtop10”)
If I don't specify any files, it reads all the files in that directory and concatenates them into one string. And that data is now stored in the variable file.
I can see the data by typing file.data.
To turn that data into a set of JavaScript objects, I use the parse function:
js> data = parse(file.data, “word, count:long”)
I tell it to parse the file data and I give it a schema string and you notice that it's the same schema string that I used before. So I’m saying that the data I am using is actually 2 columns. The first column I call word and the second column is called count of type long. The first column I don't specify a type and it defaults to string.
The function returns a JavaScript array of 10 elements, where each element is a JavaScript object with 2 properties. The first property is called word and it has a string and the second property is called count and it has the integer value of the count.
Create a bar graph of the top 10 words
This data is now a JavaScript array and it has all the standard JavaScript things that JavaScript arrays come with.
- When I enter typeof data, JavaScript returns object.
- When I enter data.length, JavaScript returns 10.
The good thing about this is that I can push this into our graphing function, because our graphing functions can take a JavaScript array with this schema and this data.
At the JavaScript prompt, type graph.bar(data)
What I get is a bar graph of the top 10 words in the Gutenberg examples.
This bar graph is made using SVG which is a new HTML5 feature. I can do all sorts of things with it.
Click on the graph to open it in a new window.
Drag the handles to resize the graph.
I can also copy and save the picture if I wanted to as an SVG or a PNG.
And I can also copy the picture and paste it into another program like Paint or PowerPoint.
That's the end of the demo. Thank you for watching.