Azure HDInsight Service: Working With Data
This tutorial covers several techniques for storing and importing data for use in Hadoop MapReduce jobs run with Windows Azure HDInsight Service ( formerly Apache™ Hadoop™-based Services for Azure). Apache Hadoop is a software framework that supports data-intensive distributed applications. While Hadoop is designed to store data for such applications with its own distributed file system (HDFS), cloud-based on-demand processing can also use other forms of cloud storage such as Windows Azure storage. Collecting and importing data in such scenarios is the subject of this tutorial.
You will learn:
- How to use Windows Azure storage in MapReduce jobs.
- How to import SQL Server data with Sqoop.
Using Azure Storage in MapReduce
While HDFS is the natural storage solution for Hadoop jobs, data needed can also be on cloud based, large, and scalable storage systems such as Windows Azure storage. It is reasonable to expect that Hadoop, when running on Windows Azure, be able to read data directly from such cloud storage.
In this tutorial you will analyze IIS logs located in Windows Azure storage by using a standard Hadoop streaming MapReduce job. The scenario demonstrates a Windows Azure web role that generates IIS logs using the Windows Azure diagnostic infrastructure. A simple Hadoop job reads the logs directly from Windows Azure storage and finds the 5 most popular URIs (web pages).
Upload IIS Log File to Azure Storage Container
You must have a Windows Azure storage account for storing the IIS log file and the results of the MapReduce job. For the instructions, see How to Create a Storage Account. After the account is created, write down the storage account name and the access key. You will use the information in this tutorial.
To find the storage account name and its key
- Open Internet Explorer, and browse to Windows Azure Management Portal at https://www.windowsazure.com.
- Sign in into your account with you credentials.
- On the left, click STORAGE. You shall see a list of your storage accounts.
- Click the storage account that will be used for this tutorial to select it. If you don't have one, create one.
- On the bottom of the page, click MANAGE KEYS.
- Write down the values of STORAGE ACCOUNT NAME and PRIMARY ACCESS KEY.
To simplify this tutorial, create two new containers in your storage account. The IISLogsInput is for the source IIS log file, the IISLogsOutput is for the MapReduce job results.
To create Windows Azure storage containers
- From Windows Azure Management Portal, click Storage (sometimes, it only appears as a table icon). You shall see a list of your storage accounts.
- Click the storage account that will be used for this tutorial to select it.
- On the top, click CONTAINERS.
- On the bottom, click ADD CONTAINER.
- From New blob container, type or select the following values:
Name Value NAME iislogsinput ACCESS Private - Click the check mark to save the changes.
- Repeat step 4 through 6 to create another container called iislogsoutput.
If you have an IIS log file, you can skip the next procedure.
To generate IIS logs and place them in Windows Azure storage, you can enable Windows Azure Diagnostics in an ASP.NET web role, and configure DiagnosticInfrastructureLogs. For more information on Windows Azure diagnostics, see Enabling Diagnostics in Windows Azure.
The following are the prerequisites for creating the web role:
To create a Windows Azure Cloud service with IIS logs
Launch Visual Studio 2012 as administrator. You can also use Visual Studio 2010.
From the FILE menu, click New and then click Project.
On New Project, type or select the following values:
Name Value Category Templates, Visual C#, Cloud Template Windows Azure Cloud Service Name WebRoleWithIISLogs Location C:\CloudProjects Solution name WebRoleWithIISLogs From New Windows Azure Cloud Service, double-click ASP.NET Web Role from the left list to add it to the solution, and then click OK. The default name is WebRole1.
From Solution Explorer, expand WebRole1, and then double-click WebRole.cs to open the file.
At the beginning of the OnStart method, add the following code:
// Configure IIS Logging DiagnosticMonitorConfiguration diagMonitorConfig = DiagnosticMonitor.GetDefaultInitialConfiguration(); diagMonitorConfig.DiagnosticInfrastructureLogs.ScheduledTransferLogLevelFilter = LogLevel.Information; diagMonitorConfig.DiagnosticInfrastructureLogs.ScheduledTransferPeriod = TimeSpan.FromMinutes(1); diagMonitorConfig.Directories.ScheduledTransferPeriod = TimeSpan.FromMinutes(1); diagMonitorConfig.Directories.DataSources.Add(new DirectoryConfiguration() { Container = "wad-iis-logfiles", Path = "logfiles" }); DiagnosticMonitor.Start("Microsoft.WindowsAzure.Plugins.Diagnostics.ConnectionString",diagMonitorConfig);
From Solution Explorer, expand WebRoleWithIISLogs, and then double-click WebRole1.
In the left, click Settings.
Add the details of your Windows Azure Storage account to the diagnostics connection string.
Press F5 to run the application. When it is open, use it to browse to different pages on the Web site. After 1 minute, an IIS log will be persisted to Windows Azure storage.
Important: you must leave the application running in the compute emulator for more than one minute or until the log file is synchronized to the Windows Azure Storage account.
Uploading, downloading, and browsing files in Windows Azure Blob is an easy task if you install a blob storage browsing application such as Azure Storage Explorer or CloudBerry Explorer for Windows Azure Blob storage.
The following steps are for the Azure Storage Explorer application; you can use the same techniques with CloudBerry Explorer, but the steps may differ.
To copy the log file to the iislogsinput container on Windows Azure storage
- Run Azure Storage Explorer.
- Click Add Account.
- Enter Storage account name and Storage account key for the Windows Azure storage. You wrote down the information at the beginning of the tutorial. And then click Add Storage Account. A new tab with the storage account name is created.
- Click the tab for the storage account.
- Click wad-iis-logfiles to open the folder. You shall see a IIS log file in the folder.
- Click the log file.
- Click Download to download a copy to your local computer.
- Rename the file to iislogs.txt.
- From Azure Storage Explorer, click iislogsinput from the left to select the folder.
- Click Upload.
- Browse to the iislogs.txt file from your local computer, and then click OK.
Create Hadoop MapReduce Streaming Job
Hadoop Streaming is a utility that lets you create and run MapReduce jobs by creating an executable or a script in any language. Both the mapper and reducer read the input from STDIN and write the output to STDOUT. For more information about Hadoop Streaming, see the Hadoop streaming documentation.
To create a MapReduce streaming job
Open Microsoft Visual Studio 2012 as Administrator.
From the FILE menu, click New, and then click Project.
Name Value Category Templates, Visual C#, Windows Template Console Application Name Map Location C:\CloudProjects Solution name Map Click OK.
From Solution Explorer, double-click Program.cs.
At the beginning of the file, add the following using statement:
using System.IO;
In the Main() function, add the following code:
if (args.Length > 0) { Console.SetIn(new StreamReader(args[0])); } var counters = new Dictionary(); string line; while ((line = Console.ReadLine()) != null) { var words = line.Split(' '); foreach (var uri in words) { if ((uri.StartsWith(@"http://")) || (uri.EndsWith(".aspx")) || (uri.EndsWith(".html"))) { if (!counters.ContainsKey(uri)) counters.Add(uri, 1); else counters[uri]++; Console.WriteLine(string.Format("{0}\t{1}", uri, counters[uri])); } } }
The code searches for strings started with "http://" or ends with ".aspx" or ends with ".html". You can customize the code if you want to.
From Solution Explorer, right-click Solution 'Map', click Add, and then click New Project.
From Add New Project, type or select the following values:
Name Value Category Templates, Visual C#, Windows Template Console Application Name Reduce Location C:\CloudProjects Click OK.
From Solution Explorer, expand Reduce, and then double-click Program.cs.
At the top of the file, add the following using statement:
using System.IO;
In the Main() function, add the following code:
if (args.Length > 0) { Console.SetIn(new StreamReader(args[0])); } // counter for each uri var UriCounters = new Dictionary(); // list of the uri ordered by the counter value var topUriList = new SortedList(); string line; while ((line = Console.ReadLine()) != null) { // parse the uri and the number of request var values = line.Split('\t'); string uri = values[0]; int numOfRequests = int.Parse(values[1]); // save the max number of requests for each uri in UriCounters if (!UriCounters.ContainsKey(uri)) UriCounters.Add(uri, numOfRequests); else if (UriCounters[uri] < numOfRequests) UriCounters[uri] = numOfRequests; } //Create the ordered list foreach (var keyValue in UriCounters) if (!topUriList.ContainsKey(keyValue.Value)) topUriList.Add(keyValue.Value, keyValue.Key); else topUriList[keyValue.Value] = string.Format("{0} , {1}", topUriList[keyValue.Value], keyValue.Key); // make the list descending var lst = topUriList.Reverse().ToArray(); // print the results for (int i = 0; (i < 5) && (i < lst.Count()); i++) Console.WriteLine(string.Format("{0} {1}", lst[i].Key, lst[i].Value));
Press F6 to compile the two projects.
By default, the executable files are stored in the Reduce\bin\Debug and Map\bin\Debug folder un the project folder. You must upload the executables to HDFS before you can execute them.
To copy map.exe and reduce.exe to HDFS
Open the Hadoop cluster portal at https://www.hadooponazure.com/.
Click the Interactive Console tile.
Click JavaScript on the top of the page.
In the Interactive JavaScript console, at the js> prompt, type the following command:
fs.put()
Fro Upload a file, type or select the following values:
Name Value Source (Specify the map.exe path and file name. The default folder for map.exe is Map\Map\bin\Debug on the project folder.) Destination /example/apps/map.exe Click Upload.
Repeat step 4 through 6 to upload the reduce.exe to /example/apps/reduce.exe.
In Source, browse to the reduce.exe folder. The default folder for map.exe is Map\Reduce\bin\Debug on the project folder. In Destination, type /example/apps/reduce.exe. And then click Upload.
From the console, type the following command, and then hit ENTER.
#ls /example/apps
You shall see both map.exe and reduce.exe listed.
Create/execute Hadoop Job
From HDInsight, you can connect to Windows Azure storage using the asv protocol. For example, within Hadoop, you normally would get a listing of files within HDFS using the command line interface:
#ls /
In the case of accessing files within Windows Azure Blob storage, you can run the following command to list the iislogs.txt you uploaded to the container earlier in the tutorial.
#ls asv://iislogsinput/
Before you can access Windows Azure storage, you must configure ASV
To set up ASV
- Open the Hadoop cluster portal at https://www.hadooponazure.com/.
- Click the Manage Cluster tile. If you are on a page showing your cluster, click the Go to Cluster link first before you can see the Manage Cluster tile.
- Click Set up ASV.
- Specify your Storage Account Name and its Passkey (Primary Access Key or Secondary Access key), and then click Save settings.
To create and execute a new Hadoop job
Open the Hadoop cluster portal at https://www.hadooponazure.com/.
Click on the Create Job tile.
From Create Job, type or select the following values:
Name Value Job Name IIS Logs JAR File Parameter 0 -files "hdfs:///example/apps/map.exe,hdfs:///example/apps/reduce.exe" Parameter 1 -input "asv://iislogsinput/iislogs.txt" -output "asv://iislogsoutput/results.txt" Parameter 2 -mapper "map.exe" -reducer "reduce.exe" The hadoop-streaming.jar can be downloaded from http://www.java2s.com/Code/Jar/h/Downloadhadoopstreamingjar.htm.
Click Execute Job.
After the job completes, open the blob results.txt/part-00000 in the container iislogsoutput using Windows Azure Explorer, and look at the results. For example:
Importing SQL Server data with Sqoop
While Hadoop is a natural choice for processing unstructured and semi-structured data like logs and files, there may be a need to process structured data stored in relational databases as well. Sqoop (SQL-to-Hadoop) is a tool that allows you to import structured data to Hadoop and use it in MapReduce and HIVE jobs.
In this tutorial, you will install the AdventureWorks community sample databases into a Windows Azure SQL Database server, and then use sqoop to import the data to HDFS.
To install the AdventureWorks databases into Windows Azure SQL Database
Download the Adventure Works for SQL Database database. Follow the installation instructions in the "ReadMe.htm" file to set up the SQL Database version of the AdventureWorks2012.
Open SQL Server Management Studio and connect to the SQL Database Server.
Open the "AdventureWorks2012" database and click the New Query button.
Since Sqoop currently adds square brackets to the table name, add a synonym to support two-part naming for SQL Server tables and run the following query:
CREATE SYNONYM [Sales.SalesOrderDetail] FOR Sales.SalesOrderDetail
Run the following query and review its result.
select top 200 * from [Sales.SalesOrderDetail]
To import data using Sqoop
In the Hadoop command prompt change the directory to "c:\Apps\dist\sqoop\bin" and run the following command:
sqoop import --connect "jdbc:sqlserver://[serverName].database.windows.net;username=[userName]@[serverName];password=;database=AdventureWorks2012" --table Sales.SalesOrderDetail --target-dir /data/lineitemData -m 1
Go to the Hadoop on Windows Azure portal and open the interactive console. Run the #lsr command to list the file and directories on your HDFS.
Run the #tail command to view selected results from the part-m-0000 file.
#tail /user/RAdmin/data/SalesOrderDetail/part-m-00000
Summary
In this tutorial you have seen how various data sources can be used for MapReduce jobs in Hadoop on Windows Azure. Data for Hadoop jobs can be on cloud storage or on HDFS. You have also seen how relational data can be imported into HDFS using Sqoop and then be used in Hadoop jobs.