HDInsight Services For Windows
This article is the main portal for technical information about HDInsight Services for Windows and related Microsoft technologies. It provides a brief overview of Apache Hadoop, as well as information for the HDInsight Services provided by Microsoft for deployment on both Windows and Windows Azure.
It also provides links to more detailed technical content in various formats.
Note: Contributions are welcome and appreciated: Please feel free to update this and other articles on this Wiki, and to add links to relevant content both from within and outside Microsoft.
Table of Contents
Topics
Content Types
Orientation
Tutorials
Getting Started with HDInsight Services on Windows
Tutorials
Getting Started with HDInsight Services for Windows Azure
Tutorials
Samples on the HDInsight Services for Windows Azure Dashboard
Samples
Tutorials
Using HDInsight Services with other BI Technologies
HowTos
HowTos
Samples
Videos
Audio
Books
Hadoop on Windows and on Windows Azure Best Practices
Guidance
Hadoop Overview
Apache Hadoop is an open source software framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It consists of two primary components: Hadoop Distributed File System (HDFS), a reliable and distributed data storage, and MapReduce, a parallel and distributed processing system. A Hadoop cluster can be made up of a single node or thousands.
HDFS is the primary distributed storage used by Hadoop applications. As you load data into a Hadoop cluster, HDFS splits up the data into blocks/chunks and creates multiple replicas of blocks and distributes them across the nodes of the cluster to enable reliable and extremely rapid computations.
Hadoop MapReduce is a software framework for writing applications that rapidly process vast amounts of data in parallel on a large cluster of compute nodes. A MapReduce job usually splits the input data-set into independent chunks. These independent chunks are processed by the map tasks running across the nodes of the Hadoop cluster in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
Some of the main advantages of Hadoop are that it can process vast amounts of data, hundreds of terabytes or even petabytes quickly and efficiently, process both structured and non-structured data, perform the processing where the data is located rather than moving the data to some processing location, and detect and handle failures by design.
There are two other key Apache technologies that are frequently used with Hadoop: Hive and Pig. Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems such as HDFS. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
For more details on Apache Hadoop, see http://hadoop.apache.org/.
Learning Apache Hadoop
This section contains links to resources useful in learning Hadoop, such as installation, configuration, and basic how-to information.
Link
Description
The Apache Hadoop home page
Introduction to Apache MapReduce and HDFS [Video]
An introduction to Apache MapReduce and HDFS
A data warehouse system for Hadoop
Introduction to Apache Hive [Video]
An introduction to Apache Hive
A platform for analyzing large data sets
Introduction to Pig [Video]
An introduction to Apache Pig
Learning resources for Apache Mahout
An introduction to Apache Mahout
A scalable machine learning library
How to Contribute to Hadoop Common
Getting Started with HDInsight Services for Windows
The links in this section provide information on deploying and using the Developer Preview of HDInsight Services on Windows.
Link | Description |
Installing the Developer Preview of HDInsight Services on Windows | How to install the Developer Preview of Hadoop on Windows with the Microsoft Web Platform Installer 4.0. |
Getting Started with HDInsight Services for Windows | Tour through the Microsoft HDInsight dashboard and resources for getting started with the developer preview. |
Getting Started with HDInsight Services for Windows Azure
The links in this section provide information on deploying and using Apache Hadoop on the Microsoft Windows Azure Platform. Instead of setting up and managing a Hadoop cluster on Azure by yourself, you can use the HDInsight Services for Windows Azure dashboard that Microsoft has made available at hadooponazure.com. This is a preview of the HDInsight Services for Windows Azure to which you can submit MapReduce jobs to be processed along with the data used in the processing. It enables you to process vast amounts of structured as well as non-structured data easily without worrying about setting up the Hadoop cluster, configuring, maintaining, and managing it manually.
Link
Description
Deployment of Hadoop-based Services on the Windows Azure Portal
A walkthrough for provisioning and using a temporary HDFS cluster on the Hadoop on Windows Azure Portal.
Introduction to HDInsight Sevices for Windows Azure
A service that deploys and provisions clusters in the cloud, providing a software framework designed to manage, analyze and report on big data.
HD Insight Services for Windows Azure QuickStart: Running Hadoop Jobs
This tutorial shows how to run MapReduce programs in a cluster by using Apache™ Hadoop™-based Services for Windows Azure in two ways.
Working With Data in HDInsight Services for Windows Azure
Outlines several techniques for importing and storing data for use in Hadoop jobs run with Hadoop-based Services for Windows Azure.
Analyzing Twitter Movie Data with Hive in HDInsight Services for Windows Azure
In this tutorial you will query, explore, and analyze data from Twitter using Apache™ Hadoop™-based Services for Windows Azure and a Hive query in Excel. Social web sites are one of the major driving forces for Big Data adoption.
Simple recommendation engine using Apache Mahout
In this tutorial you use the Million Song Dataset to create song recommendations for users based on their past listening habits.
An end-to-end introduction to HDInsight, Map/Reduce. Pig, and Hive.
Samples on the HDInsight for Windows Azure Dashboard
This section contains links to the tutorials for the samples that are on the Hadoop on Windows Azure Portal.
Link
Description
The Hadoop on Azure Pi Estimator Sample Tutorial
This tutorial shows how to deploy a MapReduce program with Hadoop on Windows Azure that uses a statistical (quasi-Monte Carlo) method to estimate the value of Pi.
The Hadoop on Azure 10-GB Graysort Sample Tutorial
This tutorial shows how to run a general purpose GraySort on a 10 GB file using Hadoop on Windows Azure.
The Hadoop on Azure C# Streaming Sample Tutorial
This tutorial shows how to use C# programs with the Hadoop streaming interface.
The Hadoop on Azure Mahout Classification Sample
This tutorial illustrates how to use Apache Mahout in Hadoop on Windows Azure to do classification.
The Hadoop on Azure Mahout Clustering Sample
This tutorial illustrates how to use Hadoop on Windows Azure to do cluster analysis with Mahout.
The Hadoop on Azure Pegasus Degree Distribution Sample Tutorial
This tutorial shows how to deploy Pegasus from the Hadoop on Windows Azure portal to compute the degree of each node and the distribution of degrees for a simple 16-node graph.
The Hadoop on Azure Pegasus Page Rank Sample Tutorial
This tutorial shows how to deploy Pegasus from the Hadoop on Windows Azure portal to compute the page rank for a simple 16-node graph.
The Hadoop on Azure Sqoop Import Sample Tutorial
This tutorial shows how to use Sqoop to import data from a SQL database on Windows Azure to an Hadoop on Windows Azure HDFS cluster.
The Hadoop on Azure Wordcount Sample Tutorial
This tutorial shows two ways to use Hadoop on Windows Azure to run a MapReduce program that counts word occurrences in a text.
Developing with Hadoop
This section contains information on developing solutions using Hadoop.
Link
Description
A tutorial on using Hadoop 0.18.0
A tutorial on using Map/Reduce
Hadoop Wiki page on the Streaming utility
Using HDInsight Services with other BI Technologies
This section contains information on using Hadoop with other BI technologies.
Link | Description |
How to Connect Excel to Hadoop on Azure via HiveODBC | Explains how to use Excel 2010 to access data in the Hive data warehouse running on Windows Azure by using the Hive ODBC Driver. |
How to Connect Excel PowerPivot to Hive on Azure via HiveODBC | Explains how to use PowerPivot to access data in the Hive data warehouse running on Windows Azure by using the Hive ODBC Driver. |
Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS) |
With the explosion of data, the open source Apache™ Hadoop™ Framework is gaining traction thanks to its huge ecosystem that has arisen around the core functionalities of Hadoop distributed file system (HDFS™) and Hadoop Map Reduce. As of today, being able to have SQL Server working with Hadoop™ becomes increasingly important because the two are indeed complementary. For instance, while petabytes of data can be stored unstructured in Hadoop and take hours to be queried, terabytes of data can be stored in a structured way in the SQL Server platform and queried in seconds. This leads to the need to transfer data between Hadoop and SQL Server. |
How To
This section contains a list of Hadoop-related how-to articles.
Link
Description
Hadoop-based Services on Windows Azure How-Tos and FAQs
A collection of common How To topics along with FAQs.
How to Contribute to Hadoop Common
How to count the number of lines in a file
An example of counting the number of lines in a file using Map Reduce
An example of getting distinct values/lines using Map Reduce
Information related to Hadoop-based services on Windows Azure.
How to Run a Job on a Provisioned Hadoop on Windows Azure Cluster
Information about creating Map Reduce jobs on a cluster that has been provisioned on the Hadoop on Windows Azure Portal
Use SQL Azure database as a Hive metastore
Information about using SQL Azure database as a Hive metastore
Code Examples
This section contains a list of Hadoop-related examples.
Link
Description
A tutorial on using Hadoop 0.18.0
A tutorial on using Map/Reduce
How to count the number of lines in a file
An example of counting the number of lines in a file using Map Reduce
An example of getting distinct values/lines using Map Reduce
Videos
This section contains a list of Hadoop-related videos.
Link
Description
Introduction to Interactive JavaScript Console
Learn how to use the JS console with your Hadoop cluster.
Introduction to Interactive Hive Console
Learn how to use the Hive console with your Hadoop cluster.
Use Excel Hive Add-in to Access Hive on Windows Azure
Use the Add-in to import data from Hive on Windows Azure.
Use PowerPivot to Access Hive on Windows Azure
Use Excel PowerPivot to access data from Hive on Windows Azure.
An introduction to Apache Hive
An introduction to Apache Pig
Uploading Data and the WordCount Sample
Upload data to Azure cluster and then run the WordCount sample
Run the Pi Estimator Sample
Import data from Marketplace into Hadoop Services for Windows Azure
10GB GraySort Sample - Generate Data
Introduction to the GraySort benchmark and generating test data
10GB GraySort Sample - Sort Data
Running the MapReduce job to sort your data
10GB GraySort Sample - Validate Data
After sorting the data, validate that the operation worked
[[PowerView Report to Hadoop on Azure Hive Sample|PowerView, PowerPivot, Hadoop, and Hive]]
Use PowerView to connect to a Hive sample table in PowerPivot
Audio
This section contains a list of Hadoop-related audio recordings.
Link
Description
.NET Rocks (podcast) episode discussing Hadoop on Azure
.NET Rocks episode 755 (March 2012) with general discussion of Hadoop on Azure.
Books
This section contains a list of Hadoop-related books.
Link
Description
Hadoop: The Definitive Guide, 3rd Edition by Tom White (May 26, 2012)
A comprehensive guide to build and maintain reliable, scalable, distributed systems with Apache Hadoop.
Hadoop on Windows and on Windows Azure Best Practices
Microsoft is planning on providing guidance on best practices in the future. If you have best practices guidance that you'd like to share, please feel free to provide a link to it here.
(Some suggestions.) Be great to list some best practices around:
- How to get big data sets into Windows Azure.
- Understanding how the costs work so as to cost optimize the process.
See Also
Another important place to find an extensive amount of Cortana Intelligence Suite related articles is the TechNet Wiki itself. The best entry point is Cortana Intelligence Suite Resources on the TechNet Wiki.