Jaa


Built-in data connectivity

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

Excel includes native support for importing data from a wide range of data sources including web sites, OData feeds, relational databases, and more. The data connectivity options available on the Data tab of the ribbon enable business users to specify data source connection information, and select the data to be imported, using a wizard-style interface. The data is imported into a worksheet, and in Excel 2013 it can be added to the workbook data model and used for data visualization in a PivotTable report, PivotChart, Power View report, or Power Map tour (depending on the specific edition of Excel).

The built-in data connectivity capabilities include support for ODBC sources, and are an appropriate option for consuming data from HDInsight when that data is accessible through Hive tables. To use this option, the Hive ODBC driver must be installed on the client computer where Excel will be used.

Note

There is a 32-bit and a 64-bit version of the driver available for download from Microsoft Hive ODBC Driver. You should install the one that matches the version of Windows on the target computer. When you install the 64-bit version it also installs the 32-bit version so you will be able to use it to connect to Hive from both 64-bit and 32-bit applications.

You can simplify the process of connecting to HDInsight by using the Data Sources (ODBC) administrative tool to create a data source name (DSN) that encapsulates the ODBC connection information, as shown in Figure 1. Creating a DSN makes it easier for business users with limited experience of configuring data connections to import data from Hive tables that are defined in HDInsight. If you set up both 32-bit and 64-bit DSNs using the same name, client applications will automatically use the appropriate one.

Figure 1 - Creating an ODBC DSN for Hive

Figure 1 - Creating an ODBC DSN for Hive

After you create a DSN, users can specify it in the data connection wizard in Excel and then select a Hive table or view to be imported into the workbook, as shown in Figure 2.

Figure 2 - Importing a Hive table into Excel

Figure 2 - Importing a Hive table into Excel

Note

The Hive tables shown in Figure 2 are used as an example throughout this section of the guide. These tables contain UK meteorological observation data for 2012, obtained from the UK Met Office Weather Open Data dataset in Azure Marketplace.

Using the built-in data connectivity capabilities of Excel is a good solution when you need to perform simple analysis and reporting based on the results of an HDInsight query, and the data can be encapsulated in Hive tables. You can import the data on any client computer that has the ODBC driver for Hive is installed, and in all editions of Excel.

In Excel 2013 you can add the imported data to the workbook data model and combine it with data from other sources to create analytical mashups. However, in scenarios where complex data modeling is the primary goal, using an edition of Excel that supports PowerPivot offers greater flexibility and modeling functionality. It also makes it easier to create PivotTable Report, PivotChart, Power View, and Power Map visualizations. These can help you create more meaningful and immersive results.

The following table describes specific considerations for using Excel’s built-in data connectivity in the HDInsight use cases and models described in this guide.

Use case

Considerations

Iterative data exploration

Built-in data connectivity in Excel is a suitable choice when the results of the data processing can be encapsulated in a Hive table, or a query with simple joins can be encapsulated in a Hive view, and the volume of data is sufficiently small to support interactive connectivity with tolerable response times.

Data warehouse on demand

When HDInsight is used to create a basic data warehouse containing Hive tables, business users can use the built-in data connectivity in Excel to consume data from those tables for analysis and reporting. However, for complex data models that require multiple related tables and queries with complex joins, PowerPivot may be a better choice.

ETL automation

Most ETL scenarios are designed to transform big data into a suitable structure and volume for storage in a relational data source for further analysis and querying. While Excel may be used to consume the data from the relational data source after it has been transferred from HDInsight, it is unlikely that an Excel workbook would be the direct target for the ETL process.

BI integration

Importing data from a Hive table and combining it with data from a BI data source (such as a relational data warehouse or corporate data model) is an effective way to accomplish report-level integration with an enterprise BI solution. However, in self-service analysis scenarios, advanced users such as business analysts may require a more comprehensive data modeling solution such as that offered by PowerPivot, and can benefit from the ability to share queries, data models, and reports with Power BI for Office 365.

Guidelines for using the Hive ODBC Driver in Excel

When planning to use the native data import functionality in Excel, consider the following guidelines:

  • Install both 32-bit and 64-bit Hive ODBC Drivers and create 32-bit and 64-bit ODBC DSNs with the same name. This enables 32-bit and 64-bit clients to use the same connection string when connecting to Hive.
  • Importing data into a table in a worksheet makes it possible to filter the data, use data bars and conditional formatting, and create charts. Tables in worksheets are automatically included in the workbook data model. However, if you need to define relationships between multiple tables, or create custom columns and aggregations, it may be more efficient to import data directly into a PowerPivot data model.
  • Imported data can be refreshed from the original data source. When importing data from Hive tables you will be able to refresh the tables only while the HDInsight cluster is running.

Next Topic | Previous Topic | Home | Community