Del via


Specifying the infrastructure

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

Considerations for planning an infrastructure for Hadoop-based big data analysis include deciding whether to use an on-premises or a cloud-based implementation, if a managed service is a better choice than deploying the chosen framework yourself, and the type of data store and cluster configuration that you require. This guide concentrates on HDInsight. However, many of the considerations for choosing an infrastructure option are typically relevant to other solutions and platforms.

The topics discussed here are:

  • Service type and location
  • Data storage
  • Cluster size and storage configuration
  • SLAs and business requirements

Service type and location

Hadoop-based big data solutions are available as both cloud-hosted managed services and as self-installed solutions. If you decide to use a self-installed solution, you can choose whether to deploy this locally in your own datacenter, or in a virtual machine in the cloud. For example, HDInsight is available as an Azure service, but you can install the Hortonworks Hadoop distribution called the Hortonworks Data Platform (HDP) as an on-premises or a cloud-hosted solution on Windows Server. Figure 1 shows the key options for Hadoop-based big data solutions on the Microsoft platform.

Figure 1 - Location options for a Hadoop-based big data solution deployment

Figure 1 - Location options for a Hadoop-based big data solution deployment

Note

You can install big data frameworks such as Hadoop on other operating systems, such as Linux, or you can subscribe to Hadoop-based services that are hosted on these types of operating systems.

Consider the following guidelines when choosing between these options:

  • A managed service such as HDInsight running on Azure is a good choice over a self-installed framework when:
    • You want a solution that is easy to initialize and configure, and where you do not need to install any software yourself.
    • You want to get started quickly by avoiding the time it takes to set up the servers and deploy the framework components to each one.
    • You want to be able to quickly and easily decommission a cluster and then initialize it again without paying for the intermediate time when you don’t need to use it.
  • A cloud-hosted mechanism such as Hortonworks Data Platform running on a virtual machine in Azure is a good choice when:
    • The majority of the data is stored in or obtained through the cloud.
    • You require the solution to be running for only a specific period of time.
    • You require the solution to be available for ongoing analysis, but the workload will vary—sometimes requiring a cluster with many nodes and sometimes not requiring any service at all.
    • You want to avoid the cost in terms of capital expenditure, skills development, and the time it takes to provision, configure, and manage an on-premises solution.
  • An on-premises solution such as Hortonworks Data Platform on Windows Server is a good choice when:
    • The majority of the data is stored or generated within your on-premises network.
    • You require ongoing services with a predictable and constant level of scalability.
    • You have the necessary technical capability and budget to provision, configure, and manage your own cluster.
    • The data you plan to analyze must remain on your own servers for compliance or confidentiality reasons.
  • A pre-configured hardware appliance that supports big data connectivity, such as Microsoft Analytics Platform System with PolyBase, is a good choice when:
    • You want a solution that provides predictable scalability, easy implementation, technical support, and that can be deployed on-premises without requiring deep knowledge of big data systems in order to set it up.
    • You want existing database administrators and developers to be able to seamlessly work with big data without needing to learn new languages and techniques.
    • You want to be able to grow into affordable data storage space, and provide opportunities for bursting by expanding into the cloud when required, while still maintaining corporate services on a traditional relational system.

Note

Choosing a big data platform that is hosted in the cloud allows you to change the number of servers in the cluster (effectively scaling out or scaling in your solution) without incurring the cost of new hardware or having existing hardware underused.

Data storage

When you create an HDInsight cluster, you have the option to create one of two types:

  • Hadoop cluster. This type of cluster combines an HDFS-compatible storage mechanism with the Hadoop core engine and a range of additional tools and utilities. It is designed for performing the usual Hadoop operations such as executing queries and transformations on data. This is the type of cluster that you will see in use throughout this guide.
  • HBase cluster. This type of cluster, which was in preview at the time this guide was written, contains a fully configured installation of the HBase database system. It is designed for use as either a standalone cloud-hosted NoSQL database or, more typically, for use in conjunction with a Hadoop cluster.

The primary data store used by HDInsight for both types of cluster is Azure blob storage, which provides scalable, durable, and highly available storage (for more information see Introduction to Microsoft Azure Storage). Using Azure blob storage means that both types of cluster can offer high scalability for storing vast amounts of data, and high performance for reading and writing data—including the capture of streaming data. For more details see Azure Storage Scalability and Performance Targets.

When to use an HBase cluster?

HBase is an open-source wide column (or column family) data store. It uses a schema to define the data structures, but the schema can contain families of columns rather than just single columns, making it ideal for storing semi-structured data where some columns can be predefined but others are capable of storing differing elements of unstructured data.

HBase is a good choice if your scenario demands:

  • Support for real-time querying.
  • Strictly consistent reads and writes.
  • Automatic and configurable sharding of tables.
  • High reliability with automatic failover.
  • Support for bulk loading data.
  • Support for SQL-like languages through interfaces such as Phoenix.

HBase provides close integration with Hadoop through base classes for connecting Hadoop map/reduce jobs with data in HBase tables; an easy to use Java API for client access; adapters for popular frameworks such as map/reduce, Hive, and Pig; access through a REST interface; and integration with the Hadoop metrics subsystem.

HBase can be accessed directly by client programs and utilities to upload and access data. It can also be accessed using storage drivers, or in discrete code, from within the queries and transformations you execute on a Hadoop cluster. There is also a Thrift API available that provides a lightweight REST interface for HBase.

HBase is resource-intensive and will attempt to use as much memory as is available on the cluster. You should not use an HBase cluster for processing data and running queries, with the possible exception of minor tasks where low latency is not a requirement. However, it is typically installed on a separate cluster, and queried from the cluster containing your Hadoop-based big data solution.

Note

For more information about HBase see the official Apache HBase project website and HBase Bigtable-like structured storage for Hadoop HDFS on the Hadoop wiki site.

Why Azure blob storage?

HDInsight is designed to transfer data very quickly between blob storage and the cluster, for both Hadoop and HBase clusters. Azure datacenters provide extremely fast, high bandwidth connectivity between storage and the virtual machines that make up an HDInsight cluster.

Using Azure blob storage provides several advantages:

  • Running costs are minimized because you can decommission a Hadoop cluster when not performing queries—data in Azure blob storage is persisted when the cluster is deleted and you can build a new cluster on the existing source data in blob storage. You do not have to upload the data again over the Internet when you recreate a cluster that uses the same data. However, although it is possible, deleting and recreating an HBase cluster is not typically a recommended strategy.
  • Data storage costs can be minimized because Azure blob storage is considerably cheaper than many other types of data store (1 TB of locally-redundant storage currently costs around $25 per month). Blob storage can be used to store large volumes of data (up to 500 TB at the time this guide was written) without being concerned about scaling out storage in a cluster, or changing the scaling in response to changes in storage requirements.
  • Data in Azure blob storage is replicated across three locations in the datacenter, so it provides a similar level of redundancy to protect against data loss as an HDFS cluster. Storage can be locally-redundant (replicas are in the same datacenter), globally-redundant (replicated locally and in a different region), or read-only globally-redundant. See Introduction to Microsoft Azure Storage for more details.
  • Data stored in Azure blob storage can be accessed by and shared with other applications and services, whereas data stored in HDFS can only be accessed by HDFS-aware applications that have access to the cluster storage. Azure storage offers import/export features that are useful for quickly and easily transferring data in and out of Azure blob storage.
  • The high speed flat network in the datacenter provides fast access between the virtual machines in the cluster and blob storage, so data movement is very efficient. Tests carried out by the Azure team indicate that blob storage provides near identical performance to HDFS when reading data, and equal or better write performance.

Note

Azure blob storage may throttle data transfers if the workload reaches the bandwidth limits of the storage service or exceeds the scalability targets. One solution is to use additional storage accounts. For more information, see the blog post Maximizing HDInsight throughput to Azure Blob Storage on MSDN. For more information about the use of blob storage instead of HDFS for data storage see Use Azure Blob storage with HDInsight.

Combining Hadoop and HBase clusters

For most of your solutions you will use an HDInsight Hadoop-based cluster. However, there are circumstances where you might combine both a Hadoop and an HBase cluster in the same solution, or use an HBase cluster on its own. Some of the common configurations are:

  • Use just a Hadoop cluster. Source data can be loaded directly into Azure blob storage or stored using the HDFS-compatible storage drivers in Hadoop. Data processing, such as queries and transformations, execute on this cluster and access the data in Azure blob storage using the HDFS-compatible storage drivers.
  • Use a combination of a Hadoop and an HBase cluster (or more than one HBase cluster if required). Data is stored using HBase, and optionally through the HDFS driver in the Hadoop cluster as well. Source data, especially high volumes of streaming data such as that from sensors or devices, can be loaded directly into HBase. Data processing takes place on the Hadoop cluster, but the processes can access the data stored in the HBase cluster and store results there.
  • Use just an HBase cluster. This is typically the choice if you require only a high capacity, high performance storage and retrieval mechanism that will be accessed directly from client applications, and you do not require Hadoop-based processing to take place.

Figure 1 shows these options in schematic form.

Figure 2 - Data storage options for an Azure HDInsight solution

Figure 2 - Data storage options for an Azure HDInsight solution

In Figure 1, the numbered data flows are as follows:

  1. Client applications can perform big data processing on the Hadoop cluster. This includes typical operations such as executing Pig, Hive, and map/reduce processing and storing the data in blob storage.
  2. Client applications can access the Azure blob storage where the Hadoop cluster stores its data to upload source data and to download the results. This is possible both when the Hadoop cluster is running and when it is not running or has been deleted. The Hadoop cluster can be recreated over the existing blob storage.
  3. The processes running on the Hadoop cluster can access the HBase cluster using the storage handlers in Hive and Pig queries, the Java API, interfaces such as Thrift and Phoenix, or the REST interface to store and retrieve data in HBase.
  4. Clients can access the HBase cluster directly using the REST interface and APIs to upload data, perform queries, and download the results. This is possible both when the Hadoop cluster is running and when it is not running or has been deleted. However the HBase cluster must be running.

Cluster size and storage configuration

Having specified the type of service you will use and the location, the final tasks in terms of planning your solution center on the configuration you require for the cluster. Some of the decisions you must make are described in the following sections.

Cluster size

The amount you are billed for the cluster when using HDInsight (or other cloud-hosted Hadoop services) is based on the cluster size. For an on-premises solution, the cluster size has an impact on internal costs for infrastructure and maintenance. Choosing an appropriate size helps to minimize runtime costs. When choosing a cluster size, consider the following points:

  • Hadoop automatically partitions the data and allocates the jobs to the data nodes in the cluster. Some queries may not take advantage of all the nodes in the cluster. This may be case with smaller volumes of data, or where the data format prevents partitioning (as is the case for some types of compressed data).
  • Operations such as Hive queries that must sort the results may limit the number of nodes that Hadoop uses for the reduce phase, meaning that adding more nodes will not reduce query execution time.
  • If the volume of data you will process is increasing, ensure that the cluster size you choose can cope with this. Alternatively, plan to increase the cluster size at specific intervals to manage the growth. Typically you will need to delete and recreate the cluster to change the number of nodes, but you can do this for a Hadoop cluster without the need to upload the data again because it is held in Azure blob storage.
  • Use the performance data exposed by the cluster to determine if increasing the size is likely to improve query execution speed. Use historical data on performance for similar types of jobs to estimate the required cluster size for new jobs. For more information about monitoring jobs, see Building end-to-end solutions using HDInsight.

Storage requirements

By default HDInsight creates a new storage container in Azure blob storage when you create a new cluster. However, it’s possible to use a combination of different storage accounts with an HDInsight cluster. You might want to use more than one storage account in the following circumstances:

  • When the amount of data is likely to exceed the storage capacity of a single blob storage container.
  • When the rate of access to the blob container might exceed the threshold where throttling will occur.
  • When you want to make data you have already uploaded to a blob container available to the cluster.
  • When you want to isolate different parts of the storage for reasons of security, or to simplify administration.

For details of the different approaches for using storage accounts with HDInsight see Cluster and storage initialization in the section Collecting and loading data into HDInsight of this guide. For details of storage capacity and bandwidth limits see Azure Storage Scalability and Performance Targets.

Maintaining cluster data

There may be cases where you want to be able to decommission and delete a cluster, and then recreate it later with exactly the same configuration and data. HDInsight stores the cluster data in blob storage, and you can create a new cluster over existing blob containers so the data they contain is available to the cluster. Typically you will delete and recreate a cluster in the following circumstances:

  • You want to minimize runtime costs by deploying a cluster only when it is required. This may be because the jobs it executes run only at specific scheduled times, or run on demand when specific requirements arise.
  • You want to change the size of the cluster, but retain the data and metadata so that it is available in the new cluster.

This applies only with a Hadoop-based HDInsight cluster. See Using Hadoop and HBase clusters for information about using an HBase cluster. For details of how you can maintain the data when recreating a cluster in HDInsight see Cluster and storage initialization in the section Collecting and loading data into HDInsight of this guide.

SLAs and business requirements

Your big data process may be an exercise of experimentation and investigation, or it may be a part of a business process. In the latter case, especially where you offer the solution to your customers as a service, you must ensure that you meet Service Level Agreements (SLAs) and business requirements. Keep in mind the following considerations:

  • Investigate the SLAs offered by your big data solution provider because these will ultimately limit the level of availability and reliability you can offer.
  • Consider if a cluster should be used for a single process, for one or a subset of customers, or for a specific limited workload in order to maintain performance and availability. Sharing a cluster across multiple different workloads can make it more difficult to predict and control demand, and may affect your ability to meet SLAs.
  • Consider how you will manage backing up the data and the cluster information to protect against loss in the event of a failure.
  • Choose an operating location, cluster size, and other aspects of the cluster so that sufficient infrastructure and network resources are available to meet requirements.
  • Implement robust management and monitoring strategies to ensure you maintain the required SLAs and meet business requirements.

Next Topic | Previous Topic | Home | Community