Configuration of HBase on Azure HDInsight as a Drill Data Source
NOTE This post is part of a series on a deployment of Apache Drill on the Azure cloud.
In a previous post, I showed how to connect my Azure-deployed Drill cluster to an Azure HDInsight (Hadoop) cluster via Hive. Azure HDInsight also supports HBase. In this post, I want to tackle how to get the Drill cluster to use Azure HDInsight HBase as its data source.
The Drill documentation on this topic is pretty straightforward. The two big wrinkles I need to work through are (1) that the ZooKeeper nodes and Thrift interface they direct Drill to are not accessible outside the Virtual Network that houses my Hadoop cluster and (2) Azure HDInsight (like the Hortonworks Data Platform its based on) does not use the default znode for HBase. With that in mind, I'll connect my Drill cluster to HBase through the following steps:
- Deploy Azure HDInsight cluster to a Virtual Network accessible by Drill
- Configure HBase access within Drill
- Confirm HBase access within Drill
Deploy Azure HDInsight Cluster to an Accessible Virtual Network
My first action is to deploy my Azure HDInsight cluster to a virtual network that's accessible by Drill. This is important because the Thrift service used to access the Hive metastore is not accessible from outside the ring-fence created by the virtual network that surrounds Azure HDInsight.
If I'm deploying a new cluster, I might deploy it into the same virtual network used by Drill. (In this scenario, I might deploy the cluster to a separate subnet within that Virtual Network just for manageability purposes.) If I have an existing Azure HDInsight cluster in a virtual network and have deployed Drill to a separate virtual network, I might configure the two virtual networks to leverage a vnet-to-vnet VPN tunnel per these instructions.
Configure HBase Access within Drill
Before I can configure the HBase cluster, I need to identify the ZooKeeper nodes that it uses, not the ZooKeeper nodes Drill is using. To do this, I will use the Ambari interface for my Azure HDInsight cluster:
- Open a browser to https://myclustername.azurehdinsight.net with appropriate substitution for myclustername
- At the prompt, login using the HTTP user name and password established when you configured the HDInsight cluster
- Within the Ambari interface, go to Services | HBase | Configs | Advanced
- Under the Advanced hbase-site subheading, copy the value for hbase.zookeeper.quorum and note the value for zookeeper.znode.parent, which should be /hbase-unsecure.
Now I can login to my Drill Web Console, select the Storage page, and click Update on my hbase plugin.
In the resulting window, I can paste the value of my hbase.zookeeper.quorum property into the appropriate position. I also need to add a value for zookeeper.znode.parent under the config property. These changes are highlighted in red in this example configuration:
{
"type": "hbase",
"config": {
"hbase.zookeeper.quorum": "zk1-drillh.h3yoplooa.dx.internal.cloudapp.net,zk3-drillh.h3yoplooa.dx.internal.cloudapp.net,zk2-drillh.h3yoplooa.dx.internal.cloudapp.net",
"hbase.zookeeper.property.clientPort": "2181",
"zookeeper.znode.parent": "/hbase-unsecure"
},
"size.calculator.enabled": false,
"enabled": true
}
I click Update and then Enable to save this configuration.
Confirm HBase Access within Drill
Unlike the Hive storage plugin setup, I found that I needed to restart my Drill cluster before being able to use the HBase plugin. Your mileage may vary but it may be a good idea for you to stop and restart the nodes in your Drill cluster before proceeding.
With the Drill cluster restarted and the plugin enabled, I should now be able to query HBase. If I go to the Drill Web Console and click on the Query page, I can query the built-in ambarismoketest table as follows:
SELECT * FROM hbase.`ambarismoketest`;
This should return to me a single row.