HDInsight Hive workload under covers
HDInsight under covers post covered cluster creation/set-up overview. Apache Hive is the most popular component of Hadoop type clusters. Hive defines a simple SQL-like query language, that enables users familiar with SQL to query HDFS data. Undercovers Hive translates the higher level SQL query to YARN application (MapReduce/Tez) for real execution.
HDInsight ships multiple version’s. Azure portal lists the list of current versions. Each version of cluster ships with specific component versions. HDInsight used HDP (HortonWorks Data Platform) distribution. Below table lists the version mapping between HDInsight and HDP.
https://hortonworks.com/hdp/whats-new/ covers specific component version for a HDP distribution.
Below layered stack covers the popular HDInsight tools and how they interact with the components underneath.
As shown in above layered stack three high level interactions happens
- CLI: Hive or beeline are the CLI used to experiment/validate the Hive queries.
- Batch: These interactions goes through WebHCat. HDInsight Jobs SDK is a .NET nuget package which is built on top of WebHCat for programmatic query execution. This is the most popular choice most customers use for production automation.
- Interactive: Workload where latencies are critical served by Hiveserver. Microsoft Hive ODBC driver (or any HiveServer thrift clients) allows interactions with Hiveserver.
In the next blog post’s, I will cover interactions of batch and interactive workloads in-details.
References
Azure HDInsight Job Management Library Visual Studio Hadoop tools for HDInsight Microsoft HIVE ODBC Driver Hive language manual WebHcat REST Reference HDP details