An Overview of an Apache Drill Topology in Azure
NOTE This post is part of a series on a deployment of Apache Drill on the Azure cloud.
Apache Drill supports both embedded, i.e. single-server, and distributed deployments. To meet my customer's needs, I need both the scale and reliability of a multi-server solution so that I have elected to deliver a distributed deployment.
While Drill recommends you deploy the software to an existing Hadoop cluster, neither Hortonworks nor Cloudera support Drill, so that I've elected to setup an independent ZooKeeper ensemble, i.e. cluster, consisting of three ZooKeeper nodes. Each node has been initially sized as a DS2 V2 Azure VM, which provides it an solid-state drive backed OS disk, 7 GB of RAM, & 2-cores of a 2.4 GHz Intel Haswell processor. (More info on Azure VM sizes here.) In addition, I've decided to add two 128 GB SSD-backed Data Disks to each ZooKeeper VM for ZooKeeper logs and data. I suspect this configuration may be overkill but until I get further along in testing, I'm heeding the warnings in the ZooKeeper documentation and this O'Reilly book around performance challenges related to logging & data (snapshots).
For the Drill nodes, I'm actually starting out a little conservatively with four DS4 V2 Azure VMs, providing each node with a solid-state drive backed OS disk, 28 GB of RAM & 8 CPU-cores. As I do into query performance testing, I may size this up or simply add more nodes, depending on what the data tell me about performance on these machines.
NOTE For all VMs, I am deploying the Ubuntu Server 14.04 LTS image found in the Azure Marketplace.
In order to control for both planned and unplanned outages, I am assigning the three ZooKeeper nodes to an Availability Set with three fault domains and 7 update domains. The update domains are higher than what are needed but I used that number to provide me headroom for expansion of the ensemble. For the Drill nodes, I defined a different Availability Set with three fault domains and the max (20) update domains. For more information about Availability Sets, please see this document.
On the networking side of things, I've decided to deploy all 7 of these servers within an Azure Virtual Network. This will allow them to speak freely to one another without needing to open them to any outside communications. That said, I do want to be able to SSH into each VM, so that I have configured each VM to allow inbound SSH on TCP port 22 and have assigned each a public IP address with a friendly fully-qualified name.
Excluding storage for the disks behind the VMs and the public IP addresses and FQDNs, here is a rough diagram of what I am deploying: