다음을 통해 공유


Does Hadoop Ecosystem Need Cloud to Survive?

The journey of Hadoop and Cloud is interesting. Both of these revolutionary technologies started around the same time. However, they charted their own separate paths. When Hadoop burst onto the scene, Cloud was getting started as well. Enterprises starting their Hadoop journey had two choices, on-premise clusters or create virtualized infrastructure in the cloud and create a cluster there.

In those early days, there were many reasons to start Hadoop deployments in the on-premise datacenter. Hadoop workloads were "Big Data" workloads and hence cloud option was restrictively expensive. Storage available in the cloud was also not optimal for heavy data-intensive workloads. Network and disk bandwidth and IOPS were also not optimized for heavy workloads like the ones Hadoop was being put in service for. In addition, Hadoop primarily used an execution engine called MapReduce which was built around disk IO with memory playing a secondary role.

This all worked out great in the early days when Hadoop was primarily a batch analytics engine built around workloads that took minutes to hours to complete and latency was not a big requirement. There were two problems with this approach though. One, disk and compute were tied together. This meant that both of these needed to be upgraded together. If the cluster was maxed out on disks the only way to expand storage is by adding nodes to the cluster which will make additional disks available. Two, all these clusters were designed with the mayday approach, which meant design for the highest peak of the workload and everything else will automatically be taken care of. There was no elasticity built into the cluster. It just ran all day and all night whether the utilization was 5% or 95%. Even that approach would have been good if we were living in a static world. But, the type of data and size of data were changing every single day. The estimation of peak workload from last year wouldn't apply to the workload of this year. It resulted in another capital expenditure to expand the cluster for just those 10% days when there will be peak workloads. Rest of the 90% days it will still be largely underutilized.

Right around 2015, an optimized and improved version of the cloud debuted. In this version compute capacity started getting increased with premium nodes becoming available. Storage capacity and latency also got improved and a slew of network and disk bandwidth improvements provided a much-needed boost to the throughput. But the real improvement came in terms of decoupling of storage and compute. You no longer needed to have storage and compute on the same nodes. You can have data stored in a persistent storage layer and compute will only come up when it is needed to run jobs or queries. This made everyone's life a lot easier. Storage was rather cheap and you could store as much data as needed without ever getting worried about the capacity. Compute was elastic and you could use as little or as much without getting worried about your peak workloads. No matter what the workload is you will be able to scale. That guarantee, that assurance, meant a lot to the organizations.

This advancement is really what saved Hadoop from becoming another technology that gets bogged down by upgrade cycles and capital expenditure. This really saved Hadoop from being relegated to a specialized environment for a specialized workload. Now, Hadoop ecosystem along with Spark is a general purpose compute environment that is being used for everything from data processing to low latency querying, to machine learning and everything in between. This makes the Hadoop ecosystem a player in the long run.

There are still a majority of Hadoop clusters in this world that are running in on-premise environments. Time is ripe to rethink that strategy. Before dipping hands into another capital expenditure budget request think Cloud for Hadoop. That will save your Hadoop clusters in the long run. When it comes to Hadoop the mantra should be "Cloud First, Cloud Must".