Thought Leaders in the Cloud: Talking with Rob Gillen, Oak Ridge National Lab Cloud Computing Researcher
Rob Gillen is researching cloud computing technology for the government at Oak Ridge National Laboratory. He also works for Planet Technologies, which recently launched a new cloud practice to assist government and public sector organizations with cloud computing. He has a great blog on cloud computing that goes back seven years, and he also has a lot of presentations and talks up on the web. Rob is also a Windows Azure MVP (Most Valued Professional).
In this interview we cover:
-The pros and cons of infrastructure-as-a-service
-Maximizing data throughput in the cloud
-Cloud adoption in computational science
-The benefits of containerized computing
-Architecting for the cloud versus architecting for on-premises
Robert Duffner: Could you take a moment to introduce yourself?
Rob Gillen: I am a solutions architect for Planet Technologies and I work in the Computer Science and Mathematics division here at Oak Ridge National Laboratory, and I'm doing work focused on scientific and technical workloads.
Robert: To jump right in, what do you see as advantages and disadvantages for infrastructure and platform-as-a-service, and do you see those distinctions going away?
Rob: Each of those aspects of the technology has different advantages. For many people, the infrastructure-as-a-service platform approach is simpler to start using, because your existing codes run more or less unmodified. Most of those services or offerings don't have requirements with regard to a particular OS.
As we receive more technically-focused offerings of unique network interconnections and so forth, people are able to deploy cloud-based assets that are increasingly similar to their on-premises assets.
We have seen some interesting pickup in platform-as-a-service offerings, particularly from the lower end of scientific computing, among people who have not traditionally been HPC users but maybe have been doing a lot of computing on their local machines and have become machine bound. We've seen tools written and developed that can extend their problems and algorithms directly into the cloud using the APIs that are inherent in platform-as-a-service offerings.
As far as the distinctions going away, I think the days of a particular vendor only offering one or the other will be over soon. If you look at some of the vendors, there's a lot of cross-play across their offerings. Still, I think the distinctions will continue to live on to some degree. Additionally, don't think that platform-as-a-service offerings will be going away any time soon.
For example, Amazon’s elastic compute cloud service is very much an infrastructure-as-a-service play. However, if you look at their elastic MapReduce product or their Beanstalk product, both of those are very much platform-as-a-service.
When we compare offerings from our perspective as computational researchers, as you start with the infrastructure offerings, you have a great deal of control from a programmatic standpoint and an infrastructure details standpoint, but you give up a lot of the “magic” traditionally associated with clouds. As you move along the cloud spectrum toward platform as a service, you give up some control, but you gain a lot of magic, in the sense that there are a lot of things you don't have to worry about. So depending on the type of computation you're doing, they have different value to you.
To summarize, I think that individual technologies will continue to grow, but the distinctions at the vendor level will fade over time.
Robert: It seems that, in the current state of the market, infrastructure-as-a-service is better suited to migrate existing applications, and platform-as-a-service is really architecting a whole new type of cloud-based applications. Would you agree with that?
Rob: Mostly, yes. Infrastructure-as-a-service is definitely easier for migrating, although I am would want to clarify the second half of your statement. I think it depends on the type of problem you're trying to solve. The platform-as-a-service offerings from any vendor are generally very interesting, but they have constraints, and depending on the type of problem you're trying to solve, those constraints may or may not be acceptable to you.
So, I agree with you, with the caveat that it's not a blanket statement that green-field implementations should always look at platform as a service first – you have to evaluate the suitability of the platform to the problem you are trying to solve.
Robert: You've interacted with government agencies that are looking at the cloud, and you've blogged about your company's launch of GovCloud. What are some of the key differences between government and other uses of the cloud?
Rob: One of the biggest things comes down simply to data privacy and data security. The first thing every customer we talk to about cloud brings up, both inside and outside the government space, is data privacy. While there’s some good reasoning behind that, the reality is that cloud computing vendors often do better there than what the customers can provide themselves, particularly in the private sector. For many of those customers, moving to the cloud gives them increased data security and data privacy.
In some areas of the government, that would also be true (especially in some of the smaller state and local government offices) – cloud vendors might actually have a more secure platform than what they're currently using. But most often there are policy and legal issues that will prevent them from moving into the cloud, even if they want to.
I think some of the major vendors have recently been certified for a base level or what we would call low-security data, allowing public sector customers to put generally available data in the cloud. But anything with any significant sensitivity can't be moved there yet by policy, regardless of the actual appropriateness of the implementation.
That's a major consideration today – which is unfortunate – because as it stands, the federal government has many tasks that could benefit from a cloud computing infrastructure. I get excited when I see progress being made toward breaking down some of those barriers. Certainly, some of those barriers should not and will not go away but there are some that should, and hopefully they will.
Robert: You did a series of blog posts on maximizing data throughput in the cloud. What led you down that path? And was there a scenario where you needed to maximize a file transfer throughput?
Rob: One of the aspects where we think cloud computing can be valuable for scientific problems is in post-processing or post-analysis of work or datasets that were generated on supercomputers.
We took a large selection of climate data generated on Jaguar, which is one of the supercomputers here at Oak Ridge, and we modeled the process of taking that data and moving it into the cloud for post-processing. We looked at different ways to get the data there faster while making sure that data integrity remained high.
We also worked through problems around data publishing, so that once it’s in the cloud, we can make it available in formats that are consumable by others, both within and outside the particular research domain. We're working through the challenge that many scientific domains use domain-specific file formats. For example, climatology folks often use file formats like NetCDF and HDF5. They have particular reasons for using those, but they are not necessarily widely used in other disciplines. Taking that same data and making it available to a wide set of people is difficult if it remains in those native formats.
Therefore, we're looking at how to leverage the infrastructure in the platforms provided by the cloud, whatever data structures they use, to actually serve that data up and make it available to a new and broader audience than has previously been possible.
That was the main problem set that we were working on, and we found some interesting results. With a number of the major providers, we came up with ways to improve data transfer, and it's only getting better as Microsoft, Amazon, and other vendors continue to improve their offerings and make them more attractive for use in the scientific domain.
Robert: Data centers are pretty opaque, in the sense that you don't have a lot of visibility into how the technology is implemented. Have you seen instances where cloud performance changes significantly from day to day? And if so, what's your guidance to app developers?
Rob: That issue probably represents the biggest hesitation on the part of the scientists I'm working with, in terms of using the cloud. I'm working in a space where we have some of the biggest and brightest minds when it comes to computational science, and the notion of asking them to use this black box is somewhat laughable to them.
That is why I don't expect, in the near term at least, that we’ll see cloud computing replace some of the specifically tuned hardware like Jaguar, Kracken, or other supercomputers. At the same time, there is a lot of scientific work being done that is not necessarily as execution-time-critical as others. Often, these codes do not benefit from the specialized hardware available in these machines.
There are certain types of simulations that are time-sensitive and communication heavy, meaning for each step of compute that is performed, a comparatively significant amount of communication between nodes is required. In cases like this, some of the general cloud platforms aren’t as good a fit.
I think it's interesting to see some of the cloud vendors realizing that fact and developing platforms that cater to that style of code, as illustrated by some of the cluster computing instances by Amazon and others. That’s important in these cases, since general-purpose cloud infrastructures can introduce unacceptable inconsistencies.
We've also seen a lot of papers published by people doing assessments of infrastructure-as-a-service providers, where they'll look and see that their computational ability changes drastically from day to day or from node to node. Most often, that's attributed to the noisy neighbor problem. When this research is done in smaller scale projects, by university students or others on constrained budgets, they tend to use the small or medium instances offered by whatever cloud vendor is available. In such cases, people are actually competing for resources with others on the same box. In fact, depending on the intensity of their algorithms and the configuration they have selected, they could be fighting with themselves on the same physical nodes, since the cloud provider’s resource allocation algorithm may have placed them on the same physical node.
As people in the scientific space become more comfortable with using the largest available node, they're more likely to have guaranteed full access to the physical box and the underlying substrate. This will improve the consistency of their results. There are still shared assets that, depending on usage patterns, will introduce variability (persistent storage, network, etc.) but using the larger nodes will definitely reduce the inconsistencies – which is, frankly, more consistent with traditional HPC clusters. When you are running on a collection of nodes within a cluster, you have full access to the allocated nodes.
The core issue in this area is to determine what the most applicable or appropriate hardware platform is for a given type of problem. If you're doing a data parallel app, in which you're more concerned about calendar time or development time than you are about your execution time, a cloud will fit the problem well in many cases. If you're concerned about latency and you have a very specific execution time scale concerns, the cloud (in its current incarnation, at least) is probably not the right fit.
Robert: Back in August of last year, you also posted about containerized computing. What interest do you see in this trend, and what scenarios are right for it?
Rob: That topic aligns very nicely with the one we touched on earlier, about data privacy in the federal space. A lot of federal organizations are building massive data centers. One key need for the sake of efficiency is to get any organization, government or otherwise, to stop doing undifferentiated heavy lifting.
Every organization should focus on where it adds value and, as much as possible, it should allow other people to fill in the holes, whether through subcontracting, outsourcing, or other means. I expect to see more cases down the road where data privacy regulations require operators not only to ensure the placement of data geographically within, say, a particular country’s boundary, but specifically within an area such as my premises, my corporate environment, or a particular government agency.
You can imagine a model wherein a cloud vendor actually drops containerized chunks of the data center inside your fence, so you have physical control over that device, even though it may be managed by the cloud vendor. Therefore, a government agency would not have to develop its own APIs or mechanisms for provisioning or maintenance of the data center – the vendor could provide that. The customer could still benefit from the intrinsic advantages of the cloud, while maintaining physical control over the disks, the locality, and so on.
Another key aspect of containerized approaches to computing is energy efficiency. We’re seeing vendors begin to look at the container as the field-replaceable unit, which allows them to introduce some rather innovative designs within the container. When you no longer expect to be able to swap out individual servers, you can eliminate traditional server chassis (which, beyond making the server “pretty” simply block airflow and reduce efficiency), you can consolidate power supplies, experiment with air cooling/swamp cooling, higher ambient temperatures… the list goes on and we are seeing some very impressive PUE numbers from various vendors and we are working to encourage these developments.
There are also some interesting models for being able to bundle very specialized resources and deploy them in non-traditional locations. You can package up a generator, a communications unit, specialized compute resources, and analysis workstations, all in a 40 foot box, and ship it to a remote research location, for example.
Robert: The National Institute of Standards and Technology (NIST) just released a report on cloud computing, where they say, and I quote, "Without proper governance, the organizational computing infrastructure could be transformed into a sprawling, unmanageable mix of insecure services." What are your thoughts on that?
Rob: My first thought is that they're right.
[laughter]
They're actually making a very similar argument to one that’s often made about SharePoint environments. Any SharePoint consultant will tell you that one of the biggest problems they have, which is really both a weakness and strength of the platform, is that it's so easy to get that first order of magnitude set up. In a large corporation, you often hear someone say, “We've got all of these rogue SharePoint installs running across our environment, and they're difficult to manage and control from an IT perspective. We don't have the governance to make sure that they're backed up and all that sort of thing.”
And while I can certainly sympathize with that situation, the flip side is that those rogue installs are solving business problems, and they probably exist because of some sort of impediment to actually getting work done, whether it was policy-based or organizationally based. Most of those organizations just set it up themselves because it was simpler than going through the official procedures.
A similar situation is apt to occur with cloud computing. A lot of people won’t even consider going through months of procurement and validations for policy and security, when they can just go to Amazon and get what they need in 10 minutes with a credit card. IT organizations need to recognize that a certain balance needs to be worked out around that relationship.
I think as we move forward over time, we will work toward an environment where someone can provision an on-premises platform with the same ease that they can go to Amazon, Microsoft, or whoever today for cloud resources. That model will also provide a simple means to address the appropriate security considerations for their particular implementation.
There's tension there, which I think has value, between IT people who want more control and end users who want more flexibility. Finding that right balance is going to be vital for any organization to use cloud successfully.
Robert: How do you see IT creating governance around how an organization uses cloud without sacrificing the agility that the cloud provides?
Rob: Some cloud computing vendors have technologies that allow customers to virtually extend their physical premises into the cloud. If you combine that sort of technology with getting organizational IT to repackage or re-brand the provisioning mechanisms provided by their chosen cloud computing provider, I think you can end up with a very interesting solution.
For example, I could imagine an internal website managed by my IT organization where I could see a catalog of available computing assets, provide our internal charge code, and have that platform provisioned and made available to me with the same ease that I could with an external provider today. In fact, that scenario could actually make the process easier for me than going outside, since I wouldn’t have to use a credit card and potentially a reimbursement mechanism. In this model, the IT organization essentially “white labels” the external vendor’s platform, and layers in the organizational policies and procedures while still benefiting from the massive scale of the public cloud.
Robert: What do you think makes architecting for the cloud different than architecting for on-premises or hosted solutions?
Rob: The answer to that question depends on the domain in which you're working. Many of my cloud computing colleagues work in a general corporate environment, with customers or businesses whose work targets the sweet spot of the cloud, such as apps that need massive horizontal scaling. In those environments, it's relatively straightforward to talk about architecting for the cloud versus not architecting for the cloud, because the lines are fairly clear and solid patterns are emerging.
On the other hand, a lot of the folks I'm working with may have code and libraries that have existed for a decade, if not longer. We still have people who are actively writing in Fortran 77 who would argue that it's the best tool for the job they're trying to accomplish. And while most people who are talking about cloud would laugh at that statement, it's that type of scenario that makes this domain unique.
Most of the researchers we're working with don't think about architecting for the cloud or not, so much as they think in terms of architecting to solve their particular problem. That's where it comes to folks like me and others in our group to help build tools that allow the domain scientist to leverage the power of the cloud without having to necessarily think about or architect for it.
I've been talking to a lot of folks recently about the cloud and where it sits in the science phase. I've worked in the hosted providers’ space for over a decade now, and I’ve been heavily involved in doing massive scaling of hosted services such as hosted email (which are now being called “cloud-based services”) for many, many years. There are some very interesting aspects of that from a business perspective, but I don't think that hosted email really captures the essence of cloud computing.
On the next level, you can look at massive pools of available storage or massive pools of available virtual machines and build interesting platforms. This seems to be where many folks are focusing their cloud efforts right now, and while it adds significant value, there’s still more to be gleaned from the cloud.
What gets me excited about architecting for the cloud is that rather than having to build algorithms to fit into a fixed environment, I can build an algorithm that will adjust the environment based on the dynamics of the problem it's solving. That is an interesting shift and a very different way of solving a problem. I can build an algorithm or a solution to a scientific problem that knows what it needs computationally, and as those needs change, it can make a call out to get another couple of nodes, more storage, more RAM, and so on. It’s a game-changer.
Robert: What advice do you have for organizations looking at porting existing apps to the cloud?
Rob: First, they should know that it's not as hard as it sounds. Second, they should take it in incremental steps. There are a number of scenarios and tutorials out there that walk you through different models. Probably the best approach is to take a mature app and consider how to move it to the cloud with the least amount of change. Once they have successfully deployed it to the cloud (more or less unmodified), they can consider what additional changes they can make to the application to better leverage the cloud platform.
A lot of organizations make the mistake of assuming that they need to re-architect applications to move them to the cloud. That can lead them to re-architect some of the key apps they depend on for their business from the ground up. In my mind, a number of controlled incremental steps are better than making fewer large steps.
Robert: That seems like a good place to wrap up. Thanks for taking the time to talk today.
Rob: My pleasure.