Should Researchers Perform Biocomputation In The Cloud To Deal With A Biological Data Tsunami?
More data is being produced, analyzed, shared, and stored than ever before. Scientific research, particularly biological sciences like genomics, is one of the more prominent examples of this, with laboratories producing teraBY a, a, part, Dan described above - use BLAST to sift through large databases, identify new animal species,improve drug effectiveness,produce biofuels, and much more. What the new NCBI BLAST hosted in the cloud does is provide a user-friendly Web interface and access to back-end (and largely out-of-sight) cloud computing on Windows Azure for very large BLAST computations. In more advanced scenarios, scientists will not only be able to conduct BLAST analyses on their private data collections, but also include public data hosted entirely in the cloud (these data include that from peer-reviewed scientific publications).
When I was in graduate school working on different aspects of genetics with Prof. Tony Long, and later at NYU working on genomics as a postdoctoral fellow, I used BLAST frequently. It was hosted as part of the National Institutes of Health (NIH) called the National Center for Biotechnology Information (NCBI). NCBI BLAST was slow. You would enter just a single DNA or protein sequence and then "blast" it against the millions of sequences in the public database to find matches, and this could take a while. Users would get a message saying "Results ready in 30 seconds..." and then after 30 seconds get another one saying "Still working... Results ready in 90 seconds..." and so on. And this was 10 years ago.
These scientific tools in the cloud help labs on the small end of the scale. “NCBI BLAST on Windows Azure gives all research organizations the same computing resources that traditionally only the largest labs have been able to afford,” said Bob Muglia, a Senior Vice President at Microsoft oveseeing all its work in the cloud. “It shows how Windows Azure provides the genuine platform-as-a-service capabilities that technical computing applications need to extract insights from massive data, in order to help solve some of the world’s biggest challenges across science, business and government.”
Now, with much more sophisticated tools to collect biological and other data, scientists are being overpowered with a "data tsunami" of sorts. As Dan Reed wrote in a blog post on the issue called The Future of Discovery and the Power of Simplicity, "Simply put, science is in transition from data poverty to data plethora. The implication is that future advantage will accrue to those who can best extract insights from this data tsunami...I believe this will have a transformative, democratizing effect – driving change and creating discovery and innovation opportunities."
More broadly, computer science researchers from different companies are interested in bringing the full force of cloud computing resources to technical specialists in many different science and engineering disciplines. Microsoft Research, for its part, is driving a worldwide program to engage the research community. Microsoft’s technical computing initiative is aimed at bringing supercomputing power and resources – particularly through Windows Azure – for modeling and prediction to more organizations across science, business and government.
Lessons Learned About Large-Scale Computational Research in the Cloud
The application of BLAST on Windows Azure to research conducted BY a, a, part, DanBY a, a, part, DanBY a, a, part, DanBY a, a, part, DanBY a, a, part, Danby the University of Washington and Children’s Hospital groups mentioned above taught Microsoft Research many important lessons about how to structure large-scale research projects in the cloud. Moreover, most of what was learned is applicable not just to BLAST, but to any parallel jobs run at large scale in the cloud. Here are three lessons learned about large-scale computational research in the cloud:
- Design for failure: Large-scale data-set computation will nearly always result in some sort of failure. In the week-long run of the Children’s Hospital project, there were a number of failures: both failures of individual machines, and entire data centers taken down for regular updates. In each case, Windows Azure produced messages about the failure, and had mechanisms in place to make sure jobs were not lost.
- Structure for speed: Structuring individual tasks in an optimal way can significantly reduce the total computation run time. Researchers conducted several test runs before embarking on runs of the whole dataset to make sure that input data was partitioned to get the most use out of each worker node.
- Scale for cost savings: If a few long-running jobs are processing alongside many shorter jobs, it is important to not have idle worker nodes continuing to run up costs once their piece of the job is done. Researchers learned to detect which computers were idle and shut them down to avoid unnecessary costs.
Working on large-scale projects in the cloud isn't like typical, traditional biological research. For a period of time (when I was in the laboratory) you actually had to understand how to run Linux and use programming languages to control robots and genomic sequencing machines, and then other languages like R to custom-analyze the data. And some people will still do this. But now science appears to be entering a phase of being aided by new kinds of apps in new form factors, backed by the technology underlying cloud computing to maximize time, effort, and ultimately, discovery and public good brought from government research grant dollars. And in the long run, such advances help everyone. The cloud is simply accelerating the pace at which discovery comes about.
Images from Wikipedia except DNA, the desk and the rocket, used under Creative Commons.