How Cloud Computing Can Help Handle Big Data

Oct 19, 2012

Last week, I attended the Amazon Web Services (AWS) Public Sector Summit and had the pleasure of listening to some very interesting speakers and meeting some of the smartest people in cloud computing.

The Summit showcased innovative work being done with ‘Big Data in the Cloud’ and I’d like to share some of my thoughts and observations.

First of all, let’s define Big Data, which is a very popular IT buzzword these days.  Depending on who you ask, there is a wide interpretation of its meaning.  Much like AWS does, I like to think of it in broad terms.  Big Data is essentially any data that hits the limits of conventional technology with respect to volume, velocity, and variety – meaning large amounts of data that needs to be scaled and accessed rapidly. As my colleague Prem Jadhwani, who was the Commissioner of the TechAmerica Big Data Commission, says Big Data can not only save money but also can improve mission effectiveness

At the Summit, many real-life case studies demonstrated this improvement of mission effectiveness.  Specifically, research institutions and cutting-edge scientific organizations implemented cloud-based architectures to process Big Data workloads.

By taking the benefits of cloud computing and scaling them to support that data, these organization were able to maximize their resource utilization while minimizing costs.

For example, one research organization was working on classifying protein-specific cancer cells.  As such, certain cancer cells have specific types of proteins in the DNA makeup that can be targeted by certain drugs or can be exploited to trigger specific antibodies to help fight the spread of those cancerous cells.  The result is astronomical amount of data points to sift through because the high-volume of permutations and combinations of these proteins.

This organization used the power, elasticity, and scalability that cloud computing offers to process these large volumes of data.

The research institution simply leveraged a company called Cycle Computing, which develops supercomputing software based on AWS infrastructure.  By utilizing simple, commercially available tools, they were able to create clusters of High Performance Computer (HPC) virtual machines that utilize large amounts of RAM and compute.  Using these HPC clusters and writing simple scripts, the organization was able to scale up and down based on the data that was being ingested and achieve the highest resource utilization.  As the workloads were processed and finished, the HPC clusters were automatically de-provisioned, at which point they weren’t being charged for them. The data was processed in a number of small relational databases and written to static storage after the workload was complete.

At the end of the day, the research organization was able to process more than 70 TBs of data in 15 days, costing around $20,000.  By contrast, if they had utilized a traditional hardware procurement to process these workloads, they would have spent tens of millions of dollars.  Of course, this hardware would end up sitting somewhere until the next project. It would also ultimately depreciate in value and require ongoing maintenance.

What is astounding about this case study is that if they had only used the resources available to them (without cloud computing), it would have taken more than 1,000,000 hours or 150 years to process the data.

This is merely one outstanding example of how cloud computing helps handle Big Data workloads.  There are many more compelling cases out there.

Cloud computing can help increase resource utilization and minimize costs, and overcome many traditional IT obstacles.  Thanks to the cloud, what was once impossible is now attainable.