Cloudster: Cost Aware Cluster Management in Cloud Computing

Abstract

A significant driving force behind cloud computing is its potential for executing scientific applications. Traditional large-scale scientific computing applications are typically executed on locally accessible clusters, or possibly on national laboratory supercomputers. However, such machines are often oversubscribed, which causes long wait times (potentially weeks) just to start an application. Furthermore, this time increases along with both the number of requested processors and the amount of requested time. The key to scientific cloud computing is that the user can run a job immediately, albeit for a certain cost. Also important is that conceptually, cloud computing can, if fully successful, allow sites to rid themselves of their local clusters, which have large total cost of ownership. Traditionally, both computational and computer scientists use metrics like run-time and throughput to evaluate high-performance applications. However, with the cloud, cost is additionally a critical factor in evaluating evaluating alternative application designs. Cloud computing installations generally provide bundled services, each at a different cost. Applications therefore must evaluate different sets of services from different cloud providers to find the lowest-cost alternative that satisfies their particular performance constraints. In the particular case of iPlant, cost and performance are most certainly a factor. In particular, iPlant has as part of its funding money to potentially spend on running jobs on Amazon EC2, the most popular cloud installation. This begs several questions: (1) Which HPC application will execute efficiently on the cloud? (2) What cloud configuration should be used?

As a first step, we present our analysis of total cost of execution and total turnaround times on EC2 vs. national laboratory supercomputers. Previous work has compared the two resources based solely on pure system performance. Our view is that this is quite narrow and the proper metrics for comparing high-performance clusters to EC2 is turnaround time and cost. In our work, we compare the top-of-the-line EC2 cluster to HPC clusters at Lawrence Livermore National Laboratory (LLNL) based on turnaround time and total cost of execution. When measuring turnaround time, we include expected wait queue time on HPC clusters. Our results show that although as expected, standard HPC clusters are superior in raw performance, EC2 clusters may produce better turnaround times. To estimate cost, we developed a pricing model relative to EC2's node-hour prices to set node-hour prices for (currently free) LLNL clusters. We observe that the cost-effectiveness of running an application on a cluster depends on raw performance and application scalability. Our work, supported by Amazon's research grant, was accepted in one of the top conferences in High Performance Computing in Summer'13.

The study presented in the paper shows that, while the cloud has great potential for the broad HPC community, there is confusion about exactly how to use the cloud. This motivates us to develop a system to choose, on the cloud, the optimal application and system configuration on the user's behalf. Such a system would make it possible for the users on EC2 to deploy their applications without much effort. Conversely, this would make EC2 a cost-effective choice and attract more users.

EC2 offers many different architectural options to the user, each with a different price and performance, but there is currently no support to help determine which of these options is the best choice given the user’s speciﬁc criteria--which may be to optimize for turnaround time, cost, or a combination of the two. Parameters such as problem size, application scalability, system performance and resource cost make the problem of choosing the optimal configuration non-trivial. The problem is further non-trivialized by varying market prices of the computational resources such as cheaper, but variable-cost computation resources offered by Amazon's EC2 Spot Market. The goal is to achieve user's cost or turnaround time bound while optimizing on the other variable.

People

Faculty:

David Lowenthal

Postdoc:

Aniruddha Marathe

Publications

Aniruddha Marathe, Rachel Harris, David K Lowenthal, Bronis R de Supinski, Barry Rountree, Martin Schulz, Xin Yuan
A Comparative Study of High-Performance Computing on the Cloud
22nd International Symposium on High-Performance Parallel and Distributed Computing (HPDC) , June 2013
Paper: PDF

Aniruddha Marathe, Rachel Harris, David Lowenthal, Bronis R de Supinski, Barry Rountree, Martin Schulz.
Exploiting Redundancy for Cost-Effective, Time-Constrained Execution of HPC Applications on Amazon EC2
23rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC), June 2014.
Paper: PDF

Intellectual Merit

The intellectual merit of the proposal will be in the design and implementation of techniques to determine automatically what cloud resources to purchase for a most cost-effective solution.

Broader Impacts

The broader impact of our proposal is in developing tools and techniques that are broadly applicable to the computational requirements of many computational scientists, in both academia and industry, who require clusters for their work. Our research agenda is focused on empowering application developers by reducing their cost without sacrificing performance. More generally, our work can have the effect of lowering the barrier to entry of a new generation of cloud applications. In addition, it may lead to cloud providers improving the way they bundle their services.

Scale of Use

Hundreds of dedicated machines.