The University of Arizona

Events & News

MS Thesis Defense

CategoryLecture
DateThursday, May 8, 2014
Time8:00 am
Concludes9:00 am
LocationGould-Simpson 701
DetailsReview Committee: John Hartman, Larry Peterson & David Lowenthal
SpeakerBinil Benjamin
TitleMaster of Science Candidate
AffiliationComputer Science, University of Arizona

Towards a Cloud Based Interactive System for Genomic Data Analysis

With the advent of cluster computing frameworks such as Apache Hadoop and Apache Spark, cloud computing is becoming the mainstream platform for performing high end computations. Computations that were possible only on HPC clusters are now possible on the cloud, thanks to the cheap clusters and efficient abstractions offered by the cluster computing frameworks. An area that can hugely benefit from these trend is the bioinformatics research field, where analyzing large sets of data using compute intensive algorithms are a common activity. Data in bioinformatics mainly consists of genetic sequence data and with the advancement of sequencing technologies, there is an ever increasing growth of genomic data in bioinformatics that needs to be analyzed efficiently. In this study, we consider a data analysis pipeline that compares a set of DNA samples from metagenomics and generates statistical information about their mutual relationships. We made this pipeline interactive to growing data set with the help of a data abstraction provided by the Apache Spark framework and evaluated its performance on the cloud platform. With one set of sample data, we were able to see a 43% reduction in the computation time as the cluster grew from 2 to 4 nodes. With the use of right tools and abstractions, we believe cloud can become a cost effective platform for bioinformatic research.