Events & News
MS Thesis Defense
Category | Lecture |
Date | Thursday, May 8, 2014 |
Time | 9:00 am |
Concludes | 10:00 am |
Location | Gould-Simpson 701 |
Details | Review Committee: John Hartman, Larry Peterson & David Lowenthal |
Speaker | Illyoung Choi |
Title | Master of Science Candidate |
Affiliation | Computer Science, University of Arizona |
H-Synthesizer: Analyzing Large-Scale Sequence Data in the Cloud
Cloud services are becoming indispensable for business and academic research. Computing resources provided through the Internet by cloud services are always available, are scalable and cost less when compared to traditional dedicated computing resources. Bioresearchers and medical scientists could be benefit from this new paradigm. As they deal with large genomic data, the advantages of cloud services--cost-efficiency, scalability and accessibility--would greatly help their research activities. However, switching their workflow from a workstation or HPC to the cloud is not trivial as this new environment differs a lot from traditional computing environment. This paper describes H-Synthesizer, an infrastructure for supporting bioinformatics research in the cloud. The new system consists of computation modules based on Hadoop MapReduce for parallel sequence analysis and storage based on Syndicate, iRODS and HDFS. The computation modules provide fast k-mer search against a group of sequences. The storage integrates a traditional sequence repository, iRODS, with Hadoop via Syndicate. Using caches efficiently, the storage lets the computation modules process the large remote sequence files directly without pre-staging the data manually. The H-Synthesizer showed 4.7 percent reduction of computation time compared to an existing technique used for computing k-mer mode on HPC cluster. It also showed 19% speedup when the number of nodes doubled.