The University of Arizona

Events & News

MS Thesis Defense

CategoryLecture
DateThursday, May 8, 2014
Time9:00 am
Concludes10:00 am
LocationGould-Simpson 701
DetailsReview Committee: John Hartman, Larry Peterson & David Lowenthal
SpeakerIllyoung Choi
TitleMaster of Science Candidate
AffiliationComputer Science, University of Arizona

H-Synthesizer: Analyzing Large-Scale Sequence Data in the Cloud

Cloud services are becoming indispensable for business and academic research. Computing resources provided through the Internet by cloud services are always available, are scalable and cost less when compared to traditional dedicated computing resources. Bioresearchers and medical scientists could be benefit from this new paradigm. As they deal with large genomic data, the advantages of cloud services--cost-efficiency, scalability and accessibility--would greatly help their research activities. However, switching their workflow from a workstation or HPC to the cloud is not trivial as this new environment differs a lot from traditional computing environment. This paper describes H-Synthesizer, an infrastructure for supporting bioinformatics research in the cloud. The new system consists of computation modules based on Hadoop MapReduce for parallel sequence analysis and storage based on Syndicate, iRODS and HDFS. The computation modules provide fast k-mer search against a group of sequences. The storage integrates a traditional sequence repository, iRODS, with Hadoop via Syndicate. Using caches efficiently, the storage lets the computation modules process the large remote sequence files directly without pre-staging the data manually. The H-Synthesizer showed 4.7 percent reduction of computation time compared to an existing technique used for computing k-mer mode on HPC cluster. It also showed 19% speedup when the number of nodes doubled.