A Feedback Guided Interface For Elastic
A Feedback Guided Interface For Elastic
A Feedback Guided Interface For Elastic
elastic computing
ABSTRACT 1. INTRODUCTION
Computer Science plays an important role in today’s Genet- In recent years Computer Science became an essential
ics. New sequencing methods produce an enormous amount part in the field of Genetics. Especially through the advent
of data, pushing genetic laboratories to storage and com- of Next Generation Sequencing (NGS), whereby a human
putational limits. New approaches are needed to eliminate genome (3 billion base pairs/chromosome set) can be se-
these shortcomings and provide possibilities to reproduce quenced in acceptable time, the amount of data is growing
current solutions and algorithms in the area of Bioinformat- significantly, exceeding all known dimensions in Genetics.
ics. In this paper a system is proposed which simplifies the Figure 1 shows a comparison between the reducing DNA se-
access to computational resources and associated compu- quencing costs and Moore’s law. Moore’s law is used as a
tational models of cluster architectures, assists end users in reference to show that computer hardware can currently not
executing and monitoring developed algorithms via a web in- keeping pace with the progress in DNA sequencing. Further-
terface and provides an interface to add future developments more, the amount of complete sequenced individuals is grow-
or any kind of programs. We demonstrate on existing algo- ing exponentially from year to year [11], making new models
rithms how an integretation can be done with little effort, necessary. For instance, to store the data of one complete
making it especially useful for the evaluation and simplified human DNA (Deoxyribonucleic acid) in raw format with 30-
usage of current algorithms. times coverage, 30 terabytes of data is produced.
In the area of Copy Number Variations, a possible cause
Categories and Subject Descriptors for many complex genetic disorders, high throughput algo-
rithms are needed to process and analyze several hundred
H.4 [Information Systems Applications]: Miscellaneous gigabytes of raw input data [16] [6], yielding to a wall time
of up to one week for a typical study size [18]. This remark-
General Terms able increase of data and time causes genetic departments
to consider new ways of importing and storing data as well
Distributed System, Experimentation, Application
as improving performance of current algorithms.
Cluster architectures in connection with associated mod-
Keywords els have the potential to solve this issue, but especially for
Bioinformatics, Hadoop, MapReduce, Cloud computing small departments often gainless and unaffordable. Using
clusters on demand, also referred to Infrastructure as a Ser-
vice (IaaS), builds therefore a good opportunity to circle
these issues. To capitalize the full potential of IaaS, a com-
bination with distribution models like MapReduce [5] is for
specific applications both possible and obvious. Several iso-
lated applications [9], [10], [14] already exist using a dis-
tributed approach for storing data and processing algorithms.
But since no general system is given to execute those solu-
tions, an evaluation and reproducibility is often not feasible.
Scientists need to setup a cluster on their own or using a
provided remote cluster architecture to evaluate a published
23rd GI-Workshop on Foundations of Databases (Grundlagen von Daten- algorithm, being both time wasting and insecure for sensi-
banken), 31.05.2011 - 03.06.2011, Obergurgl, Austria. tive data.
Copyright is held by the author/owner(s).
109
95.263.072 $ Moore’s law
Cost per Genome
100.000.000 $
13.801.124 $
10.000.000 $
Cost per Genome in $ (log scale)
advent of NGS
11.732.535 $
3.063.820 $
1.000.000 $
154.714 $
100.000 $
29.092 $
10.000 $
Sept01 Sept02 Oct03 Apr04 Oct04 Apr05 Oct05 Apr06 Okt06 Apr07 Oct07 Apr08 Oct08 Apr09 Oct09 Apr10 Oct10
Mar02 Mar03 Jan04 Jul04 Jan05 Jul05 Jan06 Jul06 Jan07 Jul07 Jan08 Jul08 Jan09 Jul09 Jan10 Jul10
Date
Figure 1: Comparision of DNA sequencing cost with Moore’s law; Data from [17]
In this paper we present the idea to build an integrated by step and distributes only whole jobs among the cluster.
system for scientists in the area of Bioinformatics to (1) get
access to distributed cluster architectures and execute ex-
isting algorithms, (2) build maintainable and reproducible
3. ARCHITECTURE
workflows and (3) provide an interface to add future de- A modular architecture is suggested in Figure 2, separat-
velopments or any kind of programs to the system without ing the process of instantiate and set up a cluster (Cloud-
detailed IT knowledge. The reminder of this paper is struc- gene) from the process of monitor and run a program (EMI ).
tured as follows: Section 2 gives an overview of the related Based on open source frameworks like Apache Hadoop [2]
work. In section 3 the architecture of our suggested system and Apache Whirr [3], we implemented a prototype to ver-
is explained in more detail with potential case studies in sec- ify our approach. The user utilizes Cloudgene to set up a
tion 4. Section 5 shows necessary future work and the paper cluster architecture to his needs through XML configuration
ends with a conclusion in section 6. files. This allows adding new algorithms dynamically with-
out digging into Cloudgene to deep. A fully operable and
customized cluster is then provided, including all necessary
2. RELATED WORK user data. In a subsequent step EMI (Elastic MapReduce
Cluster solutions guided by a web-interface to execute dis- Interface) is launched on the master node of the cluster.
tributed algorithms like Myrna [9], CrossBow [10] or Cloud- EMI can be seen as an abstraction of the underlying system
Burst [13] already exist. Unfortunately, the user must login architecture from the end user, lies on top of the integrated
to the Amazon Web Services (AWS) console to monitor the programs and allows the user to communicate and interact
progress of executed jobs or to shutdown the cluster after with the cluster as well as receive feedback of currently exe-
execution. Additionaly, a data storage in S3 buckets is of- cuted workflows (see Figure 3). EMI can be disabled in case
ten required and a custom web interface needs to be imple- a program already includes an interface by its own, yield-
mented for every single approach. ing to the most general approach to execute any kind of
Galaxy [7] is a software system which facilitates the creation, developed solution. Both parts can be operated separately
execution and maintainability of pipelines in a fast and user via configuration files with clear defined input and output
friendly way. The platform itself executes the scripts and the variables.
user has the possibility to monitor the progress. Galaxy’s ex-
tension CloudMan [1] provides the possibility to install and 3.1 Cloudgene
execute Galaxy on Amazon EC2 (Elastic Compute Cloud). Amazon provides with its EC2 the currently most devel-
However, the user needs to start the master node manually oped service for public clouds in the area of IaaS. Cloudgene
by using the AWS console and Galaxy does not provide a supports besides EC2 also Rackspace [12] to provide access
native support of Hadoop programs, executes modules step to cluster infrastructure. As mentioned in the introduction
110
a combination with MapReduce is useful: In this paradigm,
the master node chops up data into chunks and distributes
it over all active worker nodes (map step). Subsequently,
the master node reassigns coherent map results to worker
nodes (sort and shuffle) to calculate the final result (reduce
Custom Programs
step). For this project Apache Hadoop’s implementation of
Hadoop EMI CloudBurst ... MapReduce and its distributed file system (HDFS) are used.
Using Whirr as a connector, Cloudgene is able to instance
a full working EC2 or Rackspace cluster for end users with
various defined properties and copies the necessary program
data and configuration files to the cluster. Examples for de-
fined variables could be the desired image, amount and kind
of instances, HDFS options, MapReduce properties and the
user’s SSH public key. Amazon already provides several pre-
Web Container
Cloudgene
111
command line. A simple and clear XML configuration file <emi>
describes the input and output parameters of the program <program>
and contains other relevant information that are necessary <name>CloudBurst</name>
to start the job (see Section 4). In addition to this file, a zip <command>
archive file exists which contains all software relevant data hadoop jar emi/cloudburst/CloudBurst.jar \
(e.g. jar file, meta data, configuration files). With those $input1 $input2 $output1 36 36 3 0 1 240 \
files, EMI automatically generates a web interface in which 48 24 24 128 16
the possibility to set each defined parameter through wiz- </command>
ards and to run the defined job by a single click is provided. <input>
As mentioned earlier, all input data must be put into the ro- <param id="1" type="hdfs">
bust and fault-tolerant HDFS. As this process is very time- <name>Reference Genome</name>
intensive an error prone, EMI supports the user by provid- <default>data/cloudburst/s_suis.br</default>
ing a wizard which enables the import of data from different </param>
sources (FTP, HTTP, Amazon S3 buckets or local file up- </input>
loads). In addition, files defined as output parameters can <input>
be exported and downloaded as a zip archive or can be up- <param id="2" type="hdfs">
loaded to Amazon S3 or FTP servers. EMI supports a multi- <name>Reads</name>
user mode whereby all data by a certain user are password <default>data/cloudburst/100k.br</default>
protected and executed jobs are scheduled through a queue </param>
system. Overall, EMI is fully independent from Cloudgene </input>
and can be installed on a local Hadoop cluster too. <output>
<param id="1" type="hdfs" merge="true">
4. CASE STUDIES <name>Results</name>
<default>data/cloudburst/results</default>
In this section we explain how new programs can be in-
</param>
tegrated into Cloudgene and EMI. Based on two different
</output>
biomedical software solutions we demonstrate the diversity
</program>
and simplicity of our approach.
</emi>
4.1 CloudBurst After the XML file is uploaded to the Cloudgene server,
CloudBurst is a parallel read-mapping algorithm to map the user starts a web browser to (1) login to Cloudgene, (2)
NGS data to the human genome and other reference genomes start up a cluster preconfigured with CloudBurst and (3)
[13]. It is implemented as a MapReduce program using run and monitor jobs with EMI (Figure 4).
Hadoop and can be executed with the following command: Compared to a standard manual approach, this eliminates
error-prone and time-consuming tasks such as (1) setting up
hadoop jar emi/cloudburst/CloudBurst.jar \ a cluster and connecting via the command line onto the mas-
reference_genome reads results 36 36 3 0 1 240 \ ter node, (2) uploading and importing data into HDFS, (3)
48 24 24 128 16 exporting final results from HDFS and downloading them
and (4) executing and reproducing MapReduce jobs with
In order to execute CloudBurst we create a configuration
different configurations via a web interface. This shows,
file for Cloudgene which starts a Hadoop cluster on Amazon
that an easy integration can be done using a simple XML
EC2 with a standard Ubuntu Linux with open Hadoop ports
configuration, supporting and guiding researchers as far as
50030 and 50070. The corresponding XML has the following
possible.
structure:
<cloudgene>
4.2 HaploGrep
<name>CloudBurst</name> HaploGrep is a reliable algorithm implemented in a web
<options> application to determine the haplogroup affiliation of thou-
<option name="provider" value="amazon-aws"/> sands of mitochondrial DNA (mtDNA) profiles genotyped
<option name="image" value="default"/> for the entire mtDNA or any part of it [8]. As HaploGrep
<option name="service" value="hadoop"/> provides its own web interface we do not need to install EMI.
<option name="emi" value="true"/> Since it does not use the Hadoop service either, we note this
<option name="ports" value="50030 50070"/> option in the configuration as well. HaploGrep listens on
</options> the ports 80 (http) and 443 (https), therefore this ports are
</cloudgene> marked as open. The configuration file for Cloudgene with
all requirements looks as follows:
As CloudBurst has no graphical user interface, we install
EMI on the Amazon EC2 cluster and use it for user inter- <cloudgene>
actions. For this purpose the command above with its ar- <name>Haplogrep</name>
guments must be translated into the following configuration <options>
file: <option name="provider" value="amazon-aws"/>
<option name="image" value="default"/>
<option name="service" value="none"/>
<option name="emi" value="false"/>
<option name="ports" value="80 443"/>
112
Figure 4: Workflow based on CloudBurst
113
10–10, Berkeley, CA, USA, 2004. USENIX
Association.
[6] L. Forer, S. Schönherr, H. Weißensteiner, F. Haider,
T. Kluckner, C. Gieger, H. E. Wichmann, G. Specht,
F. Kronenberg, and A. Kloss-Brandstätter. CONAN:
copy number variation analysis software for
genome-wide association studies. BMC
Bioinformatics, 11:318, 2010.
[7] J. Goecks, A. Nekrutenko, J. Taylor, E. Afgan,
G. Ananda, D. Baker, D. Blankenberg,
R. Chakrabarty, N. Coraor, J. Goecks, G. Von Kuster,
R. Lazarus, K. Li, A. Nekrutenko, J. Taylor, and
K. Vincent. Galaxy: a comprehensive approach for
supporting accessible, reproducible, and transparent
computational research in the life sciences. Genome
Biol., 11:R86, 2010.
[8] A. Kloss-Brandstättter, D. Pacher, S. Schönherr,
H. Weißensteiner, R. Binna, G. Specht, and
F. Kronenberg. HaploGrep: a fast and reliable
algorithm for automatic classification of mitochondrial
DNA haplogroups. Hum. Mutat., 32:25–32, Jan 2011.
[9] B. Langmead, K. D. Hansen, and J. T. Leek.
Cloud-scale RNA-sequencing differential expression
analysis with Myrna. Genome Biol., 11:R83, 2010.
[10] B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L.
Salzberg. Searching for SNPs with cloud computing.
Genome Biol., 10:R134, 2009.
[11] R. E. Mills et al. Mapping copy number variation by
population-scale genome sequencing. Nature,
470:59–65, Feb 2011.
[12] Rackspace. http://www.rackspace.com.
[13] M. C. Schatz. CloudBurst: highly sensitive read
mapping with MapReduce. Bioinformatics,
25:1363–1369, Jun 2009.
[14] M. C. Schatz. The missing graphical user interface for
genomics. Genome Biol., 11:128, 2010.
[15] L. Shi et al. The balance of reproducibility, sensitivity,
and specificity of lists of differentially expressed genes
in microarray studies. BMC Bioinformatics, 9 Suppl
9:S10, 2008.
[16] K. Wang, M. Li, D. Hadley, R. Liu, J. Glessner,
S. F. A. Grant, H. Hakonarson, and M. Bucan.
PennCNV: An integrated hidden Markov model
designed for high-resolution copy number variation
detection in whole-genome SNP genotyping data.
Genome Research, 17(11):1665–1674, Nov. 2007.
[17] Wetterstrand, K. A. DNA Sequencing Costs: Data
from the NHGRI Large-Scale Genome Sequencing
Program Available:
http://www.genome.gov/sequencingcosts; Accessed
04/11/11.
[18] H. E. Wichmann, C. Gieger, and T. Illig.
KORA-gen–resource for population genetics, controls
and a broad spectrum of disease phenotypes.
Gesundheitswesen, 67 Suppl 1:26–30, Aug 2005.
114