A Feedback Guided Interface For Elastic

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

A feedback guided interface for

elastic computing

Sebastian Schönherr1,2,∗ , Lukas Forer1,2,∗ , Hansi Weißensteiner1,2 ,


Florian Kronenberg2 , Günther Specht1 , Anita Kloss-Brandstätter2

contributed equally
1
Databases and Information Systems
Institute of Computer Science
University of Innsbruck, Austria
[email protected]
2
Division of Genetic Epidemiology
Department of Medical Genetics, Molecular and Clinical Pharmacology
Innsbruck Medical University, Austria
[email protected]

ABSTRACT 1. INTRODUCTION
Computer Science plays an important role in today’s Genet- In recent years Computer Science became an essential
ics. New sequencing methods produce an enormous amount part in the field of Genetics. Especially through the advent
of data, pushing genetic laboratories to storage and com- of Next Generation Sequencing (NGS), whereby a human
putational limits. New approaches are needed to eliminate genome (3 billion base pairs/chromosome set) can be se-
these shortcomings and provide possibilities to reproduce quenced in acceptable time, the amount of data is growing
current solutions and algorithms in the area of Bioinformat- significantly, exceeding all known dimensions in Genetics.
ics. In this paper a system is proposed which simplifies the Figure 1 shows a comparison between the reducing DNA se-
access to computational resources and associated compu- quencing costs and Moore’s law. Moore’s law is used as a
tational models of cluster architectures, assists end users in reference to show that computer hardware can currently not
executing and monitoring developed algorithms via a web in- keeping pace with the progress in DNA sequencing. Further-
terface and provides an interface to add future developments more, the amount of complete sequenced individuals is grow-
or any kind of programs. We demonstrate on existing algo- ing exponentially from year to year [11], making new models
rithms how an integretation can be done with little effort, necessary. For instance, to store the data of one complete
making it especially useful for the evaluation and simplified human DNA (Deoxyribonucleic acid) in raw format with 30-
usage of current algorithms. times coverage, 30 terabytes of data is produced.
In the area of Copy Number Variations, a possible cause
Categories and Subject Descriptors for many complex genetic disorders, high throughput algo-
rithms are needed to process and analyze several hundred
H.4 [Information Systems Applications]: Miscellaneous gigabytes of raw input data [16] [6], yielding to a wall time
of up to one week for a typical study size [18]. This remark-
General Terms able increase of data and time causes genetic departments
to consider new ways of importing and storing data as well
Distributed System, Experimentation, Application
as improving performance of current algorithms.
Cluster architectures in connection with associated mod-
Keywords els have the potential to solve this issue, but especially for
Bioinformatics, Hadoop, MapReduce, Cloud computing small departments often gainless and unaffordable. Using
clusters on demand, also referred to Infrastructure as a Ser-
vice (IaaS), builds therefore a good opportunity to circle
these issues. To capitalize the full potential of IaaS, a com-
bination with distribution models like MapReduce [5] is for
specific applications both possible and obvious. Several iso-
lated applications [9], [10], [14] already exist using a dis-
tributed approach for storing data and processing algorithms.
But since no general system is given to execute those solu-
tions, an evaluation and reproducibility is often not feasible.
Scientists need to setup a cluster on their own or using a
provided remote cluster architecture to evaluate a published
23rd GI-Workshop on Foundations of Databases (Grundlagen von Daten- algorithm, being both time wasting and insecure for sensi-
banken), 31.05.2011 - 03.06.2011, Obergurgl, Austria. tive data.
Copyright is held by the author/owner(s).

109
95.263.072 $ Moore’s law
Cost per Genome
100.000.000 $

13.801.124 $

10.000.000 $
Cost per Genome in $ (log scale)

advent of NGS

11.732.535 $

3.063.820 $
1.000.000 $

154.714 $
100.000 $

29.092 $

10.000 $
Sept01 Sept02 Oct03 Apr04 Oct04 Apr05 Oct05 Apr06 Okt06 Apr07 Oct07 Apr08 Oct08 Apr09 Oct09 Apr10 Oct10
Mar02 Mar03 Jan04 Jul04 Jan05 Jul05 Jan06 Jul06 Jan07 Jul07 Jan08 Jul08 Jan09 Jul09 Jan10 Jul10

Date

Figure 1: Comparision of DNA sequencing cost with Moore’s law; Data from [17]

In this paper we present the idea to build an integrated by step and distributes only whole jobs among the cluster.
system for scientists in the area of Bioinformatics to (1) get
access to distributed cluster architectures and execute ex-
isting algorithms, (2) build maintainable and reproducible
3. ARCHITECTURE
workflows and (3) provide an interface to add future de- A modular architecture is suggested in Figure 2, separat-
velopments or any kind of programs to the system without ing the process of instantiate and set up a cluster (Cloud-
detailed IT knowledge. The reminder of this paper is struc- gene) from the process of monitor and run a program (EMI ).
tured as follows: Section 2 gives an overview of the related Based on open source frameworks like Apache Hadoop [2]
work. In section 3 the architecture of our suggested system and Apache Whirr [3], we implemented a prototype to ver-
is explained in more detail with potential case studies in sec- ify our approach. The user utilizes Cloudgene to set up a
tion 4. Section 5 shows necessary future work and the paper cluster architecture to his needs through XML configuration
ends with a conclusion in section 6. files. This allows adding new algorithms dynamically with-
out digging into Cloudgene to deep. A fully operable and
customized cluster is then provided, including all necessary
2. RELATED WORK user data. In a subsequent step EMI (Elastic MapReduce
Cluster solutions guided by a web-interface to execute dis- Interface) is launched on the master node of the cluster.
tributed algorithms like Myrna [9], CrossBow [10] or Cloud- EMI can be seen as an abstraction of the underlying system
Burst [13] already exist. Unfortunately, the user must login architecture from the end user, lies on top of the integrated
to the Amazon Web Services (AWS) console to monitor the programs and allows the user to communicate and interact
progress of executed jobs or to shutdown the cluster after with the cluster as well as receive feedback of currently exe-
execution. Additionaly, a data storage in S3 buckets is of- cuted workflows (see Figure 3). EMI can be disabled in case
ten required and a custom web interface needs to be imple- a program already includes an interface by its own, yield-
mented for every single approach. ing to the most general approach to execute any kind of
Galaxy [7] is a software system which facilitates the creation, developed solution. Both parts can be operated separately
execution and maintainability of pipelines in a fast and user via configuration files with clear defined input and output
friendly way. The platform itself executes the scripts and the variables.
user has the possibility to monitor the progress. Galaxy’s ex-
tension CloudMan [1] provides the possibility to install and 3.1 Cloudgene
execute Galaxy on Amazon EC2 (Elastic Compute Cloud). Amazon provides with its EC2 the currently most devel-
However, the user needs to start the master node manually oped service for public clouds in the area of IaaS. Cloudgene
by using the AWS console and Galaxy does not provide a supports besides EC2 also Rackspace [12] to provide access
native support of Hadoop programs, executes modules step to cluster infrastructure. As mentioned in the introduction

110
a combination with MapReduce is useful: In this paradigm,
the master node chops up data into chunks and distributes
it over all active worker nodes (map step). Subsequently,
the master node reassigns coherent map results to worker
nodes (sort and shuffle) to calculate the final result (reduce
Custom Programs
step). For this project Apache Hadoop’s implementation of
Hadoop EMI CloudBurst ... MapReduce and its distributed file system (HDFS) are used.
Using Whirr as a connector, Cloudgene is able to instance
a full working EC2 or Rackspace cluster for end users with
various defined properties and copies the necessary program
data and configuration files to the cluster. Examples for de-
fined variables could be the desired image, amount and kind
of instances, HDFS options, MapReduce properties and the
user’s SSH public key. Amazon already provides several pre-
Web Container
Cloudgene

defined images for all sorts of use cases, which can be be


Whirr
used with Cloudgene (e.g. http://www.cloudbiolinux.com).
Restlet ExtJS Cloudgene takes over the customization of predefined images
and installs services like MapReduce, in our case included in
Cloudera’s distribution of Apache Hadoop [4]. The cluster
configuration is defined in an XML-based file format, includ-
ing all necessary information for a successful cluster boot.
Cloudgene routinely checks if new configurations are added
XML Access and offers the possibility to execute newly defined programs.
Config Manager Since EC2 is using a pay-per-use model, end users must
provide their Amazon Access ID and Secret Key, which is
transferred via Cloudgene to Amazon in a secure way. Alter-
natively, Cloudgene can also be launched on every machine
Figure 2: Architecture of the suggested system in-
having Java installed, eliminating the transfer via our server.
cluding Cloudgene and EMI
Cloudgene solves one important issue and gives genetic de-
partments access to computational power and storage. A
still unresolved problem is the lack of a graphical user inter-
face to control jobs deriving from command line based ap-
plications. Especially the need of putting enormous amount
of local data into HDFS has to be considered. To overcome
these shortcomings, a user interface (EMI) was designed.

3.2 Efficient MapReduce Interface (EMI)


Running Hadoop MapReduce programs on a cluster re-
quires the execution of several non-trivial steps: First, the
user must upload all input data to the master node, copy the
data into the proprietary HDFS, run the Hadoop MapRe-
duce job, export the results from the filesystem and finally
download them to the local workstation. For researchers
without expertise in Computer Science these tasks turns out
to be very challenging. For this purpose we developed EMI
which facilitates the execution, monitoring and evaluation
of MapReduce jobs. A web interface, which runs on the
master node of the cluster, enables the execution of jobs
through well-structured wizards and setting all required pa-
rameters step by step. As several studies have shown, repro-
ducibility of data analysis is one of the greatest problems in
biomedical publications [15]. For this purpose the execution
of a MapReduce job with its parameters and input data is
logged, thus a fast comparison of experiments with differ-
ent settings is possible. Moreover, the user always has the
full control over an execution of each job and can monitor
Figure 3: Workflow of the system including Cloud- its current progress and status. All running jobs are listed
gene and EMI whereby the progress of the map and reduce phase are dis-
played separately. Since using resources from Amazon costs
money, EMI informs the user about the uptime of the clus-
ter and the number of rented instances (Figure 4).
The modular architecture enables a fast integration of any
Hadoop job which could be normally executed through the

111
command line. A simple and clear XML configuration file <emi>
describes the input and output parameters of the program <program>
and contains other relevant information that are necessary <name>CloudBurst</name>
to start the job (see Section 4). In addition to this file, a zip <command>
archive file exists which contains all software relevant data hadoop jar emi/cloudburst/CloudBurst.jar \
(e.g. jar file, meta data, configuration files). With those $input1 $input2 $output1 36 36 3 0 1 240 \
files, EMI automatically generates a web interface in which 48 24 24 128 16
the possibility to set each defined parameter through wiz- </command>
ards and to run the defined job by a single click is provided. <input>
As mentioned earlier, all input data must be put into the ro- <param id="1" type="hdfs">
bust and fault-tolerant HDFS. As this process is very time- <name>Reference Genome</name>
intensive an error prone, EMI supports the user by provid- <default>data/cloudburst/s_suis.br</default>
ing a wizard which enables the import of data from different </param>
sources (FTP, HTTP, Amazon S3 buckets or local file up- </input>
loads). In addition, files defined as output parameters can <input>
be exported and downloaded as a zip archive or can be up- <param id="2" type="hdfs">
loaded to Amazon S3 or FTP servers. EMI supports a multi- <name>Reads</name>
user mode whereby all data by a certain user are password <default>data/cloudburst/100k.br</default>
protected and executed jobs are scheduled through a queue </param>
system. Overall, EMI is fully independent from Cloudgene </input>
and can be installed on a local Hadoop cluster too. <output>
<param id="1" type="hdfs" merge="true">
4. CASE STUDIES <name>Results</name>
<default>data/cloudburst/results</default>
In this section we explain how new programs can be in-
</param>
tegrated into Cloudgene and EMI. Based on two different
</output>
biomedical software solutions we demonstrate the diversity
</program>
and simplicity of our approach.
</emi>
4.1 CloudBurst After the XML file is uploaded to the Cloudgene server,
CloudBurst is a parallel read-mapping algorithm to map the user starts a web browser to (1) login to Cloudgene, (2)
NGS data to the human genome and other reference genomes start up a cluster preconfigured with CloudBurst and (3)
[13]. It is implemented as a MapReduce program using run and monitor jobs with EMI (Figure 4).
Hadoop and can be executed with the following command: Compared to a standard manual approach, this eliminates
error-prone and time-consuming tasks such as (1) setting up
hadoop jar emi/cloudburst/CloudBurst.jar \ a cluster and connecting via the command line onto the mas-
reference_genome reads results 36 36 3 0 1 240 \ ter node, (2) uploading and importing data into HDFS, (3)
48 24 24 128 16 exporting final results from HDFS and downloading them
and (4) executing and reproducing MapReduce jobs with
In order to execute CloudBurst we create a configuration
different configurations via a web interface. This shows,
file for Cloudgene which starts a Hadoop cluster on Amazon
that an easy integration can be done using a simple XML
EC2 with a standard Ubuntu Linux with open Hadoop ports
configuration, supporting and guiding researchers as far as
50030 and 50070. The corresponding XML has the following
possible.
structure:
<cloudgene>
4.2 HaploGrep
<name>CloudBurst</name> HaploGrep is a reliable algorithm implemented in a web
<options> application to determine the haplogroup affiliation of thou-
<option name="provider" value="amazon-aws"/> sands of mitochondrial DNA (mtDNA) profiles genotyped
<option name="image" value="default"/> for the entire mtDNA or any part of it [8]. As HaploGrep
<option name="service" value="hadoop"/> provides its own web interface we do not need to install EMI.
<option name="emi" value="true"/> Since it does not use the Hadoop service either, we note this
<option name="ports" value="50030 50070"/> option in the configuration as well. HaploGrep listens on
</options> the ports 80 (http) and 443 (https), therefore this ports are
</cloudgene> marked as open. The configuration file for Cloudgene with
all requirements looks as follows:
As CloudBurst has no graphical user interface, we install
EMI on the Amazon EC2 cluster and use it for user inter- <cloudgene>
actions. For this purpose the command above with its ar- <name>Haplogrep</name>
guments must be translated into the following configuration <options>
file: <option name="provider" value="amazon-aws"/>
<option name="image" value="default"/>
<option name="service" value="none"/>
<option name="emi" value="false"/>
<option name="ports" value="80 443"/>

112
Figure 4: Workflow based on CloudBurst

</options> system. Its modular architecture enables a fast integration


</cloudgene> of any Hadoop job which could be only executed through
the command line. By hiding the low-level informatics, it is
After the cluster setup is finalized, Cloudgene returns a the ideal system for researchers without deeper knowledge in
web address which points to the installed instance of Hap- Computer Science. Moreover, our system is not constricted
loGrep. to the life sciences and can be used in nearly every applica-
tion range. Overall, it is a first approach in order to narrow
5. FUTURE WORK the gap between cloud-computing and usability.
One of the biggest advantages of IaaS is the changable
amount of needed datanodes on demand. Thus, the next 7. ACKNOWLEDGMENTS
version of Cloudgene is conceived to provide functions for
Sebastian Schönherr was supported by a scholarship from
adding and removing instances during runtime. Currently,
the University of Innsbruck (Doktoratsstipendium aus der
clusters started with Cloudgene are not data persistent which
Nachwuchsförderung, MIP10/2009/3). Hansi Weißensteiner
yields to a data loss after a shutdown is fulfilled. For this
was supported by a scholarship from the Autonomous Pro-
purpose we plan to store all results on persistent Amazon
vince of Bozen/Bolzano (South Tyrol). The project was
EBS volumes. Furthermore, a simple user interface for Ha-
supported by the Amazon Research Grant. We thank the
doop is not only useful for the end user but also for devel-
Whirr Mailinglist especially Tom White and Andrei Savu
opers. It supports them during the whole prototyping and
for their assistance.
testing process of novel MapReduce algorithms by highlight-
ing performance bottlenecks. Thus, we plan to implement
time measurements of the map, reduce and shuffle phase 8. REFERENCES
and to visualize them in an intuitive chart. Additionally, [1] E. Afgan, D. Baker, N. Coraor, B. Chapman,
Hadoop plans in its next generation approach to support al- A. Nekrutenko, and J. Taylor. Galaxy CloudMan:
ternate programming paradigms to MapReduce, what is par- delivering cloud compute clusters. BMC
ticularly important for applications (e.g. K-Means) where Bioinformatics, 11 Suppl 12:S4, 2010.
custom frameworks out-perform MapReduce by an order of [2] Apache Hadoop. http://hadoop.apache.org.
magnitude. [3] Apache Whirr. http://incubator.apache.org/whirr/.
[4] Cloudera. http://www.cloudera.com/.
6. CONCLUSION [5] J. Dean and S. Ghemawat. MapReduce: simplified
We presented a software system for running and maintain- data processing on large clusters. In OSDI’04:
ing elastic computer clusters. Our approach combines the Proceedings of the 6th conference on Symposium on
individual steps of setting up a cluster into a user-friendly Opearting Systems Design & Implementation, pages

113
10–10, Berkeley, CA, USA, 2004. USENIX
Association.
[6] L. Forer, S. Schönherr, H. Weißensteiner, F. Haider,
T. Kluckner, C. Gieger, H. E. Wichmann, G. Specht,
F. Kronenberg, and A. Kloss-Brandstätter. CONAN:
copy number variation analysis software for
genome-wide association studies. BMC
Bioinformatics, 11:318, 2010.
[7] J. Goecks, A. Nekrutenko, J. Taylor, E. Afgan,
G. Ananda, D. Baker, D. Blankenberg,
R. Chakrabarty, N. Coraor, J. Goecks, G. Von Kuster,
R. Lazarus, K. Li, A. Nekrutenko, J. Taylor, and
K. Vincent. Galaxy: a comprehensive approach for
supporting accessible, reproducible, and transparent
computational research in the life sciences. Genome
Biol., 11:R86, 2010.
[8] A. Kloss-Brandstättter, D. Pacher, S. Schönherr,
H. Weißensteiner, R. Binna, G. Specht, and
F. Kronenberg. HaploGrep: a fast and reliable
algorithm for automatic classification of mitochondrial
DNA haplogroups. Hum. Mutat., 32:25–32, Jan 2011.
[9] B. Langmead, K. D. Hansen, and J. T. Leek.
Cloud-scale RNA-sequencing differential expression
analysis with Myrna. Genome Biol., 11:R83, 2010.
[10] B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L.
Salzberg. Searching for SNPs with cloud computing.
Genome Biol., 10:R134, 2009.
[11] R. E. Mills et al. Mapping copy number variation by
population-scale genome sequencing. Nature,
470:59–65, Feb 2011.
[12] Rackspace. http://www.rackspace.com.
[13] M. C. Schatz. CloudBurst: highly sensitive read
mapping with MapReduce. Bioinformatics,
25:1363–1369, Jun 2009.
[14] M. C. Schatz. The missing graphical user interface for
genomics. Genome Biol., 11:128, 2010.
[15] L. Shi et al. The balance of reproducibility, sensitivity,
and specificity of lists of differentially expressed genes
in microarray studies. BMC Bioinformatics, 9 Suppl
9:S10, 2008.
[16] K. Wang, M. Li, D. Hadley, R. Liu, J. Glessner,
S. F. A. Grant, H. Hakonarson, and M. Bucan.
PennCNV: An integrated hidden Markov model
designed for high-resolution copy number variation
detection in whole-genome SNP genotyping data.
Genome Research, 17(11):1665–1674, Nov. 2007.
[17] Wetterstrand, K. A. DNA Sequencing Costs: Data
from the NHGRI Large-Scale Genome Sequencing
Program Available:
http://www.genome.gov/sequencingcosts; Accessed
04/11/11.
[18] H. E. Wichmann, C. Gieger, and T. Illig.
KORA-gen–resource for population genetics, controls
and a broad spectrum of disease phenotypes.
Gesundheitswesen, 67 Suppl 1:26–30, Aug 2005.

114

You might also like