Bio in For Matics
Bio in For Matics
Bio in For Matics
ABSTRACT
The emerging biochip technology has made it possible to simultaneously study expression (activity level) of thousands
of genes or proteins in a single experiment in the laboratory. However, in order to extract relevant biological knowledge from the biochip experimental data, it is critical not
only to analyze the experimental data, but also to crossreference and correlate these large volumes of data with information available in external biological databases accessible online. We address this problem in a comprehensive
system for knowledge management in bioinformatics called
e2e. To the biologist or biological applications, e2e exposes
a common semantic view of inter-relationship among biological concepts in the form of an XML representation called
eXpressML, while internally, it can use any data integration
solution to retrieve data and return results corresponding
to the semantic view. We have implemented an e2e prototype that enables a biologist to analyze her gene expression
data in GEML or from a public site like Stanford, and discover knowledge through operations like querying on relevant annotated data represented in eXpressML using pathways data from KEGG, publication data from Medline and
protein data from SWISS-PROT.
Client
Sequence Alignment
Medical Literature Summarization
Biochemical Pathway Recognition
Biochips
Query
Interface
Biologist Cares
Only Here
Wrappers
Sources
External
biochip
pathway
literature
sequence
1. INTRODUCTION
DNA based microarray technologies[14] have been used
extensively in generating the expression levels of all or most
of the genes of several organisms under a variety of experimental conditions. Specialized repositories and data warehousing projects are being built (e.g., NCBIs Gene Expression Omnibus1 (GEO), Stanford Microarray Database2 (SMD))
to store the vast quantities of data that are being generated
by the biochips. A biologist starts with analysis of the gene
expression data for insightful patterns among some clusters
of genes. Once a gene cluster is obtained, the main interest of a biologist lies in finding out the underlying biological mechanisms and functions causing these genes to be coexpressed and assign biological significance to this cluster.
The biological relations among these genes span multidisciplinary islands of biology. Downstream annotation involves
combining expression data with other sources of information
to improve the range and quality of conclusions that can be
drawn. However, related biomedical data[3] are numerous
and our focus is to develop an infrastructural framework
for building knowledge discovery tools for microarrays that
can leverage related but continuously updated diverse online
data.
General Terms
Management, Performance
Keywords
Knowledge Management, Bioinformatics, Biochips
Lead contact. Author names appear in alphabetic order.
# Formerly with IBM.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CIKM02, November 49, 2002, McLean, Virginia, USA.
Copyright 2002 ACM 1-58113-492-4/02/0011 ...$5.00.
1
2
638
http://www.ncbi.nlm.nih.gov/geo/
http://genome-www4.Stanford.EDU/MicroArray/SMD/
2.
A Biological Environment
for Knowledge Discourse
and Relevant Tool Flows
Microarray Data
Clustered Genes
Novel
KM
(Knowledge Management)
Pathways
Analysis
Protein Structure
Analysis
Biomedical
Literature
Summarization
Apps
Sequence Data
Analysis
Semantic
Domain
Model
Chemical Compound
Analysis
Visualization
Q
U
E
R
Y
BACKGROUND
639
links and not invoking the sources directly. SRS[7] is an example of this approach. Though the link driven approach
is very convenient for non-expert users and provides limited keyword search capability on the content of a source,
it does not scale well and has no across-source capabilities.
Another approach is that of view integration in which a virtual global schema is created in a common data model using
the descriptions of the individual sources so that the user
can declaratively pose queries on the common data model
that may span the content of multiple sources. The system
seamlessly and automatically figures out how data from the
different sources has to be retrieved [12]. A variation of view
integration is the warehousing approach where instantiation
of the global schema is created, i.e., all data of interest in
remote sources is locally replicated and maintained for predictable performance. Examples are IBMs DiscoveryLink[9]
and Kleisli[5] which provide powerful querying capabilities,
but fail to provide the in-depth analysis that are provided
by the point solutions.
A biologist wants to access only relevant data that she can
easily correlate in the pursuit of understanding the biochip
assay. Hence, what is needed is semantic integration in
which the user sees domain concepts like proteins and pathways while the infrastructural artifacts like source names
(SWISS-PROT, KEGG, etc) and attribute fields (protein id,
etc) are handled transparently by the user. A system related
to our definition of semantic integration is TAMBIS[8] where
a common ontology of about 1900 terms is constructed to
describe the concepts and relationships in molecular biology. Users interact with TAMBIS in the ontological realm
while the system internally maps them to source schemas using Kleisli[5] as its data integration middleware. However,
TAMBIS is not targeted towards microarrays and does not
provide the full spectrum of query/analytical capabilities
(breadth) that is needed in making (biological) knowledge
discoveries from biochip data.
E2E FRAMEWORK
We now discuss e2e in which semantic relationship among
biological concepts is represented in the XML representation
of eXpressML[1] and analytical KM tools can work from this
3.
3.2 KM Layer
The KM layer consists of two types of applications: (a)
Tools for detecting gene expression patterns by supporting
clustering, classification, and visualization of biochip experimental data, and (b) Downstream annotation tools combining expression data with other sources of information to
improve the range and quality of conclusions that can be
drawn. Below, we describe some of the implemented frontend tools in detail but note that new tools can be built that
have as input a group of genes and optionally, subset of data
represented in eXpressML.
4. KM APPLICATIONS
http://www.geml.org
640
http://www.mged.org/Workgroups/MAGE/mage.html
http://db.cis.upenn.edu/Kweelt/
The biomedical literature databases are rich source of information from various disciplines of biomedical sciences.
Text mining of these databases can be used to augment, confirm, or discover biologically significant information for gene
clusters spanning different biological domains. The main
challenges in handling biomedical citations are: (1) Querying on even a small cluster of genes retrieves tens of thousands of documents. (2) Use of multiple names and conventions in referring to genes makes it difficult to cross-reference
documents with gene names. (3) Non-uniform nomenclature and language usage for same biological concepts make
it difficult for text mining of the citations retrieved. (4)
Highly complex and parallel interrelations among biological
processes across multiple biological domains.
We have developed a specialized text-mining system called
MedMeSH summarizer [10] that provides a summary of the
citations pertaining to a group of genes in a given cluster. The MedMeSH summarizer system uses PubMed as
the literature database and provides an automated document extraction and summarization solution PubMed, the
most widely used biomedical literature database has more
than 11 million citations (since 1960) and about 30,000 new
citations are added each month. The user is required to provide only a list of genes (gene cluster) as input. The output
is a summary of the documents, which shows the most important MeSH terms which describe the whole cluster and
produces summaries across all biological domains.
6.
REFERENCES
[1] Adak, S., Srivastava, B., Kankar, P., and Kurhekar, M. 2002.
A Common Data Representation for Organizing and
Managing Annotations of Biochip Expression Data. IBM
Research Report RI02017. Available at
http://domino.watson.ibm.com/library/CyberDig.nsf/Home
[2] Adak, S., Batra, V., Bhardwaj, D., Kamesam, P., Kankar,
P., Kurhekar, and Srivastava, B. 2002. Bioinformatics for
Microarrays. IBM Research Report RI02016. Available at
http://domino.watson.ibm.com/library/CyberDig.nsf/Home
[3] Baxevanis, A. 2001. The Molecular Biology Database
Collection: an updated compilation of biological database
resources. Numcleic Acids Research, Vol. 29, No. 1.
[4] Brazma, A., Jonassen, I., Vilo, J., and Ukkonen, E. (1998).
Predicting gene regulatory elements in silico on a genomic
scale. Genome Research, 8:1202-1215.
[5] Buneman, P., Davidson, S. Hart, K., Overton, C., and
Wong, L. (1995). A Data Transformation System for
Biological Data Sources. Proc. VLDB, pp 158169.
[6] Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998).
Cluster analysis and display of genome-wide expression
patterns. Proc. Natl Acad Sci USA, 95:14863-14868, 1998.
[7] Etzold, T., and Argos, P. (1993). SRS: An Indexing and
Retrieval Tool for Flat File Data Libraries. Computer
Application of Biosciences, 9:49-57.
[8] Goble, C., Stevens, R., Ng, G., Bechhofer, S., Paton, N.,
Baker, P., Peim, M., and Brass, A. (2001). Transparent
Access to Multiple Bioinformatics Information Sources. IBM
Systems Journal, Vol. 40, No.2, pp 532-551.
[9] Haas, L., Schwarz, P., Kodali, P., Kotlar, E., Rice, J., and
Swope, W. (2001). DiscoveryLink: A system for integrated
access to life sciences data sources. IBM Systems Journal,
Volume 40, Number 2, 2001.
[10] Kankar, P., Adak, S., Sarkar, A., Murari, K. and Sharma,
G. (2002). MedMeSH Summarizer: Text Mining for Gene
Clusters. In Proc. of the SIAM Conf. in Data Mining.
[11] Kurhekar, M., Adak, S., Jhunjhunwala, S., and
Raghupathy, K. (2002). Genome-wide pathway analysis and
visualization using gene expression data. In Proc. of the
Pacific Symposium of Biocomputing.
[12] Levy, A. 1998. Combining Artificial Intelligence and
Databases for Data Integration. At http://citeseer.nj.nec.com
[13] Robie, D., Chamberlin, D. and Florescu, D. (2001). Quilt:
an XML Query Language.
http://www.almaden.ibm.com/cs/people/chamberlin/quilt euro.html
[14] Shalon, D., Smith, S. and Brown, P. (1996). A DNA
microarray system for analyzing complex DNA samples using
two-color fluorescent probe hybridization. Genome Research,
6:639-645.
[15] Srivastava, B. 2002. Using Planning for Query
Decomposition in Bioinformatics Sixth Intl. Conf. on AI
Planning & Scheduling (AIPS-02) Workshop on Is There
Life Beyond Operator Sequencing? Exploring Real World
Planning.
5.
In this paper, we presented a comprehensive bioinformatics KM framework called e2e which provides a uniform window to biochip data and related annotations. We demonstrated an e2e prototype that gives an early glimpse of the
wide potential of an integrated KM solution for bioinfor-
641