The Kleisli Approach to Data Transformation and Integration
Susan B. Davidson
Limsoon Wong
University of Pennsylvania
Kent Ridge Digital Labs
[email protected]
[email protected]
February 7, 2001
Abstract
Kleisli is a data transformation and integration system that can be used for any application where the
data is typed, but has proven especially useful for bioinformatics applications. It extends the conventional at
relational data model supported by the query language SQL to a complex object data model supported by
the collection programming language CPL. It also opens up the closed nature of commercial relational data
management systems to an easily extensible system that performs complex transformations on autonomous
data sources that are heterogeneous and geographically dispersed. This paper describes some implementation
details and example applications of Kleisli.
1
Introduction
The Kleisli system [14, 32, 33] is an advanced broad-scale integration technology that has proven very useful
in the bioinformatics arena. Many bioinformatics problems require access to data sources that are large, highly
heterogeneous and complex, constantly evolving, and geographically dispersed. Solutions to these problems
usually involve many steps and require information to be passed smoothly (and usually transformed) between the
steps. Kleisli is designed to handle these requirements directly by providing a high-level query language, CPL,
that can be used to express complicated transformations across multiple data sources in a clear and simple way.
Many key ideas in the Kleisli system are in uenced by functional programming research, as well as database query
language research. Its high-level query language CPL is a functional programming language that has a built-in
notion of \bulk" data types suitable for database programming and has many built-in operations required for
modern bioinformatics. Kleisli is itself implemented on top of the functional programming language Standard ML
of New Jersey (SML). Even the data format that Kleisli uses to exchange information with the external world is
derived from ideas in type inference.
This paper provides an overview of the Kleisli system, a summary of its impact on query language theory, a
description of its SML-based implementation, a description of its handling of relational databases, as well as
some sample applications in the biomedical arena. The organization of the paper is as follows. Section 2 o ers
an overview of the architecture, data model, and query language (CPL) of Kleisli. Section 3 is a discussion of
Kleisli's type system and self-describing data exchange format and their impact on data integration in a dynamic
heterogeneous environment. Section 4 describes how monads give rise to Kleisli's internal abstract representation
of queries and simple optimization rules. Section 5 explains how higher-order functions give rise to a simple
implementation of Kleisli's powerful optimizer. Section 6 gives details on Kleisli's optimization with respect to
the most important class of external data sources, viz. relational databases. Section 7 discusses the impact
of Kleisli on bioinformatics data integration. In particular, the rst Kleisli query written for this purpose is
1
reproduced here to illustrate the smoothness of Kleisli's interface to relational and non-relational bioinformatics
sources and to show its optimizations. Section 8 shows how to use Kleisli to turn a at relational database system
into a complex object store to warehouse complex biological data. Section 9 demonstrates the use of Kleisli to
access multiple external data sources and external data analysis functions for querying protein patents. Finally,
Section 10 uses a clinical database to demonstrate Kleisli's ability to perform \window" queries. Such queries are
very clumsy to write in SQL, since the data must rst be segmented and then each segment is analyzed separately.
2
Quick Tour of Kleisli
We provide here the complex object data model of Kleisli, and the high-level query language supported by Kleisli
called CPL (the Collection Programming Language). Let us begin with the architecture of the system, as depicted
in the gure below.
Drivers
CPL-Kleisli
pipe
Type
Module
Net
Sybase
Remote
Servers
ASN.1
GSDB
Driver
Manager
Complex
CPL
NRC
OPM
Object
Library
GenBank
ACeDB
Optimizer
GDB
shared
memory
Primitive
Manager
BLAST
NCBI-BLAST
Kleisli is extensible in several ways: It can be used to support other high-level query languages by replacing
the CPL module. Kleisli can also be used to support many di erent types of external data sources by adding
new drivers, which forward Kleisli's requests to these sources and translate their replies into Kleisli's exchange
format. The version of Kleisli that forms the backbone of the ConnectivityEnginetm of GeneticXchange Inc.
(www.geneticXchange.com) contains over sixty drivers for many popular bioinformatics systems, including Sybase,
Oracle, Entrez [27], WU-BLAST2 [1], Gapped BLAST [3], ACEDB [31], etc. The optimizer of Kleisli can also be
customized by di erent rules and strategies.
When a query is submitted to Kleisli, it is rst processed by the CPL Module which translates it into an equivalent
expression in the abstract calculus NRC. NRC is based on that described in [9], and is chosen as the internal
query representation because it is easy to manipulate and amenable to machine analysis. The NRC expression is
then analyzed by the Type Module to infer the most general valid type for the expression, and is passed to the
Optimizer Module. Once optimized, the NRC expression is then compiled by the NRC Module into calls to the
Complex Object Library. The resulting compiled code is then executed, accessing drivers and external primitives
as needed through pipes or shared memory. The Driver and Primitive Managers keep information on external
sources and primitives and the wrapper/interface routines. The Complex Object Library contains routines for
manipulating complex objects such as code for set intersection and code for set iteration.
The data model underlying Kleisli is a complex object type system that goes beyond the \sets of records" or
\ at relations" type system of relational databases [13]. It allows arbitrarily nested records, sets, lists, bags, and
2
variants. A variant is also called a tagged union type, and represents a type that is \either this or that". The
collection or \bulk" types { sets, bags, and lists { are homogeneous. In order to mix objects of di erent types in
a set, bag, or list, it is necessary to inject these objects into a variant type.
The simultaneous availability of sets, bags, and lists in Kleisli deserves some comments. In a relational database,
the sole bulk data type is the set. Having only one bulk data type presents at least two problems in real life
applications. Firstly, the particular bulk data type may not be a natural model of real data. Secondly, the
particular bulk data type may not be an ecient model of real data. For example, if we are restricted to the at
relational data model, the GenPept report in Example 2.1 below must necessarily be split into many separate
tables in order to be losslessly stored in a relational database. The resulting multi-table representation of the
GenPept report is conceptually unnatural and operationally inecient. A person querying the resulting data must
pay the mental overhead of understanding both the original GenPept report and its badly-fragmented multi-table
representation. He may also have to pay the performance overhead of having to re-assemble the original GenPept
report from its fragmented multi-table representation to answer queries.
Example 2.1 The GenPept report is the format chosen by the US National Center for Biotechnology Information
to present amino acid sequence information. While an amino acid sequence is a string of letters, certain regions
and positions of the string are of special biological interest, such as binding sites, domains, and so on. The feature
table of a GenPept report is the part of the GenPept report that documents the positions of these regions of special
biological interest, as well as annotations or comments on these regions. The following type represents the feature
table of a GenPept report from Entrez [27].
(#uid:num, #title:string,
#accession:string, #feature:{(
#name:string, #start:num, #end:num,
#anno:[(#anno_name:string, #descr:string)])})
It is an interesting type because one of its elds (#feature) is a set of records, one of whose elds (#anno) is in
turn a list of records. More precisely, it is a record with four elds #uid, #title, #accession, and #feature.
The rst three of these store values of types num, string, and string respectively. The #uid eld uniquely
identi es the GenPept report. The #feature eld is a set of records, which together form the feature table of the
corresponding GenPept report. Each of these records has four elds #name, #start, #end, and #anno. The rst
three of these have types string, num, and num respectively. They represent the name, start position, and end
position of a particular feature in the feature table. The #anno eld is a list of records. Each of these records has
two elds #anno_name and #descr, both of type string. These records together represent all annotations on the
corresponding feature.
2
In general, the types are freely formed by the syntax:
t
::= num
j
string
j
bool
j
{t}
j
{|t|}
j
[t]
j
(l1 : t1 ,
..., ln : tn )
j
<l1 : t1 ,
..., ln : tn >
Here num, string, and bool are the base types. The other types are constructors and build new types from
existing types. The types {t}, {|t|}, and [t] respectively construct set, bag, and list types from type t. The type
(l1 : t1 , ..., ln : tn ) constructs record types from types t1 , ..., tn . The type <l1 : t1 , ..., ln : tn > constructs variant
types from types t1 , ..., tn . The at relations of relational databases are basically sets of records, where each eld
of the records is a base type; in other words, relational databases have no bags, no lists, no variants, no nested
sets, and no nested records. Values of these types can be explicitly constructed in CPL as follows, assuming the
e's are values of appropriate types: (l1 : e1 , ..., ln : en ) for records; <l : e> for variants; {e1 , ..., en } for sets; {|e1 ,
..., en |} for bags; and [e1, ..., en ] for lists.
Example 2.2 The feature table of GenPept report 131470, a tyrosine phosphatase 1C sequence, is shown below.
3
(#uid:131470, #accession:"131470",
#title:"... (PTP-1C)...", #feature:{(
#name:"source", #start:0, #end:594, #anno:[
(#anno_name:"organism", #descr:"Mus musculus"),
(#anno_name:"db_xref", #descr:"taxon:10090")]),
...})
The particular feature displayed above goes from amino acid 0 to amino acid 594, which is actually the entire
sequence, and has two annotations: The rst annotation indicates that this amino acid sequence is derived from
mouse DNA sequence. The second is a cross reference to the US National Center for Biotechnology Information
taxonomy database.
2
The schemas and structures of all popular bioinformatics databases, at les, and softwares are easily mapped
into this data model. At the high end of data structure complexity are Entrez [27] and ACEDB [31], which
contain deeply nested mixtures of sets, bags, lists, records, and variants. At the low end of data structure
complexity are the relational database systems [13] such as Sybase and Oracle, which contain at sets of records.
Currently, Kleisli gives access to over sixty of these and other bioinformatics sources. The reason for this ease of
mapping bioinformatics sources to Kleisli's data model is that they are all inherently composed of combinations
of sets, bags, lists, records, and variants. We can directly and naturally map sets to sets, bags to bags, lists to
lists, records to records, and variants to variants into Kleisli's data model, without having to make any (type)
declaration before hand.
We now come to CPL, the primary query language of Kleisli. An interesting feature of the syntax of CPL
is the use of the comprehension syntax [8, 30]. An example of a typical comprehension in CPL syntax is
{x * x | \x <- S, odd(x)}, which returns a set consisting of the squares of all odd numbers in the set S.
This is similar to the notation found in functional languages, the main di erence being that the binding occurrence of x is indicated by preceding it with a backslash, and that the expression returns a set rather than a list.
As in functional languages, \x <- S is called a \generator", and odd(x) is called a \ lter." Rather than giving
the complete syntax, we illustrate CPL by a few examples on a set of feature tables DB.
Example 2.3 The query below extracts the titles and features of those elements of
tyrosine as
a substring.
DB
whose titles contain
{ (#title: x.#title, #feature: x.#feature) | \x <- DB, x.#title string-islike "%tyrosine%" };
2
This query is a simple project-select query. A project-select query is a query that operates on one ( at) relation
or set. Thus the transformation that such a query can perform is limited to selecting some elements of the relation
and extracting or projecting some elds from these elements. Except for the fact that the source data and the
result may not be in rst normal form, these queries can be expressed in a relational query language. However,
CPL can perform more complex restructurings such as nesting and unnesting not found in common relational
database languages like SQL, as shown in the following examples.
Example 2.4 The following query attens DB completely. The sytax \a
\x <- DB,
<--- f.#anno has similar meaning to
but works on lists instead of sets. Thus it binds a to each item in the list f.#anno.
{(#title:x.#title, #feature:f.#name, #start:f.#start, #end:f.#end,
#anno-name:a.#anno_name, #anno-descr:a.#descr)
| \x <- DB, \f <- x.#feature, \a <--- f.#anno};
4
2
Example 2.5 This query demonstrates how to do nesting in CPL. The subquery DB' is the restructuring of DB
by pairing each entry with its source organism. The subquery ORG then extracts all organism names. The main
query groups entries in DB' by organism names. It also sorts the output list by alphabetical order of organism
names, i.e. [u | \u <- ORG] converts the set ORG into a duplicate-free sorted list.
let \DB' == {(#entry:x, #organism:a.#descr)
| \x <- DB, \f <- x.#feature, \a <--- f.#anno,
a.#anno_name = "organism"} in
let \ORG == {y.#organism | \y <- DB'}
in [(#organism:z, #entries: {v.#entry | \v <- DB', v.#organism = z}) | \z <--- [u | \u <- ORG]];
2
The inspiration for CPL came from [6] where structural recursion was presented as a query language. However,
structural recursion has two diculties. The rst is that not every syntactically correct structural recursion
program is logically well de ned [7]. The second is that structural recursion has too much expressive power
because it can express queries that require exponential time and space.
In the context of databases, which are typically very large, programs (queries) are usually restricted to those which
are \practical" in the sense that they are in a low complexity class such as LOGSPACE, PTIME, or TC0 . In
fact, one may even want to prevent any query that has worse than ( log ) complexity, unless one is con dent
that the query optimizer has a high probability of optimizing the query to no more than ( log ) complexity.
Database query languages such as SQL are therefore designed in such a way that joins are easily recognized, since
joins are the only operations in a typical database query language that require ( 2 ) complexity if evaluated
naively.
Thus Tannen and Buneman suggested a natural restriction on structural recursion to reduce its expressive power
and to guarantee its well-de nedness. Their restriction cuts structural recursion down to homomorphisms on the
commutative idempotent monoid of sets, revealing a telling correspondence to monads [30]. A nested relational
calculus, which is denoted here by NRC , was then designed around this restriction [9]. NRC is essentially the
simply-typed lambda calculus extended by a construct for building records, a construct for decomposing records
by eld selection, a construct for building sets, a construct for decomposing
S sets by means of the restriction
on structural recursion. Speci cally, the construct for decomposing sets is f 1 j 2 2 g, which forms a set
by taking the big union of 1 [ ] over each in the set 2 . NRC (suitably extended) is implemented by the
NRC Module of Kleisli and is the abstract counterpart of CPL, a la Wadler's equations relating monads and
comprehensions[30].
The expressive power of NRC and its extensions are studied in [28, 15, 19, 9, 29]. The impact ofPthese and
other theoretical results on the design of CPL and Kleisli is that CPL adopts NRC (Q , +, ,?, , , Q =)
as its core, while allowing for fullP edged recursion and other operators to be imported easily as needed into the
system. NRC (Q , +, ,?, , , Q =) captures all standard nested relational queries in a high-level manner
that is easy for automated optimizer analysis. It is also easy to translate a more user-friendlyPsurface syntax such
as the comprehension syntax or the SQL select-from-where syntax into NRC (Q , +, ,?, , , Q =). It is thus
a very suitable core.
O n
n
O n
n
O n
e
e
o=x
o
x
e
e
;
;
;
5
3 Type Inference and Self-Describing Exchange Format
In a dynamic heterogeneous environment such as that of bioinformatics, many di erent database and software
systems are used. They often do not have anything that can be thought of as an explicit database schema. Further
compounding the problem is that research biologists demand exible access and queries in ad-hoc combinations.
Thus, a query system that aims to be a general integration mechanism in such an environment must satisfy four
conditions. First, it must not count on the availability of schemas. It must be able to compile any query submitted
based solely on the structure of that query. Second, it must have a data model that the external database and
software systems can easily translate to, without doing a lot of type declarations. Third, it must shield existing
queries from evolution of the external sources as much as possible. For example, an extra eld appearing in an
external database table must not necessitate the recompilation or rewriting of existing queries over that data
source. Fourth, it must have a data exchange format that is straightforward to use, so that it does not demand
too much programming e ort or contortion to capture the variety of structures of output from external databases
and softwares.
Three of these requirements are addressed by features of CPL's type system. CPL has polymorphic record types
that allow, for example,
\R => {x.#name | \x <- R, x.#salary > 1000}
which de nes a function that returns names of people in R earning more than a thousand dollars. This function
is applicable to any R that has at least the #name and the #salary elds, thus allowing the input source some
freedom to evolve. CPL also has variant types that allow, for example, the following value:
{ <#name: "John">, <#zip-code: 119613> }
This set contains objects of very di erent structures: a string carrying a #name tag and a number carrying a
#zip-code tag. This feature is particularly useful in handling ASN.1-formatted [18] data from Entrez, one of the
most important and most complex sources of DNA sequences, as it contains a profusion of variant types.
In addition, CPL does not require any type to be declared at all. The type and meaning of any CPL program
can always be completely inferred from its structure without the use of any schema or type declaration. This
makes it possible to logically plug in any data source without doing any form of schema declaration, at a small
acceptable risk of run-time errors if the inferred type and the actual structure are not compatible. This is an
important feature because most of our data sources do not have explicit schemas, while a few have extremely
large schemas that take many pages to write down | for example, the ASN.1 schema of Entrez [22]| making it
impractical to have any form of declaration.
We now come to the fourth requirement. A data exchange format is an agreement on how to lay out data in a data
stream or message when the data is exchanged between two systems. In our case, it is the format for exchanging
data between Kleisli and all the bioinformatics sources. The data exchange format of Kleisli corresponds one-toone to Kleisli's data model. It provides for records, variants, sets, bags, and lists; and it allows these data types
to be freely composed. In fact, the data exchange format completely adopts the syntax of value construction
in CPL, as described in the previous section. Recall that CPL programs contain no type declaration. A CPL
compiler has to gure out if a CPL program has a principle typing scheme. This kind of type inference is possible
because every construct in CPL has an unambiguous most general type. In particular, the value construction
syntax is such that it is possible to inspect only the rst several symbols to gure out local type constraints on
the corresponding value, as each value constructor is unambiguous. For example, if a {| bracket is seen, it is
immediately clear that it is a bag; and if a ( bracket is seen, it is immediately clear that it is a record. Thus, by
adopting the value construction syntax of CPL as the data exchange format, the latter becomes self describing.
6
A self-describing exchange format is one in which there is no need to de ne in advance the structure of the
objects being exchanged. That is, there is no xed schema and no type declaration. In a sense, each object being
exchanged carries its own description. A self-describing format has the important property that, no matter how
complex the object being exchanged is, it can be easily parsed and reconstructed without any schema information.
To understand this advantage, one should look at the ISO ASN.1 standard [18] open systems interconnection.
It is not easy to exchange ASN.1 objects because before we can parse any ASN.1 object, we need to parse the
schema that describes its structure rst|making it necessary to write two complicated parsers instead of one
simple parser.
4 Kleisli Triples and Abstract Syntax
Let us now consider the restricted form of structural recursion which corresponds to the presentation of monads
by Kleisli [30, 9]. It is the combinator ext( )( ) obeying the following three equations:
ext(f ) =
ext(f ) o = f (o)
ext(f )(A B ) = ext(f )(A) ext(f )(B )
fg
fg
f g
[
[
S
Thus, ext(f )(R) is equivalent to the f (x) x R construct of
ext{e1 | \x <-
f
j
2
g
N RC
. The direct correspondence in CPL is:
e2 }, which is interpreted as ext(f )(e2 ), where f (x) = e1 . This combinator is a key operator
in the Complex Object Library of Kleisli and is at the heart of the NRC, the abstract representation of queries
in the implementation of CPL. It earns its central position in the Kleisli system because it o ers tremendous
practical and theoretical convenience.
Its practical convenience is best seen in the issue of abstract syntax in the implementation of a database query
language. The abstract syntax is the internal representation of a query and is usually manipulated by code
generators; the better abstract synax is the one that is easier to analyse. It must not be confused with the surface
syntax, which is what the usual database programmer programs in; the better surface syntax is the one that is
easier to read. It is worth contrasting the ext construct to the comprehension synax here. With regard to surface
syntax, CPL adopts the comprehension syntax because it is easier to read than the ext construct. For example,
the Cartesian product of two sets is expressed using the comprehension syntax as
{(x, y) | \x <- R, \y <- S}
In contrast, it is expressed using the ext construct as
ext{ext{{(x,y)} | \y <- S} | \x <- R}
which is more convoluted. However, the advantage of the comprehension syntax more or less ends here. With
regard to abstract syntax, the situation is exactly the opposite! Comprehensions are easy for the human programmer to read and understand. However, they are in fact extremely inconvenient for automatic analysis and
is thus a poor candidate as an abstract representation of queries. This di erence is illustrated below by a pair of
contrasting examples in implementing optimization rules.
A well-known optimization rule is vertical loop fusion [17], which corresponds to the physical notion of getting
rid of intermediate data and the logical notion of quanti er elimination. Such an optimization on queries in the
comprehension syntax can be expressed informally as
e G1 , ..., Gn , x <- e H1 , ..., Hm , J1 , ..., Jk
f
j
n
f
0
j
g
g
;
e[e =x] G1 , ..., Gn , H1 , ..., Hm , J1 [e =x], ..., Jk [e =x]
f
7
0
j
0
0
g
Such a rule in comprehension form is very simple to grasp. Basically the intermediate set built by the comprehension fe j H1 , ..., Hm g has been eliminated, in favour of generating the x on the y. In practice it is quite messy
to implement the rule above. In writing that rule, the informal \..." denotes any number of generator- lters in
a comprehension. When it comes to actually implementing it, a nasty traversal routine must be written to skip
over the non-applicable Gi in order to locate the applicable nx <- fe j H1 , ..., Hm g and Ji .
Let us now consider the ext construct. As pointed out by Wadler [30], any comprehension can be translated into
this contruct. Its e ect on the optimization rule for vertical loop fusion is dramatic. This optimization is now
expressed as
0
0
extfe1 j nx <- extfe2 j ny <-
e3 gg ; extf extfe1 j nx <- e2 g j ny <- e3 g
The informal and troublesome \..." no longer appears. Such a rule can be coded up straightforwardly in almost
any implementation language. A similar simplication is also observed in proofs using structural induction. For
comprehension syntax, when one comes to the case for comprehension, one must introduce a secondary induction
proof based on the number of generators and lters in the comprehension, whereas the ext construct does not
give rise to such complication. A related saving, pointed out to us by Wadler, is that comprehensions require two
kinds of terms, expressions and quali ers, whereas the ext formulation requires only one kind of term, expressions.
In order to illustrate this point more concretely, it is necessary to introduce some detail from the implementation
of the Kleisli system. Recall from the introductory section that Kleisli is implemented on top of the Standard
ML of New Jersey (SML). The type SYN of SML objects that represent queries in Kleisli is declared in the NRC
Module mentioned in Section 2. The data constructors that are relevant to our discussion are:
type VAR = int
(* Variables, represented by int *)
type SVR = int
(* Server connections, represented by int *)
type CO = ...
(* Representation of complex objects *)
datatype SYN = ...
| EmptySet
(* { }
*)
| SngSet of SYN
(* { E }
*)
| UnionSet of SYN * SYN
(* E1 {+} E2
*)
| ExtSet of SYN * VAR * SYN
(* ext{ E1 | \x <- E2 } *)
| IfThenElse of SYN * SYN * SYN
(* if E1 then E2 else E3 *)
| Read of SVR * real * SYN
(* process E using S,
the real is the request priority assigned by optimizer *)
| Variable VAR
(* x *)
| Binary (CO * CO -> CO) * SYN * SYN (* Construct for caching
static objects. This allows the optimizer to insert
some codes for doing dynamic optimization *)
All SML objects that represent optimization rules in Kleisli are functions and they have type RULE:
type RULE = SYN -> SYN option
If an optimization rule r can be successfully applied to rewrite an expression e to an expression e , then r(e) =
SOME(e ). If it cannot be successfully applied, then r(e) = NONE.
We return to the rule on vertical loop fusion. As promised earlier, we have a very simple implementation:
0
0
Example 4.1
Vertical loop fusion.
fun Vertfusion(ExtSet(E1,x,ExtSet(E2,y,E3))) = SOME(ExtSet(ExtSet(E1,x E2),y,E3))
| Vertfusion _ = NONE
8
2
5 Higher-Order Functions and Optimization
Another idea that we have exploited in implementing Kleisli is the use of higher-order functions. There are many
advantages and conveniences of higher-order functions, besides allowing the expression of better algorithms as
discussed in [29]. We use the implementation of the Kleisli query optimizer module for illustration here. The
optimizer consists of an extensible number of phases. Each phase is associated with a rule-base and a rule
application strategy. A large number of rule application strategies are supported. The more familiar ones include
BottomUpOnce, which applies rules to rewrite an expression tree from leaves to root in a single pass; TopDownOnce,
which applies rules to rewrite an expression tree from root to leaves in a single pass; MaxOnce, which applies rules
to the largest redices in a single pass; and so on, together with their multi-pass versions.
By exploiting higher-order functions, all of these rule application strategies can be decomposed into a \traversal"
component that is common to all strategies and a very simple \control" component that is special for each
strategy. In short, higher-order functions can generate all these strategies extremely simply, resulting in a very
small optimizer core. To give some ideas on how this is done, some SML code fragments from the optimizer
module mentioned in Section 2 are presented below.
The \traversal" component is a higher-order function that is shared by all strategies:
val Decompose: (SYN -> SYN) -> SYN -> SYN
Recall that SYN is the type of SML object that represents query expressions. The Decompose function accepts
a rewrite rule r and a query expression Q. Then it applies r to all immediate subtrees of Q to rewrite these
immediate subtrees. Note that it does not touch the root of Q and it does not traverse Q|it just nonrecursively
rewrites immediate subtrees using r. It is therefore very straightforward and looks like this:
fun Decompose r (SngSet N) = SngSet(r N)
| Decompose r (UnionSet(N,M)) = UnionSet(r N, r M)
| Decompose r (ExtSet(N,x,M)) = ExtSet(r N, x, r M)
| ...
A rule application strategy S is a function having the following type
val S: RULEDB -> SYN -> SYN
The precise de nition of the type RULEDB is not important to our discussion at this point and is deferred until
later. Such a function takes in a rule base R and a query expression Q and optimizes it to a new query expression
Q by applying rules in R according to the strategy S .
Assume that Pick: RULEDB -> RULE is a SML function that takes a rule base R and a query expression Q and
returns NONE if no rule is applicable, and SOME(Q ) if some rule in R can be applied to rewrite Q to Q . Then the
\control" components of all the strategies mentioned earlier can be generated in a very simple way.
0
0
0
Example 5.1 The MaxOnce strategy applies rules to maximal subtrees. It starts trying the rules on the root of
the query expression. If no rule can be applied, it moves down one level along all paths and tries again. But as
soon as a rule can be applied along a path, it stops at that level for that path. In other words, it applies each rule
at most once along each path from the root to the leaves. Here is its \control" component:
9
fun MaxOnce RDB Qry = case Pick RDB Qry
of SOME ImprovedQry => ImprovedQry
| NONE => Decompose (MaxOnce RDB) Qry
2
Example 5.2 The BottomUpOnce strategy applies rules in a leaves-to-root pass. It tries to rewrite each node at
most once as it moves towards the root of the query expression. Here is its \control" component:
fun BottomUpOnce RDB Qry =
let fun Pass SubQry =
let val BetterSubQry = Decompose Pass SubQry
in case Pick RDB BetterSubQry
of SOME EvenBetterSubQry => EvenBetterSubQry
| NONE => BetterSubQry end
in Pass Qry end
2
Let us now present an interesting class of rules that requires the use of multiple rule application strategies. The
scope of rules like the vertical loop fusion in the previous section is over the entire query. In contrast, this class
of rules has two parts. The inner part is \context sensitive" and its scope is limited to certain components of the
query. The outer part scopes over the entire query to identify contexts where the inner part can be applied. The
two parts of the rule can be applied using completely di erent strategies.
A rule base RDB is represented in our system as an SML record of type
type RULEDB = {
DoTrace: bool ref,
Trace: (rulename -> SYN -> SYN -> unit) ref,
Rules: (rulename * RULE) list ref }
The Rules eld of RDB stores the list of rules in RDB together with their names. The Trace eld of RDB
stores a function f that is to be used for tracing the usage of the rules in RDB . The DoTrace eld of RDB
stores a ag to indicate whether tracing is to be done. If tracing is indicated, then whenever a rule of name N in
RDB is applied successfully to transform a query Q to Q , the trace function is invoked as f N Q Q to record a
trace. Normally, this simply means a message like \Q is rewritten to Q using the rule N " is printed. However,
the trace function f is allowed to carry out considerably more complicated activities.
It is possible to exploit trace functions to achieve sophisticated transformations in a simple way. An example is
the rule that rewrites if e1 then ... e1 ... else e3 to if e1 then ... true ... else e3 . The inner part of this rule
rewrites e1 to true. The outer part of this rule identi es the context and scope of the inner part of this rule:
limited to the then-branch. This example is very intuitive to a human being. In the then-branch of a conditional,
all subexpressions that are identical to the test predicate of the conditional must eventually evaluate to true.
However, such a rule is not so straightforward to express to a machine. The informal \..." are again in the way.
Fortunately, rules of this kind are straightforward to implement in our system.
0
0
0
Example 5.3 The If-then-else absorption rule is expressed by the AborbThen rule below. The rule has three
clauses. The rst clause says that the rule should not be applied to an IfThenElse whose test predicate is already
a Boolean constant, because it would lead to non-termination otherwise. The second clause says that the rule
should be applied to all other forms of IfThenElse. The third clause says that the rule is not applicable in any
other situation.
10
fun AbsorbThen (IfThenElse(Bool _,_,_)) = NONE
| AbsorbThen (IfThenElse(E1,E2,E3)) =
let fun Then E = if SyntaxTools.Equiv E1 E then SOME(Bool true) else NONE
in case ContextSensitive Then TopDownOnce E2
of SOME E2' => IfThenElse(E1,E2',E3)
| NONE => NONE end
| AbsorbThen _ = NONE
The second clause is the meat of the implementation. The inner part of the rewrite if e1 then ... e1 ... else e3
to if e1 then ... true ... else e3 is captured by the function Then which rewrites any e identical to e1 to true.
This function is then supplied as the rule to be applied using the TopDownOnce strategy within the scope of the
then-branch ... e1 ... using the ContextSensitive rule generator given below.
fun ContextSensitive Rule Strategy Qry =
let val Changed = ref false
(*
val RDB = {
(*
DoTrace = ref true,
Trace = ref (fn _ => fn _ => fn _ => Changed := true) (*
Rules = ref [("", Rule)]}
val OptimizedQry = Strategy RDB Qry
(*
in if !Changed then SOME OptimizedQry else NONE end
This flag is set if Rule is applied *)
Set up a context-sensitive rule base *)
Changed is true if Rule is used *)
Apply Rule using Strategy. *)
This ContextSensitive rule generator is reused for many other context-sensitive optimization rules, such as the
rule for migrating projections to external relational database systems.
2
6 Optimization of Queries on Relational Databases
Relational database systems are the most powerful data sources that Kleisli interfaces to. These database systems
are themselves equipped with the ability to perform sophisticated transformations expressed in SQL. A good
optimizer should aim to migrate as many operations in Kleisli to these systems as possible. There are four
main optimizations that are useful in this context: the migration of projections, selections, and joins on a single
database; and the migration of joins across two databases. The Kleisli optimizer has four di erent rules to exploit
these four opportunities. We show them below.
Let us begin with the rule for migrating projections. A special case of this rule is to rewrite {x.#name | \x
<- process "select * from T x where 1 = 1" using A} to { x.#name | \x <- process "select name =
x.name from T x where 1 = 1" using A}, assuming A connects to a SQL database. In the original query, the
entire table T has to be retrieved. In the rewritten query, only one column of that table has to be retrieved.
More generally, if x is from a relational database system and every use of x is in the context of a eld projection
x.#l, these projections can be \pushed" to the relational database so that unused elds are not retrieved and
transferred.
Example 6.1 The rule for migrating projections to a relational database is implemented by MigrateProj below.
The rule requires a function FullyProjected x N that traverses an expression N to determine whether x is
always used within N in the context of a eld projection and to determine what elds are being projected; it
returns NONE if x is not always used in such a context; otherwise, it returns SOME L, where the list L contains
all the elds being projected. This function is implemented in a simple way using the ContextSensitive rule
generator from Example 5.3.
fun FullyProjected x N =
11
let val (Count, Projs) = (ref 0, ref [])
fun FindProjs (Variable y) = (if x = y then inc Count else (); NONE)
| FindProjs (Proj (L, Variable y)) =
(if x = y then Projs := L :: (!Projs) else (); NONE)
| FindProjs _ = NONE
in ContextSensitive FindProjs BottomUpOnce N;
if length (!Projs) = !Count then SOME (!Projs) else NONE
end
Recall from Section 4 that process M using S is represented in the NRC Module as a SYN object Read(S , p, M ),
where p is a priority to be assigned by Kleisli. The MigrateProj rule is de ned below. The function SQL.PushProj
is one of the many support routines available in the current release of Kleisli that handle manipulation of SQL
queries and other SYN abstract syntax objects.
fun MigrateProj (ExtSet (N, x, Read (S, p, String M))) =
if Annotations.IsSQL S
(* test if S connects to a SQL server *)
then case FullyProjected x N (* test if x is always in a projection *)
of SOME Projs => SOME (ExtSet (N, x, Read (S, p, String (SQL.PushProj Projs M))))
| NONE => NONE
else NONE
| MigrateProj _ = NONE
2
Second is the rule for migrating selections. A special case of this rule is to rewrite {x | \x <- process "select *
from EMP e where 1 = 1" using A, x.#name = "peter"} to {x | \x <- process "select * from EMP e
where e.name = 'peter'" using A}. In the original query, the entire table EMP has to be retrieved from the
relational database A so that Kleisli can lter for Peter's record. In the rewritten query, the record for Peter
is retrieved directly without retrieving any other records. More generally, if x is from a relational database and
there are some equality tests x.#l1 = c, then we should push as many of these tests to the database as possible.
Example 6.2 The rule for migrating selections to a relational database is implemented by MigrateSelect below.
The rule requires a function FlattenTests Ok N that traverses a tower of if-then-else's in N to extract equality
tests satisfying the check Ok for migration; it returns a triple (Pve; Nve; N ), where Pve is a list of tests to be
`and' together, Nve is a list of tests to be negated
V and `and'Wtogether, and N is the remaining (transformed)
un attenable part of the tower; it satis es if ( Pve) ^ :( Nve) then N else fg = N . This function is
implemented in a simple way as follows.
0
0
0
fun FlattenTests Ok (IfThenElse (C, N, EmptySet)) =
let val (Pve, Nve, N') = FlattenTests Ok N
in if Ok C then (C::Pve, Nve, N') else (Pve, Nve, IfThenElse (C, N', EmptySet)) end
| FlattenTests Ok (IfThenElse (C, EmptySet, M)) =
let val (Pve, Nve, M') = FlattenTests Ok M
in if Ok C then (Pve, C::Nve, M') else (Pve, Nve, IfThenElse (C, EmptySet, M')) end
| FlattenTests Ok (ExtSet (N, x, M)) =
let val (Pve, Nve, N') = FlattenTests Ok N
in (Pve, Nve, ExtSet (N', x, M)) end
| FlattenTests Ok N = ([], [], N)
The MigrateSelect rule is de ned below. The function SQL.PushTest is one of the many support routines
available in the current release of Kleisli that handle manipulation of SQL queries and other SYN abstract syntax
objects. The function call SQLDict.ExtractLONG S C uses the catalog of the relational database S to determine
12
which of the elds in the output of the SQL query C are BLOB-like; a BLOB-like eld is usually a very large
string where equality test on it is not supported by the underlying database S . The function CanPushTest uses
the output of SQLDict.ExtractLONG to produce a function to test if an expression is an equality test that can be
migrated.
fun MigrateSelect (ExtSet (N, x, Read (S, p, String M))) =
if Annotations.IsSQL S
(* test if S is a relational db *)
then
let val Forbidden = SQLDict.ExtractLONG S((!SQL.Mk) M) (* can't migrate tests on BLOB-like fields *)
val Hash = IntHashTable.mkTable (1, Error.Goofed "Oops!")
val _ = IntHashTable.insert Hash (x, Forbidden)
val (Pve, Nve, N') = FlattenTests (CheckPushTest Hash) N
val Pve = CvtTest S Pve
(* convert list to tower of if-then-else *)
val Nve = CvtTest S Nve
(* convert list to tower of if-then-else *)
in if (null Pve) andalso (null Nve)
then NONE
else SOME (ExtSet (N', x, Read (S, p, String (SQL.PushTest Pve Nve x M)))) end
else NONE
| PushSelect _ = NONE
2
Third is the rule for migrating joins. A special case of this rule is to rewrite {
y.#mgr | \x <- process "select *
from EMP e where 1 = 1" using A,
\y <- process "select * from DEPT d where 1 = 1" using A,
x.#name = "peter", y.#name = x.#dept } to { y.#mgr | \x <- process "select dept: e.dept from EMP
e, DEPT d where e.name = 'peter' and d.name = e.dept" using A }. In the original query, the entire table EMP has to be retrieved once and the entire table DEPT has to be retrieved times, where is the cardinality
of EMP. In other words, + 1 requests have to be made on the relational database . In the rewritten query,
only one request is made on the relational database to retrieve the record in the join of EMP and DEPT matching
Peter. More generally, if x and y are from the same relational database and are always used in the context of a
eld projection and there are some equality tests x.# 1 = y.# 2, then this is a join that we should push to that
n
n
A
l
A
n
l
database. The advantages are that only one request is made, instead of n + 1; only matching records in the joins
are retrieved, instead of entire tables; and the underlying relational database system now also has more context
information to perform a better optimization.
Example 6.3 The rule for migrating joins to a relational database is implemented by MigrateJoin below. The
function SQL.PushJoin is one of the many support routines available in the current release of Kleisli that handle
manipulation of SQL queries and other SYN abstract syntax objects.
fun MigrateJoin (ExtSet (N, x, Read (S, p, String M))) =
if Annotations.IsSQL S
(* test if S is a SQL server *)
then case FullyProjected x N (* test if x is always in a projection *)
of SOME ProjO =>
(* ProjO are projections on the outer relation *)
let val Forbidden = SQLDict.ExtractLONG S ((!SQL.Mk) M)
(* can't migrate tests on BLOB-like fields *)
val Hash = IntHashTable.mkTable (10, Error.Goofed "Oops!")
val _ = IntHashTable.insert Hash (x, Forbidden)
val (PO, NO, N') = FlattenTests (CheckPushTest Hash) N
(* PO, NO are tests on the outer relation
that can be migrated. N' is what's left
after migration *)
in case CheckJoin Hash x S N'
13
of SOME (PI, NI, ProjI, ExtSet (U, v, Read (_, _, String W))) =>
(* W is the inner relation of the join.
PI, NI are tests on W that can be migrated.
ProjI are projections on W. U is what's
left after migration *)
SOME (ExtSet (Rename x v U, x, Read (S, p, String (
SQL.PushJoin x v ProjO ProjI (PO @ PI) (NO @ NI) M W))))
| _ => NONE end
| NONE => NONE
else NONE
| MigrateJoin _ = NONE
Most of the work is done by the function CheckJoin Hash S N', which traverses N to look for a subexpression that uses the same database S to be the inner relation of the join. If such a relation exists, it returns
SOME (PI, NI, ProjI, ExtSet (U, v, Read (S, q, String W))) such that the set W can be used as the inV
W
ner relation of the join and the join condition is ( PI) ^ :( NI) and ProjI stores the projections in which
the inner join variable v occurs and U is an expression corresponding to the remaining operations that cannot be
migrated.
2
0
Fourth is the migration of selections across two relational databases. An example is the following rewrite of
{ y.#mgr | \x <- process "select * from EMP e where 1 = 1" using A,
\y <- process "select *
from DEPT d where 1 = 1" using B, x.#name = "peter", y.#name = x.#dept } to { y.#mgr | \x <process "select dept: e.dept from EMP e where e.name = 'peter'" using A, \y <- process "select
mgr: d.mgr from DEPT d where d.name ='" ^ x.#dept ^ "'" using B}. Here
and are two di erent relational databases, so we cannot use MigrateSelect to push the test x.#dept = y.#name to . The reason is that
being a database di erent from , it has no access to the value of each instance of x.#dept. To enable such a
migration, the value of each instance of x.#dept must be passed dynamically to , as shown in the rewritten query
above where x.#dept is dynamically concatenated into the SQL query select mgr: d.mgr from DEPT d where
d.name = to be passed to . Note that, in general, x does not need to come from a relational database and we
A
B
A
B
B
B
B
simply need to look for equality tests involving the variable of the inner relation that we can migrate.
Example 6.4 The rule for migrating selections dynamically across two relational databases is implemented by
below. The function SQL.PushTestDyn is one of the many support routines available in the
current release of Kleisli that handle manipulation of SQL queries and other SYN abstract syntax objects.
MigrateSelectDyn
fun MigrateSelectDyn (ExtSet (N, x, Read (S, p, String M))) =
if Annotations.IsSQL S
(* test if S is a relational database *)
then
let val Vars = SyntaxTools.FV N (* Vars are free variables in N *)
val Forbidden = SQLDict.ExtractLONG S ((!SQL.Mk) M)
(* can't migrate BLOB-like fields *)
val Hash = IntHashTable.mkTable (1, Error.Goofed "Oops!")
val _ = IntHashTable.insert Hash (x, Forbidden)
fun Ins y = IntHashTable.insert Hash (y, fn _ => false)
val _ = IntSet.app (fn y => if y = x then () else Ins y) Vars
val Vars' = IntSet.difference (Vars, IntSet.singleton x)
(* Vars' are free variables of the entire
expression. If there is a x.#l = y.#l'
for any y in Vars' inside N, then we
may have something to migrate! *)
val (Pve, Nve, N') = FlattenTests (fn N =>
(CheckPushTest Hash N) andalso (IntSet.member(SyntaxTools.FV N, x))) N
14
(* Pve, Nve are tests that can be migrated
dynamically. N' is what's left. *)
in if null Pve andalso null Nve
then NONE
else SOME (ExtSet (N', x, Read (S, p, Binary (
fn (X,Y)=> COString.Mk (SQL.PushTestDyn Pve Nve Vars' x (COString.Km X) (CvtTestCO S Y)),
String M,
Record (Record.MkTuple (map Variable (IntSet.listItems Vars'))))))) end
else NONE
| PushSelectDyn _ = NONE
The use of the Binary (f; E ; V ) construct above is notable. When the Kleisli engine encounters this construct,
it e ectively executes f (E ; V ) using the values of E and V at that point. In our example, E happens to be the
original SQL query, V happens to store the values on which equality tests are to be performed, and f dynamically
pushes V to E !
2
Besides the four rules above, there is also a rule for reordering joins on two relational databases. While we
do not provide here the implementation of the reordering rule in Kleisli, let us use an example to explain this
optimization. Consider the join { y.#mgr | \x <- process "select * from EMP e where 1 = 1" using A,
\y <- process "select * from DEPT d where 1 = 1" using B, y.#name = x.#dept }. We could optimize
it as { y.#mgr | \x <- process "select dept: e.dept from EMP e where 1 = 1" using A, \y <process "select mgr: d.mgr from DEPT d where d.name = '" ^ x.#dept ^ '" using B}. But we could
also optimize it as { y.#mgr | \y <- process "select mgr: d.mgr, name: d.name from DEPT d where
1 = 1" using B, \x <- process "select 1 from EMP e where e.dept ='" ^ y.#name ^ "'" using A}.
Assume that there is an index on the name eld of DEPT, then the rst optimization is good, because for each
x.#dept, the cost of the looking up the corresponding manager in DEPT from
would be log where is the
cardinality of DEPT. However, if such an index does not exist, that cost would be instead of log . In such a
case, if an index happens to exist for the dept eld of EMP in A, the second optimization would have been much
better. More generally, in a nested loop, the order should be swapped if the outer relation is larger than the inner
relation and there is an index on the selected eld of the outer relation which has good selectivity.
B
n
n
n
n
7 A DOE \Impossible" Query
Having seen the optimizations for queries that involve relational database sources, we now show a sample bioinformatics query that bene ts signi cantly from these optimizations. In fact, it is the very rst bioinformatics
query implemented in Kleisli in 1994 [14], and was one of the so-called \impossible" queries of a US Department
of Energy Bioinformatics Summit Report (www.gdb.org/Dan/DOE/whitepaper/contents.html.) The query was
to nd for each gene located on a particular cytogenetic band of a particular human chromosome as many of its
non-human homologs as possible. Basically, the query means that for each gene in a particular position in the
human genome, nd DNA sequences from non-human organisms that are similar to it.
In 1994, the main database containing cytogenetic band information was the GDB [24], which was a Sybase
relational database. In order to nd homologs, the actual DNA sequences were needed and the ability to compare
them was also needed. Unfortunately, that database did not keep actual DNA sequences. The actual DNA
sequences were kept in another database called GenBank [10]. At the time, access to GenBank was provided
through the ASN.1 version of Entrez [27], which was an extremely complicated retrieval system. Entrez also kept
precomputed homologs of GenBank sequences.
So this query needed the integration of GDB (a relational database located in Baltimore) and Entrez (a nonrelational \database" located in Bethesda). The query rst extracted the names of genes on the desired cytogenetic
15
band from GDB, and then accessed Entrez for homologs of these genes. Finally, these homologs were ltered to
retain the non-human ones. This query was considered \impossible" as there was at that time no system that
could work across the bioinformatics sources involved due to their heterogeneity, complexity, and geographical
locations. Given the complexity of this query, the CPL query given in [14] was remarkably short. Since then
Kleisli has been used to power many bioinformatics applications [4, 12, 11, etc.]
Example 7.1 The query mentioned is shown below.1
sybase-add (#name:"gdb", ...);
readfile locus from "locus_cyto_location" using gdb-read;
readfile eref from "object_genbank_eref" using gdb-read;
{(#accn: g.#genbank_ref, #nonhuman-homologs: H)
| \c <- locus, c.#chrom_num = "22",
\g <- eref, g.#object_id = c.#locus_id,
\H == { u
| \u <- na-get-homolog-summary(g.#genbank_ref),
not(u.#title string-islike "%Human%"),
not(u.#title string-islike "%H.sapien%")},
not (H = { })}
The rst three lines connect to GDB and map two tables in GDB to Kleisli. After that, these two tables could be
referenced within Kleisli as if they were two locally de ned sets, locus and eref. The next 9 lines extract from
these tables the accession numbers of genes on Chromosome 22, use the Entrez function na-get-homolog-summary
to obtain their homologs, and lter these homologs for non-human ones.
Besides the obvious smoothness of integration of the two data sources, this query is also remarkably ecient.
On the surface, it seems to fetch the locus table in its entirety once and the eref table in its entirety n times
from GDB, as a naive evaluation of the comprehension would be two nested loops iterating over these two tables.
Fortunately, in reality, the Kleisli optimizer is able to migrate the join, selection, and projections on these two
tables into a single ecient access to GDB using the optimizing rules from Section 6. Furthermore, the accesses
to Entrez are also automatically made concurrent.
2
Since the query above, Kleisli and its components have been used in a number of bioinformatics projects such as
GAIA at the University of Pennsylvania (www.cis.upenn.edu/gaia2), TAMBIS at the University of Manchester
[4], and FIMM at Kent Ridge Digital Labs [26]. It has also been used in constructing databases in pharmaceutical/biotechnology companies such as SmithKline Beecham, Schering-Plough, GlaxoWellcome, Genomics Collaborative, Signature Biosciences, etc. Kleisli is also the backbone of GeneticXchange Inc. (www.geneticxchange.com).
8
Warehousing of GenPept Reports
Besides the ability to query, assemble, and transform data from remote heterogeneous sources, it is also important
to be able to conveniently warehouse the data locally. The reasons to create local warehouses are several: (1) it
increases eciency; (2) it increases availabilty; (3) it reduces risk of unintended \denial of service" attacks on the
original sources; and (4) it allows more careful data cleansing that cannot be done on the y.
The warehouse should be ecient to query and easy to update. Equally important in the biology arena is that
the warehouse should model the data in a conceptually natural form. Although a relational database system
1 Those who have read [14] will notice that the SQL avor in the original implementation [14] has completely vanished. This is
because the current version of Kleisli has made signi cant advancements in interfacing with relational databases.
16
is ecient for querying and easy to update, its native data model of at tables forces us to unnaturally and
unnecessarily fragment our data in order to t our data into third normal form.
Kleisli does not have its own native database management system. Instead, Kleisli has the ability to turn many
kinds of database systems into an updatable store conforming to its complex object data model. In particular,
Kleisli can use at relational database management systems such as Sybase, Oracle, MySQL, etc. to be its
updatable complex object store. It can even use all of these systems simultaneously!
We illustrate this power of Kleisli using the example of GenPept reports. Kleisli provides several functions to
access GenPept reports remotely from Entrez [27]: aa-get-uid-general, which retrieves unique identi ers of
GenPept reports matching a search string; aa-get-seqfeat-general, which retrieves GenPept reports matching
a search string; aa-get-seqfeat-by-uid, which retrieves the GenPept report corresponding to a given unique
identi er; and so on. The National Center for Biotechnology Information imposes a quota on how many times a
foreign user can access Entrez in a day. Thus, it would be prudent and desirable if we incrementally \warehouse"
GenPept reports into a local database.
Example 8.1 Create a warehouse of GenPept reports. Initialize it to reports on protein tyrosine phosphatases.
! connect to our Oracle database system
oracle-cplobj-add (#name: "db", ...);
! create a table to store GenPept reports
db-mktable (#table: "genpept", #schema: (#uid: "NUMBER", #detail: "LONG"));
! initialize it with PTP data
writefile { (#uid: x.#uid, #detail: x) | \x <- aa-get-seqfeat-general "PTP"}
to "genpept" using db-write;
! index the uid field for fast access
db-mkindex (#table: "genpept", #index: "genpeptindex", #schema: "uid");
! let's use it now to see the title of report 131470
readfile GenPept from "genpept" using db-read;
{ x.#detail.#title | \x <- GenPept, x.#uid = 131470};
In this example, a table genpept is created in our Oracle database system. This table has two columns, uid
for recording the unique identi er and detail for recording the GenPept report. A LONG data type is used
for the detail column of this table. However, recall from Example 2.2 that each GenPept report is a highly
nested complex objects. There is therefore a \mismatch" between LONG (which is essentially a big uninterpreted
string) and the complex structure of a GenPept report. This mismatch is resolved by Kleisli which automatically
performs the appropriate encoding and decoding. Thus, as far as the Kleisli user is concerned, x.#detail has the
type of GenPept report as given in Example 2.1. So he can ask for the title of a report as straightforwardly as
x.#detail.#title.
2
Normally, when the daily quota for accessing Entrez is exhausted, aa-get-seqfeat-by-uid returns the empty
set. However, it is possible to con gure Kleisli so that a testable \null" value is returned instead. Then it would
be useful to regularly examine the local warehouse to update \null" values to their proper GenPept records.
Example 8.2 A query to check for \null" values in the local warehouse and to replace them2 with proper GenPept
reports. This query should be run regularly whenever new Entrez quota becomes available.
oracle-cplobj-add (#name: "db", ...);
readfile GenPept from "genpept" using db-read;
{ db-update (#table: "genpept", #selector: (#uid: y.#uid), #replacement: (#detail: y))
| \x <- GenPept, x.#detail = null, \y <- aa-get-seqfeat-by-uid (x.#uid)};
2 The oracle-cplobj-add ...
and the
readfile GenPept ...
part are not needed if the local warehouse is currently connected.
17
2
It would also be convenient to provide a function my-aa-get-seqfeat-by-uid that looks up the local warehouse
for GenPept reports rst, instead of going straight to Entrez. It would be even more useful if it also automatically
updates the warehouse if it ever needs to fetch new reports from Entrez.
Example 8.3 Implement a \memoized" version of aa-get-seqfeat-by-uid. It uses the before (
E ; f ) function
of Kleisli which makes sure that the expression E is rst evaluated to some value v and then evaluates and returns
f (v ). Also implement a \memoized" version of aa-get-seqfeat-general.
oracle-cplobj-add (#name: "db", ...);
readfile GenPept from "genpept" using db-read;
primitive my-aa-get-seqfeat-by-uid == \u => { x.#detail | \x <- GenPept, x.#uid = u} before (\X =>
if X = { }
! the desired GenPept report was never fetched
then { x | \x <- aa-get-seqfeat-by-uid (u),
_ <- db-insert (#table: "genpept", #replacement: (#uid: x.#uid, #detail: x)) }
else if X = { null } ! quota ran out last time we tried to fetch
then { x | \x <- aa-get-seqfeat-by-uid (u),
_ <- db-update (#table:"genpept", #selector:(#uid: x.#uid), #replacement:(#detail: x))}
else X);
primitive my-aa-get-seqfeat-general == \s =>
{ x | \u <- aa-get-uid-general (s), \x <- my-aa-get-seqfeat-by-uid (u) };
2
9 A Protein Patent Query
In this section, we demonstrate Kleisli in the context of querying protein patents. We use Kleisli to tie together
the following sources to answer queries on protein patents that are considerably more demanding than simple
free-text search: (1) the protein section of the Entrez system at the National Center for Biotechnology Information
[27]; (2) the BLAST sequence homology service at the the National Center for Biotechnology Information [2];
(3) the WU-BLAST2 sequence homology software from Washington University [1]; (4) the Isite system at the
US Patent and Trademark Oce (http://patents.uspto.gov); and (5) the structural classi cation of protein
database SCOP at Cambridge MRC Laboratory of Molecular Biology [21].
Consider a pharmaceutical company that has a large choice of protein sequences to work on (i.e. to determine
their functions.) A criterion for selection is patent potential. A process involving many data sources and steps is
required, as depicted in the gure below.
18
USER
query seq
SCOP
Entrez
find superfamily
SCOP domains
superfamily
repersentative seq
NR
patented
seq
USPTO
check against patented seqs
patented
representative seq
unpatented
representative seq
find other similar seqs
find abstracts and claims
unpatented seq
in superfamily
of seq
USER
patent texts
patent detail
USER
At the initial stage, we know only the amino acid sequence of these sequences and very little else. A question at
this point would be: Which of the sequences have already been patented? Existing patent search systems are IR
systems that rely on English words. These systems su er from the dichotomy of recall vs precision [16]: They
either return only the highly relevant information at the expense of missing out a large portion of it, or return
most of the highly relevant information at the expense of also returning a lot of irrelevent information. So using
the standard interface to patent search systems is laborious and not always fruitful. Furthermore, at this early
search, we do not even have an English word to use for searching|we have only the actual amino acid sequences!
Thus, we need more reliable technology for comparing our protein sequences to those sequences already patented.
There are, however, two things in our favour. First, protein patents are generally based on protein function and
primary sequence structure (i.e. the linear string of 20 letters in the amino sequence.) Second, tools that can
reliably identify homology at the primary sequence structure level are available [25, 5]. Therefore, if patented
protein sequences can be extracted from some database and prepared in a form suitable for such tools to operate
on, we will have a means to reliably identify which of our sequences have not yet been patented. We obtain the
patented sequences from the protein section of Entrez [27]. These are \warehoused" locally for greater eciency.
We then use WU-BLAST2 [1] for comparing our sequences against this warehouse for primary sequence structure
homology.
After the unpatented protein sequences have been identi ed, the second question at this point is: Which ones of
these have the potential for wider patent claims? To understand this question it is necessary to recognize that
protein patents are generally granted on a sequence and its function. While we do not know the functions of our
proteins, because we have not done work on them yet, we know that proteins of the same evolutionary origin tend
to have similar functions even if they di er signi cantly in their primary structure [20]. Using the terminology
of SCOP, these proteins are in the same superfamily [21]. So one way to identify protein sequences that have
the potential for wider patent claims is to nd those having large number of unpatented sequences in the same
families. Homology searching algorithms based on primary structure are generally not suciently sensitive to
detect the majority of sequences in a typical superfamily [25], as the primary structure of distance members of
the family are likely to have mutated signi cantly. So we need tools for homology at the tertiary structure level.
No reliable automatic tools for this purpose exist at this moment, because structural similarity at the tertiary
level does not necessarily imply similarity in function. Nevertheless, reliable manually constructed databases of
superfamilies exist. A very nice one is the SCOP database [21]. Therefore, if we screen our unpatented sequences
against SCOP, we can pull out other representative sequences in their superfamilies and check which ones have
already been patented, thus identifying superfamilies with good potential. We use WU-BLAST2 for the screening.
After unpatented representatives of superfamilies have been found, it is still necessary to use them to sh out the
rest of the unpatented members of superfamilies. This step can be accomplished by using BLAST [2] to remotely
compare the representatives against the huge nonredundant protein database (NR) curated at the National Center
19
for Biotechnology Information.
Having found these potentially good protein sequences, we are ready to work on them and hopefully patent our
results. We thus need to ask the third question: What are the relevant prior arts? Retrieving the texts of patented
sequences in the same superfamilies as our proteins would be very helpful here. This step is complementary to
the previous step and can be carried out using exactly the same technology.
Example 9.1 We describe the query to nd unpatented sequences in the same superfamily of a user-supplied
protein sequence, as it is the most complicated of the three questions we mentioned. The program is shown below.
webblast-blastp(#name: "nr-blast", #db: "nr", #level: 2);
! line 1
localblast-blastp(#name: "patent-blast", #db: "patent-aa/blast/fasta");
localblast-blastp(#name: "scop-blast", #db: "scop-aa/blast/fasta")
seqindex-scanseq(#name:"scop-index", #index:"scop-aa/seqindex", #level: 1);
scop-add "scop";
readfile scop-summary from "scop-aa/data/summary" using stdin;
! line 6
primitive scop-accn2uid == (set-index' (#accession, scop-summary)).#eq;
materialize "scop-accn2uid";
{(#title: z.#title, #accession: z.#accession, #uid: z.#uid,
#class: i.#desc.#cl, #fold: i.#desc.#cf, #superfamily: i.#desc.#sf,
#family: i.#desc.#fa, #protein: i.#desc.#dm, #species: i.#desc.#sp, ! line 11
#scop-pscore: x.#pscore, #nr-pscore: z.#p-n)
| \x <- process SEQ using scop-blast, x.#pscore <= PSCORE-SCOP,
\i <- process <#sidinfo: x.#accession> using scop,
\sf <- process <#numsid: i.#type.#sf> using scop,
\sfuid <- scop-accn2uid (sf),
! line 16
\y <- process <#get: sfuid.#uid> using scop-index,
{} = { x | \x <- process y.#seq using patent-blast, x.#pscore <= PSCORE-PATENT },
\z <- process y.#seq using nr-blast, z.#p-n <= PSCORE-NR };
As several data sources are used, we must rst connect to them. We establish a connection nr-blast to BLAST
at National Center for Biotechnology Information for searching its NR database (line 1). The concurrency level
of this connection is set to 2 for more ecient parallel access. We establish a connection patent-blast to
WU-BLAST2 for searching against our local warehouse of patented proteins (line 2). We establish a connection
scop-blast to WU-BLAST2 for searching against our local warehouse of SCOP representative sequences (line
3). We establish a connection scop-index to the index of SCOP representative sequences (line 4). Both warehouses and the index were constructed previously using Kleisli. We establish a connection scop to the SCOP
classi cation database (line 5). These di erent connections to SCOP are needed because the SCOP database (line
5) contains only names and classi cation but not the actual representative sequences. We keep our sequences
using a proprietary sequence indexing technology SeqIndex [23]. This index (line 4) allows us to quickly retrieve
a sequence given an identi er or a pattern. Unfortunately, WU-BLAST2 requires the sequences to be stored in
a di erent format. Hence we need the third connection to SCOP (line 3). We must also deal with one more
problem: The identi ers used by SCOP (lines 3, 5) are di erent from the identi ers used by SeqIndex (line 4).
We therefore need a map between these identi ers, which is captured in the relation scop-summary (line 6). For
quick access we create a memory-resident index scop-accn2uid to map SCOP identi ers to SeqIndex identi ers
(lines 7-8).
After setting up connections to various data sources as described above, we are ready to issue our query to retrieve
information about unpatented sequences in the same superfamily as our protein sequence SEQ. The information
returned include title, accession, unique identifer, and classi cation of these sequences (lines 9{11). Also returned
with each unpatented protein sequence is its pscore with respect to SCOP and NR (line 12). The pscore is a reliable
estimate of the corresponding sequence being a false positive, given by BLAST and WU-BLAST2 [2, 1]. Eg., if
20
BLAST returns a hit with pscore of 0.001 to your sequence, then there is a one in a thousand chance of the hit
being a uke.
Let us step through the body of the program. Given the user sequence SEQ, we compare it against representative
sequences in SCOP; we keep only those hits x whose pscore is within the error threshold PSCORE-SCOP (line 13).
Since each hit x is good, the superfamily of x can be taken as the superfamily of the input sequence SEQ. We nd
the superfamily of x by simply asking SCOP to return us the SCOP classi cation i of x (line 14). The name
and identi er of x's superfamily is stored in the #desc.#sf and #type.#sf elds of i respectively. Next, we
need to sh out all representatives of that family from SCOP. SCOP gives us their SCOP identi ers sf (line
15). We convert these identi ers into unique identi ers sfuid in the SeqIndex where the sequences are kept (line
16). The SeqIndex is then accessed to give us the actual representative sequences y (line 17). Each representative
sequence y is compared against patented sequences. We retain those that have no hits within the error threshold
PSCORE-PATENT (line 18). These are representatives of our superfamily that are dissimilar to every patented
sequence. We compare them against the NR database to sh out all sequences z that are similar to them within
the error threshold PSCORE-NR (line 19). These are all the desired unpatented sequences in our superfamily. 2
Although the details of this process may seem confusing and requires knowledge of the biomedical data resources
available, it should be clear that a high-level technology such as Kleisli [32, 33, 14] greatly simpli es the process of developing interesting integrated systems. Furthermore, the Kleisli/CPL programs that access multiple
distributed databases and softwares are clear and concise, easily written and easilty modi able. The ability to
return information on protein patents as shown above goes well beyond the reach of existing patent information
servers based on standard IR systems.
10 A Clinical Data Query
We discuss one last example application of Kleisli; this time, in the context of querying clinical data. Clinical data
usually involve historical records of patient visits. Queries over such data often involves the concept of \window",
which is dicult to express using SQL.
We illustrate this point using a clinical database of Hepatitis-B patients. The database has a table BLD IMG of
type {(#patient: string, #date: num, #ALT: num, ...}.3 Each time a patient's ALT level is measured, a
new record is added to this table. One typical query of this data might be: \Find patients who have been tracked
for at least 300 days and whose ALT level is always between 70 and 200 units." Such a query is easily expressed
in SQL using the GROUP-BY and HAVING constructs and the aggregate functions MIN and MAX. However, another
typical query might be: \Find patients whose ALT levels have stayed between 70 and 200 units for a period of
at least 300 days." This query is problemmatic for SQL.
The challenges posed by the second query are the followings. First, the records must be grouped by patients.
Second, the records in each group must be sorted in chronological order. Third, the sorted records in each group
must be segmented into subgroups such that within each subgroup either all ALT levels are between 70 and 200
units or all the ALT levels are outside this range; futhermore, the date span of each subgroup must not overlap
another subgroup. Fourth, return the records of a patient if he has a subgroup that spans at least 300 days and all
ALT levels in that subgroup are between 70 and 200 units. The rst, second, and fourth steps are straightforward
in SQL. Unfortunately, the third step is not do-able in SQL.
P
This query is also not expressible in the Kleisli system using just the core calculus NRC (Q , +, ,?, , , Q =).
However, after a combinator for structural recursion [6] is imported into Kleisli, the query becomes expressible.
The combinator is list-sri, which corresponds to the fold operation in functional programming languages.
;
3 As we do not want to complicate our discussion with arithmetics on dates, we treat the value of the date attribute in our patient
database as an integer corresponding to the number of days passed since a particular xed time point.
21
Example 10.1 We can de ne a function split in terms of list-sri so that split ( ) returns a list
[( 1 1 ), ..., (
)] such that 1 , ..., is sorted chronologically; 1 , ..., alternate between true and false;
and ( ) = for each 2 . In the implementation below, s2l converts a set into a list and list-gensort
sorts a list using a given ordering.
d; c
x ;Y
x n ; Yn
c yj
xi
Y
yj
Yn
S
x
S
xn
Yi
primitive split == (\date, \check) =>
(list-sri ((\x, \y) =>
if y = []
then [ (x.check, [x]) ]
else if y.list-head.#1 = x.check
then (y.list-head.#1, x +] y.list-head.#2) +] y.list-tail
else (x.check, [x]) +] y, [])) o
(list-gensort ((\x, \y) => x.date > y.date)) o s2l;
Then the query \Find patients whose ALT levels have stayed between 70 and 200 units for a period of at least 300
days" can be de ned as follows.
! access the hepatitis-B database on Oracle
oracle-cplobj-add (#name: "hepB", ...);
readfile BLD_IMG from "BLD_IMG" using hepB-read;
! define the ``window'' based on ALT level
primitive window == split (#date, \x => (x.#ALT > 70) andalso (x.#ALT <= 200));
! compute the answer
{ p | \k <- set-unique { x.#patient | \x <- BLD_IMG },
\P == { y | \y <- BLD_IMG, y.#patient = k },
{()} = { () | \g <--- window (P), g.#1,
(g.#2.list-rev.list-head.#date - g.#2.list-head.#date) >= 300},
\p <- P };
Incidentally, this implementation also demonstrates the seamless integration of the operation of Kleisli, such as
the use of \window", with that of a relational database.
2
11
Conclusion
The Kleisli system and its high-level query language CPL embody many advances that have been made in database
query languages and in functional programming. It represents a substantial deployment of functional programming
in an industrial strength prototype that has made signi cant impact on data integration in bioinformatics. Indeed,
since the early Kleisli prototype was applied to bioinformatics, it has been used to eciently solve many data
integration problems in bioinformatics. To date, thanks to the use of CPL, we do not know of another system
that can express general bioinformatics queries as succinctly as Kleisli.
There are several key ideas behind the success of the system. The rst is its use of a complex object data model
where sets, bags, lists, records and variants can be exibly combined. The second is its use of a high-level query
language (CPL) which allows these objects to be easily manipulated. The third is its use of a self-describing data
exchange format, which serves as a simple conduit to external data sources. The fourth is its query optimizer,
which is capable of many powerful optimizations. The last-but-not-least reason behind the success of the system is
the choice of SML as its implementation platform, which enables a remarkably compact implementation consisting
of about 45000 of codes in SML. We have no doubt that without this robust platform of functional programming,
it would have demanded much more e ort to implement Kleisli.
22
We would like to end this paper by acknowledging the contributions to Kleisli by our colleagues. The rst prototype
of Kleisli/CPL was designed and implemented in 1994, while Wong was at the University of Pennsylvania. Peter
Buneman, Val Tannen, Leonid Libkin, and Dan Suciu contributed to the query language theory and foundational
issues of CPL. Chris Overton introduced us to problems in bioinformatics. Kyle Hart helped us in applying Kleisli
to address the rst bioinformatic integration problem ever solved by Kleisli. Wong re-designed and re-implemented
the entire system in 1995, when he returned to the Institute of Systems Science (now renamed Kent Ridge
Digital Labs, following its corporatization.) The new system, which is in production use in the pharmaceutical
industry, has many new implementation ideas, has much higher performance, is much more robust, and has much
better support for bioinformatics. Desai Narasimhalu supported its development in Singapore. Oliver Wu, Jing
Chen, and Jiren Wang added much to its bioinformatics support under funding from the Singapore Economic
Development Board. Finally, S. Subbiah was responsible for taking Kleisli to the market|Kleisli is now available
commercially from GeneticXchange Inc (www.geneticxchange.com).
References
[1] S. F. Altschul and W. Gish. Local alignment statistics. Methods in Enzymology, 266:460{480, 1996.
[2] S. F. Altschul et al. Basic local alignment search tool. Journal of Molecular Biology, 215:403{410, 1990.
[3] S. F. Altschul et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs.
Nucleic Acids Research, 25(17):3389{3402, 1997.
[4] P. G. Baker et al. TAMBIS|transparent access to multiple bioinformatics information sources. Intelligent
Systems for Molecular Biology, 6:25{34, 1998.
[5] G. J. Barton and M.J.E. Sternberg. Evaluation and improvements in the automatic alignment of protein
sequences. Protein Engineering, 1:89{94, 1987.
[6] V. Tannen et al. Structural recursion as a query language. In Proc. 3rd International Workshop on Database
Programming Languages, pages 9{19, 1991.
[7] V. Tannen and R. Subrahmanyam. Logical and computational aspects of programming with Sets/Bags/Lists.
In LNCS 510: Proc. 18th International Colloquium on Automata, Languages, and Programming, pages 60{75,
1991.
[8] P. Buneman et al. Comprehension syntax. SIGMOD Record, 23(1):87{96, 1994.
[9] P. Buneman et al. Principles of programming with complex objects and collection types. Theoretical Computer Science, 149(1):3{48, 1995.
[10] C. Burks et al. GenBank. Nucleic Acids Research, 20 Supplement:2065{9, 1992.
[11] J. Chen et al. Using Kleisli to bring out features in BLASTP results. Genome Informatics, 9:102{111, 1998.
[12] J. Chen et al. A protein patent query system powered by Kleisli. In Proc. ACM SIGMOD International
Conference on Management of Data, pages 593{595, June 1998.
[13] E. F. Codd. A relational model for large shared data bank. Communications of the ACM, 13(6):377{387,
1970.
[14] S. Davidson et al. BioKleisli: A digital library for biomedical researchers. International Journal of Digital
Libraries, 1(1):36{53, 1997.
[15] G. Dong et al. Local properties of query languages. In Proc. 6th International Conference on Database
Theory, pages 140{154, 1997.
23
[16] W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice Hall,
1992.
[17] A. Goldberg and R. Paige. Stream processing. In Proc. ACM Symposium on LISP and Functional Programming, pages 53{62, 1984.
[18] ISO. Standard 8824. Information Processing Systems. Open Systems Interconnection. Speci cation of Abstraction Syntax Notation One (ASN.1), 1987.
[19] L. Libkin and L. Wong. Query languages for bags and aggregate functions. Journal of Computer and System
Sciences, 55(2):241{272, 1997.
[20] H. Lodish et al. Molecular Cell Biology. W. H. Freeman, New York, 1995.
[21] A. Murzin et al. SCOP: A structural classi cation of protein database for the investigation of sequences and
structures. Journal of Molecular Biology, 247:536{540, 1995.
[22] National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD. NCBI ASN.1
Speci cation, 1992. Revision 2.0.
[23] H.-H. Pang et al. S-Hash: An indexing scheme for approximate subsequence matching in large sequence
databases. Technical report, Institute of Systems Science, Heng Mui Keng Terrace, Singapore 119597, 1997.
[24] P. Pearson et al. The GDB human genome data base anno 1992. Nucleic Acids Research, 20:2201{2206,
1992.
[25] W. R. Pearson. Comparison of methods for searching protein sequence databases. Protein Science, 4:1145{
1160, 1995.
[26] C. Schoenbach et al. FIMM, a database of functional molecular immunology. Nucleic Acids Research,
28(1):222{224, 2000.
[27] G. D. Schuler et al. Entrez: Molecular biology database and retrieval system. Methods in Enzymology,
266:141{162, 1996.
[28] D. Suciu. Bounded xpoints for complex objects. Theoretical Computer Science, 176(1{2):283{328, 1997.
[29] D. Suciu and L. Wong. On two forms of structural recursion. In LNCS 893: Proc. 5th International
Conference on Database Theory, pages 111{124, 1995.
[30] P. Wadler. Comprehending monads. Mathematical Structures in Computer Science, 2:461{493, 1992.
[31] S. Walsh et al. ACEDB: A database for genome information. Methods Biochem Anal, 39:299{318, 1998.
[32] L. Wong. The functional guts of the kleisli query system. In Proc. 5th ACM SIGPLAN International
Conference on Functional Programming, pages 1{10, 2000.
[33] L. Wong. Kleisli, a functional query system. J. Funct. Prog., 10(1):19{56, 2000.
24