Journal of Cheminformatics: Orchem - An Open Source Chemistry Search Engine For Oracle

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Journal of Cheminformatics

Software Open Access


OrChem - An open source chemistry search engine for Oracle®
Mark Rijnbeek* and Christoph Steinbeck

Address: Cheminformatics and Metabolism, European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Cambridge, UK
Email: Mark Rijnbeek* - [email protected]; Christoph Steinbeck - [email protected]
* Corresponding author

Published: 22 October 2009 Received: 31 July 2009


Accepted: 22 October 2009
Journal of Cheminformatics 2009, 1:17 doi:10.1186/1758-2946-1-17
This article is available from: http://www.jcheminf.com/content/1/1/17
© 2009 Rijnbeek and Steinbeck; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract
Background: Registration, indexing and searching of chemical structures in relational databases is
one of the core areas of cheminformatics. However, little detail has been published on the inner
workings of search engines and their development has been mostly closed-source. We decided to
develop an open source chemistry extension for Oracle, the de facto database platform in the
commercial world.
Results: Here we present OrChem, an extension for the Oracle 11G database that adds
registration and indexing of chemical structures to support fast substructure and similarity
searching. The cheminformatics functionality is provided by the Chemistry Development Kit.
OrChem provides similarity searching with response times in the order of seconds for databases
with millions of compounds, depending on a given similarity cut-off. For substructure searching, it
can make use of multiple processor cores on today's powerful database servers to provide fast
response times in equally large data sets.
Availability: OrChem is free software and can be redistributed and/or modified under the terms
of the GNU Lesser General Public License as published by the Free Software Foundation. All
software is available via http://orchem.sourceforge.net.

Introduction now co-developed with collaborators world-wide as well


Registration, indexing and searching of chemical struc- as NMRShiftDB, an NMR database which provides a
tures in relational databases is one of the core areas of MySQL-based open source system for the registration and
cheminformatics [1,2]. Research on the topic goes back to searching of chemical compounds in a relational database
the 1960s and probably before that [3]. However, little [7,8]. In the meantime, three projects dedicated to the
detail has been published on the inner workings of search development of chemical registry and search capabilities
engines and developments have been mostly closed- for PostgreSQL (project PGChem) [9], MySQL (project
source. This has led to the situation that despite more than MyChem) [10] and Oracle (project OrChem) [11] have
thirty years of research and publications very few open ref- been established under the umbrella of the ChemiSQL-
erence code is available for use and study. The cheminfor- Project (pronounced "chemiscule") [12]. Here we report
matics open source community [4] has been working about our development of OrChem, an open source soft-
since the mid 1990s to overcome this problematic situa- ware package which adds registration and chemical search
tion. Our group has contributed to this by creating and capabilities to the Oracle 11G database system. OrChem
developing the Chemistry Development Kit (CDK) [5,6], is a mix of PL/SQL and Java that executes inside the data-

Page 1 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17

base. Figure 1 gives an overview of OrChem's main com- and so quickly implement chemistry extensions for Ora-
ponents, showing how the user interacts with OrChem via cle. OrChem works in the same way, using the CDK where
PL/SQL packages that call out to so called "Java Stored possible.
Procedures" [13]. Starting with Oracle 11 g there is a just-
in-time (JIT) compiler for the Oracle JVM environment. A Fingerprinting
just-in-time (JIT) compiler is a program that converts Java OrChem uses chemical fingerprints to find compounds by
bytecode into machine language instructions which substructure or similarity criteria. Fingerprints are bitsets
makes Java run much faster than when the bytecode is in which each bit indicates the presence or absence of a
executed by an interpreter. particular chemical aspect. During a similarity search the
fingerprints are used to calculate a Tanimoto measure
OrChem is built on top of the Chemistry Development [14]. A Tanimoto similarity measure between two binary
Kit (CDK) [5,6] and depends on this Java library in fingerprints is defined by the ratio of the number of com-
numerous ways. For example, compounds are represented mon bits set to one to the total number of bits set to one
internally as CDK molecule objects, the CDK's I/O pack- in the two fingerprints [15]. For substructure searching the
age is used to retrieve compound data, and its subgraph fingerprint has a different function: it is used to screen
isomorphism algorithms are used for substructure valida- possible candidates before a computationally more
tion. OrChem adds its own Java layer on top of the CDK expensive isomorphism test.
to implement fast database storage and retrieval. With the
CDK loaded into Oracle, a large cheminformatics library For both substructure and similarity searching OrChem
becomes readily available to PL/SQL. With little effort uses the same fingerprint that currently measures around
developers can build database functions around the CDK 800 bits in size and uses both structural keys and hashed

Figure 1 overview
OrChem
OrChem overview.

Page 2 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17

values. With structural keys each position in the bit string • Character 1: {"6","5"} indicates if the ring triplet
corresponds to a pre-defined structure or molecular fea- contains strictly 6-size rings or otherwise (up to three)
ture. A hashed fingerprint is produced by generating all 5-size rings
possible linear paths of connected molecules containing
between one and a defined number of atoms, and project- • Character 2: {"S","D"} indicates bond nature, S
ing the hash values of the resulting strings onto the small means single bonds only, D double bonds (possibly
set of bits in the fingerprint in a deterministic manner aromatic) also present
[16].
• Character 3: {"L","H"} indicates connectivity, H
In the OrChem fingerprint the first approximately one means some atom participates in all 3 rings (tightly
hundred bits are reserved for hashing three-atom SMILES coupled)
strings. If a compound contains for example "C-C:S" then
bit number 20 will be set, as the calculated hash value is • Character 4: {"L","M","H"} indicates shared bonds
20 for this particular example. A hash value is calculated between rings in the triplet: Low = 0,1,2 Med = 3,4
consistently for each pattern encountered. We only take High = 5..n
three-atom SMILES into account because these yield a dis-
tinct yet relatively small amount of possible combina- The aim of these descriptors is to characterize particular
tions, all of which can be properly hashed into the limited constellations of common sized rings, to identify struc-
amount of bits reserved for the hashed key in the finger- tures that look normal purely based on the ring sizes but
print. may be in fact rare due to the way the rings are connected
together in a cluster.
Hashing in general decreases the accuracy of Tanimoto
scoring because different chemical aspects will hash to the Future OrChem releases may see the structural key further
same bit position. However the benefit of hashing lies in extended for improvement of specific search types. For
flagging relatively rare patterns. Rare SMILES patterns now, in summary the fingerprint captures the following
would normally not be part of a structural key because aspects:
assigning an explicit bit would be wasteful, but by hashing
all size three patterns into a range of bit positions the • hashed three-atom SMILES strings
infrequent patterns still leave their mark and this can
speed up substructure searches. The OrChem fingerprint • element counts
was at first only structural and structure searches for an
unusual SMILES pattern like "O:C:O" saw many com- • atom pairs
pounds being screened in vain. The current fingerprint
hashes "O:C:O" to a bit position and this helps to narrow • atom nearest neighbours
down the set of candidates significantly.
• common SMILES patterns
Coming after the hashed part, the structural key in the
OrChem fingerprint starts around bit position one hun- • individual rings: size, aromaticity, elements
dred. The structural key was initially based on the
PubChem fingerprint [17,18] but the current version dif- • ring clusters: cluster size, what ring sizes in the clus-
fers from it in various ways. OrChem for instance flags ter
specific SMILES patterns that have proved discriminating
for searching compounds like those in ChEMBL [19]. • 5/6 size ring triplets: how are rings connected, how
Numerous bits are reserved to capture ring aspects: clus- many atoms and bonds are shared
ters of rings are detected and aspects of these clusters are
fingerprinted, like the occurrence of a ring size three • long carbon trails: longer series of (non aromatic)
attached to a ring size five. With regards to the common carbon molecules
rings of size five and six, OrChem creates textual descrip-
tors for every connected set of three of these encountered • infrequent SMILES patterns (grouped per bit)
and incorporates this information into the fingerprint.
Example textual descriptors are "5DLM" and or "6SHH", Figure 2 shows the frequency distribution of OrChem fin-
the meaning being as follows: gerprint bits for all compounds in the ChEMBL database.
The dual nature of the fingerprint is clearly visible: the first
part shows a randomly distributed hashed key, followed
by the ordered structural key. Ordering the structural key

Page 3 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17

Figure 2 fingerprint distribution in Chembl database


OrChem
OrChem fingerprint distribution in Chembl database.

by occurrence of bits makes sense because the bits get from table(orchem_simsearch.search
indexed using composite B-tree indexes. Clustering rare
bits at the tail of the fingerprint makes the corresponding ('CCCCCC[C@H]1CC[C@H]2CCCC[C@]12C','SMILE
indexes small and provide an obvious pick for the Oracle S',0.7))}
optimizer.
At the center of the similarity search is a table called
Similarity searching orchem_fingprint_simsearch, pictured in Table 1.
With the fingerprint in place OrChem can perform fast The implemented algorithm uses the bitmap indexed col-
similarity searches. The algorithm for similarity searching umn bit_count to first inspect compounds for which
is taken from a paper by S. Joshua Swamidass and Pierre the number of bits set to one is the same as that of query
Baldi [15]. molecule. After that it works its way through compounds
with a bit count close to that of the query until it is done,
The algorithm proves to work well and allows the search that is until the minimum similarity has been reached or
to break out when a minimal given similarity is reached, until the result set size satisfies the user. Column fp in the
whilst completeness of the output is guaranteed. For similarity table stores a byte array representation of the
Orchem the similarity search has been implemented as a fingerprint. This raw data column allows the similarity
Java stored procedure that returns an array of results. The
current function accepts SMILES and Molfiles for a query Table 1: Database table with fingerprints optimized for similarity
and a cut-off value between 0 and 1 to indicate minimum searching
required similarity. Optional arguments allow the user to
TABLE orchem_fingprint_simsearch
indicate a cap to indicate the maximum number of results
required, and whether or not to display debugging infor-
id VARCHAR2(32)
mation. The select statement below shows a query exam-
ple to find compounds with a similarity of seventy percent bit_count NUMBER(4)
or more to the given SMILES string:
fp RAW(100)
select *

Page 4 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17

search to quickly read bytes from the database, convert structure search for "P = O" will have OrChem create a
these to Java bitsets and use those for bitwise comparisons SQL clause using the three bits that are set in the finger-
to obtain a Tanimoto similarity score. print of "P = O". The pre-screening query will resemble:

Substructure searching select ...


Prescreening
Chemical fingerprints can be used to quickly pre-screen from orchem_fingprint_subsearch
candidates likely to contain a given query as a substructure
[20]. Due to the degeneracy of bits in a fingerprint, this where bit472='1' and bit477='1' and
leads to false positives. The screened set of candidates thus bit480='1'}
needs further inspection by an isomorphism algorithm to
detect if the substructure is truly present or not. A sub- The meaning of the bits is not relevant here, the point is
structure search is therefore a two step process - ideally the that the OrChem prescreening process is essentially done
first step uses the fingerprint to screen accurately so that through a SQL clause filtering for indexed bit columns
the computationally more expensive second step will that correspond to bits set in the fingerprint of the query.
have a high ratio of positive verifications. The other way The bitwise comparison is thus replaced entirely by a sin-
around, if the first step is not efficient, the second step will gle SQL statement with no need for computationally
have to inspect too many compounds that don't contain expensive Java.
the substructure at all.
Exact search for subgraph isomorphism
The screening process is in essence a bitwise comparison For each compound that passes the prescreening process,
between the two fingerprints of a query and a candidate a graph matching algorithm needs to establish whether
structure. Any bit set to true in the query fingerprint must the query indeed occurs as a substructure or not. Mole-
be set to true in a candidate fingerprint. OrChem imple- cules can easily be interpreted as graphs with atoms being
ments this comparison using a dedicated database table nodes and bonds being edges. The graph matching in
that is partly listed in Table 2. Database table Orchem is done by a CDK Java implementation of the VF2
orchem_fingprint_subsearch has a separate col- algorithm [21]. VF2 is a graph matching algorithm that
umn for each bit in the fingerprint; this is different from has been shown to perform better than the Ullmann algo-
the table for similarity searching that has a single raw data rithm for small graphs, and much better than Ullmann for
column to store an entire fingerprint. The reason behind large graphs. Compared to the original VF algorithm, VF2
this difference is that for similarity searching the bitwise lowers the memory requirements from O(n2) to O(n)
operation is done in Java with binary bitsets, whereas the with respect to the number of nodes in the graphs.
substructure screening uses the separate bit columns to
construct a dynamic SQL clause instead. This bit-column VF2 is a fast backtracking algorithm that tries to match
based "where" clause instructs the database to select com- each node in a query graph to a node in a target graph.
pounds for which bits in the fingerprint set to "1" include OrChem further improves VF2 performance by reorganiz-
those set to "1" in the query structure. For instance a sub- ing the query graph beforehand by sorting the nodes
(atoms) and edges (bonds). A primary sort is done based
Table 2: Database table with fingerprints optimized for on the uniqueness of elements and a secondary sort on
substructure searching
bond connectivity. Essentially the sort moves the rare
TABLE orchem_fingprint_subsearch nodes up to be matched first: if the query structure is
C16H14N2O9S then it is best to let VF2 try to match sul-
id VARCHAR2(80) fur first, then the nitrogens, then the oxygens et cetera. The
secondary sort on bond connectivity can be useful if the
atoms VARCHAR2(4000) query does not contain distinctive atoms but has distinc-
tive structures such as ring groups in which some atoms
bonds VARCHAR2(4000) are more connected than others.
bit0 VARCHAR2(1)
The performance of the Java VF2 implementation mainly
depends on the complexity and characteristics of the
bit1 VARCHAR2(1)
molecular input graphs. The algorithm may need to get
etc...
recursively deep, or may need to explore many possible
mappings, making the graph matching algorithm the
bit799 VARCHAR2(1) computationally expensive part of the substructure
search.

Page 5 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17

Query example ilar tests on the PubChem sample (3.5 million com-
The developer can submit an OrChem substructure search pounds) show a longer cache "warm up" time and give
through package orchem_subsearch: throughput times in the order of several seconds or
longer, again mostly depending on the similarity cut-off.
select * Figure 3 illustrates performance for searching the
PubChem sample and also shows a table with average
from table (orchem_subsearch.search numbers with regards to query throughput time and result
set sizes.
('S:C:C','SMILES',1000))
Substructure searching
The example shows a substructure search for SMILES As described before, substructure searching is done as a
"S:C:C" with a break out after 1000 results through an two-step process with a prescreening step followed by a
optional argument that emulates the ROWNUM column. VF2 isomorphism check. In the isomorphism algorithm
Under the hood the SMILES string will be fingerprinted, a each screened database compound needs to be material-
prescreen query will be constructed based on it and then ized as a CDK molecule to be able to compare it to the
executed with the Oracle Optimizer picking the most suit- query structure. It can however be expensive to materialize
able index available for the bit columns. Each candidate thousands of molecules on the fly using database Molf-
will be verified using VF2 and only valid superstructures iles, particularly with regards to calculation of aromaticity.
will be returned. The substructure search function is pipe- Instead, OrChem stores data for each molecule in col-
lined, so rows are returned iteratively as they are produced umns atoms and bonds of table
without having to wait for the entire collection to be con- orchem_fingprint_subsearch during the finger-
structed. This allows developers to build web interfaces printing process. These columns provide a quick way to
refreshing at a constant interval while presenting the materialize a basic CDK molecule to be passed into the
results returned so far to the user. VF2 algorithm. The data structures used are quite straight-
forward, for instance with data in column atom "C O"
Performance interpreted as: "atom 0 is Carbon, atom 1 is Oxygen" and
General bond column "0 1 D Y" then implying "there is a bond
OrChem has been tested extensively on a release of the between C (atom 0) and O (atom 1) that is double (D)
ChEMBL database (former Starlite) with around 420,000 and aromatic is true (Y)". In this way, CDK molecules can
compounds. Additional tests were done on a random be generated very fast without the need for calculating any
PubChem data sample of 3.5 million compounds in order properties during the search.
to assess performance on a larger data set. All tests were
done on Oracle 11.1.0.7 installed on an eight CPU quad Parallel substructure searching
core 32 Gb RedHat Linux server at the EBI. With regards to Oracle allows parallel execution of table functions [13]
performance, it is good to keep in mind that the initial run and this feature can be used to speed up Or-Chem sub-
of any Java stored procedure incurs a lot of overhead. Even structure searching. A parallel table function returns a col-
a first call to a simple 'helloWorld()' program takes a lection and is executed in a two-stage operation. First, one
while to complete, but following calls then respond set of slave processes partitions the data as directed in the
immediately. Orchem's search performance improves function's declaration; then a second set of slave scans
after a few queries have been issued and the fingerprint executes the table function in parallel on the partitioned
data starts to accumulate in the buffer cache where data data. The following statement illustrates the concept, with
can be accessed faster than by reading from disk. The f being the function to be run in parallel, taking a ref cur-
tables that support similarity and substructure searching sor input argument to partition the data.
are created with the cache option, so the more memory
assigned to the SGA the better the performance will be. select *

Similarity searching from table(f(cursor(select * from tab)))


Similarity searches typically show a more consistent per-
formance than substructure searches. To obtain a similar- Although the VF2 isomorphism algorithm is an ideal can-
ity score it is sufficient to compare fingerprint bitsets, and didate for a parallel approach, going parallel is not always
the time to complete a similarity search depends mainly the best solution. OrChem can execute non-parallel queries
on the required minimum Tanimoto similarity. Tests on relatively fast providing the query structure itself has suffi-
ChEMBL show query throughput times in the order of cient unique features. The prescreening step for a discrim-
split-seconds to seconds for high (>0.75) similarity, inating query will normally yield a small set of database
improving once the fingerprints start getting cached. Sim- candidate compounds that can quickly be scanned with

Page 6 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17

Similarity Average compounds found Average throughput time (seconds)


0.5 14,541 18.3
0.65 1,936 10.6
0.8 226 5.8
0.95 13 1.4

Similarity
Figure 3 search throughput time for different minimal Tanimoto similarity
Similarity search throughput time for different minimal Tanimoto similarity. Explanation of graph: one hundred
randomly sampled compounds were used to query for all similar compounds, repeated for different minimum Tanimoto simi-
larities. Searches were done in the PubChem sample of 3.5 million compounds. As the similarity cutoff increases, performance
goes up: finding all compounds similar to compound X with at least a Tanimoto similarity of 0.8 (= 80% similar) is faster than
finding (many more) compounds that are 50% similar.

VF2. In such as case a non-parallel search can actually run case of Orchem this means that table
faster than a parallel search because there is not much VF2 orchem_fingprint_subsearch must be accessed
workload and no parallel overhead is incurred. But most with full table scans, but this is not always the best option.
importantly, the non-parallel search can use the regular B- When queries are done on a database with several million
tree indexes on the bit columns in table compounds the indexed approach can outperform the
orchem_fingprint_subsearch whereas a parallel parallel approach easily if the index can be used to quickly
query can not. narrow down a small set of candidates. Furthermore the
performance of a parallel query depends on a number of
In general, to parallelize any query Oracle must access at factors, one of them being the hardware on which the
least one of the tables through a full table scan or an index Oracle instance runs, with the more CPU cores the better.
through a range scan involving multiple partitions. In the The performance also depends on the size of the database,

Page 7 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17

the type of the query (specific/generic) and the volume of


the result set. Tests at the EBI show that parallel queries in
general perform well on the ChEMBL dataset, even if a
query is very specific and an indexed approach would be
quicker. This can be explained by the fact that the specific
EBI hardware can run very fast full table scans on a dataset
the size of ChEMBL. For larger databases this no longer
holds true, as full scans become quite expensive even
when run in parallel.

The invocation of a parallel substructure search needs to


be done in two steps due to the underlying implementa-
tion - possibly in a later release this can be simplified.
Below is an example query as done on the command line.
First the substructure search is quickly set up, in this case
for SMILES "C:O:C:N". A key is retrieved and this key is
then used to perform the actual parallel search:

> var key number;

> exec :key :=

orchem_subsearch_par.setup('C:O:C:N','SMI
LES') Figurecompounds
Performance
million 4 of substructure search for "C:C:C:C:N" on 3.5
Performance of substructure search for "C:C:C:C:N"
> select * from table on 3.5 million compounds.

(orchem_subsearch_par.search(:key,1000)

The parallel table function works best for general queries rare bit combination, without being able to use any index
with high volumes of data being processed and many pos- for this. Finally in figure 6 we have a graph with through-
itive verifications being returned, the VF2 workload put times for a sampled set of sixty random structures
shared over several slave processes. This can be observed used to do a parallel substructure searching with different
when selecting the first n results for a common substruc- breakout values. The graph shows quite a spread in total
ture pattern like "C:C:C:C:N" with n set to a high value. throughput time, with the time to return up to 100,000
Figure 4 shows a graph for a "C:C:C:C:N" query done verified results varying between 20 seconds and more
against the PubChem sample: the parallel function iden- than three minutes, all depending on the complexity of
tifies the first 5000 results in 3.5 seconds whereas the sin- the query, the quality of the fingerprint and the occur-
gle process takes 22.5 seconds. The parallel approach rence rate of the structure. Either way users can see results
clearly benefits from doing the VF2 isomorphism checks streamed back before completion because data is piped
spread over multiple parallel processes, which becomes back as soon as it becomes available.
even more obvious when n is further increased to 25000.
However to add to the picture, figure 5 shows what hap- In conclusion the given examples show how the database
pens with a similar test for "C:O:C:O". This structure hap- content and the nature of the query determine the differ-
pens to occur only once in the entire data set, and the non- ence in performance between the parallel and the non-
parallel function can use a fast index scan on the bit col- parallel substructure search. Which one is fastest depends
umns of orchem_fingprint_subsearch to quickly mainly on the size of the database, the number of results
find three possible candidates with the right combination requested (limited or all) and the size of the prescreened
of fingerprint bits. One compound is then verified by VF2 compound set. The smaller the pre-screened set is the less
to be a superstructure of "C:O:C:O" (compound is shown attractive a parallel full scan normally is - but it is hard to
in the graph). The parallel approach at the other hand find exact rules. On top of that performance also depends
needs to make full table scans, albeit in parallel, and takes on other factors such as the server hardware and the qual-
more than forty seconds to find the same single result. The ity of the query's fingerprint. To make the search process
benefit of fast parallel VF2 execution does not apply now more user friendly, we plan to find heuristic rules to be
and the elapse time is spent scanning rows that meet the incorporated into a generic search function that will

Page 8 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17

Figure 5
Performance of substructure search for "C:O:C:O" on 3.5 million compounds
Performance of substructure search for "C:O:C:O" on 3.5 million compounds.

decide for the user whether to take a parallel approach or installation is complete and OrChem can perform com-
not. pound searches.

OrChem Installation OrChem future development


This paragraph briefly describes the OrChem installation A number of features have been identified to be imple-
process. A complete step-by-step description of the instal- mented in coming releases:
lation can be found in manuals on the project's web pages
[11]. The first step is to create a new schema and necessary • The current version of OrChem expects the user's
database elements (tables, indexes, packages) and to load compound table to have MDL Molfiles to work with.
required Java libraries into it. Users have to provide details This will be extended to include other common data
of the base table containing the actual compound data, formats such as CML.
presumed to be present in another schema. Next a proce-
dure needs to run to create a fingerprint for every com- • OrChem should be able to deal with R-group and
pound. Each fingerprint captures the chemical aspects of SMARTS queries and to ignore bond order on request.
a compound and is stored in the database. The amount of These features have in common that wildcards should
time it takes to create all fingerprints depends on the be allowed to widen the search, and the challenge will
amount and complexity of the compounds and on the be to not only interpret these wildcards but also to
capacity of the database server. As a performance indica- keep performance acceptable.
tion, at the EBI we fingerprinted over 400,000 compounds
in an hour, running a parallel process with DBMS_JOB on • At present, OrChem development has not put partic-
an Oracle instance hosted on a multi-processor Linux ular emphasis on stereochemical searches. OrChem
server. Once the fingerprinting procedure has finished, needs to be able to distinguish between stereoisomers

Page 9 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17

Figure substructure
Parallel 6 search throughput time on 3.5 million compounds
Parallel substructure search throughput time on 3.5 million compounds. Explanation of graph: sixty sampled sub-
structures were used as a query for a parallel substructure search in the PubChem sample of 3.5 million compounds. The same
searches were repeated with different maximum number of rows requested: 1000 (bottom), 10,000 (middle) and 100,000
(top). The graph displays the total throughput time in seconds but users can view the intermediate output generally faster as
the search function spools ('pipes') results as they become available. Fastest throughput times are observed for query struc-
tures that commonly exist in compounds, resulting in high a success ratio of the VF2 isomorphism algorithm.

and provide substructure search criteria related to ster- cheminformatics functionality but is only in its first
eoisomerism. release and by no means a finished product. Developers
who are interested to further extend OrChem are invited
• More of the CDK needs to be exposed through PL/ to participate through Sourceforge to make it a truly col-
SQL API's, for example CDK functions to calculate laborative project. We also encourage users to submit sug-
QSAR/QSPR descriptors or chemical properties. gestions and requests for functionality through the
mailing list "[email protected]".
• The similarity search will benefit from an option to
run in parallel, and more types of similarity scoring Availability and requirements
should be added to complement the current default Project name: OrChem
Tanimoto scoring.
Project home page: http://orchem.sourceforge.net/
• Along the same lines we might want to experiment
with different types of fingerprints. Operating system: Platform independent

• Pharmacophore searches could be added to take into language: Java 1.5 or higher
account 3D arrangements of molecular features.
Database system: Oracle 11 g (with JRE 1.5)
Conclusion
We have presented OrChem, an open source extension for Comm. restrictions: None
the Oracle 11G database platform that adds registration
and indexing of chemical structures to support fast sub-
structure and similarity searching. OrChem provides core

Page 10 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17

License time. Journal of Chemical Information and Modeling 2007,


47(2):302-317.
OrChem is released under the terms of the GNU Lesser 16. Leach AR, Gillet VJ: An Introduction to Chemoinformatics.
General Public License as published by the Free Software 2007:255.
Foundation; either version 2.1 of the License, or (at your 17. Sayers E, Barrett T, Benson D, Bryant S: Database resources of
the national center for biotechnology information. Nucleic
option) any later version. Acids Research 2009, 37:D5-D15.
18. Definitions of PubChem fingerprints 2004 [ftp://
ftp.ncbi.nlm.nih.gov/pubchem/specifications/
Competing interests pubchem_fingerprints.txt].
The authors declare that they have no competing interests. 19. The ChEMBL project at the European Bioinformatics Insti-
tute 2004 [http://www.ebi.ac.uk/chembl].
20. Barnard J: Substructure searching methods: Old and new. Jour-
Authors' contributions nal of Chemical Information and Modeling 1993, 33(4):532-538.
MR designed and implemented the software and drafted 21. Cordella L, Foggia P, Sansone C, Vento M: An improved algorithm
most of the manuscript. CS guided the development proc- for matching large graphs. Proc of the 3rd IAPR-TC-15 International
Workshop ... 2001.
ess and provided expertise on usage of the CDK. Both
authors performed testing of the application, and
approved the final manuscript.

Acknowledgements
We are grateful to John Overington and Bissan Al-Lazikani for making the
ChEMBL database available to us for testing. We thank the Pubchem team
for assembling this great collection of molecular structures and for opening
up their fingerprint definition. Our particular thanks go to the Chemistry
Development Kit (CDK) community for helping us to build this great
resource. Finally we'd like to thank Federico Paoli from Siena Biotech for
participating in OrChem development.

References
1. Hagadone T, Michael S: Integrating chemical structures into an
extended relational database system. Chemical Structures 2: The
International Language of Chemistry: Proceedings of the 2nd International
Conference 1995:257-269.
2. Hagadone T, Schulz M: Capturing Chemical Structure Informa-
tion in a Relational Database System: The Chemical Soft-
ware Component Approach. Journal of Chemical Information and
Computer Sciences 1995, 35(5):879-884.
3. Frome J: Searching Chemical Structures. Journal of Chemical Doc-
umentation 1964:43-45.
4. Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Stein-
beck C, Wegner J, Willighagen EL: The Blue Obelisk - Interoper-
ability in Chemical Informatics. Journal of Chemical Information
and Modelling 2005, 46(3):991-998.
5. Steinbeck C, Han YQ, Kuhn S, Horlacher O, Luttmann E, Willighagen
E: The Chemistry Development Kit (CDK): An open-source
Java library for chemo- and bioinformatics. Journal of chemical
information and computer sciences 2003, 43(2):493-500.
6. Steinbeck C, Hoppe C, Kuhn S, Guha R, Willighagen EL: Recent
Developments of The Chemistry Development Kit (CDK) -
An Open-Source Java Library for Chemo- and Bioinformat-
ics. Current pharmaceutical design 2006, 12(17):2111-2120.
7. Steinbeck C, Krause S, Kuhn S: NMRShiftDB - Constructing a
free chemical information system with open-source compo-
nents. Journal of chemical information and computer sciences 2003,
43(6):1733-1739. Publish with ChemistryCentral and every
8. Steinbeck C, Kuhn S: NMRShiftDB - compound identification
and structure elucidation support through a free commu- scientist can read your work free of charge
nity-build web database. Phytochemistry 2004, 65(19):2711-2717. Open access provides opportunities to our
9. The PGChem project 2004 [http://pgfoundry.org/projects/
pgchem/]. colleagues in other parts of the globe, by allowing
10. The MyChem project 2004 [http://mychem.sourceforge.net]. anyone to view the content free of charge.
11. The OrChem project 2004 [http://orchem.sourceforge.net]. W. Jeffery Hurst, The Hershey Company.
12. The ChemiSQL umbrella project for Open Source chemical
search engines 2004 [http://chemdb.sourceforge.net]. available free of charge to the entire scientific community
13. Oracle 11.1 database Java developer's guide (B31225-03)
[http://tinyurl.com/yh5kb7r] peer reviewed and published immediately upon acceptance
14. Willett P: Chemical Similarity Searching. Journal of Chemical cited in PubMed and archived on PubMed Central
Information and Modeling 1998, 38(6):983-996. yours you keep the copyright
15. Swamidass S, Baldi P: Bounds and algorithms for fast exact
searches of chemical fingerprints in linear and sublinear Submit your manuscript here:
http://www.chemistrycentral.com/manuscript/

Page 11 of 11
(page number not for citation purposes)

You might also like