Journal of Cheminformatics: Orchem - An Open Source Chemistry Search Engine For Oracle
Journal of Cheminformatics: Orchem - An Open Source Chemistry Search Engine For Oracle
Journal of Cheminformatics: Orchem - An Open Source Chemistry Search Engine For Oracle
Address: Cheminformatics and Metabolism, European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Cambridge, UK
Email: Mark Rijnbeek* - [email protected]; Christoph Steinbeck - [email protected]
* Corresponding author
Abstract
Background: Registration, indexing and searching of chemical structures in relational databases is
one of the core areas of cheminformatics. However, little detail has been published on the inner
workings of search engines and their development has been mostly closed-source. We decided to
develop an open source chemistry extension for Oracle, the de facto database platform in the
commercial world.
Results: Here we present OrChem, an extension for the Oracle 11G database that adds
registration and indexing of chemical structures to support fast substructure and similarity
searching. The cheminformatics functionality is provided by the Chemistry Development Kit.
OrChem provides similarity searching with response times in the order of seconds for databases
with millions of compounds, depending on a given similarity cut-off. For substructure searching, it
can make use of multiple processor cores on today's powerful database servers to provide fast
response times in equally large data sets.
Availability: OrChem is free software and can be redistributed and/or modified under the terms
of the GNU Lesser General Public License as published by the Free Software Foundation. All
software is available via http://orchem.sourceforge.net.
Page 1 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17
base. Figure 1 gives an overview of OrChem's main com- and so quickly implement chemistry extensions for Ora-
ponents, showing how the user interacts with OrChem via cle. OrChem works in the same way, using the CDK where
PL/SQL packages that call out to so called "Java Stored possible.
Procedures" [13]. Starting with Oracle 11 g there is a just-
in-time (JIT) compiler for the Oracle JVM environment. A Fingerprinting
just-in-time (JIT) compiler is a program that converts Java OrChem uses chemical fingerprints to find compounds by
bytecode into machine language instructions which substructure or similarity criteria. Fingerprints are bitsets
makes Java run much faster than when the bytecode is in which each bit indicates the presence or absence of a
executed by an interpreter. particular chemical aspect. During a similarity search the
fingerprints are used to calculate a Tanimoto measure
OrChem is built on top of the Chemistry Development [14]. A Tanimoto similarity measure between two binary
Kit (CDK) [5,6] and depends on this Java library in fingerprints is defined by the ratio of the number of com-
numerous ways. For example, compounds are represented mon bits set to one to the total number of bits set to one
internally as CDK molecule objects, the CDK's I/O pack- in the two fingerprints [15]. For substructure searching the
age is used to retrieve compound data, and its subgraph fingerprint has a different function: it is used to screen
isomorphism algorithms are used for substructure valida- possible candidates before a computationally more
tion. OrChem adds its own Java layer on top of the CDK expensive isomorphism test.
to implement fast database storage and retrieval. With the
CDK loaded into Oracle, a large cheminformatics library For both substructure and similarity searching OrChem
becomes readily available to PL/SQL. With little effort uses the same fingerprint that currently measures around
developers can build database functions around the CDK 800 bits in size and uses both structural keys and hashed
Figure 1 overview
OrChem
OrChem overview.
Page 2 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17
values. With structural keys each position in the bit string • Character 1: {"6","5"} indicates if the ring triplet
corresponds to a pre-defined structure or molecular fea- contains strictly 6-size rings or otherwise (up to three)
ture. A hashed fingerprint is produced by generating all 5-size rings
possible linear paths of connected molecules containing
between one and a defined number of atoms, and project- • Character 2: {"S","D"} indicates bond nature, S
ing the hash values of the resulting strings onto the small means single bonds only, D double bonds (possibly
set of bits in the fingerprint in a deterministic manner aromatic) also present
[16].
• Character 3: {"L","H"} indicates connectivity, H
In the OrChem fingerprint the first approximately one means some atom participates in all 3 rings (tightly
hundred bits are reserved for hashing three-atom SMILES coupled)
strings. If a compound contains for example "C-C:S" then
bit number 20 will be set, as the calculated hash value is • Character 4: {"L","M","H"} indicates shared bonds
20 for this particular example. A hash value is calculated between rings in the triplet: Low = 0,1,2 Med = 3,4
consistently for each pattern encountered. We only take High = 5..n
three-atom SMILES into account because these yield a dis-
tinct yet relatively small amount of possible combina- The aim of these descriptors is to characterize particular
tions, all of which can be properly hashed into the limited constellations of common sized rings, to identify struc-
amount of bits reserved for the hashed key in the finger- tures that look normal purely based on the ring sizes but
print. may be in fact rare due to the way the rings are connected
together in a cluster.
Hashing in general decreases the accuracy of Tanimoto
scoring because different chemical aspects will hash to the Future OrChem releases may see the structural key further
same bit position. However the benefit of hashing lies in extended for improvement of specific search types. For
flagging relatively rare patterns. Rare SMILES patterns now, in summary the fingerprint captures the following
would normally not be part of a structural key because aspects:
assigning an explicit bit would be wasteful, but by hashing
all size three patterns into a range of bit positions the • hashed three-atom SMILES strings
infrequent patterns still leave their mark and this can
speed up substructure searches. The OrChem fingerprint • element counts
was at first only structural and structure searches for an
unusual SMILES pattern like "O:C:O" saw many com- • atom pairs
pounds being screened in vain. The current fingerprint
hashes "O:C:O" to a bit position and this helps to narrow • atom nearest neighbours
down the set of candidates significantly.
• common SMILES patterns
Coming after the hashed part, the structural key in the
OrChem fingerprint starts around bit position one hun- • individual rings: size, aromaticity, elements
dred. The structural key was initially based on the
PubChem fingerprint [17,18] but the current version dif- • ring clusters: cluster size, what ring sizes in the clus-
fers from it in various ways. OrChem for instance flags ter
specific SMILES patterns that have proved discriminating
for searching compounds like those in ChEMBL [19]. • 5/6 size ring triplets: how are rings connected, how
Numerous bits are reserved to capture ring aspects: clus- many atoms and bonds are shared
ters of rings are detected and aspects of these clusters are
fingerprinted, like the occurrence of a ring size three • long carbon trails: longer series of (non aromatic)
attached to a ring size five. With regards to the common carbon molecules
rings of size five and six, OrChem creates textual descrip-
tors for every connected set of three of these encountered • infrequent SMILES patterns (grouped per bit)
and incorporates this information into the fingerprint.
Example textual descriptors are "5DLM" and or "6SHH", Figure 2 shows the frequency distribution of OrChem fin-
the meaning being as follows: gerprint bits for all compounds in the ChEMBL database.
The dual nature of the fingerprint is clearly visible: the first
part shows a randomly distributed hashed key, followed
by the ordered structural key. Ordering the structural key
Page 3 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17
by occurrence of bits makes sense because the bits get from table(orchem_simsearch.search
indexed using composite B-tree indexes. Clustering rare
bits at the tail of the fingerprint makes the corresponding ('CCCCCC[C@H]1CC[C@H]2CCCC[C@]12C','SMILE
indexes small and provide an obvious pick for the Oracle S',0.7))}
optimizer.
At the center of the similarity search is a table called
Similarity searching orchem_fingprint_simsearch, pictured in Table 1.
With the fingerprint in place OrChem can perform fast The implemented algorithm uses the bitmap indexed col-
similarity searches. The algorithm for similarity searching umn bit_count to first inspect compounds for which
is taken from a paper by S. Joshua Swamidass and Pierre the number of bits set to one is the same as that of query
Baldi [15]. molecule. After that it works its way through compounds
with a bit count close to that of the query until it is done,
The algorithm proves to work well and allows the search that is until the minimum similarity has been reached or
to break out when a minimal given similarity is reached, until the result set size satisfies the user. Column fp in the
whilst completeness of the output is guaranteed. For similarity table stores a byte array representation of the
Orchem the similarity search has been implemented as a fingerprint. This raw data column allows the similarity
Java stored procedure that returns an array of results. The
current function accepts SMILES and Molfiles for a query Table 1: Database table with fingerprints optimized for similarity
and a cut-off value between 0 and 1 to indicate minimum searching
required similarity. Optional arguments allow the user to
TABLE orchem_fingprint_simsearch
indicate a cap to indicate the maximum number of results
required, and whether or not to display debugging infor-
id VARCHAR2(32)
mation. The select statement below shows a query exam-
ple to find compounds with a similarity of seventy percent bit_count NUMBER(4)
or more to the given SMILES string:
fp RAW(100)
select *
Page 4 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17
search to quickly read bytes from the database, convert structure search for "P = O" will have OrChem create a
these to Java bitsets and use those for bitwise comparisons SQL clause using the three bits that are set in the finger-
to obtain a Tanimoto similarity score. print of "P = O". The pre-screening query will resemble:
Page 5 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17
Query example ilar tests on the PubChem sample (3.5 million com-
The developer can submit an OrChem substructure search pounds) show a longer cache "warm up" time and give
through package orchem_subsearch: throughput times in the order of several seconds or
longer, again mostly depending on the similarity cut-off.
select * Figure 3 illustrates performance for searching the
PubChem sample and also shows a table with average
from table (orchem_subsearch.search numbers with regards to query throughput time and result
set sizes.
('S:C:C','SMILES',1000))
Substructure searching
The example shows a substructure search for SMILES As described before, substructure searching is done as a
"S:C:C" with a break out after 1000 results through an two-step process with a prescreening step followed by a
optional argument that emulates the ROWNUM column. VF2 isomorphism check. In the isomorphism algorithm
Under the hood the SMILES string will be fingerprinted, a each screened database compound needs to be material-
prescreen query will be constructed based on it and then ized as a CDK molecule to be able to compare it to the
executed with the Oracle Optimizer picking the most suit- query structure. It can however be expensive to materialize
able index available for the bit columns. Each candidate thousands of molecules on the fly using database Molf-
will be verified using VF2 and only valid superstructures iles, particularly with regards to calculation of aromaticity.
will be returned. The substructure search function is pipe- Instead, OrChem stores data for each molecule in col-
lined, so rows are returned iteratively as they are produced umns atoms and bonds of table
without having to wait for the entire collection to be con- orchem_fingprint_subsearch during the finger-
structed. This allows developers to build web interfaces printing process. These columns provide a quick way to
refreshing at a constant interval while presenting the materialize a basic CDK molecule to be passed into the
results returned so far to the user. VF2 algorithm. The data structures used are quite straight-
forward, for instance with data in column atom "C O"
Performance interpreted as: "atom 0 is Carbon, atom 1 is Oxygen" and
General bond column "0 1 D Y" then implying "there is a bond
OrChem has been tested extensively on a release of the between C (atom 0) and O (atom 1) that is double (D)
ChEMBL database (former Starlite) with around 420,000 and aromatic is true (Y)". In this way, CDK molecules can
compounds. Additional tests were done on a random be generated very fast without the need for calculating any
PubChem data sample of 3.5 million compounds in order properties during the search.
to assess performance on a larger data set. All tests were
done on Oracle 11.1.0.7 installed on an eight CPU quad Parallel substructure searching
core 32 Gb RedHat Linux server at the EBI. With regards to Oracle allows parallel execution of table functions [13]
performance, it is good to keep in mind that the initial run and this feature can be used to speed up Or-Chem sub-
of any Java stored procedure incurs a lot of overhead. Even structure searching. A parallel table function returns a col-
a first call to a simple 'helloWorld()' program takes a lection and is executed in a two-stage operation. First, one
while to complete, but following calls then respond set of slave processes partitions the data as directed in the
immediately. Orchem's search performance improves function's declaration; then a second set of slave scans
after a few queries have been issued and the fingerprint executes the table function in parallel on the partitioned
data starts to accumulate in the buffer cache where data data. The following statement illustrates the concept, with
can be accessed faster than by reading from disk. The f being the function to be run in parallel, taking a ref cur-
tables that support similarity and substructure searching sor input argument to partition the data.
are created with the cache option, so the more memory
assigned to the SGA the better the performance will be. select *
Page 6 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17
Similarity
Figure 3 search throughput time for different minimal Tanimoto similarity
Similarity search throughput time for different minimal Tanimoto similarity. Explanation of graph: one hundred
randomly sampled compounds were used to query for all similar compounds, repeated for different minimum Tanimoto simi-
larities. Searches were done in the PubChem sample of 3.5 million compounds. As the similarity cutoff increases, performance
goes up: finding all compounds similar to compound X with at least a Tanimoto similarity of 0.8 (= 80% similar) is faster than
finding (many more) compounds that are 50% similar.
VF2. In such as case a non-parallel search can actually run case of Orchem this means that table
faster than a parallel search because there is not much VF2 orchem_fingprint_subsearch must be accessed
workload and no parallel overhead is incurred. But most with full table scans, but this is not always the best option.
importantly, the non-parallel search can use the regular B- When queries are done on a database with several million
tree indexes on the bit columns in table compounds the indexed approach can outperform the
orchem_fingprint_subsearch whereas a parallel parallel approach easily if the index can be used to quickly
query can not. narrow down a small set of candidates. Furthermore the
performance of a parallel query depends on a number of
In general, to parallelize any query Oracle must access at factors, one of them being the hardware on which the
least one of the tables through a full table scan or an index Oracle instance runs, with the more CPU cores the better.
through a range scan involving multiple partitions. In the The performance also depends on the size of the database,
Page 7 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17
orchem_subsearch_par.setup('C:O:C:N','SMI
LES') Figurecompounds
Performance
million 4 of substructure search for "C:C:C:C:N" on 3.5
Performance of substructure search for "C:C:C:C:N"
> select * from table on 3.5 million compounds.
(orchem_subsearch_par.search(:key,1000)
The parallel table function works best for general queries rare bit combination, without being able to use any index
with high volumes of data being processed and many pos- for this. Finally in figure 6 we have a graph with through-
itive verifications being returned, the VF2 workload put times for a sampled set of sixty random structures
shared over several slave processes. This can be observed used to do a parallel substructure searching with different
when selecting the first n results for a common substruc- breakout values. The graph shows quite a spread in total
ture pattern like "C:C:C:C:N" with n set to a high value. throughput time, with the time to return up to 100,000
Figure 4 shows a graph for a "C:C:C:C:N" query done verified results varying between 20 seconds and more
against the PubChem sample: the parallel function iden- than three minutes, all depending on the complexity of
tifies the first 5000 results in 3.5 seconds whereas the sin- the query, the quality of the fingerprint and the occur-
gle process takes 22.5 seconds. The parallel approach rence rate of the structure. Either way users can see results
clearly benefits from doing the VF2 isomorphism checks streamed back before completion because data is piped
spread over multiple parallel processes, which becomes back as soon as it becomes available.
even more obvious when n is further increased to 25000.
However to add to the picture, figure 5 shows what hap- In conclusion the given examples show how the database
pens with a similar test for "C:O:C:O". This structure hap- content and the nature of the query determine the differ-
pens to occur only once in the entire data set, and the non- ence in performance between the parallel and the non-
parallel function can use a fast index scan on the bit col- parallel substructure search. Which one is fastest depends
umns of orchem_fingprint_subsearch to quickly mainly on the size of the database, the number of results
find three possible candidates with the right combination requested (limited or all) and the size of the prescreened
of fingerprint bits. One compound is then verified by VF2 compound set. The smaller the pre-screened set is the less
to be a superstructure of "C:O:C:O" (compound is shown attractive a parallel full scan normally is - but it is hard to
in the graph). The parallel approach at the other hand find exact rules. On top of that performance also depends
needs to make full table scans, albeit in parallel, and takes on other factors such as the server hardware and the qual-
more than forty seconds to find the same single result. The ity of the query's fingerprint. To make the search process
benefit of fast parallel VF2 execution does not apply now more user friendly, we plan to find heuristic rules to be
and the elapse time is spent scanning rows that meet the incorporated into a generic search function that will
Page 8 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17
Figure 5
Performance of substructure search for "C:O:C:O" on 3.5 million compounds
Performance of substructure search for "C:O:C:O" on 3.5 million compounds.
decide for the user whether to take a parallel approach or installation is complete and OrChem can perform com-
not. pound searches.
Page 9 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17
Figure substructure
Parallel 6 search throughput time on 3.5 million compounds
Parallel substructure search throughput time on 3.5 million compounds. Explanation of graph: sixty sampled sub-
structures were used as a query for a parallel substructure search in the PubChem sample of 3.5 million compounds. The same
searches were repeated with different maximum number of rows requested: 1000 (bottom), 10,000 (middle) and 100,000
(top). The graph displays the total throughput time in seconds but users can view the intermediate output generally faster as
the search function spools ('pipes') results as they become available. Fastest throughput times are observed for query struc-
tures that commonly exist in compounds, resulting in high a success ratio of the VF2 isomorphism algorithm.
and provide substructure search criteria related to ster- cheminformatics functionality but is only in its first
eoisomerism. release and by no means a finished product. Developers
who are interested to further extend OrChem are invited
• More of the CDK needs to be exposed through PL/ to participate through Sourceforge to make it a truly col-
SQL API's, for example CDK functions to calculate laborative project. We also encourage users to submit sug-
QSAR/QSPR descriptors or chemical properties. gestions and requests for functionality through the
mailing list "[email protected]".
• The similarity search will benefit from an option to
run in parallel, and more types of similarity scoring Availability and requirements
should be added to complement the current default Project name: OrChem
Tanimoto scoring.
Project home page: http://orchem.sourceforge.net/
• Along the same lines we might want to experiment
with different types of fingerprints. Operating system: Platform independent
• Pharmacophore searches could be added to take into language: Java 1.5 or higher
account 3D arrangements of molecular features.
Database system: Oracle 11 g (with JRE 1.5)
Conclusion
We have presented OrChem, an open source extension for Comm. restrictions: None
the Oracle 11G database platform that adds registration
and indexing of chemical structures to support fast sub-
structure and similarity searching. OrChem provides core
Page 10 of 11
(page number not for citation purposes)
Journal of Cheminformatics 2009, 1:17 http://www.jcheminf.com/content/1/1/17
Acknowledgements
We are grateful to John Overington and Bissan Al-Lazikani for making the
ChEMBL database available to us for testing. We thank the Pubchem team
for assembling this great collection of molecular structures and for opening
up their fingerprint definition. Our particular thanks go to the Chemistry
Development Kit (CDK) community for helping us to build this great
resource. Finally we'd like to thank Federico Paoli from Siena Biotech for
participating in OrChem development.
References
1. Hagadone T, Michael S: Integrating chemical structures into an
extended relational database system. Chemical Structures 2: The
International Language of Chemistry: Proceedings of the 2nd International
Conference 1995:257-269.
2. Hagadone T, Schulz M: Capturing Chemical Structure Informa-
tion in a Relational Database System: The Chemical Soft-
ware Component Approach. Journal of Chemical Information and
Computer Sciences 1995, 35(5):879-884.
3. Frome J: Searching Chemical Structures. Journal of Chemical Doc-
umentation 1964:43-45.
4. Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Stein-
beck C, Wegner J, Willighagen EL: The Blue Obelisk - Interoper-
ability in Chemical Informatics. Journal of Chemical Information
and Modelling 2005, 46(3):991-998.
5. Steinbeck C, Han YQ, Kuhn S, Horlacher O, Luttmann E, Willighagen
E: The Chemistry Development Kit (CDK): An open-source
Java library for chemo- and bioinformatics. Journal of chemical
information and computer sciences 2003, 43(2):493-500.
6. Steinbeck C, Hoppe C, Kuhn S, Guha R, Willighagen EL: Recent
Developments of The Chemistry Development Kit (CDK) -
An Open-Source Java Library for Chemo- and Bioinformat-
ics. Current pharmaceutical design 2006, 12(17):2111-2120.
7. Steinbeck C, Krause S, Kuhn S: NMRShiftDB - Constructing a
free chemical information system with open-source compo-
nents. Journal of chemical information and computer sciences 2003,
43(6):1733-1739. Publish with ChemistryCentral and every
8. Steinbeck C, Kuhn S: NMRShiftDB - compound identification
and structure elucidation support through a free commu- scientist can read your work free of charge
nity-build web database. Phytochemistry 2004, 65(19):2711-2717. Open access provides opportunities to our
9. The PGChem project 2004 [http://pgfoundry.org/projects/
pgchem/]. colleagues in other parts of the globe, by allowing
10. The MyChem project 2004 [http://mychem.sourceforge.net]. anyone to view the content free of charge.
11. The OrChem project 2004 [http://orchem.sourceforge.net]. W. Jeffery Hurst, The Hershey Company.
12. The ChemiSQL umbrella project for Open Source chemical
search engines 2004 [http://chemdb.sourceforge.net]. available free of charge to the entire scientific community
13. Oracle 11.1 database Java developer's guide (B31225-03)
[http://tinyurl.com/yh5kb7r] peer reviewed and published immediately upon acceptance
14. Willett P: Chemical Similarity Searching. Journal of Chemical cited in PubMed and archived on PubMed Central
Information and Modeling 1998, 38(6):983-996. yours you keep the copyright
15. Swamidass S, Baldi P: Bounds and algorithms for fast exact
searches of chemical fingerprints in linear and sublinear Submit your manuscript here:
http://www.chemistrycentral.com/manuscript/
Page 11 of 11
(page number not for citation purposes)