Chemoinformatics As A Theoretical Chemistry Discipline: Review

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Review

DOI: 10.1002/minf.201000100

Chemoinformatics as a Theoretical Chemistry Discipline


Alexandre Varnek*[a] and Igor I. Baskin[b]
Contribution to the 2nd Strasbourg Summer School on Chemoinformatics, VVF Obernai, France, 20–24 June 2010

20  2011 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2011, 30, 20 – 32
Chemoinformatics as a Theoretical Chemistry Discipline

Abstract: Here, chemoinformatics is considered as a theo- between chemical objects (graphs or descriptor vectors).
retical chemistry discipline complementary to quantum Statistical Learning Theory, one of the main mathematical
chemistry and force-field molecular modeling. These three approaches in structure-property modeling, is briefly re-
fields are compared with respect to molecular representa- viewed. Links between chemoinformatics and its “sister”
tion, inference mechanisms, basic concepts and application fields – machine learning, chemometrics and bioinformatics
areas. A chemical space, a fundamental concept of chemo- are discussed.
informatics, is considered with respect to complex relations
Keywords: Chemoinformatics · Chemical space · Similarity · Computational learning theory

1 Introduction semble of three complementary disciplines: chemoinfor-


matics, quantum chemistry and force field simulations.
Chemoinformatics, a young field incorporating several “old” Then, we discuss two fundamental concepts of chemoinfor-
fields (QSAR and chemical databases development),[1] is ap- matics: chemical space and statistical learning theory. Final-
proaching maturity.[2–10] Indeed, it is widely applied in aca- ly, some relations of chemoinformatics with machine learn-
demia and industry (especially in the drug design area), it is ing, bioinformatics and chemometrics are discussed.
taught in many universities at the undergraduate and grad-
uate level, and there are several specialized international
journals, as well as many international meetings being held
every year. At the same time, it has not still been recog- 2 Complementarities of the Chemoinformatics,
nized as an individual scientific discipline, but mostly con- Quantum Chemistry and Force Field
sidered as an interface between chemistry and informatics, Approaches
or as a collection of methods and tools specifically oriented 2.1 Basic Molecular Models
toward drug design. This is clearly seen from the early defi-
nitions of chemoinformatics suggested by Brown, Paris, Differences and complementarities of three theoretical
Gasteiger, and Faulon (Table 1). In fact, any scientific disci- chemistry disciplines – Chemoinformatics, Quantum
pline should satisfy some obvious requirements: it should Chemistry and Force Field approach – are directly related
be based on its own concepts and approaches, and its dif- to the way they represent molecular structures, i.e., their
ferences from and complementarity to related disciplines basic molecular models (Table 2). Quantum chemistry (QC)
must be clearly identified. explicitly considers ensembles of electrons and nuclei
One of the ultimate applications of chemoinformatics is which are described by the Schrçdinger wave equation.
the development of models linking chemical structure and Since this equation can only be solved analytically for
various molecular properties. This logically relates chemoin- atoms with one electron, in practice various approximate
formatics with two other modeling approaches – quantum methods (commonly Hartree–Fock or Density Functional
chemistry and force-field simulations. These three comple- Theory) are used. Since even such calculations are time-
mentary fields differ with respect to the form of their mo- consuming, they are usually performed on single molecules
lecular models, their basic concepts, inference mechanisms or reactions in the gas phase, or on relatively small ensem-
and domains of application (Table 2). Unlike the molecular bles of molecules. The Force Field (FF) approach considers
models used in quantum mechanics (ensembles of nuclei “classical” atoms and bonds and it uses empirical equations
and electrons) and force field molecular modeling (ensem- to calculate the molecular potential energy as a sum of
bles of “classical” atoms and bonds), chemoinformatics terms corresponding to both bonding and nonbonding in-
treats molecules as molecular graphs or related descriptor teractions. This approach can be easily coupled with classi-
vectors with associated features (physicochemical proper- cal mechanics, allowing one to calculate molecular trajecto-
ties, biological activity, 3D geometry, etc.) (Figure 1). The ries (Molecular Dynamics simulations), or with statistical
ensemble of graphs or descriptor vectors forms a chemical mechanics in order to generate Boltzmann ensembles
space in which some relations between the objects must (Monte-Carlo simulations), or, simply, with optimization
be defined. Unlike real physical space, a chemical space is techniques (Molecular Mechanics).[11] Due to the simplicity
not unique: each ensemble of graphs and descriptors de-
fines its own chemical space. Thus, chemoinformatics could [a] A. Varnek
Laboratoire d’Infochimie, UMR 7177 CNRS, Universit de
be defined as a scientific field based on the representation of
Strasbourg
molecules as objects (graphs or vectors) in a chemical space. 4, rue B. Pascal, Strasbourg 67000, France
Here, we attempt to define chemoinformatics as a theo- *e-mail: [email protected]
retical chemistry discipline by characterizing its fundamen- [b] I. I. Baskin
tal concepts and underlining its links with some “sister” dis- Department of Chemistry, Moscow State University
ciplines. First, we present theoretical chemistry as an en- Moscow 119991, Russia

Mol. Inf. 2011, 30, 20 – 32  2011 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.molinf.com 21
Review A. Varnek, I. I. Baskin

Figure 1. Chemoinformatics: from objects to major applications. Notice that for each Chemoinformatics Object (graph, descriptor vector in
the input or in the feature space) there exist associated machine-learning approaches: graph based, vector-based or kernel-based methods,
respectively.

of the basic molecular model and potential energy equa- the relationships between the objects themselves, on one
tions, Force Field methods can be applied to rather large hand, and between their chemical structures and related
systems containing many thousands of atoms (proteins, properties, on the other hand, are established using two
solutions, etc.). Chemoinformatics considers a molecule as a main mathematical approaches: graph theory and statisti-
graph or an ensemble of descriptors generated from this cal learning. Due to the rapidity of such calculations, these
graph. A set of molecules forms a chemical space for which structure-property relationships can be applied to fast

Alexandre Varnek got his PhD in physi- Igor Baskin received his PhD in organic
cal chemistry from the Institute of In- chemistry (1990) and habilitation in
organic and General Chemistry of the mathematical and quantum chemistry
Russian Academy of Sciences, Moscow. (2010) from Lomonosov Moscow State
In 1988–1995, he was Associate Profes- University, Russia. After holding several
sor in theoretical chemistry at the positions at the Semenov Institute of
Moscow Mendeleyev University of Chemical Physics and Zelinsky Institute
Chemical Technology. In 1995, Alexan- of Organic Chemistry of the Russian
dre joined the University of Stras- Academy of Sciences, Moscow, he
bourg, France, where he holds the po- joined in 2001 the Chemistry Depart-
sition of a Professor in theoretical ment of Lomonosov Moscow State
chemistry, head of the laboratory on University, where since 2005 he holds
chemoinformatics and the director of the position of a Leading Scientist. He
the master courses on chemoinformatics. His research interests is regularly engaged as Visiting Scientist and Invited Professor at
focus on the development of new approaches and tools for virtual the University of Strasbourg, France. He has published more than
screening and “in silico” design of new compounds and chemical 100 articles related to SAR/QSAR/QSPR methodology, artificial
reactions. neural networks, medicinal chemistry, as well as molecular model-
ing of biological receptors and supramolecular systems. Igor Baskin
is a member of the International Academy of Mathematical
Chemistry since 2009. His current work focuses on the application
of advances machine learning approaches in chemoinformatics.

22 www.molinf.com  2011 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2011, 30, 20 – 32
Chemoinformatics as a Theoretical Chemistry Discipline

Table 1. Different definitions of chemoinformatics as a field.


Frank Brown[5] The use of information technology and management has become a critical part of the drug discovery process.
Chemoinformatics is the mixing of those information resources to transform data into information and information
into knowledge for the intended purpose of making better decisions faster in the area of drug lead identification
and organization.
Greg Paris[45] Chemoinformatics is a generic term that encompasses the design, creation, organization, management, retrieval,
analysis, dissemination, visualization, and use of chemical information.
Johann Gasteiger[2] Chemoinformatics is the application of informatics methods to solve chemical problems.
Jean-Loup Faulon and Chemoinformatics is the field of handling chemical information
Andreas Bender[8]
This work Chemoinformatics is a field based on the representation of molecules as objects (graphs or vectors) in a chemical
space.

Table 2. Interrelations between three branches of theoretical chemistry.


Quantum chemistry Force field based molecular modeling Chemoinformatics
Molecular model Electrons and Nuclei Atoms and bonds Graphs and descriptor vectors
Inference mechanism Deductiveinductive Deductiveffiinductive Deductiveinductive
Typically applied to Individual species or ensemble of Individual species, complex system rep- Ensemble of species (both for knowl-
a few species resenting an ensemble of many species edge extraction and predictions), indi-
vidual species (for predictions only)
Basic concept Wave/particle dualism Classical mechanics Chemical space
Basic mathematical Schrçdinger equation and approx- Force field method and its implemen- Statistical learning, graph theory
approaches imate methods (HF, DFT, …) tation in molecular mechanics, molecu-
lar dynamics, Monte-Carlo and free
energy perturbation techniques

screening of large databases. Any property for which a suf- ticular molecules. In chemoinformatics, the logic of infer-
ficient number of experimental data is available can be ence is different, because it is generally not based on exist-
modeled in chemoinformatics, whereas this is not always ing physical theories. Chemoinformatics considers the
so for QC and FF approaches. world too complex to be a priori described by any set of
Thus, Chemoinformatics, Quantum Chemistry and Force rules. The incompleteness of our knowledge changes the
Field approaches are interrelated areas. Indeed, QC influ- inference paradigm: instead of searching for exact solu-
enced the development of many popular molecular con- tions, chemoinformatics applies plausible reasoning quanti-
nectivity indices such as E-state, whereas molecular me- fied by probability theory.[13] The rules (models) in chemoin-
chanics is indispensible part of 3D shape descriptors gener- formatics are not explicitly taken from rigorous physical
ation. On the other hand, various machine-learning meth- models, but learned inductively from the data. Thus, in in-
ods can be used to fit the parameters in some QC and FF ductive learning, the models are the result of generalization
approaches. of patterns in the data. More general models have a greater
Nonetheless, QC, FF and chemoinformatics are different, chance to be predictive. Various approaches to assess the
if highly complementary approaches. Each has its own ap- generalization ability of models have been suggested in
plication area, its advantages and problems. A good knowl- the statistical learning theory[14–15] that is the mathematical
edge of the all these areas is beneficial for a theoretical basis of modeling in chemoinformatics.
chemist to enable selection of the most suitable tools for a It should be noted that the inductive learning approach
particular task. is also used to some extent in QC and FF methods. In
quantum chemistry, the parameterization of the electron
density functional[16] and pseudopotentials[17–18] is often
2.2 Inference in Chemoinformatics
based on empirical parameters fitted to experimental data,
One of the main distinctions of chemoinformatics from QC as is the case in numerous semi-empirical methods.[8, 19] The
and FF concerns the inference (learning) mechanism. Quan- number of these parameters is sometimes so great that
tum chemical studies are a typical example of deductive in- some quantum chemical methods, like DFT with the func-
ference, where a general physical model is applied to par- tional M06[20] or B97D,[21] can be considered to define a sort

Mol. Inf. 2011, 30, 20 – 32  2011 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.molinf.com 23
Review A. Varnek, I. I. Baskin

of “Schrçdinger force field”.[16] In Force-Field simulations, in- chemical bonds, respectively. The vertex labels identify
ductive learning is at least as important as deductive, since symbols of chemical elements, whereas the edge labels
potential energy calculations involve many empirical pa- characterize the bond type. The label corresponds either to
rameters. the bond order in molecules or to some special bond types
in more complex systems. For instance, different types of
“coordination” bonds can be defined for supramolecular
systems, whereas “dynamic” bonds corresponding to chem-
3 Fundamentals of Chemoinformatics
ical transformations can be used to encode chemical reac-
For the objects in chemical space, chemoinformatics builds tions.[24] More complex chemical systems, like polymers or
its models using two main mathematical approaches: mixtures can be described by ensembles of graphs.
graph theory and statistical learning. While these mathe- For several practical purposes, more generalized repre-
matical methods can be applied to other fields, the chemi- sentations of chemical structures are needed. For example,
cal space is a particular concept of chemoinformatics de- for pharmacophore analysis, the graph vertexes can be la-
scribing a way to handle ensembles of chemical structures. beled as pharmacophoric centers (H-donors, H-acceptors,
cation, anion, aliphatic, aromatic), while the separation of
two centers can be depicted by an edge labeled by the
3.1 Chemical Space Paradigm
value of the 2D or 3D distance.[25] In Markush structures
As pointed out by C. Lipinski and A. Hopkins, “chemical used for patent searches, a graph vertex can stand for sev-
space can be viewed as being analogous to the cosmologi- eral types of either individual atoms or whole substructures
cal universe in its vastness, with chemical compounds pop- (e.g., substituents).[26] The same is true for substructure
ulating space instead of stars”.[20] Any attempt even to queries used for searching chemical databases.[27]
count the number of chemical compounds which potential- Consideration of some complex chemical objects reveals,
ly could be synthesized leads to combinatorial explosion however, some limitations of graph theory to code chemi-
and yields an absolutely unrealistic number estimated as cal structures and their ensembles. Instead, hypergraphs[28]
more than 1060,[22] which exceeds the number of elemental have been suggested as a more adequate mathematical
particles in the cosmological universe. Clearly that this model to encode stereochemical information and multicen-
number is so huge that it is impossible not only to synthe- ter bonds. However, hypergraphs are much more difficult
size these molecules but even to generate computationally objects to operate compared to graphs, and, therefore,
their structures. The goal of chemoinformatics is to find a their use is still very limited.
rational way of representing this literally infinite chemical Another popular representation of molecular structure is
space and to navigate in this space. Efficient strategies for based on molecular descriptors defined by Todeschini and
navigating chemical space are crucially important for the Consonni as “… the final result of a logical and mathemati-
development of new biologically active compounds and cal procedure which transforms chemical information en-
the design of new drugs for medicine.[20] This is due to the coded within a symbolic representation of a molecule into
fact that biologically active compounds of a certain type a useful number or the result of some standardized experi-
are not distributed evenly over the whole chemical space, ment.”[29] This molecular representation is extremely popu-
but form very compact regions in it, like galaxies in the cos- lar in chemoinformatics because: (a) various descriptors can
mological universe.[20] This is certainly true for any other be generated from one and the same molecular graph,
chemical property. A special term, chemography, analogous thus describing different facets of the information hidden
to geography, has even been suggested for the art of navi- in the graph; (b) it is invariant to any renumbering of
gating in chemical space.[23] graph vertices; (c) most of the descriptors are easy inter-
Although the expression “Chemical space” is widely used pretable; (d) inductive transfer of knowledge can be per-
in the chemoinformatics literature, it is not still well de- formed via descriptors;[30] and, (e) descriptors define a
fined. Generally speaking, the notion of “space” stands for vector space which is mathematically much easier to
a set of objects with some particular properties and some handle compared to the graph-based space. Descriptor
relationships between them (metric). Below, we consider vectors can be prepared not only for individual molecules
two types of chemical objects (graphs and descriptor vec- but for more complex systems like chemical reactions[24] or
tors), different metrics, and related chemical spaces. multicomponent mixtures.[31] Nowadays, more than 5000
types of descriptors of different types have been report-
ed.[29] They are used for database processing (as screens or
3.1.1 Representation of Chemical Objects in Chemoinformatics
fingerprints), for building SAR/QSAR/QSPR models, in simi-
In chemoinformatics, the molecules are treated as informa- larity searching, clustering, etc.
tional objects, identifying their structure and properties. At the same time, several weak points of molecular de-
Generally, two main types of objects are used: graphs and scriptors should be mentioned: (a) If descriptors are not
descriptor vectors. In a vertex- and edge-labeled undirected well selected, in the resulting chemical space two different
graph, the vertices and edges correspond to atoms and molecules can be superposed on one point; (b) The

24 www.molinf.com  2011 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2011, 30, 20 – 32
Chemoinformatics as a Theoretical Chemistry Discipline

number of existing descriptors is very large and despite nu- algorithms to search for an MCS,[41] this approach, however,
merous variables selection techniques reported in the liter- is rarely used to perform a similarity search[42] or to cluster
ature,[32] there is always a risk of selecting irrelevant and re- chemical databases.[43]
dundant descriptors; (c) A serious drawback of molecular Another type of graph-based similarity measure is that of
descriptors is the loss of reciprocity with the molecular graph kernels which assign to each pair of graphs a posi-
structure. Indeed, the reverse reconstruction of molecular tive real number characterizing similarity.[44–45] They are
graphs from descriptors is a very difficult and, in some used to map a graph-based chemical space to a vector
cases, impossible task known in QSAR as the “inverse” (feature) space in which the structure–property model is
problem.[33–34] From the practical point of view, it concerns built. This approach has been successfully used in SAR and
generation of molecular structures possessing desired QSAR.[46]
property values. Attempts to solve this problem have been The most popular similarity measures are based on fixed-
reported by Gordeeva et al.,[35] Skvortsova et al.,[36] and sized descriptor vectors. These are various types of distan-
Faulon et al.[37] who observed some degeneracy of solu- ces (Euclidean, Manhattan, Mahalanobis, Minkowski) meas-
tions, when several chemical structures corresponded to uring molecular dissimilarity or some indices (Tanimoto,
one set of molecular descriptor values. As pointed out in,[38] Dice, cosine, Tversky, etc.) measuring similarity. These meas-
this prevents a reverse engineering of chemical structures ures are widely discussed in the literature, e.g., see the
from molecular descriptors, but, on the other hand, can be review paper by Willett[47] and references therein.
useful to safely exchange chemical information in the form Several approaches have been developed to compare
of molecular descriptors. molecular fields. The Carbo index is computed by integrat-
ing overlaps of electronic densities of two molecules as-
sessed using quantum-chemical approaches.[48–51] The SEAL
3.1.2 Chemical Similarity as a Metric of Chemical Space
index[52] is used to assess an alignment of steric and elec-
By definition, a metric is a function which defines a dis- trostatic fields of the molecules. Since any molecular field
tance between the elements of a set. For all x, y, z, this could be represented as a descriptor vector based on the
function must satisfy the following conditions: (i) d(x, y)  0 field value on the grid points, a similarity measure can be
(nonnegativity); (ii) d(x, y) = d(y, x) (symmetry) and, (iii) d(x, simply calculated as the product of two vectors.
z)  d(x, y) + d(y, z) (triangle inequality). Strictly speaking, the Similarity measures for which all matrices of values are
distance d(x, z) is a dissimilarity measure which is zero for semipositive definite (the determinant is larger or equal to
identical elements and increases with the decrease of simi- zero) are called “Mercer kernels”, or simply “kernels”. Gener-
larity between them. Thus, it can be defined as distance = ally, kernels are used to project the objects (graphs or vec-
1similarity. Some similarity measures are briefly consid- tors) into a Hilbert “feature space”, in which a similarity
ered below. measure between these objects is equal to dot-product of
Molecular similarity (or chemical similarity) is one of the their projections. A dot product of vectors, which can be
most basic concepts in chemoinformatics.[39–40] It is widely viewed as the cosine similarity measure for normalized vec-
used in virtual screening and in silico design of new com- tors, is the simplest type of kernel.
pounds. Such studies are based on the similar property prin- Unsupervised machine-learning methods of nonlinear
ciple which states that similar compounds have similar neighborhood-preserving projections of data can also be
properties.[39] In application to classification problems this used to assess similarity. A typical example is mapping to
means that similar chemical compounds tend to belong to Self-Organizing (Kohonen) Maps, SOM,[53] where the similari-
the same class (e.g., possessing similar biological activity), ty is measured as a distance between different cells. This
whereas as applied to regression problems it means that offers the possibility to use SOMs for property predic-
the approximating function should be as smooth as possi- tions[54] and in virtual screening.[55]
ble. It should also be pointed out that molecular similarity If several QSAR models are simultaneously applied to
always depends on the choice of descriptors and methods predict a property for a series of compounds, the similarity
to compare molecular graphs. can be assessed in the “models’ space”. Indeed, for each
Chemical similarity measures described in the literature compound, one can form a vector based on the prediction
can be calculated from (a) molecular graphs; (b) descriptor results. A dot product of these vectors can be considered
vectors; (c) molecular fields; they can also be assessed from as a measure of the similarity of two molecules. This ap-
(d) kernels, and (e) unsupervised or (f) supervised modeling proach has been used by Tetko in the ASNN (Associative
studies. This classification is rather fuzzy, and some similari- Neural Networks) method.[56]
ty measures belong simultaneously to several classes. Some Generally, similarity measures could be used both for
details are given below. similarity-based predictions and similarity searching.[39] Simi-
A similarity measure based on the size of the maximum larity-based prediction approaches in the initial descriptor
common subgraph (MCS) for a pair of graphs is perhaps space are based on the k nearest neighbors method (kNN).
the most well-known graph-based similarity measure. Due However, kernel similarity measures implemented in kernel-
to the relative complexity and inefficiency of computational based machine learning methods lead generally to more

Mol. Inf. 2011, 30, 20 – 32  2011 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.molinf.com 25
Review A. Varnek, I. I. Baskin

computationally efficient and predictive models.[44] Both in To represent relationships in analogous series of com-
similarity-based prediction methods and in querying large pounds having the same scaffold and different substitution
chemical databases, the computational efficiency largely patterns, multilayer-rooted “combinatorial analogue
depends on whether a given similarity measure defines a graphs” (CAGs) have been proposed by Peltason et al.[19]
metric in chemical space.[57] These graphical representations hierarchically organize
For most of similarity measures, the metric axioms (i)–(iii) compounds according to substitution patterns and are an-
are valid, and, therefore, they can be perceived as distances notated with SARI discontinuity scores[64] in order to ac-
in chemical space. count for SAR discontinuity at the level of functional
groups. The approach makes it possible to identify under-
sampled regions and highlight key substitution patterns
3.1.3 Navigation in Graph-based Chemical Space
which determine the SAR of a compound series. An alterna-
In principle, each ensemble of molecular graphs forms a tive way to visualize SARs in analogous series with a
discrete metric topological space. Its topology is defined by common scaffold is offered by the “SAR maps” invented by
a set of all its possible subsets, where the simplest discrete Agrafiotis et al.[65] In a “SAR map”, each series is rendered as
metric gives the distance 0 if two chemical objects are a rectangular matrix of cells, each representing a unique
equivalent, i.e. corresponding chemical graphs are isomor- combination of substituents (i.e., a unique compound).
phic to each other and 1 otherwise. This simplest metric is Color-coding the cells by their potency easily identifies SAR
however not useful in practical applications, because in patterns.
such space all distinct objects are equally similar to each Pollock et al.[66] introduced the scaffold topology ap-
other. More flexible relationship between graphs can be ex- proach, which represents a connected graph with the mini-
pressed as a degree of their mutual similarity/dissimilarity. mum number of nodes and edges required to fully de-
In particular, this relationship can be established by map- scribe its ring structure. An algorithm for systematic gener-
ping an ensemble of graphs onto a descriptor vector space ation of scaffold topologies allows one to analyze systemat-
followed by an assessment of standard similarity measures. ically all scaffold topologies for up to eight-ring molecules
The three main approaches used to describe a set of mo- and four-valence atoms, thus providing coverage of the
lecular graphs and to navigate in this space are: (a) sub- lower portion of the chemical space of small molecules.[66]
structure-based, (b) superstructure-based, and (c) mutation- Scaffold topology distributions were analyzed for several of
based. the most popular chemical structure databases with huge
In the substructure-based approach a special “navigation” number of compounds, both real and virtual, and many in-
graph is usually constructed. It can be used for the visuali- teresting features were found.[67] It is claimed that “scaffold
zation of chemical databases, exploring relations between topologies can be the first step toward an efficient coarse-
compounds and discovering unexplored regions in the grained classification scheme of the molecules found in
chemical space. In the navigation graph, the nodes corre- chemical databases”.[67]
spond to individual molecular graphs and edges corre- In the superstructure-based approach, each individual mo-
spond to some transition rules. Bemis and Mursko have lecular graph is considered as a subgraph of a common su-
considered transitions between an unlabelled graph (frame- pergraph corresponding to the ensemble of individual
work) to a labeled graph (full chemical structure).[58–59] They graphs.[68] Although this approach is limited to relatively
invented the concept of molecular frameworks,[58–59] used small congeneric sets of compounds, it has been found
to organize the structural data by grouping the atoms of very suitable to build QSAR models, as demonstrated in the
each drug molecule into ring, linker, framework, and side positional analysis by Magee,[69–70] the DARC/CALPHI system
chain atoms. Thus, a huge database can be described by a by Mercier et al.,[71] the MTD-PLS approach of Kurunczi
limited number of frameworks. In the “scaffold tree” graph et al.,[72–74] and the MFTA approach by Palyulin et al.[68, 75] For
approach of Schuffenhauer et al.,[60–61] transitions are al- each individual chemical structure, the occupancies of su-
lowed between a molecular graph and its subgraph. It has pergraph nodes or local physicochemical descriptors of
been demonstrated that this type of navigation graphs atoms matching these nodes, form a fixed-size descriptor
allows one to perform an efficient and intuitive activity vector used in machine-learning methods as an input.
mapping, visualization and navigation of the chemical An alternative mutation-based approach to travel in
space defined by a given library, which in turn leads to graph-based chemical space has been suggested by van
building correlations with bioactivity and further com- Deursen et al.[76] They represent a chemical space as a
pound design.[62] Thus, the hierarchical scaffold classifica- graph in which vertices correspond to individual molecules
tion proposed in[61] helps to chart biologically relevant and edges correspond to structural mutations: change of
chemical space using data on natural products. The idea of atom type; inversion of stereochemical configuration at
a “scaffold tree” is implemented in the open source “Scaf- chiral centers, removal and addition of atom; saturation
fold Hunter” software,[63] an interactive tool for navigation and unsaturation of bond; bond rearrangement; and aro-
in chemical space, which facilitates recognition of complex matic ring addition. Traveling in such space from one
structural relationships associated with bioactivity. active molecule to another one, one can discover along the

26 www.molinf.com  2011 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2011, 30, 20 – 32
Chemoinformatics as a Theoretical Chemistry Discipline

trajectory a certain number of novel structures which can because the former are more suitable to analyze complex
be further analyzed in the context of lead optimization. A topological structures of the descriptor space. The ability of
similar approach has been reported by Bishop et al.[77] who SOMs to build “navigation maps” for visualizing chemical
suggested the use of chemical reactions as structural muta- space has been demonstrated on GPCR ligands,[54] toxic
tions connecting in the chemical space known organic compounds,[87] inhibitors of P-glycoprotein[88] and different
compounds taken from the Beilstein database. The super- organic reactions.[89]
graph created in such a way enabled the authors to select A set of chemical structures can be presented as a graph
a set of the “most useful compounds” from which the ma- in which the vertices correspond to individual molecules
jority of chemical compounds can be synthesized. and the edges connecting them correspond to certain
neighborhood relations.[90] This technique has been used to
represent relationships between different classes of drug
3.1.4 Navigation in Descriptor-Based Chemical Space
molecules,[91] to elucidate similarity relationships within the
Descriptor-based chemical space is a multidimensional sets of active compounds,[92] and to explore structure-selec-
space in which molecules are represented as vectors. Two tivity relationships.[93]
main approaches – dimensionality reduction and clustering Hierarchical clustering techniques using some similarity
-are used to facilitate the navigation in this space. measures also offer the possibility of analyzing large chemi-
Dimensionality reduction is achieved in classical multi- cal data sets. Thus, Agrafiotis et al.[94] have used radial clus-
variate data analysis by the Principal Component Analysis terograms, different segments of which are color-coded by
(PCA) procedure.[78–79] In PCA, several features (called “prin- biological activity or any other user-defined property.
cipal components”) corresponding to the principal inertia To characterize structure–activity landscapes in the de-
axes of the “cloud” of data points in the initial descriptor scriptor-based chemical space, SARI and SALI indices have
space are used as axes of a new low-dimensional space, been suggested. The SARI index[64] globally characterizes
onto which the initial data points are projected. Such pro- structure-activity landscapes. It consists of two terms: the
jection occurs with the minimal loss of information and, continuity score which measures the potency-weighted
therefore, maximal conservation of the neighborhood rela- structural diversity, and the discontinuity score calculated
tionships between data points. Thus, representation of the as the average potency difference among similar pairs of
data points in the resulting low-dimensional space can be molecules. The SALI index[95] is local, considering two relat-
considered as a “navigation map” of the descriptor space. ed molecules, and it is often used to quantify “activity
This idea has been implemented in the ChemGPS (chemical cliffs”.[96]
global positioning system) technique[23] which positions
chemical structures in drug-like chemical space (drug
3.2 Modeling Background
space). This makes this approach as well as the related
ChemGPS-NP[80–81] tool a well-suited reference system to The two main mathematical approaches used in chemoin-
compare multiple libraries and to keep track of previously formatics are graph theory and computational learning
explored regions of the chemical descriptor space.[23] theory. Whilst the chemical applications of graphs are de-
Although the axes of the PCA “navigation map” are or- scribed in numerous books and review articles (e.g., see
thogonal, corresponding latent variables are statistically in- Bonchev[97]), the latter is described mostly in the data-
dependent only for a Gaussian distribution of data points. mining literature. Here, we give some general information
Since this distribution in the descriptor space is usually about some basic concepts of computational learning
strongly non-Gaussian, this can hamper the chemical inter- theory.
pretability of particular latent variables and reduce the use-
fulness of the whole “navigation map”. To solve this prob-
3.2.1 Computational Learning Theory
lem, Independent Component Analysis (ICA) has been sug-
gested.[82–85] It has been demonstrated that the application In recent years, in statistical modeling there has been a
of ICA instead of PCA yields chemically more readily inter- shift from the classical statistical paradigm of “model pa-
pretable latent variables.[86] rameterization” to a new paradigm of “predictive flexible
Hierarchical cluster analysis represents an alternative ap- modeling”. The first paradigm supposes that the functional
proach to navigate in the descriptor space. The resulting dependence between the input and output data is estab-
dendrogram gives a clear picture of the neighborhood rela- lished from some external knowledge and the goal of the
tions between chemical objects, although for a large statistical study is to find a few independent free parame-
number of compounds it becomes too burdensome. ters by fitting to experimental data. This usually requires a
The combined application of dimensionality reduction certain number of experimental observations per each free
and clustering methods is realized in Kohonen Self-Organiz- parameter. Unfortunately, this requirement can be met only
ing Maps (SOM).[53] In SOMs, the dimensionality reduction is in very few cases, e.g., within the classical Hansch-Fujita ap-
achieved by embedding a net of neurons onto a 2D sur- proach based on three descriptors only.[98] The aim of the
face. The SOMs provide more efficient solutions than PCA, second paradigm is to build models with maximal predic-

Mol. Inf. 2011, 30, 20 – 32  2011 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.molinf.com 27
Review A. Varnek, I. I. Baskin

tive performance by fitting to experimental data rather flex- the data, thus reducing Remp. On the other hand, f could fit
ible families of functions involving large numbers of inter- not only a trend but also noise in the data, thus increasing
correlated parameters. Such a setup is evidently much the complexity term (overfitting). Thus, to minimize the risk
more appropriate for most chemoinformatics studies. The R[f], one should find a compromise between Remp and the
first attempts to implement the second paradigm in the complexity term in (2). This can be achieved by introduc-
framework of so-called nonparametric statistical analysis tion of some trade-off parameters depending on particular
failed because of the “curse of dimensionality” (which re- machine learning method. For example, these include the
quired a huge number of observations exponentially grow- number of descriptors in multiple linear regression models
ing with the number of free parameters).[99] Nonetheless, with variables selection; the ridge value in the ridge regres-
early works on predictive modeling were successfully car- sion; the number of “leaves” in decision trees; the number
ried out using completely heuristic methodologies of artifi- of iterations in neural networks; the parameter k in kNN
cial neural networks[100–101] and decision trees.[102] For the and the C and n parameters in SVM calculations. These pa-
first time, a strong theoretical background to build statisti- rameters should be optimized in order to achieve the best
cal models using finite (even small) data sets was devel- prediction performance of the model.
oped by Vapnik in his Statistical Learning Theory (SLT).[14] One of the most interesting conclusions of SLT is that
This approach, together with that developed later as the the value of the complexity term does not directly depend
PAC (Probably Approximately Correct) theory by Valiant[15] on the number of free parameters q in the function class f,
and the MDL (Minimum Description Length) concept by the flexibility (capacity, complexity) of which is measured
Rissanen[103] constitute the basis of modern computational by the VC dimension h. The value of h can be considered
learning theory. as an “effective” number of free parameters. (Note that h is
According to SLT, the goal of statistical study is to equal to the number of free parameters in classical multiple
choose from a given set of functions f(x, q) the “best” one linear regression without descriptor selection).
f(x, q*) with the minimum value of the risk functional R[f], According to SLT, h is controlled by the trade-off parame-
which is defined as an expected prediction error on new ter used to simultaneously minimize both terms in Equa-
data taken from the same distribution as the training set tion 2. This offers an opportunity to build models with any
(i.e., the mean prediction performance on all possible test (even very huge) number of variables using kernel ap-
sets). Here x denotes the variables (descriptors in QSAR proaches, which approximate nonlinear functional depend-
studies) and q the adjustable parameters. Another impor- encies of any form by projecting descriptors onto a feature
tant characteristic is the empirical risk functional Remp [f], space of any (even infinite) dimensionality and build linear
which is defined as an error on the training set (fitting models in this feature space.[44]
error). For regression tasks, Remp is usually calculated as: Nowadays, computation learning theory represents a
quickly developing area. Thus recently, a Bayesian learning
1X N
approach to predictive flexible modeling has been de-
Remp ½f  ¼ ðy  f ðxi ; qÞÞ2 ð1Þ scribed.[104] Instead of one single model (as in STL), it con-
N i¼1 i
siders the whole statistical distributions of models weight-
Here i denotes the observations (compounds in QSAR ed by their ability to fit data, thus allowing one to make
studies) in the training set, N is the size of the training set, probabilistic predictions by averaging these distributions.
yi is the response value in i-th observation (the property This approach has come to be rather popular in chemoin-
value of i-th compound in QSAR studies). According to formatics: its implementations in Bayesian Neural Net-
Vapnik,[14, 99] for the classification tasks the risk can be esti- works,[105] Gaussian Processes,[106] and Bayesian Networks[107]
mated as have been recently published.

R½f   Remp ½f  þ cðh,NÞ ð2Þ 3.2.2 Different Facets of Statistical Modeling


It should be pointed out that the range of application of
Here, the complexity term c(h,N) characterizes the flexi- different statistical (machine learning) methods in chemoin-
bility of the set of functions f(x, q) to fit experimental data. formatics is currently very wide (Figure 2). Most of the exist-
It increases with the Vapnik-Chervonenkis (VC) dimension h ing machine learning approaches can provisionally be di-
and decreases with the number of data N. It follows from vided into two large families: supervised and unsupervised
Equation 2 that in order to obtain a predictive model, one machine learning. (Some other approaches – semisuper-
should minimize both the empirical risk Remp (i.e., fitting vised, active and multi-instant learning – are very rarely
error) and the complexity term. used in chemistry so far).
In fact, the notion of complexity is related to the The goal of the supervised learning in chemistry is to
smoothness of functions for regression tasks. If f is not flexi- predict physicochemical properties and biological activities
ble enough, the complexity term is small, but Remp could be of chemical compounds. The quantitative prediction of
large (underfitting). Too complex (flexible) f perfectly fits real-valued properties is performed by regression models,

28 www.molinf.com  2011 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2011, 30, 20 – 32
Chemoinformatics as a Theoretical Chemistry Discipline

Figure 2. Different approaches to model description.

whereas qualitative predictions (“active” or “inactive”?) are tionships between chemical structures. Kernels represent
assessed in classification models. The most popular regres- the most useful types of such measures; they can be com-
sion methods currently used in chemoinformatics applica- puted both from molecular descriptors and by direct com-
tions are multiple linear regression (MLR), partial least parison of chemical structures. Both primal and dual ap-
squares (PLS), neural networks, support vector regression proaches can be used within supervised and unsupervised
(SVR), and kNN, whereas the nave Bayes, support vector modeling tasks.
machines (SVM), neural networks and classification trees Finally, statistical models can be built for a net of mutual-
(especially the Random Forest method[108]) are widely used ly related models, in which their predictive performance
for classification. There are also ranking models,[109] in can be leveraged due to Inductive Learning Transfer phe-
which ranking order instead of property values are predict- nomenon,[30, 113] in the framework of the Multi-Task Learning
ed, and models with structured output,[110] in which predict- and Feature Net approaches.[30]
ed values belong to classes of any complexity. Models of
the latter two types can be built using some special modifi-
cations of SVM. 4 Relations of Chemoinformatics with the
Unsupervised learning describes the data and reveals “Sister” Disciplines
their hidden patterns. The most important tasks treated by
4.1 Chemoinformatics and Machine Learning
unsupervised modeling approaches are: (a) cluster analysis
(data reduction); (b) dimensionality reduction; (c) novelty Although machine learning is widely used for structure-
(outlier) detection. All these tasks can be perceived as par- property modeling, chemoinformatics can be considered as
ticular cases of data density estimation. Many standard al- a very specific area of its application. The specificity of che-
gorithms for both nonhierarchical (e.g., k-means) and hier- moinformatics results from (i) the nature of chemical ob-
archical clustering algorithms are used. The most popular jects, (ii) the complexity of the chemical universe and (iii) a
algorithms for dimensionality reduction are PCA (Principal possibility to take into account an extra-knowledge.
Component Analysis) and ICA (Independent Component The basic chemical object is a graph (or hypergraph),
Analysis). Tasks (a) and (b) are solved simultaneously in the rather than simple fixed-sized vector of numbers as in the
Kohonen Self-Organizing Maps (SOMs),[53] which are inten- typical applications in mathematical statistics and machine
sively used for the purposes of visualization and analysis of learning. This dictates the need to apply graph theory, to
the chemical space. The ability of several machine learning develop novel descriptors and structured graph kernels,
methods, such as one-class SVM,[44] to tackle the problem and to apply machine learning methods capable of dealing
of novelty detection is currently used to define the applica- with structured discrete data.
bility domains of QSAR/QSPR models[111] as well as in virtual The second important distinction comes from the fact
screening experiments.[112] that the chemical data result from an explorative process in
With respect to data description, two types of models – a huge chemical space rather than from specially organized
primal and dual – can be identified. Primal models are sampling. Hence, they cannot be considered as representa-
based on the direct use of descriptors, whereas dual tive, independent and identically distributed sampling from
models are based on measures describing similarity rela- a well defined distribution. Thus, special approaches are

Mol. Inf. 2011, 30, 20 – 32  2011 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.molinf.com 29
Review A. Varnek, I. I. Baskin

needed to treat this problem: various strategies to explore build pharmacophoric ligand models based on the analysis
chemical space, the “applicability domain” concept, the of 3D protein-ligand structures.
active learning approach, etc. A promising way to describe ligand–receptor complexes
Finally, one can use the relationships between different concerns construction of protein-ligand kernels (PLK) as
properties issued from physicochemical theory. (For exam- products of “chemical” ligand–ligand (LLK) and “biological”
ple, the Arrhenius law could be particularly useful upon the protein–protein kernels (PPK). The resulting feature space
modeling the rate constants). These relationships could be for PLK is a tensor product of the features spaces corre-
integrated into chemoinformatics workflow as an external sponding to LLK and PPK. Machine learning models involv-
knowledge. ing PLK are based on the idea that similar ligands bind to
similar proteins. Using these kernels, one can predict bind-
ing potency of both different ligands with respect to a
4.2 Chemoinformatics and Chemometrics given protein, and different proteins with respect to a
given ligand. Several articles describing PPK have been
Massart[114] has defined chemometrics as “a chemical disci-
published. Erhan et al. combined “chemical” kernels based
pline that applies mathematics, statistics and formal logic (a)
on MOE descriptors and “biological” kernels based on pro-
to design and select optimal experimental procedures; (b) to
tein-ligand “interaction fingerprints”.[118] Faulon et al.[119]
provide maximum relevant chemical information by analyz-
used the signature molecular descriptors to calculate
ing chemical data; and (c) to obtain knowledge about chemi-
“chemical” and “biological” Tanimoto kernels. Jacob and
cal systems”. Generally, chemometrics requires no informa-
Vert[120] combined a Tanimoto kernel for the ligands and
tion about chemical structure and, therefore it overlaps
several types of kernels for the proteins. In particular, for
with chemoinformatics only in the area of application of
PPK they compared either protein sequences or EC num-
machine learning methods. It is widely used in experiment
bers. Bajorath et al.[121] used a linear kernel for the ligands
design, chemical engineering, analytical chemistry and
and protein-protein kernels calculated from sequence iden-
treatment of spectra – fields where an exhaustive treat-
tity matrix.
ment of multivariate data is needed.

5 Conclusions
4.3 Chemoinformatics and Bioinformatics
Here, chemoinformatics has been described as a fundamen-
Unlike chemoinformatics dealing with “chemical size” mole- tal theoretical chemistry discipline complementary to quan-
cules, bioinformatics uses computational tools to study the tum chemistry and force-field molecular modeling. Chemo-
structure and function of biomolecules (proteins, nucleic informatics represents molecules as graphs or descriptor
acids). This is a broad field mostly involving 3D (force field vectors whose ensembles form, respectively, graph-based
and quantum mechanics calculations) and 1D (sequence or descriptor-based spaces. Chemical similarity measures or
alignment) modeling. In the latter, a biomolecule is repre- hierarchical relationships between graphs are used as met-
sented as a string of characters (building blocks). Graph rics in the chemical space. Chemoinformatics uses two
and fixed size vector models used in chemoinformatics are main mathematical approaches – graph theory and statisti-
very rarely used in bioinformatics. In this sense, chemo- cal learning theory; the latter is briefly described here.
and bioinformatics are “complementary”. On the other In this paper, we have not aimed to describe all facets of
hand, there are many examples of interpenetration of these chemoinformatics, but have attempted to delineate some
fields. Thus, in docking calculations, protein structures important points identifying this field as an independent
could be generated by bioinformatics tools, whereas some scientific discipline. This view is probably incomplete. How-
scoring functions involve vector representation of ligands. ever, we hope it will initiate a discussion which in any case
Another way to combine bio- and chemoinformatic ap- could be useful for the chemoinformatics community.
proaches is related to the construction of protein-ligand
descriptors or fingerprints based on available 3D informa-
tion about protein-ligand complexes. Thus, Tropsha Acknowledgements
et al.[115] developed CoLiBRI descriptors calculated for a
pseudomolecule constructed from interacting atoms of the We thank Prof. L. Morin-Allory, Dr. P. Vayer and Dr. G. Marcou
protein and the ligand. Marcou and Rognan[116] have devel- for fruitful discussion and Prof. J. Harrowfield for his help
oped “interaction fingerprints” accounting for eight interac- and advice.
tion types per each protein atom interacting with the
ligand: hydrophobic; aromatic (face to face); aromatic
(edge to face); H-bond (protein donor atom); H-bond (pro- References
tein acceptor atom); ionic (positively charged protein [1] J. Gasteiger, Anal. Bioanal. Chem. 2006, 384, 57 – 64.
atom); ionic (negatively charged protein atom); metal com- [2] J. Gasteiger, T. Engel, Chemoinformatics: A Textbook, Wiley-
plexation., Langer et al.[117] have reported a technique to VCH, Weinheim, 2003.

30 www.molinf.com  2011 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2011, 30, 20 – 32
Chemoinformatics as a Theoretical Chemistry Discipline

[3] J. Gasteiger, Handbook of Chemoinformatics: From Data to [39] A. M. Johnson, G. M. Maggiora, Concepts and Applications of
Knowledge, Wiley-VCH, Weinheim, 2003. Molecular Similarity, Wiley, New York, 1990.
[4] J. Bajorath, Mol. Divers. 2002, 5, 305 – 313. [40] N. Nikolova, J. Jaworska, QSAR Comb. Sci. 2003, 22, 1006 –
[5] N. Brown, Computing Surveys 2006. 1026.
[6] W. L. Chen, J. Chem. Inf. Model. 2006, 46, 2230 – 2255. [41] J. W. Raymond, P. Willett, J. Comput. Aided Mol. Des. 2002, 16,
[7] T. Engel, J. Chem. Inf. Model. 2006, 46, 2267 – 2277. 521 – 533.
[8] J.-L. Faulon, A. Bender, Handbook of Chemoinformatics Algo- [42] T. R. Hagadone, J. Chem. Inf. Model. 1992, 32, 515 – 521.
rithms, CRC Press, Boca Raton, 2010. [43] I. L. Ruiz, C. G. Garcia, M. A. Gomez-Nieto, J. Chem. Inf. Model.
[9] B. Nathan, ACM Comput. Surv. 2009, 41, 1 – 38. 2005, 45, 1178 – 1194.
[10] I. Baskin, A. Varnek, in Chemoinformatics Approaches to Virtual [44] B. Schçlkopf, A. J. Smola, Learning with Kernels: Support
Screening (Eds: A. Varnek, A. Tropsha), RSC Publisher, Cam- Vector Machines, Regularization, Optimization, and Beyond,
bridge, 2008, pp. 1 – 43. MIT Press, Cambridge, MA, USA, 2002.
[11] A. R. Leach, Molecular Modelling Principles and Applications, [45] G. Paris, (August 1999 Meeting of the American Chemical So-
2nd ed, Prentice Hall, Upper Saddle River, 2001. ciety), quoted by W. Warr at http://www.warr.com/warrzone.
[12] N. Brown, ACM Comput. Surv. 2009, 41, 1 – 38. htm.
[13] E. T. Jaynes, Probability Theory. The Logic of Science, Cam- [46] M. Rupp, G. Schneider, Mol. Inf. 2010, 29, 266 – 273.
bridge University Press, Cambridge, 2003. [47] P. Willett, J. M. Barnard, G. M. Downs, J. Chem. Inf. Comput.
[14] V. Vapnik, Statistical Learning Theory, Wiley-Interscience, New Sci. 1998, 38, 983 – 996.
York, 1998. [48] E. Besalu, X. Girones, L. Amat, R. Carbo-Dorca, Acc. Chem. Res.
[15] L. G. Valiant, Commun. ACM 1984, 27, 1134 – 1142. 2002, 35, 289 – 295.
[16] A. Nicholls, in ACS Fall 2009 National Meeting & Exposition, [49] X. Fradera, L. Amat, E. Besalu, R. Carbo-Dorca, Quant. Struct.-
2009, p. CINF55. Act. Rel. 1997, 16, 25 – 32.
[17] T. R. Cundari, M. T. Benson, M. L. Lutz, S. O. Sommerer, Rev. [50] A. Gallegos, R. Carbo-Dorca, R. Ponec, K. Waisser, Int. J.
Comput. Chem. 1996, 8, 145 – 202. Pharm. 2004, 269, 51 – 60.
[18] G. Frenking, I. Antes, M. Bahme, S. Dapprich, A. W. Ehlers, V. [51] X. Girones, L. Amat, R. Carbo-Dorca, SAR QSAR Environ. Res.
Jonas, A. Neuhaus, M. Otto, R. Stegmann, A. Veldkamp, S. F. 1999, 10, 545 – 556.
Vyboishchikov, Rev. Comput. Chem. 1996, 8, 63 – 144. [52] S. K. Kearsley, G. M. Smith, Tetrahedron Comput. Methodol.
1990, 3, 615 – 633.
[19] L. Peltason, N. Weskamp, A. Teckentrup, J. Bajorath, J. Med.
[53] T. Kohonen, Self-Organizing Maps, Springer, 2001.
Chem. 2009, 52, 3212 – 3224.
[54] M. von Korff, M. Steger, J. Chem. Inf. Comput. Sci. 2004, 44,
[20] Y. Zhao, D. G. Truhlar, Theor. Chem. Acc. 2008, 120, 215 – 241.
1137 – 1147.
[21] S. Grimme, J. Comput. Chem. 2006, 27, 1787 – 1799.
[55] D. Hristozov, T. I. Oprea, J. Gasteiger, J. Chem. Inf. Model.
[22] C. M. Dobson, Nature 2004, 432, 824 – 828.
2007, 47, 2044 – 2062.
[23] T. I. Oprea, J. Gottfries, J. Comb. Chem. 2001, 3, 157 – 166.
[56] I. V. Tetko, J. Chem. Inf. Comput. Sci. 2002, 42, 717 – 728.
[24] A. Varnek, D. Fourches, F. Hoonakker, V. P. Solov’ev, J. Comput.
[57] T. G. Kristensen, J. Math. Chem. 2010, 48, 287 – 289.
Aided Mol. Des. 2005, 19, 693 – 703.
[58] G. W. Bemis, M. A. Murcko, J. Med. Chem. 1996, 39, 2887 –
[25] T. Langer, R. D. Hoffman, Pharmacophores and Pharmaco-
2893.
phore Searches, Wiley-VCH, Weinheim, 2000.
[59] G. W. Bemis, M. A. Murcko, J. Med. Chem. 1999, 42, 5095 –
[26] J. M. Barnard, J. Chem. Inf. Comput. Sci. 1991, 31, 64 – 68. 5099.
[27] U. Schoch-Grbler, Online Inform. Rev. 1990, 14, 95 – 108. [60] A. Schuffenhauer, P. Ertl, S. Roggo, S. Wetzel, M. A. Koch, H.
[28] C. Berge, Hypergraphs, Elsevier, Amsterdam, 1989. Waldmann, J. Chem. Inf. Model. 2007, 47, 47 – 58.
[29] R. Todeschini, V. Consonni, Handbook of Molecular Descrip- [61] M. A. Koch, A. Schuffenhauer, M. Scheck, S. Wetzel, M. Casaul-
tors, Wiley-VCH, Weinheim, 2000. ta, A. Odermatt, P. Ertl, H. Waldmann, Proc. Natl. Acad. Sci.
[30] A. Varnek, C. Gaudin, G. Marcou, I. Baskin, A. K. Pandey, I. V. USA 2005, 102, 17272 – 17277.
Tetko, J. Chem. Inf. Model. 2009, 49, 133 – 144. [62] S. Renner, W. A. L. Van Otterlo, M. Dominguez Seoane, S.
[31] N. M. Halberstam, I. I. Baskin, V. A. Palyulin, N. S. Zefirov, Dokl. Mçcklinghoff, B. Hofmann, S. Wetzel, A. Schuffenhauer, P. Ertl,
Chem. (Engl. Transl.) 2002, 384, 140 – 143. T. I. Oprea, D. Steinhilber, L. Brunsveld, D. Rauh, H. Wald-
[32] D. J. Livingstone, D. W. Salt, Rev. Comput. Chem. 2005, 21, mann, Nature Chem. Biol. 2009, 5, 585 – 592.
287 – 348. [63] S. Wetzel, K. Klein, S. Renner, D. Rauh, T. I. Oprea, P. Mutzel, H.
[33] I. I. Baskin, E. V. Gordeeva, R. O. Devdariani, N. S. Zefirov, V. A. Waldmann, Nature Chem. Biol. 2009, 5, 581 – 583.
Palyulin, M. I. Stankevich, Dokl. Akad. Nauk. SSSR 1989, 307, [64] L. Peltason, J. Bajorath, J. Med. Chem. 2007, 50, 5571 – 5578.
613 – 617 [Chem]. [65] D. K. Agrafiotis, M. Shemanarev, P. J. Connolly, M. Farnum,
[34] M. I. Skvortsova, I. I. Baskin, V. A. Palyulin, O. L. Slovokhotova, V. S. Lobanov, J. Med. Chem. 2007, 50, 5926 – 5937.
N. S. Zefirov, in AIP Conf. Proc. 330. E.C.C.C.1 Comput. Chem. [66] S. N. Pollock, E. A. Coutsias, M. J. Wester, T. I. Oprea, J. Chem.
F.E.C.S. Conf., Nancy, France (Eds: F. Bernardi, J.-L. Rivail), AIP Inf. Model. 2008, 48, 1304 – 1310.
Press, Woodbury, New York, 1995, pp. 486 – 499. [67] M. J. Wester, S. N. Pollock, E. A. Coutsias, T. K. Allu, S. Muresan,
[35] E. V. Gordeeva, M. S. Molchanova, N. S. Zefirov, Tetrahedron T. I. Oprea, J. Chem. Inf. Model. 2008, 48, 1311 – 1324.
Comput. Methodol. 1990, 3, 389 – 415. [68] E. V. Radchenko, V. A. Palyulin, N. S. Zefirov, in Chemoinfor-
[36] M. I. Skvortsova, I. I. Baskin, O. L. Slovokhotova, V. A. Palyulin, matics Approaches to Virtual Screening (Eds: A. Varnek, A.
N. S. Zefirov, J. Chem. Inf. Comput. Sci. 1993, 33, 630 – 634. Tropsha), RSC, 2008, pp. 150 – 181.
[37] J.-L. Faulon, C. J. Churchwell, D. P. Visco, Jr., J. Chem. Inf. [69] P. S. Magee, Quant. Struct.-Act. Rel. 1990, 9, 202 – 215.
Comput. Sci. 2003, 43, 721 – 734. [70] P. S. Magee, in QSAR: Rational Approaches to the Design of
[38] J. L. Faulon, W. M. Brown, S. Martin, J. Comput. Aided Mol. Bioactive Compounds (Eds: C. Silipo, A. Vittoria), Elsevier, Am-
Des. 2005, 19, 637 – 650. sterdam, 1991.

Mol. Inf. 2011, 30, 20 – 32  2011 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.molinf.com 31
Review A. Varnek, I. I. Baskin

[71] C. Mercier, V. Fabart, Y. Sobel, J. E. Dubois, J. Med. Chem. [98] C. Hansch, T. Fujita, J. Am. Chem. Soc. 1964, 86, 1616 – 1626.
1991, 34, 934 – 942. [99] V. Cherkassky, F. Mulier, Learning from Data: Concept, Theory
[72] L. Kurunczi, E. Seclaman, T. I. Oprea, L. Crisan, Z. Simon, J. and Methods., 2nd ed., Wiley, Hoboken, New Jersey, 2007.
Chem. Inf. Model. 2005, 45, 1275 – 1281. [100] J. Zupan, J. Gasteiger, Neural Networks in Chemistry, Wiley-
[73] L. Kurunczi, M. Olah, T. I. Oprea, C. Bologa, Z. Simon, J. Chem. VCH, Weinheim, 1999.
Inf. Comput. Sci. 2002, 42, 841 – 846. [101] I. I. Baskin, V. A. Palyulin, N. S. Zefirov, Methods Mol. Biol.
[74] T. I. Oprea, L. Kurunczi, M. Olah, Z. Simon, SAR QSAR Environ. 2008, 458, 137 – 158.
Res. 2001, 12, 75 – 92. [102] L. Breiman, J. Friedman, C. J. Stone, R. A. Olshen, Classification
[75] V. A. Palyulin, E. V. Radchenko, N. S. Zefirov, J. Chem. Inf. and Regression Trees, Chapman & Hall/CRC, Wadsworth, CA
Comput. Sci. 2000, 40, 659 – 667. 1984.
[76] R. Van Deursen, J. L. Reymond, ChemMedChem 2007, 2, 636 – [103] J. Rissanen, Ann. Stat. 1983, 11, 416 – 431.
640. [104] C. M. Bishop, Pattern Recognition and Machine Learning,
[77] K. J. M. Bishop, R. Klajn, B. A. Grzybowski, Angew. Chem. Int. Springer, New York, 2006.
Ed. 2006, 45, 5348 – 5354. [105] F. R. Burden, D. A. Winkler, J. Med. Chem. 1999, 42, 3183 –
[78] K. Varmuza, in Handbook of Chemoinformatics. From Data to 3187.
Knowledge (Ed: J. Gasteiger), Wiley-VCH, Weinheim, 2003, [106] O. Obrezanova, G. Csanyi, J. M. R. Gola, M. D. Segall, J. Chem.
pp. 1098 – 1133. Inf. Model. 2007, 47, 1847 – 1857.
[79] I. T. Jolliffe, Principal Component Analysis, 2nd ed., Springer, [107] A. Abdo, B. Chen, C. Mueller, N. Salim, P. Willett, J. Chem. Inf.
Heidelberg 2002. Model. 2010, 50, 1012 – 1020.
[80] J. Larsson, J. Gottfries, L. Bohlin, A. Backlund, J. Nat. Prod. [108] L. Breiman, Mach. Learn. 2001, 45, 5 – 32.
2005, 68, 985 – 991. [109] S. Agarwal, D. Dugar, S. Sengupta, J. Chem. Inf. Model. 2010,
[81] J. Larsson, J. Gottfries, S. Muresan, A. Backlund, J. Nat. Prod. 50, 716 – 731.
2007, 70, 789 – 794. [110] T. Joachims, T. Hofmann, Y. Yue, C. N. Yu, Commun. ACM
[82] A. Hyvarinen, Acta Polytech. Sc. Ma. 1997, 88. 2009, 52, 97 – 104.
[83] A. Hyvarinen, E. Oja, Neural Comput. 1997, 9, 1483 – 1492. [111] I. I. Baskin, N. Kireeva, A. Varnek, Mol. Inf. 2010, 29, 581 – 587.
[84] A. Hyvarinen, E. Oja, Neural Networks 2000, 13, 411 – 430. [112] N. Fechner, A. Jahn, G. Hinselmann, A. Zell, J. Cheminformatics
[85] J. Chen, X. Z. Wang, J. Chem. Inf. Comput. Sci. 2001, 41, 992 – 2010, 2.
1001. [113] I. I. Baskin, N. I. Zhokhova, V. A. Palyulin, A. N. Zefirov, N. S. Ze-
[86] M. G. Gustafsson, J. Chem. Inf. Model. 2005, 45, 1244 – 1255. firov, Dokl. Chem. (Engl. Transl.) 2009, 427, 172 – 175.
[87] P. Mazzatorta, M. Vracko, A. Jezierska, E. Benfenati, J. Chem. [114] D. L. Massart, Handbook of Chemometrics and Qualimetrics,
Inf. Comput. Sci. 2003, 43, 485 – 492. Elsevier, New York, 1998.
[88] Y.-H. Wang, Y. Li, S.-L. Yang, L. Yang, J. Chem. Inf. Model. 2005, [115] S. Oloff, S. Zhang, N. Sukumar, C. Breneman, A. Tropsha, J.
45, 750 – 757. Chem. Inf. Model. 2006, 46, 844 – 851.
[89] H. Satoh, O. Sacher, T. Nakata, L. Chen, J. Gasteiger, K. Funat- [116] G. Marcou, D. Rognan, J. Chem. Inf. Model. 2007, 47, 195 –
su, J. Chem. Inf. Comput. Sci. 1998, 38, 210 – 219. 207.
[90] A. Tropsha, D. Fourches, Chem. Central J. 2009, 3. [117] C. Laggner, G. Wolber, J. Kirchmair, D. Schuster, T. Langer, in
[91] J. Hert, M. J. Keiser, J. J. Irwin, T. I. Oprea, B. K. Shoichet, J. Chemoinformatics Approaches to Virtual Screening (Eds: A.
Chem. Inf. Model. 2008, 48, 755 – 765. Varnek, A. Tropsha), RSC Publisher, Cambridge, 2008, pp. 76 –
[92] M. Wawer, L. Peltason, N. Weskamp, A. Teckentrup, J. Bajor- 101.
ath, J. Med. Chem. 2008, 51, 6075 – 6084. [118] D. Erhan, P.-J. L’Heureux, S. Y. Yue, Y. Bengio, J. Chem. Inf.
[93] L. Peltason, Y. Hu, J. Bajorath, ChemMedChem 2009, 4, 1864 – Model. 2006, 46, 626 – 635.
1873. [119] J. L. Faulon, M. Misra, S. Martin, K. Sale, R. Sapra, Bioinformat-
[94] D. K. Agrafiotis, D. Bandyopadhyay, M. Farnum, J. Chem. Inf. ics 2008, 24, 225 – 233.
Model. 2007, 47, 69 – 75. [120] L. Jacob, J. P. Vert, Bioinformatics 2008, 24, 2149 – 2156.
[95] R. Guha, J. H. Van Drie, J. Chem. Inf. Model. 2008, 48, 646 – [121] H. Geppert, J. Humrich, D. Stumpfe, T. Gaertner, J. Bajorath, J.
658. Chem. Inf. Model. 2009, 49, 767 – 779.
[96] G. M. Maggiora, J. Chem. Inf. Model. 2006, 46, 1535 – 1535.
[97] D. Bonchev, D. H. Rouvray, Chemical Graph Theory. Introduc- Received: August 31, 2010
tion and Fundamentals, Gordon and Breach, New York, 1991, Accepted: January 14, 2011
p. 300. Published online: January 24, 2011

32 www.molinf.com  2011 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2011, 30, 20 – 32

You might also like