Novel Hybrid Ultrafast Shape Descriptor Method For Use in Virtual Screening
Novel Hybrid Ultrafast Shape Descriptor Method For Use in Virtual Screening
Novel Hybrid Ultrafast Shape Descriptor Method For Use in Virtual Screening
net/publication/5570487
Novel Hybrid Ultrafast Shape Descriptor Method for use in Virtual Screening
CITATIONS READS
34 6,337
3 authors, including:
SEE PROFILE
All content following this page was uploaded by John Blayney Owen Mitchell on 04 June 2014.
Address: Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2
1EW, UK
Email: Edward O Cannon - [email protected]; Florian Nigsch - [email protected]; John BO Mitchell* - [email protected]
* Corresponding author
Abstract
Background: We have introduced a new Hybrid descriptor composed of the MACCS key
descriptor encoding topological information and Ballester and Richards' Ultrafast Shape
Recognition (USR) descriptor. The latter one is calculated from the moments of the distribution of
the interatomic distances, and in this work we also included higher moments than in the original
implementation.
Results: The performance of this Hybrid descriptor is assessed using Random Forest and a dataset
of 116,476 molecules. Our dataset includes 5,245 molecules in ten classes from the 2005 World
Anti-Doping Agency (WADA) dataset and 111,231 molecules from the National Cancer Institute
(NCI) database. In a 10-fold Monte Carlo cross-validation this dataset was partitioned into three
distinct parts for training, optimisation of an internal threshold that we introduced, and validation
of the resulting model. The standard errors obtained were used to assess statistical significance of
observed improvements in performance of our new descriptor.
Conclusion: The Hybrid descriptor was compared to the MACCS key descriptor, USR with the
first three (USR), four (UF4) and five (UF5) moments, and a combination of MACCS with USR
(three moments). The MACCS key descriptor was not combined with UF5, due to similar
performance of UF5 and UF4. Superior performance in terms of all figures of merit was found for
the MACCS/UF4 Hybrid descriptor with respect to all other descriptors examined. These figures
of merit include recall in the top 1% and top 5% of the ranked validation sets, precision, F-measure,
area under the Receiver Operating Characteristic curve and Matthews Correlation Coefficient.
Page 1 of 9
(page number not for citation purposes)
Chemistry Central Journal 2008, 2:3 http://journal.chemistrycentral.com/content/2/1/3
date have used machine learning methods to classify pro- Such descriptors have been reported by Hert et al. [13,14]
hibited substances into their respective categories using and Bender et al. [15] to perform well in the domain of
chemical descriptors calculated by computers or data col- similarity-based virtual screening. In contrast, until fairly
lated from analytical laboratory instruments. An alterna- recently, it has been accepted that three dimensional
tive approach is to use a machine learning algorithm to methods do not perform as well as existing two dimen-
virtually screen a database of compounds. Here the objec- sional methods solely in terms of the number of actives
tive is to rank molecules based on their probability of retrieved [16], if the actives have a larger common sub-
being active relative to a reference structure or set of struc- structure to each other than to the negatives [17]. Most
tures and this method has been used in the past by the three dimensional methods have performed less well in
pharmaceutical industry to search for novel lead com- the past due to the fact that three dimensional descriptors
pounds and identify compounds potentially appropriate have to deal with translational and rotational variance in
for a given receptor. There are a number of different meth- addition to a potentially large number of conformations.
ods that can be used for virtual screening of databases of
chemical compounds. However, this work involves simi- Previous studies have shown that one of the key features
larity-based virtual screening. in discriminating active from inactive molecules is molec-
ular shape [18,19]. However, as with many other three
Similarity-based virtual screening is founded on the simi- dimensional methods, it is often said that the problem of
lar property principle [5], which states that molecules calculating molecular shape is not only a very challenging
with similar structures often exhibit similar properties and task, it is also time consuming. A recent shape descriptor
biological activity. Similarity-based virtual screening is a proposed by Ballester and Richards [20] called Ultrafast
technique used to rank the compounds of large databases Shape Recognition (USR) has been shown to avoid the
based on how similar they are to one, or several, reference alignment problem, and to be up to 1500 times faster to
molecules of pharmacological interest. The rank assigned calculate than other current methodologies. The shape
to a molecule reflects its probability of being active. Mol- descriptor makes the assumption that a molecule's shape
ecules in the top few ranks of a sorted database are can be uniquely defined by the relative position of its
expected to be very similar to the biologically active query atoms and that three-dimensional shape can be character-
molecules in the training set and are thus assigned a ised by one-dimensional distributions. They compared
higher probability of being active. One of the main appli- the performance of USR to the EigenSpectrum Shape Fin-
cations of similarity-based virtual screening is to help gerprints (EShape3D) in the Molecular Operating Envi-
decide which compounds should be taken forward for in ronment (MOE) [21] by visualising the shapes of the top
vitro screening. Other related applications include identi- few hits in the ranked database and found that similar
fying which compounds should be purchased from an shapes were retrieved for both methods.
external vendor or which libraries to synthesise [6].
Recent work by Baber et al. [22] has shown that combin-
There are a number of important steps in conducting a ing two and three-dimensional methods can improve vir-
similarity-based virtual screening experiment, the first of tual screening performance. Baber et al. have used
which is to define the representation of the molecules in consensus scoring in ligand based virtual screening with a
chemical space using descriptors. Descriptors are usually set of two and three dimensional structural and pharma-
defined by their dimensionality. One-dimensional cophore based descriptors and found that consensus
descriptors are properties such as molecular weight and scores generally worked better than single scores. The
log P [7]. Two-dimensional descriptors are derived from improvement in performance was attributed to additional
the connection table [8], whilst three-dimensional information relevant to ligand – receptor binding.
descriptors use geometric information from molecular
structures in three-dimensional space [9]. Probably the The second stage in any virtual screen is to decide the
most commonly used descriptors are those based on two- number of bioactive reference compounds. Past virtual
dimensional structure [6,10]. Such descriptors are usually screening studies have mainly been concerned with the
binary in nature and typically encode the presence or use of a single bioactive reference structure. However
absence of substructural fragments, a prime example more recent studies have used multiple bioactive refer-
being the MACCS key descriptor [11]. Hashed fingerprints ences [13]. A common way to rank molecules is to select
are also commonly used, and differ from structural key a training set of actives and inactives for the training of a
descriptors in that they do not use a predefined dictionary, classification method in order to predict the likelihood of
but incorporate patterns, often made up of atom types, unseen molecules in a test set being active; the molecules
augmented atoms and atom paths. The Daylight finger- are then ranked based on this likelihood. A number of
print [12] of length 1024 bits is an example of a hashed machine learning methods have been used: support vec-
fingerprint. tor machine [23], k-nearest neighbour [24], binary kernel
Page 2 of 9
(page number not for citation purposes)
Chemistry Central Journal 2008, 2:3 http://journal.chemistrycentral.com/content/2/1/3
discrimination [13], neural networks [25] and the naive compounds taken from the 2005 WADA Anti-Doping
Bayes classifier [15]. More recently, the Random Forest Agency (WADA) dataset [1] to form a final dataset of
classification algorithm has successfully been used to 116,476 compounds. The WADA classes are composed of
screen a database of ~8,000 Chinese herbal substances for molecules explicitly on the prohibited list and of mole-
potential inhibitors of several therapeutically important cules of similar biological activity and chemical structure
molecular targets [26]. taken from the MDDR database [11] (version 2003.1).
The WADA dataset is composed of 10 different activity
The final stage is to assess the performance of the descrip- classes: beta blockers (P2), anabolic agents (S1), hor-
tor and classification method. Common methods used in mones and related substances (S2), β-2 agonists (S3),
the machine learning and information retrieval commu- agents with anti-estrogenic activity (S4), diuretics and
nity are: recall, precision, the F-measure, Matthews Corre- other masking agents (S5), stimulants (S6), narcotics
lation Coefficient and the area under the Receiver (S7), cannabinoids (S8) and glucocorticosteroids (S9).
Operating Characteristic curve (AUC). The breakdown of the WADA activity classes is given in
Table 1 and in the supplementary information (Addi-
In this paper, we report the use of a novel Hybrid descrip- tional file 1). Pictures of the most and least representative
tor that combines both two and three-dimensional infor- molecules for each prohibited class can also be found in
mation. The novel descriptor is composed of the MACCS the supplementary information (see Additional files 2
key 166 bit packed descriptor, which is binary in nature and 3). For the purpose of this work, the NCI compounds
(composed of 0s and 1s) and the USR descriptor which is were assumed to be inactive, with no NCI compound
based on 12 floating point numbers. We have extended being present in the WADA dataset after an initial screen
the USR descriptor to include 4 additional floating point that compared canonical SMILES strings.
numbers from the calculation of the fourth moment (kur-
tosis) of the interatomic distance distributions. Concate- Conformer generation
nated together, this makes a descriptor of 182 Based on the original work of Ballester and Richards [20],
components. only one low energy conformation per molecule was
used. They showed, by taking one query molecule and
0101110011...MACCS(166) + 2.567...USR(16) generating 292 additional conformations, that the
changes in the results from their Ultrafast Shape Recogni-
We have used the Random Forest classifier to conduct a tion as a function of conformer were negligible. It has
virtual screen and rank molecules taken from the WADA been shown [30] that taking more conformers into
2005 dataset and the National Cancer Institute (NCI) account can improve performance. However, to retain the
database based on their probability of being active. We descriptor's Ultrafast property, only the CORINA [27]
have assessed the Hybrid descriptor's performance against generated conformation has been used.
the USR descriptor (with three moments), the USR
descriptor with four and five moments (UF4, UF5) and Descriptor
the MACCS key descriptor on an external validation set MACCS key descriptor
and report: the recall of actives in the top 1% and top 5% The MACCS key 166 bit descriptor was originally created
of the validation sets, precision, the F-measure, Matthews by Molecular Design Limited (MDL) [11], and is a two
Correlation Coefficient and the area under the Receiver dimensional substructure descriptor which encodes
Operating Characteristic curve of the ranked validation
sets. Details of the performance measures can be found Table 1: WADA class & number of molecules
below.
WADA Class Number of Molecules
Methods
Dataset and preprocessing P2 239
S1 47
All 249,071 three dimensional CORINA[27] generated
S2 272
structures were taken from the publicly available 1999 S3 367
National Cancer Institute database [28]. Duplicates were S4 928
removed, leaving 236,936 unique structures, which were S5 1,000
then filtered for drug-likeness, using a Lipinski filter [29] S6 804
in MOE [21], leaving 111,694 structures. A further 463 S7 195
metal ion complexes were filtered from the database as no S8 1,000
S9 26
bits were set in the MACCS key descriptor for these com-
Allowed 367
pounds. This left a dataset of 111,231 NCI compounds. Total 5,245
These compounds were combined with the original 5,245
Page 3 of 9
(page number not for citation purposes)
Chemistry Central Journal 2008, 2:3 http://journal.chemistrycentral.com/content/2/1/3
Page 5 of 9
(page number not for citation purposes)
Chemistry Central Journal 2008, 2:3 http://journal.chemistrycentral.com/content/2/1/3
example than a negative example if one from each class sor and the implementation of the method in the Python
were picked at random. programming language. Clearly the difference of less than
100 seconds is marginal between USR and UF4. However
Computational time the computational time increases by an extra 478 seconds
The wall clock time measures the time in seconds required (31%) upon addition of the fifth central moment.
to calculate shape descriptors for the 116,476 molecules
used in this work. Performance
The results (Table 2) for USR, UF4 and UF5 as stand-alone
Standard error of the mean methods are all fairly comparable and are worse than the
The standard error (S.E) of the mean MACCS key fingerprint on all measures. If one were to
gauge performance based on recall in the top 1% of the
sˆ ranked validation sets, the UF4 descriptor is the best
S.E = method for 6 out of the 10 classes and in 6 out of 10
n′
classes for the recall in the top 5%. With regards to preci-
sion, the UF5 descriptor is better than the USR descriptor
with σ̂ the standard deviation of n independent runs, has
for 7 of the classes, but only two classes (S2 and S5) when
been calculated over the ten Random Forest runs for the
compared to UF4. The UF5 descriptor achieves the highest
different descriptor methods, WADA prohibited classes AUC values for 8 classes. When considering the F-meas-
and all performance measures. ure, USR is worse than UF4 for 7 classes and UF5 is worse
than UF4 for 6 classes. The Matthews Correlation Coeffi-
Results and discussion cient gives similar information to the F-measure and
MACCS key descriptor hence the results are very alike, with the UF5 being better
We find that the MACCS key descriptor is able to recall a than the USR for 7 out of the ten classes and better than
large percentage of each WADA prohibited class in the top the UF4 for only one class.
1% and top 5% of the ranked validation sets, with values
as high as 96% for the P2 class for the recall in the top 1%, Hybrid descriptor
see Table 2. The precision values are fairly high across the In order to enhance the performance of the MACCS key
board with values ranging from 0.53 for the S8 class to descriptor we have combined it with UF4, the best per-
0.93 for P2. The MACCS key descriptor, however, per- forming shape descriptor based on computational per-
forms poorly for the S1 and S9 classes. In the case of S1, a formance and time to calculate, to form a Hybrid
large number of false positives are predicted, giving a poor descriptor. The Hybrid descriptor combines the 166 bits
precision value of 0.07. The precision obtained for S9 of the MACCS key descriptor with the 12 components of
using MACCS is also disappointing, with MACCS keys the Ultrafast shape descriptor and four extra components
failing to predict true positives, as illustrated by the low from the fourth moment, to form a descriptor of length
recall values in the top 1% and top 5% of the ranked val- 182. The descriptor is composed of discrete (0s or 1s) data
idation sets. The lowest of the AUC values is found to be from the MACCS key descriptor and continuous data
0.55 for S1, with intermediate values for other classes, and from the UF4 descriptor.
up to practically unity (perfect case scenario) for the P2
class. It should be noted that all AUC values have been The results show a significant improvement in perform-
rounded to two decimal places, and that the values for the ance over MACCS when the Hybrid descriptor is used.
P2 class are slightly below 1.00. The F-measure and Mat- This is particularly true for the values averaged over all
thews Correlation Coefficient values are all moderately classes, where the Hybrid descriptor is the best descriptor
high, with six of the classes returning values greater than for all performance measures. The only exception was the
0.6. Standard error results averaged over the ten runs are S5 class, which consistently showed the MACCS key
detailed in Table 3. All values are low, indicating consist- descriptor to give better results than the other methods,
ency between the runs. one possible explanation being that the molecules in the
S5 classes have very diverse shapes resulting in the UF4
Shape descriptors descriptor creating noise, hence the slight drop in per-
Computational time formance relative to using MACCS on its own. MACCS
It is not surprising that adding more moments increases was also combined with USR, this combination perform-
the computational time required to calculate the descrip- ing slightly worse than the MACCS-UF4 Hybrid descriptor
tors for the 116,476 molecules in this dataset. The time we propose in this work. The standard error values for the
taken for the USR descriptor to be calculated was 1,432 95% confidence level for the runs averaged over the ten
seconds, UF4 1,528 seconds and UF5 2,006 seconds. All runs and ten classes are detailed in Table 3, and support
results are based on the use of a 1.06 GHz Athlon proces-
Page 6 of 9
(page number not for citation purposes)
Chemistry Central Journal 2008, 2:3 http://journal.chemistrycentral.com/content/2/1/3
Table 2: Performance measures. Percentage of actives recalled in the top 1% and top 5% of the ranked validation sets, precision of
predicted positives, area under the Receiver Operating Characteristic curve, F-measure and the Matthews Correlation Coefficient. All
results are calculated over ten different runs and are based on the validation sets.
Descriptor P2 S1 S2 S3 S4 S5 S6 S7 S8 S9 Average
Recall 1% Hybrid 100.00 76.36 90.29 94.73 85.86 57.60 91.74 88.96 40.92 55.00 78.15
MACCS 96.17 50.83 51.03 43.08 82.49 63.96 90.50 80.41 41.88 0.00 60.04
USR 34.58 70.91 51.03 43.08 46.98 13.28 32.39 25.00 8.76 6.67 33.27
UF4 42.88 70.00 62.94 52.53 54.05 16.72 30.80 27.87 8.40 16.67 38.29
UF5 69.06 41.45 32.17 25.59 15.78 4.07 12.34 21.05 3.22 0.00 22.47
Recall 5% Hybrid 100.00 85.45 95.44 97.47 95.82 78.12 96.37 93.13 69.80 63.33 87.49
MACCS 96.67 59.17 76.47 59.45 93.27 80.40 94.63 85.92 65.64 4.17 71.58
USR 61.19 80.91 76.47 59.45 69.87 32.72 54.88 45.42 28.56 23.33 53.28
UF4 73.39 77.27 83.82 65.82 74.01 38.76 53.98 42.19 29.04 41.67 58.00
UF5 84.21 49.35 48.20 34.93 24.55 11.39 22.41 38.30 11.39 3.75 32.85
Precision Hybrid 0.94 0.80 0.91 0.80 0.79 0.62 0.90 0.78 0.58 0.42 0.75
MACCS 0.93 0.07 0.87 0.77 0.67 0.62 0.89 0.71 0.53 0.00 0.61
USR 0.12 0.55 0.23 0.38 0.50 0.13 0.36 0.41 0.03 0.00 0.27
UF4 0.32 0.69 0.17 0.51 0.71 0.08 0.51 0.49 0.27 0.00 0.38
UF5 0.21 0.52 0.37 0.44 0.64 0.16 0.49 0.27 0.04 0.00 0.32
AUC Hybrid 1.00 0.89 0.96 0.87 0.79 0.67 0.94 0.90 0.70 0.68 0.84
MACCS 1.00 0.55 0.96 0.75 0.67 0.75 0.91 0.89 0.74 0.83 0.81
USR 1.00 0.69 0.65 0.61 0.58 0.54 0.56 0.59 0.54 0.57 0.63
UF4 1.00 0.77 0.64 0.72 0.60 0.55 0.57 0.63 0.53 0.55 0.66
UF5 1.00 0.94 0.84 0.79 0.81 0.71 0.76 0.76 0.71 0.53 0.78
F-measure Hybrid 0.91 0.54 0.82 0.72 0.71 0.45 0.83 0.70 0.27 0.22 0.62
MACCS 0.91 0.11 0.79 0.67 0.63 0.51 0.84 0.63 0.29 0.00 0.54
USR 0.11 0.27 0.17 0.23 0.35 0.06 0.22 0.08 0.04 0.00 0.15
UF4 0.16 0.49 0.23 0.36 0.42 0.07 0.19 0.09 0.03 0.00 0.20
UF5 0.10 0.31 0.20 0.23 0.36 0.07 0.22 0.07 0.03 0.00 0.16
MCC Hybrid 0.91 0.58 0.83 0.73 0.71 0.47 0.83 0.71 0.32 0.26 0.63
MACCS 0.91 0.13 0.79 0.68 0.63 0.52 0.84 0.64 0.32 0.00 0.55
USR 0.12 0.32 0.18 0.25 0.37 0.08 0.24 0.13 0.08 0.00 0.18
UF4 0.19 0.51 0.25 0.38 0.46 0.09 0.24 0.15 0.08 0.02 0.24
UF5 0.13 0.34 0.23 0.26 0.40 0.09 0.26 0.10 0.06 0.00 0.19
our finding that the Hybrid descriptor is the best descrip- Characteristic curve. Incorporating an additional central
tor. The Hybrid descriptor was found to be statistically sig- moment, the kurtosis, into Ballester and Richards' [20]
nificantly better than all other descriptors across all Ultrafast Shape Recognition descriptor, significantly
performance measures at the 95% confidence level. improved its performance. The addition of the fifth cen-
tral moment, however, does not improve the performance
Conclusion of UF4 sufficiently to justify the increased computational
We have introduced a novel Hybrid descriptor for virtual expense.
screening of databases of chemical structures, which is
quick to calculate, robust, and incorporates both two and Methods
three-dimensional information. The Hybrid descriptor Ultrafast shape recognition
gives better performance than either MACCS keys or the Ballester and Richards' USR descriptor and the UF4 shape
shape descriptors presented in this work based on: the descriptor were implemented in Python [37].
recall in the top 1% and top 5% of the validation set, the
positive precision, F-measure, Matthews Correlation
Coefficient and the area under the Receiver Operating
Page 7 of 9
(page number not for citation purposes)
Chemistry Central Journal 2008, 2:3 http://journal.chemistrycentral.com/content/2/1/3
Table 3: The standard error of the mean. Percentage of actives recalled in the top 1% and top 5% of the ranked validation sets,
precision of predicted positives, area under the Receiver Operating Characteristic curve, F-measure and the Matthews Correlation
Coefficient. All results are calculated over ten different runs and are based on the validation sets. The standard error values at the 95%
confidence level were calculated over the ten different runs and ten classes.
Descriptor P2 S1 S2 S3 S4 S5 S6 S7 S8 S9 SE 95%
Recall 1% Hybrid 0.000 3.090 1.319 1.020 0.812 1.010 0.566 1.585 0.962 4.339 3.924
MACCS 0.255 1.944 3.142 1.092 4.183 1.000 0.381 1.330 4.379 0.000 5.679
USR 2.652 4.242 1.612 1.369 0.972 0.710 0.837 2.106 0.665 2.722 4.025
UF4 2.115 5.080 1.934 1.482 0.764 0.616 0.765 1.374 0.637 5.556 4.171
UF5 0.015 0.018 0.017 0.015 0.005 0.002 0.006 0.011 0.002 0.000 4.131
Recall 5% Hybrid 0.000 2.010 0.774 0.614 0.450 0.949 0.386 1.164 0.794 3.333 2.520
MACCS 0.000 2.307 1.373 0.442 2.836 0.912 0.220 1.037 6.719 2.668 5.395
USR 1.722 3.442 1.886 1.231 1.403 0.749 1.169 1.195 0.966 6.667 3.959
UF4 1.805 3.105 1.404 1.609 0.733 0.917 1.023 2.633 1.040 5.693 3.772
UF5 0.012 0.027 0.010 0.019 0.004 0.003 0.007 0.016 0.003 0.027 4.677
Precision Hybrid 0.004 0.020 0.059 0.099 0.005 0.009 0.034 0.009 0.012 0.037 0.033
MACCS 0.010 0.014 0.018 0.043 0.026 0.022 0.008 0.030 0.036 0.000 0.064
USR 0.005 0.034 0.015 0.018 0.010 0.015 0.013 0.033 0.002 0.000 0.038
UF4 0.017 0.017 0.007 0.019 0.009 0.005 0.020 0.033 0.028 0.000 0.048
UF5 0.066 0.104 0.056 0.061 0.022 0.028 0.035 0.068 0.015 0.000 0.042
AUC Hybrid 0.000 0.009 0.002 0.005 0.003 0.002 0.001 0.003 0.004 0.019 0.024
MACCS 0.000 0.002 0.002 0.004 0.002 0.002 0.001 0.003 0.003 0.015 0.027
USR 0.000 0.011 0.004 0.003 0.001 0.001 0.001 0.006 0.001 0.011 0.027
UF4 0.000 0.010 0.004 0.004 0.002 0.000 0.001 0.007 0.001 0.015 0.028
UF5 0.000 0.011 0.008 0.013 0.004 0.004 0.006 0.028 0.007 0.025 0.025
F-measure Hybrid 0.003 0.016 0.035 0.052 0.003 0.002 0.027 0.006 0.004 0.019 0.047
MACCS 0.006 0.021 0.008 0.015 0.012 0.009 0.007 0.017 0.010 0.000 0.061
USR 0.003 0.014 0.009 0.004 0.003 0.003 0.003 0.007 0.001 0.000 0.022
UF4 0.006 0.015 0.006 0.010 0.003 0.003 0.003 0.006 0.001 0.001 0.033
UF5 0.016 0.059 0.020 0.019 0.016 0.008 0.013 0.019 0.006 0.000 0.024
MCC Hybrid 0.914 0.580 0.828 0.727 0.711 0.468 0.830 0.708 0.316 0.257 0.044
MACCS 0.006 0.025 0.008 0.015 0.011 0.009 0.007 0.016 0.010 0.000 0.060
USR 0.116 0.315 0.183 0.252 0.372 0.081 0.237 0.130 0.075 0.001 0.023
UF4 0.188 0.511 0.251 0.383 0.455 0.085 0.244 0.150 0.077 0.019 0.033
UF5 0.019 0.064 0.013 0.018 0.010 0.007 0.011 0.025 0.010 0.001 0.026
Page 8 of 9
(page number not for citation purposes)
Chemistry Central Journal 2008, 2:3 http://journal.chemistrycentral.com/content/2/1/3
Page 9 of 9
(page number not for citation purposes)
View publication stats