Artificial Intelligence in Chemistry and Drug Design
Artificial Intelligence in Chemistry and Drug Design
Artificial Intelligence in Chemistry and Drug Design
https://doi.org/10.1007/s10822-020-00317-x
EDITORIAL
Introduction data points at the beginning of a project are low. How can
we enrich projects in short time frames with informative
The discovery of molecular structures with desired prop- molecules and data that are subsequently used to drive the
erties for applications in drug discovery, crop protection, design?
or chemical biology is among the most impactful scientific With these questions in mind, it comes as no surprise
challenges. However, given the complexity of biological that data mining and statistics have been integrated into
systems and the associated cost for experiments and trials, molecular discovery and design pipelines to provide compu-
molecular design is also scientifically very challenging, tational support in the prioritization of molecular hypotheses
prone to failure, inherently expensive and time consuming [6, 7]. Machine learning algorithms have been part of the
[1, 2]. To improve our odds and the timelines in this process, routine toolbox of computational and medicinal chemists
and to identify good starting points, unbiased incorporation for decades. The recent increase in applications and cover-
of knowledge through continuous analysis of literature and age of these methodologies has been attributed to advances
patents from different scientific fields is required [3]. The in computational power, the growing amount of digitized
number of yearly publications is increasing, and a good research data, and an increasing theoretical understanding of
collaboration between scientific experts across disciplines the algorithms and their shortcomings. However, given the
is required to fully evaluate the potential of a hypothesis. gradual character of these evolutions, it might be counterin-
The theoretical space of chemistry, even when limited by tuitive to expect a dramatic revolution of molecular design.
molecular size, is huge [4] and dramatically exceeds what we Nevertheless, extravagant claims have been made for the
can assess experimentally and even computationally. How ability of Artificial Intelligence (AI) to accelerate the design
to navigate through it efficiently and select molecules that process [8, 9]; how well founded are these claims? While
satisfy the multiple parameters that need to be optimized and there is unquestionably a lot of potential in novel computa-
that are synthetically accessible [5]? The number of existing tional tools, it is important to scrutinize them and compare
their performance to already existing methods, to objectively
distinguish real progress from promotion. Only such careful
* Richard Lewis evaluations will enable us to shed light on whether novel
[email protected] artificial intelligence methods contribute to an evolution or
* Torsten Luksch a revolution of the established scientific discipline of com-
[email protected] puter-assisted molecular design [10].
* Daniel Reker
[email protected]
1 The historical context of machine learning
BenevolentAI, 4‑8 Maple Street, London W1T 5HD, UK
2
in molecular design
Novartis Institutes for BioMedical Research, 4056 Basel,
Switzerland
Machine learning and AI are not new to researchers in com-
3
Syngenta Crop Protection AG, 4332 Stein, Switzerland puter-assisted molecular design. The pioneering work of
4
Koch Institute for Integrative Cancer Research and MIT‑IBM Hansch and Fujita [6], as well as Free and Wilson [7], estab-
Watson AI Lab, Massachusetts Institute of Technology, lished the field of quantitative structure–activity relation-
Cambridge, MA 02142, USA
ship (QSAR) modelling. In their groundbreaking work, they
5
Division of Gastroenterology, Hepatology and Endoscopy, used focused datasets as small as a series of a dozen chemi-
Department of Medicine, Harvard Medical School, Brigham
and Women’s Hospital, Boston, MA, 02115, USA cal derivatives to fit equations that would anticipate fairly
13
Vol.:(0123456789)
710 Journal of Computer-Aided Molecular Design (2020) 34:709–715
complex phenotypic effects such as toxicity [11]. Spurred by solutions; as a bonus, they provide unprecedented opportuni-
this success, a large research area has emerged that focuses ties to navigate large datasets.
specifically on (a) identifying approaches to describe chemi-
cal structures in more detail, to capture the characteristics
that govern their properties such as pharmacophores and Big data and navigation in chemical space
three dimensional structure but also autonomously learned
representations [12, 13], and (b) derive increasingly complex Analysis of very big chemical datasets is a major research
mathematical relationships that aim at describing the causal area that can profit from the application of modern machine
relationship between these chemical characteristics and the learning and AI-based methods. For many years the only
biological properties of interest for predictive purposes [14, larger public chemical data set available was the “NCI Open
15]. Through an increasing amount of structural informa- Database” [23], released in 1999 containing about 250,000
tion [16], as well as data generation through combinatorial molecules. This database was used as a test case for vali-
libraries and high-throughput screening, first applications dation of numerous “classical” cheminformatics methods
of more complex machine learning models became feasi- and virtual screening techniques. Advent of PubChem [24]
ble. However, the excitement and promise was shortly after and later ChEMBL [25] databases considerably increased
followed with disenchantment. The growing field of QSAR the amount of publicly available chemical data for model
learnt hard lessons in the 1990s about model validation, training and validation. PubChem currently contains more
control experiments and other pitfalls [17]. Specifically, the than 100 million unique compounds. ChEMBL, in its cur-
overly broad application of computational models as hard rent 26th release, holds information on nearly 2 million
filters for data sets that had not been covered in the training compounds, 13 thousand targets, and 16 million relation-
data led to an increasing disappointment in this technology. ships between these compounds and targets. Another useful
With increasing understanding of the algorithmic prin- source of public chemical data is the ZINC database [26]
ciples and their statistical interpretation, the concept of providing information about more than 230 million com-
domains of applicability was introduced [18–20]. Such pre- mercially available compounds. All these three data sources
dictive confidence estimates enabled computational drug offer user friendly web interfaces, but since the data may be
hunters to increase the transparency of the capabilities of downloaded and processed locally, they also were used for
their tools as well as adjust expectations. This led to an development of several novel analysis and visualization tools
increasing number of successful applications of machine [27, 28]. Recently, two new experimental developments have
learning to drug discovery and design across academia and increased the amount of available data by several orders of
industry in the 2000s, which slowly rebuilt the trust of the magnitude. One of these technologies is DNA-based library
community and led to a sustained growth of their use. By synthesis [29], where a single library can contain tens or
2015, computational advances such as the broad inclusion even hundreds of millions of molecules. Introduction of so
of GPUs in modern computing frameworks and the increas- called "readily available" virtual libraries offered currently
ing amount of available RAM, the training of larger and by several compound vendors became another important fac-
deeper neural nets became feasible. At the famous Kaggle tor in increasing the resolution of possible molecular solu-
challenge, a team from Toronto used a Deep Neural Net [21] tions: the virtual molecules in these libraries are enumerated
to win a SAR challenge set by Merck. This competition is using exclusively validated synthetic protocols and available
commonly perceived as a turning point in which a complex building blocks, thereby enabling the vendor to guarantee
deep learning AI method had outperformed other machine delivery of picked molecules in a relatively short time. The
learning approaches and therefore arrived as a useful tool for number of molecules in these libraries is reaching billions
computational molecular design. Deep Learning can trace its [30]. With these developments in mind, the community is
roots back to the 1960s, in its theoretical form at least, with expecting further increases in available chemical matter, so
the work of Ivakhnenko and Lapa [22]. AI can trace its roots that in the next decades we are likely to witness datasets with
even further back to a workshop that was run at Dartmouth several billion compound structures. This is an exponential
College in 1956. Even given AI’s long history, and typically growth, comparable with the Moore’s law describing the
longer than many imagine, the field has had a number of increase in computer processing power, that will push the
‘winters’ with expectations not matching reality. This has number of synthetically accessible molecules towards the
led to a number of setbacks for the field and it has taken size of the virtual chemistry database GDB-17 with 166 bil-
time to recover from these. While now multiple promising lion structures [4] and thereby enable the fine-tuned selec-
applications of AI exist to derive molecular descriptors and tion of molecular prototypes—if the amount of data can be
understand their relationship to biological properties, these appropriately handled.
methods are inherently linked to big data. These algorithms Classical cheminformatics methods are often strug-
are typically very data hungry before they can provide useful gling with such very big data sets, although some recent
13
Journal of Computer-Aided Molecular Design (2020) 34:709–715 711
developments are promising [30–32]. Novel machine learn- [14, 15, 21, 36–38]. CNNs are especially attractive in this
ing and AI-based approaches can help by adaptively navigat- regard as they offer a different, data-driven way to extract
ing vast chemical spaces and autonomously focusing on the molecular features [39, 40]. The promise of these novel
most promising regions. In this special issue, several such techniques originates not only from slightly higher perfor-
approaches are described: in the study by Varnek and col- mance metrics in retrospective evaluations but even more
leagues, [33] Generative Topographic Mapping, a sophisti- importantly in an inherent ability to process unstructured
cated dimensionality reduction method, was used to compare data as well as navigating and manipulating the “latent”
molecules in the company archive of a large pharmaceu- space. This has led to a series of specialized AI tools that
tical company with over 8 million commercially available can perform tasks that are not possible with “traditional”
samples. The method was enhanced by an AutoZoom func- machine learning algorithms (see for example References [9,
tion that focuses on the heavily populated areas of chemical 41, 42]). Another series of publications has shown the ability
space and automatically extracts substructures well repre- of deep neural nets to use matrices of experimental observa-
senting these dense regions. The methodology was used to tions (multitasking) rather than vectors to improve predic-
identify sets of commercial molecules maximally enhancing tive accuracy [43, 44]—this is especially useful for noisy
the chemical space covered by molecules already available and smaller data sets, for which data collection experiments
in the investigated company archive. Such approaches enable are time-consuming and expensive, for example in ADMET
the adaptive enrichment of compound sets. predictions [45–49]. Directly tackling this challenge is also
Following an orthogonal approach, Tetko and colleagues possible with one shot learning [50] which enables learning
[34] describe a focused library generator that is able to from a low amount of data that is potentially better curated
generate molecules with a higher chance to exhibit desired compared to high-throughput data. Conversely, to further
properties. The generator is based on the long short-term combat low data limits and autonomously enable data gener-
memory (LSTM) recurrent deep neural network with results ation, a new direction is the automation of experiments and
directed by the reinforcement learning process to a specific “closing the loop” in the design-make-test-analysis (DMTA)
target. As a proof of concept, Mdmx inhibitors were chosen cycle typically used in drug discovery programs [51]. Active
as the objective for the presented study. The generated mol- learning [52] is being applied with increasing popularity to
ecules were further refined by pharmacophore screening and the analysis part of the DMTA cycle. This technique assists
molecular dynamics simulations. Additionally (and some- in selecting the most “interesting” compounds (most com-
thing that fortunately has become more commonplace in monly the compounds that will help to improve the model) to
computational molecular design research), the source code test in the next cycle. The new results are then fed back into
of the generator is available at GitHub, which will allow the system to improve model prediction quality and to rap-
other researchers to adapt it and use it in their own projects. idly increase the applicability domain of the model [53]. The
Taken together, such adaptive approaches will improve the design part of the DMTA cycle has received more attention,
ability of research teams to navigate billions of possible with generative chemistry methods well to the fore. Multiple
structures to find molecular solutions that are sufficiently new de novo design models based on RNNs [54–56], vari-
optimal for practical applications if the predictive algorithms ational autoencoder (VAE) architectures [57–59] or genera-
are powerful enough and sufficiently validated. tive adversarial networks (GAN) [60, 61] have been devel-
oped recently (see also Ref [62].). Most of these models are
trained on molecule structures from large public compound
Practical considerations for AI‑based collections like ChEMBL [25] or PubChem [24] (to ensure
molecular design “druglikeness”) and are able to generate completely novel
molecules according to an objective function, for example,
The field of machine learning and AI has moved from theo- similarity to a given input structure or fitting to constraints
retical studies to real-world applications. The field of chem- in certain properties like logP or activity against a protein
informatics and especially QSAR have always been early target. For the “make part” of the DMTA cycle retro-syn-
adopters of statistical methods and machine learning, but thesis, reaction condition or reactivity prediction has been
in the past few years the development of novel algorithms in the focus of the new DNN-based models [41, 63–66].
in this area has drastically increased. Besides more conven- Here, substantial progress has been made in all areas given
tional models like Random Forest, Gradient Boosted Trees, both access to more experimental data [67, 68] but also to
or Gaussian Processes, which have been applied very suc- the sophisticated techniques like Monte Carlo Tree Search
cessfully in the past [35], novel techniques like deep neural (MCTS) which helps to identify the most likely synthetic
nets (DNNs), convolutional neural nets (CNNs) or recurrent routes in retro-synthesis planning using deep neural net-
neural nets (RNNs) have been increasingly recognized as works and symbolic AI [41]. In this special issue, Ghian-
valuable additions to the toolbox of chemoinformaticians doni and colleagues present a novel reaction-based de novo
13
712 Journal of Computer-Aided Molecular Design (2020) 34:709–715
design algorithm [69] adapting previously published work victory for Deep Learning in 2012. The recent advances,
on reaction vectors [70, 71] to optimise molecular structures especially in Deep Learning, have led to a huge quantity
that are likely to be more synthetically tractable. Using a of research conducted in this area and published online in
recommender system, the authors demonstrate that their new preprints and peer-reviewed articles. Of particular interest
methodology successfully prioritises the most relevant reac- here, is the great quantity of research directly at challenges
tion vectors; this reduces the possibility of combinatorial in chemistry and, specifically, drug discovery and materials
explosion in the number of solutions while simultaneously chemistry. Given the increasing importance of these new
ensuring that the probability of successful synthesis is high. machine-learning methods in a plethora of fields, research-
QSAR modelling has also concentrated on interpretabil- ers are trying to better understand how these models work
ity to assist the design part of DMTA; this assumes that the [79, 80]. As might be expected, these models have a high
design is being carried out or supervised by skilled human risk to learn something different than what was intended [81,
experts. AI models are rather complex, in terms of their rep- 82]. Much work has still to be done to make these methods
resentations of molecules. For that reason they are often resilient to noise (brittleness) or overfitting [83]. Latter, i.e.
treated as black boxes and interpretation or understanding of memorization of training data by these models, can lead
what exactly is learned remains difficult [72]. The paper in to a reduced performance on prospective data in the best
this special issue from Webel et al. demonstrates the impact case but also to security issues in the worst case [84, 85].
of deep learning to the area of identifying cytotoxic sub- Due to these reasons, the establishment of a strong tool kit
structures in a large corpus of data [73]. Here, the authors for validation of these models is crucial (see for example
use Deep Taylor Decomposition to identify these toxico- [86–88]). In this special issue, Lee and coworkers [89] have
phores in the training set so that one can more easily diag- investigated a recent study on large scale comparison of deep
nose the structural drivers of toxicity. Such interpretability learning models with more traditional methods on bioactiv-
will enable to increase the credence into novel methodologi- ity prediction tasks [43]. They show how critical it is to
cal developments and facilitate the implementation of such choose the right metrics for benchmarking regarding data
methods into established molecular design pipelines. distribution and data biases to enable a fair comparison of
In an industrial setting, an important aspect is making the methods. Furthermore they suggest using precision and
all these novel machine-learning models and technologies recall statistics in conjunction with the common area under
operational: this includes deployment, access, reproduc- the receiver-operator curve (AUC–ROC). Finally they report
ibility, monitoring and maintenance. In addition, these new challenges in interpreting scaffold-splitting cross-validation
machine-learning systems bring novel technical challenges results. They conclude that more research needs to be done
in industrial settings which often are not directly obvious in proper validation procedures for these models used in the
[74]. Green and colleagues [75] discuss how these novel field of chemoinformatics.
methods can be made accessible to a broad range of sci-
entists in GSK and how a smart design of the system can
help with maintenance and deployment. Their system called Conclusions
BRADSHAW integrates methods for chemical structure
generation, experimental design, active learning and chem- As is evident from the information covered in this perspec-
informatics tools to allow automated molecular design in tive and by the plethora of scientific and media outlets, many
the DMTA cycle. Due to a very modular design of their opportunities exist now for the development of novel com-
system they can incorporate many of these novel methods putational methods, data-driven workflows and algorithmic
and models. In a retrospective case study they show how the tools that lead to a higher degree of automation and improve
system can be used successfully in lead optimization for the the efficacy of certain components in the drug design pro-
design of MMP12 inhibitors. cess [37]. A particular focus lies on assisting the selection
of which experiment to carry out next [52]. The tight inte-
gration of artificial intelligence into pharmaceutical, chemi-
Control experiments—is AI really doing cal, and crop protection research is inevitable and has the
better? potential to significantly improve the efficiency and efficacy
in molecular discovery.
In recent years there has been a resurgence of interest and Although slight increases in retrospective accuracy are
demonstrated impact of Artificial Intelligence in a number unlikely to qualitatively change the ability of machine learn-
of domains [9, 76, 77]. The biggest impact in recent years ing to support the drug discovery and development pipe-
has been the advent of publicly available Deep Learning line [10], we anticipate an enthusiasm for this technology,
algorithms for processing image data and pattern recogni- coupled to technological and algorithmic advances, to sig-
tion through the ImageNet [78] competition, leading to a nificantly further the field and increase the contribution of
13
Journal of Computer-Aided Molecular Design (2020) 34:709–715 713
computational tools in the chemical sciences. A possible 4. Ruddigkeit L, van Deursen R, Blum LC, Reymond J-L (2012)
inflection point for the field will be the concurrent progress Enumeration of 166 billion organic small molecules in the chemi-
cal universe database GDB-17. J Chem Inf Model 52:2864–2875
initiated by the convergence of multiple AI branches, such 5. Lipinski C, Hopkins A (2004) Navigating chemical space for biol-
as natural language processing, computer vision, and robot- ogy and medicine. Nature 432:855–861
ics. This might very well amplify the increase in available 6. Hansch C, Fujita T (1964) p-σ-π Analysis. A method for the cor-
information, change our ability to automate and increase relation of biological activity and chemical structure. J Am Chem
Soc 86:1616–1626
reproducibility of experiments, as well as accelerate our 7. Free SM Jr, Wilson JW (1964) A mathematical contribution to
understanding of the inner-workings AI. We are still a very structure-activity studies. J Med Chem 7:395–399
long way from a completely in silico discovery process; the 8. Zhavoronkov A, Ivanenkov YA, Aliper A et al (2019) Deep learn-
need to perform experiments is still vital. ing enables rapid identification of potent DDR1 kinase inhibitors.
Nat Biotechnol 37:1038–1040
With these advantages in mind, novel challenges will 9. Stokes JM, Yang K, Swanson K et al (2020) A deep learning
occur. First and foremost, similar to the emergence of appli- approach to antibiotic discovery. Cell 180:688–702.e13
cability domains, a consensus among the community needs 10. Morrison C (2019) AI developers tout revolution, drugmakers talk
to be reached about what appropriate controls are to vali- evolution. Nat Biotechnol. https://doi.org/10.1038/d41587-019-
00033-4
date and assess novel AI tools [90]. Specifically relevant 11. Holzgrabe U (1994) QSAR: Hansch analysis and related
will be the proper implementation of adversarial controls approaches, H. Kubiny, VCH, Weinheim 1993. 232 Seiten, 60
to reduce the risk of overfitting, brittleness, and other clas- Abb. und 32 Tab. 158,– DM. ISBN 3-527-30035-X. Pharm
sical machine learning challenges [84, 91], which are eas- Unserer Zeit 23:192–193
12. Todeschini R, Consonni V (2000) Methods and principles in
ily overlooked with increasing model complexity. Another medicinal chemistry. Handbook of molecular descriptors. Wiley-
important challenge that arises with increasingly complex VCH, Weinheim
models will be the potential for attacks or simply unrobust 13. Yang K, Swanson K, Jin W et al (2019) Are learned molecular
predictive behavior [85, 92]. This is a recurrent hot topic in representations ready for prime time?. Massachusetts Institute of
Technology, Cambridge
deep learning research and its implications for novel compu- 14. Vamathevan J, Clark D, Czodrowski P et al (2019) Applications
tational tools in molecular design will need to be carefully of machine learning in drug discovery and development. Nat Rev
considered. Drug Discov 18:463–477
In this special issue, we have carefully picked a selec- 15. Chen H, Engkvist O, Wang Y et al (2018) The rise of deep learn-
ing in drug discovery. Drug Discov Today 23:1241–1250
tion of classical challenges in computer-assisted molecular 16. Lewis RA (2005) A general method for exploiting QSAR models
design and have invited some of the leading scientists in in lead optimization. J Med Chem 48:1638–1648
their respective disciplines to contribute studies that pro- 17. Dearden JC, Cronin MTD, Kaiser KLE (2009) How not to develop
pose avant-garde computational approaches to address these a quantitative structure-activity or structure-property relationship
(QSAR/QSPR). SAR QSAR Environ Res 20:241–266
challenges and evaluate and contextualize their potential to 18. Varnek A, Baskin I (2012) Machine learning methods for property
accelerate drug discovery. We expect that this special issue prediction in chemoinformatics: Quo Vadis? J Chem Inf Model
will provide an overview of the possibilities that these novel 52:1413–1437
tools hold, but also provide important examples on proper 19. Fechner N, Jahn A, Hinselmann G, Zell A (2010) Estimation of
the applicability domain of kernel-based machine learning models
quality control, validation, and domain of applicability for virtual screening. J Cheminform 2:2
assessments. We hope that this will serve as a compendium 20. Sheridan RP, Feuston BP, Maiorov VN, Kearsley SK (2004)
to stir further discussions and guide the future development Similarity to molecules in the training set is a good discrimina-
of novel AI-tools to guide molecular design. tor for prediction accuracy in QSAR. J Chem Inf Comput Sci
44:1912–1928
21. Ma J, Sheridan RP, Liaw A et al (2015) Deep neural nets as a
Acknowledgements We would like to specially thank all the authors of method for quantitative structure-activity relationships. J Chem
this special issue for their great contributions and all the reviewers for Inf Model 55:263–274
their valuable and critical feedback to ensure high-quality publications. 22. Ivakhnenko AG, Lapa VG (1967) Cybernetics and forecasting
techniques. American Elsevier Pub. Co., New York
Author contributions All authors contributed equally. 23. Voigt JH, Bienfait B, Wang S, Nicklaus MC (2001) Comparison
of the NCI open database with seven large chemical structural
databases. J Chem Inf Comput Sci 41:702–712
24. Kim S, Chen J, Cheng T et al (2019) PubChem 2019 update:
improved access to chemical data. Nucleic Acids Res
References 47:D1102–D1109
25. Mendez D, Gaulton A, Bento AP et al (2019) ChEMBL:
1. Mullard A (2014) New drugs cost US$2.6 billion to develop. Nat towards direct deposition of bioassay data. Nucleic Acids Res
Rev Drug Discov 13:877–877 47:D930–D940
2. Kola I, Landis J (2004) Can the pharmaceutical industry reduce 26. Sterling T, Irwin JJ (2015) ZINC 15—ligand discovery for every-
attrition rates? Nat Rev Drug Discov 3:711–715 one. J Chem Inf Model 55:2324–2337
3. Searls DB (2005) Data integration: challenges for drug discovery. 27. Reymond J-L (2015) The chemical space project. Acc Chem Res
Nat Rev Drug Discov 4:45–58 48:722–730
13
714 Journal of Computer-Aided Molecular Design (2020) 34:709–715
28. Borrel A, Kleinstreuer NC, Fourches D (2018) Exploring drug 52. Reker D, Schneider G (2015) Active-learning strategies in com-
space with ChemMaps.com. Bioinformatics 34:3773–3775 puter-assisted drug discovery. Drug Discov Today 20:458–465
29. Goodnow RA, Dumelin CE, Keefe AD (2017) DNA-encoded 53. Reker D, Schneider P, Schneider G (2016) Multi-objective active
chemistry: enabling the deeper sampling of chemical space. Nat machine learning rapidly improves structure-activity models
Rev Drug Discov 16:131–147 and reveals new protein-protein interaction inhibitors. Chem Sci
30. Hoffmann T, Gastreich M (2019) The next level in chemical space 7:3919–3927
navigation: going far beyond enumerable compound libraries. 54. Segler MHS, Kogej T, Tyrchan C, Waller MP (2018) Generat-
Drug Discov Today 24:1148–1156 ing focused molecule libraries for drug discovery with recurrent
31. NextMove Software|SmallWorld. Available at https: //www.nextm neural networks. ACS Cent Sci 4:120–131
ovesoftware.com/smallworld.html. Accessed 24 May 2019 55. Olivecrona M, Blaschke T, Engkvist O, Chen H (2017) Molecular
32. Walters WP (2019) Virtual chemical libraries. J Med Chem de-novo design through deep reinforcement learning. J Chemin-
62:1116–1124 form 9:48
33. Lin A, Beck B, Horvath D et al (2019) Diversifying chemical 56. Ertl P, Lewis R, Martin E, Polyakov V (2017) In silico genera-
libraries with generative topographic mapping. J Comput Aided tion of novel, drug-like chemical matter using the LSTM neural
Mol Des. https://doi.org/10.1007/s10822-019-00215-x network. arXiv preprint arXiv:171207449
34. Xia Z, Karpov P, Popowicz G, Tetko IV (2019) Focused library 57. Winter R, Montanari F, Noé F, Clevert D-A (2019) Learning
generator: case of Mdmx inhibitors. J Comput Aided Mol Des. continuous and data-driven molecular descriptors by translating
https://doi.org/10.1007/s10822-019-00242-8 equivalent chemical representations. Chem Sci 10:1692–1701
35. Sheridan RP, Wang WM, Liaw A et al (2016) Extreme gradient 58. Gómez-Bombarelli R, Wei JN, Duvenaud D et al (2018) automatic
boosting as a method for quantitative structure-activity relation- chemical design using a data-driven continuous representation of
ships. J Chem Inf Model 56:2353–2360 molecules. ACS Cent Sci 4:268–276
36. Sanchez-Lengeling B, Aspuru-Guzik A (2018) Inverse molecu- 59. Jin W, Barzilay R, Jaakkola T (2018) Junction tree variational
lar design using machine learning: generative models for matter autoencoder for molecular graph generation. arXiv preprint
engineering. Science 361:360–365 arXiv:180204364
37. Schneider P, Walters WP, Plowright AT et al (2019) Rethinking 60. Sanchez-Lengeling B, Outeiral C, Guimaraes GL, Aspuru-Guzik
drug design in the artificial intelligence era. Nat Rev Drug Discov. A (2017) Optimizing distributions over molecular space An objec-
https://doi.org/10.1038/s41573-019-0050-3 tive-reinforced generative adversarial network for inverse-design
38. de Almeida AF, de Almeida AF, Moreira R, Rodrigues T (2019) chemistry (ORGANIC). ChemRxiv. https://doi.org/10.26434/
Synthetic organic chemistry driven by artificial intelligence. Nat chemrxiv.5309668.v2
Rev Chem 3:589–604 61. Prykhodko O, Johansson S, Kotsias P-C et al (2019) A de novo
39. Kearnes S, McCloskey K, Berndl M et al (2016) Molecular graph molecular generation method using latent vector based generative
convolutions: moving beyond fingerprints. J Comput Aided Mol adversarial network. J Cheminform 11:74
Des 30:595–608 62. Elton DC, Boukouvalas Z, Fuge MD, Chung PW (2019) Deep
40. Yang K, Swanson K, Jin W et al (2019) Analyzing Learned learning for molecular design—a review of the state of the art.
Molecular Representations for Property Prediction. J Chem Inf Mol Syst Design Eng 4:828–849
Model 59:3370–3388 63. Coley CW, Green WH, Jensen KF (2018) Machine learning in
41. Segler MHS, Preuss M, Waller MP (2018) Planning chemical computer-aided synthesis planning. Acc Chem Res 51:1281–1289
syntheses with deep neural networks and symbolic AI. Nature 64. Engkvist O, Norrby P-O, Selmi N et al (2018) Computational
555:604–610 prediction of chemical reactions: current status and outlook. Drug
42. Méndez-Lucio O, Baillif B, Clevert D-A et al (2020) De novo Discov Today 23:1203–1218
generation of hit-like molecules from gene expression signatures 65. Gao H, Struble TJ, Coley CW et al (2018) Using machine learning
using artificial intelligence. Nat Commun 11:10 to predict suitable conditions for organic reactions. ACS Cent Sci
43. Mayr A, Klambauer G, Unterthiner T et al (2018) Large-scale 4:1465–1476
comparison of machine learning methods for drug target predic- 66. Coley CW, Jin W, Rogers L et al (2019) A graph-convolutional
tion on ChEMBL. Chem Sci 9:5441–5451 neural network model for the prediction of chemical reactivity.
44. Whitehead TM, Irwin BWJ, Hunt P et al (2019) Imputation of Chem Sci 10:370–377
assay bioactivity data using deep learning. J Chem Inf Model 67. Lowe DM (2012) Extraction of chemical structures and reactions
59:1197–1204 from the literature. PhD University of Cambridge, Cambridge
45. Montanari F, Kuhnke L, Ter Laak A, Clevert D-A (2020) Mod- 68. Reaxys. In: Reaxys. Available at www.reaxys.com. Accessed 1
eling physico-chemical ADMET endpoints with multitask graph Jan 2020
convolutional networks. Molecules 25:44 69. Ghiandoni GM, Bodkin MJ, Chen B et al (2020) Enhancing
46. Ramsundar B, Liu B, Wu Z et al (2017) Is multitask deep learning reaction-based de novo design using a multi-label reaction class
practical for pharma? J Chem Inf Model 57:2068–2076 recommender. J Comput Aided Mol Des. https://doi.org/10.1007/
47. Wenzel J, Matter H, Schmidt F (2019) Predictive multitask deep s10822-020-00300-6
neural network models for ADME-Tox properties: learning from 70. Patel H, Bodkin MJ, Chen B, Gillet VJ (2009) Knowledge-based
large data sets. J Chem Inf Model 59:1253–1268 approach to de novo design using reaction vectors. J Chem Inf
48. Xu Y, Ma J, Liaw A et al (2017) Demystifying multitask deep Model 49:1163–1184
neural networks for quantitative structure-activity relationships. 71. Hristozov D, Bodkin M, Chen B et al (2012) ChemInform
J Chem Inf Model 57:2490–2504 abstract: validation of reaction vectors for de novo design. Chem-
49. Zhou Y, Cahya S, Combs SA et al (2019) Exploring tunable Inform 43:50
hyperparameters for deep neural networks with industrial ADME 72. Sheridan RP (2019) Interpretation of QSAR models by coloring
data sets. J Chem Inf Model 59:1005–1016 atoms according to changes in predicted activity: how robust is
50. Altae-Tran H, Ramsundar B, Pappu AS, Pande V (2017) Low data it? J Chem Inf Model 59:1324–1337
drug discovery with one-shot learning. ACS Cent Sci 3:283–293 73. Webel HE, Kimber TB, Radetzki S et al (2020) Revealing cyto-
51. Schneider G (2018) Automating drug discovery. Nat Rev Drug toxic substructures in molecules using deep learning. J Comput
Discov 17:97–113 Aided Mol Des. https://doi.org/10.1007/s10822-020-00310-4
13
Journal of Computer-Aided Molecular Design (2020) 34:709–715 715
74. Sculley D, Holt G, Golovin D et al (2015) Hidden technical 85. Carlini N, Liu C, Kos J, et al (2018) The secret sharer: measuring
debt in machine learning systems. Adv Neural Inf Process Syst unintended neural network memorization & extracting secrets.
2:2503–2511 arXiv preprint arXiv:180208232
75. Green DVS, Pickett S, Luscombe C et al (2019) BRADSHAW: 86. Wu Z, Ramsundar B, Feinberg EN et al (2018) MoleculeNet: a
a system for automated molecular design. J Comput Aided Mol benchmark for molecular machine learning. Chem Sci 9:513–530
Des. https://doi.org/10.1007/s10822-019-00243-7 87. Brown N, Fiscato M, Segler MHS, Vaucher AC (2019) GuacaMol:
76. Cui J, Zhang H, Han H et al (2018) Improving 2D Face Recogni- benchmarking models for de novo molecular design. J Chem Inf
tion via Discriminative Face Depth Estimation. 2018 International Model 59:1096–1108
Conference on Biometrics (ICB) 88. Raschka S (2018) Model evaluation, model selection, and
77. Cha KH, Petrick N, Pezeshk A et al (2020) Evaluation of data aug- algorithm selection in machine learning. arXiv preprint
mentation via synthetic images for improved breast mass detection arXiv:181112808
on mammograms using deep learning. J Med Imag (Bellingham) 89. Robinson MC, Glen RC, Lee AA (2020) Validating the valida-
7:012703 tion: reanalyzing a large-scale comparison of deep learning and
78. Fei-Fei L, Deng J, Li K (2010) ImageNet: constructing a large- machine learning models for bioactivity prediction. J Comput
scale image database. J Vision 9:1037–1037 Aided Mol Des. https://doi.org/10.1007/s10822-019-00274-0
79. Samek W, Müller K-R (2019) Towards explainable artificial intel- 90. Walters WP, Murcko M (2020) Assessing the impact of generative
ligence explainable. AI: interpreting, Explaining and Visualizing AI on medicinal chemistry. Nat Biotechnol 38:143–145
Deep Learning. Springer, Cham, pp 5–22 91. Chuang KV, Keiser MJ (2018) adversarial controls for scientific
80. Alber M, Lapuschkin S, Seegerer P et al (2019) iNNvestigate machine learning. ACS Chem Biol 13:2819–2821
neural networks. J Mach Learn Res 20:1–8 92. Eykholt K, Evtimov I, Fernandes E, et al (2018) Robust Phys-
81. Sieg J, Flachsenberg F, Rarey M (2019) In need of bias control: ical-World Attacks on Deep Learning Visual Classification.
evaluating chemical data for machine learning in structure-based 2018 IEEE/CVF Conference on Computer Vision and Pattern
virtual screening. J Chem Inf Model 59:947–961 Recognition
82. Lapuschkin S, Wäldchen S, Binder A et al (2019) Unmasking
Clever Hans predictors and assessing what machines really learn. Publisher’s Note Springer Nature remains neutral with regard to
Nat Commun 10:1096 jurisdictional claims in published maps and institutional affiliations.
83. Heaven D (2019) Why deep-learning AIs are so easy to fool.
Nature 574:163–166
84. Wallach I, Heifets A (2018) Most ligand-based classification
benchmarks reward memorization rather than generalization. J
Chem Inf Model 58:916–932
13