Predicting Clinically Promising Therapeutic Hypotheses Using Tensor Factorization
Predicting Clinically Promising Therapeutic Hypotheses Using Tensor Factorization
Predicting Clinically Promising Therapeutic Hypotheses Using Tensor Factorization
Abstract
Background: Determining which target to pursue is a challenging and error-prone first step in developing a
therapeutic treatment for a disease, where missteps are potentially very costly given the long-time frames and high
expenses of drug development. With current informatics technology and machine learning algorithms, it is now
possible to computationally discover therapeutic hypotheses by predicting clinically promising drug targets based
on the evidence associating drug targets with disease indications. We have collected this evidence from Open
Targets and additional databases that covers 17 sources of evidence for target-indication association and
represented the data as a tensor of 21,437 × 2211 × 17.
Results: As a proof-of-concept, we identified examples of successes and failures of target-indication pairs in clinical
trials across 875 targets and 574 disease indications to build a gold-standard data set of 6140 known clinical
outcomes. We designed and executed three benchmarking strategies to examine the performance of multiple
machine learning models: Logistic Regression, LASSO, Random Forest, Tensor Factorization and Gradient
Boosting Machine. With 10-fold cross-validation, tensor factorization achieved AUROC = 0.82 ± 0.02 and AUPRC
= 0.71 ± 0.03. Across multiple validation schemes, this was comparable or better than other methods.
Conclusion: In this work, we benchmarked a machine learning technique called tensor factorization for the
problem of predicting clinical outcomes of therapeutic hypotheses. Results have shown that this method can
achieve equal or better prediction performance compared with a variety of baseline models. We demonstrate
one application of the method to predict outcomes of trials on novel indications of approved drug targets.
This work can be expanded to targets and indications that have never been clinically tested and proposing
novel target-indication hypotheses. Our proposed biologically-motivated cross-validation schemes provide
insight into the robustness of the prediction performance. This has significant implications for all future
methods that try to address this seminal problem in drug discovery.
Keywords: Drug target discovery, Clinical trial outcomes, Tensor factorization
Background more than half of clinical trials still fail due to lack of ef-
Drug discovery and development often begin with a drug ficacy, i.e., modulating the target’s activity did not pro-
target, through which the drug exerts its therapeutic vide a statistically significant benefit to patients [1, 2].
effect in patients with a certain disease or clinical condi- Target selection is critical in drug discovery given the
tion (termed as an indication). A target is a broad term long-time frame and high expense of drug development.
which includes many biological entities such as proteins, Often drug targets come from research publication
genes, and RNA, whose modulation (increase or de- where evidence is generated to support a hypothesis that
crease in activity) can provide a therapeutic benefit to a inhibition or activation of a target may result in a thera-
patient. Although selecting an efficacious drug target is peutic effect for a specific disease indication. For
the first and most important step in drug development, example, amyloid precursor protein is a target suggested
for Alzheimer’s Disease (AD). A piece of important evi-
* Correspondence: [email protected] dence to support this hypothesis is that familial AD
1
Computational Biology, GSK R&D, 1250 S. Collegeville Road, UP12-200, patients commonly have genetic mutations in the corre-
Collegeville, PA, USA
Full list of author information is available at the end of the article sponding gene which lead to the production and
© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Yao et al. BMC Bioinformatics (2019) 20:69 Page 2 of 12
deposition in the brain of increased amounts of amyloid evidence sources. Though Open Targets contains over
beta peptide, a major characteristic of AD [3]. With 2.8 million associations, that is still only 0.08% of the
current informatics technology, it is now possible to possible combinations covered by this data, suggesting
construct online repositories that aggregate existing that a great deal of association evidence (99.92%) is still
knowledge about the association evidence linking poten- to be determined by biomedical researchers and clini-
tial targets with disease indications. Open Targets [4] is cians. Traditional paradigms of machine learning algo-
such a platform that provides drug discovery researchers rithms, learning a mapping from input features
with multiple evidence types including genetic associ- (biological evidence) to output prediction (clinical out-
ation, pathways, animal models and drugs, that connect comes), may be inadequate in this context. We explored
targets with indications for validating potential thera- if tensor factorization is useful in the analysis of this
peutic hypotheses. At the same time, these online know- sparse biological dataset.
ledge repositories are also amenable to computational Tensor extends the concept of a matrix to a multidi-
analysis to discover drug target hypotheses using ma- mensional array where each dimension corresponds to
chine learning. one “axis”, called mode, of a tensor [5]. Data in many
One major challenge in framing this problem from a applications can be naturally organized into a tensor for-
machine learning perspective is that there are very few mat. Figure 1a shows a three-mode tensor representing
positive examples (0.005% of target-indication hypoth- different types of evidence associating targets with dis-
eses included in Open Targets have approved drugs). ease indications and one extra “slice” represents clinical
However, any insights gleaned from the limited number outcomes. Tensor factorization decomposes a tensor
of pursued targets may be useful in delivering new medi- into factor matrices that compactly store information
cines with lower attrition rates. In this paper, we collated encoded in a tensor and integrate interaction across dif-
historical outcomes of clinical trials and determined if ferent modes even when a large portion of entries of a
these clinical outcomes can be predicted retrospectively tensor is missing [5]. This technique has a wide range of
using multiple machine learning models built on existing applications such as in recommendation systems [6],
evidence of the targets’ biological association with indi- knowledge graph systems [7] and multiple biomedical
cations. A successful prediction model provides an un- domains [8].
derstanding of how informative the evidence is for There are several lines of previous work that are re-
clinical success, and is also capable of generating new lated to this paper. One line of work is focused on an-
target-indication hypotheses with a higher potential of swering the question of what makes a good drug target
being developed into successful medicines. by investigating features of targets that are correlated
Another difficulty in building such a model is that not with clinical successes in the context of genetics [9], tis-
all biological evidence is available for every pair of target sue mRNA expression [10], human protein interactome
and indication due to reasons such as technological limi- [11] and publication trends [12]. Another line of work is
tation and limited disease coverage. For example, as of focused on disease gene prediction, where the goal is to
June 2017, Open Targets contained 26,122 targets, 9150 predict genes mechanistically involved in a given disease
diseases with 2,857,732 positive associations from 15 [13–17]. Our work is different from these efforts in that
a b
Fig. 1 Data representation and model benchmark schematic. a Tensor representation of the dataset. The last “slice” matrix represents the clinical
outcomes of target-indication pairs. b Illustration of three schemes of benchmarking models on predicting clinical outcomes. Each matrix represents the
clinical outcomes of targets (rows) and indications (columns). Grey and green cells are target-indications pairs used for training and testing, respectively.
Blank cells represent unknown clinical outcomes of target-indication pairs
Yao et al. BMC Bioinformatics (2019) 20:69 Page 3 of 12
we leverage a novel computational method to integrate Table 2 Six sources of target-only categorical attributes
multiple evidence types and directly assess the models’ Attribute Type Sources (# of categories)
performance of predicting clinical outcomes of drug tar- Mutation Tolerance ExAc_LoF(3), ExAc_Missense (3), RVIS (3), Mouse
get hypotheses. Protein identity (2)
In the following sections, we first describe the dataset Other Target Target Location (7), Target Topology (5)
we have collected and then introduce the basic formula- Characteristics
tion of matrix factorization, a special form of tensor Genes were broken into non-overlapping categories based on available data.
Genes were classified as tolerant, intolerant and unclassified based on data
factorization. Then we explain our selection of a specific from the Exome Aggregation Consortium [51] and the percentile rank of
algorithm of tensor factorization based on characteristics Residual Variation Intolerance Score [52]. Genes were based on the
of our data. We then discuss how we design experiments identification of > = 75% protein homology between human and mouse, data
downloaded from BioMart [53]. Target Location and Topology were derived
to benchmark the method against a series of baseline from a review of information from Gene Ontology, InterPro, PFAM,
models under three scenarios of drug discovery. We dem- and UniProt
onstrate that the model can capture known biological
mechanisms of human diseases and can identify oppor- animal models manifest phenotypes that are concord-
tunities of approved drug targets to novel indications. ant with the human disease?). Besides these 17 associ-
ation evidence sources, we also collected information
Methods about properties of targets from six sources (Table 2)
Data collection and processing as previous studies have found that successful
We created a dataset which combined clinical out- FDA-approved drugs are enriched with targets with
comes from the commercial database Pharmaprojects properties that are independent of disease indications
[18] with evidence from Open Targets [4] and other [19, 20]. The collected evidence covers the data space
sources (Tables 1, 2). In total, we collected 17 associ- of 21,437 targets, 2211 indications, and 17 evidence
ation evidence sources that connect potential targets sources.
with disease indications. These 17 association evi- For clinical outcome data, if at least one drug asset for
dence sources can be further grouped into seven evi- a given target-indication pair was identified as success-
dence types: Genetics (is germline mutation in the ful, then the target-indication pair was classified as Suc-
target associated with the disease?), somatic mutations ceeded. Of the remaining target-indication pairs, if at
(is somatic mutation in the target associated with the least one asset had a clinical failure then it was classified
disease, typically cancer?), pathways (is the target part as a Clinical Failure. Open Targets presents evidence
of a pathway involved in the disease?), mRNA disease from each individual source as a numerical value for a
expression (does the target’s expression significantly target-indication pair, with a positive value representing
change in the disease?), mRNA tissue overexpression the strength of evidence. To simplify the further colla-
(is the target’s expression overexpressed in tion of target-indication evidence with target-only attri-
disease-related tissues?), literature (is there association butes (Table 2), we converted numerical evidence value
between the target and the indication identified into binary values: 1 indicates a positive association, 0
through text mining of scientific literature?) and ani- means that there is no association and unknown evi-
mal models (does the knockout of the target in dence is represented as null. We encoded categorical
data, typically present in target-only attributes, as mul-
Table 1 17 sources of target-indication evidence tiple binary values with each category converted into a
Evidence Type Sources
binary value, i.e., having the property or not having the
property. Here, we analyzed data mapped to the 574
Animal models Phenodigm
non-cancer indications (a subset of 2211 indications)
Genetics European Variant Archive, Uniprot, Uniprot
with at least one clinical outcome and the corresponding
literature, GWAS catalog, STOPGAP [9]
875 targets (a subset of 21,437 targets). Oncology indica-
Somatic mutation Cancer Gene Census, European Variant Archive
somatic tions were excluded, as studies have observed that fea-
tures of successful targets for cancer differ from features
Literature Europe PMC, TERMITE
of successful targets for other indications [21, 22], more-
mRNA disease Expression Atlas, Internal expression data
expression
over, cancer trials fail more frequently than trials for
other indications. [23]
mRNA tissue GeneLogic, GTEx [49], Human Protein Atlas [50]
overexpression
Matrix factorization
Pathways Reactome, Metabase
The clinical outcomes of existing target-indication pairs
Evidence data were obtained from Open Targets [1] except for TERMITE:
www.scibite.com/products/termite; GeneLogic: GeneLogic Division, Ocimum
can be represented in a matrix format as R ∈ ℝM × N,
Biosolutions, Inc., Internal expression data, and those explicitly referenced where the M rows represent targets and the N columns
Yao et al. BMC Bioinformatics (2019) 20:69 Page 4 of 12
represent indications. Rij = 1 if there is at least one drug Given these three aspects, we investigated a method
that modulates target i and is marketed for indication j. based on tensor factorization, called Macau, that is cap-
Rij = 0 if all the drugs modulating target i are reported able of naturally handling all the three aspects in a uni-
failed for indication j in the clinic (from Phase I to Phase fied Bayesian framework and was originally used to
III). For target-indication pairs that have no outcomes in predict drug-protein interaction [27]. Tensor extends
the clinic, the corresponding Rij is empty. The goal is to the matrix concept to a multidimensional array, where
predict clinical outcomes for all possible pairs of targets each dimension corresponds to one mode of a tensor.
and indications i.e. fill out the empty Rij′s. Thus, we can Our data can be organized into a three-mode tensor: tar-
treat the problem as completing the target-indication get × indication × evidence T ∈RMNK ;where one entry
matrix of clinical outcomes. Matrix completion problem tijk indicates the association score in kth evidence be-
has been widely studied in the machine learning com- tween target i and indication j and one “slice” of the ten-
munity in the context of recommendation systems [6, sor corresponding to one evidence source organized as a
24]. A famous application is Netflix’s movie recommen- matrix. M, N, K are the number of targets, indications
dation system, where each user has ratings on a small and evidence sources, respectively. To predict clinical
number of movies and the task is to recommend movies outcomes, we appended the clinical outcome matrix R
for each user based on existing ratings of other users as one extra “slice” to the evidence tensor (Fig. 1a) and
with similar patterns of movie ratings. Matrix factorized the resulting tensor X ∈RMNðK þ1Þ . Similar
factorization is recognized as one of the more successful to matrix factorization, tensor factorization decomposes
methods for this task [6, 25, 26]. The method assumes a tensor into a series of low-dimensional latent factor
that the true completed matrix is of low rank and can be matrices where each matrix represents one mode of the
approximated by a product of two low-dimensional la- tensor. One direct way to decompose a three-mode ten-
tent factor matrices that represent rows and columns of sor is to assume that each entry xijk can be expressed as
a matrix in a joint D-dimensional latent space, i.e. R ≈ the sum of the elementwise product of three
UT V, where U ¼ fui gM i¼1 ∈R
DM
, V ¼ fv j gNj¼1 ∈RDN and low-dimensional vectors: ui, vj.and ek , representing ith
ui ∈ ℝ , vj ∈ ℝ are column vectors of U and V, respect-
D D target, jth indication and kth evidence (including the clin-
ively. The predicted entries in Rij is achieved by the ical outcomes), respectively in a joint latent factor space,
P
inner product of ui and vj. Learning of U and V can be i.e. xijk ≈ D d¼1 udi vdj edk , where D is the dimensionality of
formulated as an optimization problem by minimizing the latent factors. The latent factors can be further orga-
the mean squared error between observed and predicted
nized in three factor matrices: U ¼ fui gM
i¼1 ∈ℝ
DM
, V
entries. To avoid overfitting, regularization on the latent D
þ1
factor matrices is added to the minimization problem ¼ fv j gNj¼1 ∈ℝ DN , E ¼ fek gKk¼1 ∈ℝ DðK þ1Þ and ui ∈ ℝ ,
D D
that can be solved by methods such as stochastic gradi- vj ∈ ℝ ,ek ∈ ℝ are column vectors of U, V and E, re-
ent descent and alternating least square [6]. spectively. Here the eK + 1 column of E corresponds with
the latent factor of the clinical outcome. The prediction
of any entry xijk of the tensor can be achieved by the
Bayesian tensor factorization
sum of the elementwise product of the low-dimensional
Many matrix-factorization based methods have been
vectors of target ui, indication vjand evidence ek. Since
proposed for recommendation systems. To choose an
the factorized tensor included the clinical outcome
appropriate method to predict clinical outcomes, we
matrix, we can use the low-dimensional vector corre-
considered three aspects of our problem. First, some of
sponding to the clinical outcome to perform prediction,
the evidence is target-indication specific such as human
i.e. the predicted outcome of Tmodulating target i for the
genetic evidence for each disease, and this has been sug-
treatment of indication j is 1 (ui ∘ vj ∘ eK + 1), where 1 is
gested as related to clinical outcome [9]. Second, in our
an all one vector and ∘ is the elementwise product.
data, there are several target-only attributes independent
The specified tensor factorization method we chose is
of indications, such as target protein location, tolerance
based on Bayesian probabilistic modeling, which as-
of mutation. Thus, the chosen method should also take
sumes each observed cell of the tensor X is a random
target-only information into consideration. Third, in
variable following a normal distribution, xijk N ð1T ðui
drug discovery, it is not uncommon that targets or indi-
cations that have never been tested in clinical trials. In ∘v j ∘ek Þ; α−1 Þ, where α is the precison of the normal dis-
the case of movie recommendation systems, this corre- tribution. In this model, the mean of the normal distri-
sponds to recommending movies to users who have not bution is determined by the three low-dimensional
rated any movies in the system or recommending new latent factors: ui, vj.and ek. Each latent factor is assumed
movies that do not have any ratings in the system. The to have a Gaussian prior with a Gaussian-Wishart hyper
chosen method should be able to handle this situation. prior placed on its hyperparameters: ui N ðμtarget þ ΒT
Yao et al. BMC Bioinformatics (2019) 20:69 Page 5 of 12
a model ranking a randomly chosen positive example family are split across training and test set, though
higher than a randomly chosen negative example and is the same drug may bind to multiple members in each
commonly used in assessing the performance of models set. This leads to an overestimate of prediction accur-
for binary classification tasks. AUROC treats positive and acy for truly independent and novel targets. A similar
negative examples equally, this metric is of limited value effect may relate indications with different subtypes,
when the number of positive examples is relatively low. as drug targets are often tried against many closely
Given the low success rate in drug development, we chose related diseases.
AUPRC as the primary evaluation metric as it focuses on To mitigate this problem and obtain an unbiased esti-
the performance of positive examples. Here the precision mation of predictive capacity, we designed two bench-
is the proportion of correctly predicted positives out of all mark experiments, where a group of similar targets
predicted positives and recall is the proportion of correctly (Fig. 1b, panels 2 and 3) or indications is held out as a
predicted positives out of all positives. test set and models trained on the other target or indi-
cation groups, respectively, are evaluated against the
Results held-out set. We categorized targets into ten target
Model benchmark results classes largely derived from the ChEMBL hierarchical
We performed a standard cross-validation experiment target classification system [30], and grouped indica-
to benchmark various types of machine learning tions into eight clusters that are based on MeSH hier-
models (Fig. 1b, panel 1). The best model is the archy and co-occurrence frequency in the literature
matrix factorization (MF) model (AUROC = 0.83 ± 0.02, (see Additional file 1). In the leave-one-target-class-out
AUPRC = 0.77 ± 0.02) (Fig. 2), which only factorizes the cross-validation experiment (Fig. 2), the performance of
clinical outcomes matrix without considering any other MF decreases dramatically as there is no information in
evidence in the dataset. Due to the highly-correlated the training set to predict the clinical outcomes of the
structure within the clinical outcomes of target-indication held-out target class. All the other methods perform
pairs, the standard way of randomly splitting them into similarly and the overall performance is not as good as
training and test sets may overestimate the predictability in the standard cross-validation setting. This implies
of clinical outcomes. This may explain the high perform- that it is difficult to predict candidate indications for
ance of MF; knowing which targets have succeeded targets that have not been assessed in clinical trials. In
against which indications in the training data may provide the leave one disease cluster out validation experiment,
enough information to predict the outcome status of new the performance of MF again dropped below that of the
indications for these targets. Many drug targets are from other methods as there is no information about clinical
the same gene family, and in a random training-test outcomes of the held-out disease clusters in the train-
split, it is likely that targets within the same gene ing step.
Fig. 2 Benchmark performance of models. Prediction performance comparison in three benchmark schemes in terms of Area Under Receiving
Operation Curve (AUROC, Top) and Area Under Precision Recall Curve (AUPRC, Bottom). Error bars are calculated from cross-validation (LR:
Logistic Regression; GBM: Gradient Boosting Machine; RF: Random Forest; MF: Matrix Factorization; BTF: Bayesian Tensor Factorization)
Yao et al. BMC Bioinformatics (2019) 20:69 Page 7 of 12
However, the Bayesian tensor factorization (BTF) LASSO). Although MF performed relatively well in the
model scored as the best model in the disease group standard validation case, its performance was inconsist-
benchmark (AUROC = 0.73 ± 0.05, AUPRC = 0.58 ± 0.09) ent among validation settings. BTF combined both evi-
and the second to best model in standard dence and inter-relationship among targets and
cross-validation (AUROC = 0.82 ± 0.02, AUPRC = 0.71 ± indications and performed consistently well in all three
0.03). It is counter-intuitive that BTF does not validation scenarios. In addition to AUROC and AUPRC,
out-perform the MF method in the standard we also evaluated performance using F-score, preci-
cross-validation case, as it incorporated more data. MF sion@30, and recall@30 (see Additional file 4), but the
approach may be taking maximum advantage of the comparison across methods was not affected.
highly-related nature of the outcomes, given the poor
performance of MF in the target class and disease group Leave one out experiments
benchmarks. MF also only needs to learn latent factors One advantage of this leave one target/disease group out
to explain the clinical outcomes, while BTF needs to validation scheme is that we can assess how trained
learn latent factors to explain the clinical outcomes and models can be generalized to groups of targets/diseases
all the evidence as well, which is inherently a more diffi- that the models have never trained on before. Figure 3
cult task. shows the prediction performance of the six models on
In general, the performance of models that explicitly the held-out target classes (Fig. 3a) and disease clusters
use evidence as predictors did not vary too much across (Fig. 3b). In the leave-one-target-class-out case, the pre-
three validation settings. Among these models, diction performance averaged over the six models varies
ensemble-based methods (RF and GBM) worked slightly between target classes (AUPRC ranges from 0.24 to 0.58;
better than linear model-based methods (LR and AUROC ranges from 0.53 to 0.68). Specifically, we
Fig. 3 Benchmark performance of leave one out experiments. Model performance on predicting clinical outcomes of target classes (a) and
disease clusters (b) in the leave-one-out experiments in terms of Area Under Receiving Operation Curve (AUROC, x-axis) and Area Under Precision
Recall Curve (AUPRC, y-axis). 95% confidence interval is calculated using 1000 bootstraps. Dotted lines mark the AUROC (vertical) and AUPRC
(horizontal) of a random guess, which is 0.5 and the fraction of positives in the testing set, respectively. The percentage of target-indication pairs
in each held-out set is listed after the pipe symbol (|) in the titles. (LR: Logistic Regression; GBM: Gradient Boosting Machine; RF: Random Forest;
MF: Matrix Factorization; BTF: Bayesian Tensor Factorization)
Yao et al. BMC Bioinformatics (2019) 20:69 Page 8 of 12
notice that the models perform consistently poorly for factors can capture inter-relationship among indica-
transcriptional factor targets and miscellaneous en- tions. t-SNE is a dimension reduction technique used
zymes, which implies that these target classes are quite to visualize high-dimensional dataset where similar
different from the other target classes. On the other points in high dimensional space are transformed to
hand, most models perform relatively well in protease neighboring points in a low dimensional space and
targets. We note the performance is consistent among dissimilar points are transformed to distant points in
models within each target class, but this low variability the low dimensional embedding. Figure 4a shows the
is not repeated in the leave-one-disease-cluster-out case, two-dimensional t-SNE embedding of the 574 indica-
where the prediction performance shows higher variabil- tions with at least one clinical outcome, where three
ity among disease clusters. For example, the BTF model distinct clusters are present on the map. We further
performs better than the other models in the metabolic, checked the MeSH annotations of the diseases in
GI and urologic and oral disease clusters, and performs each cluster and found that the three clusters are
as well as any other model in the other disease clusters. enriched with three distinct disease categories includ-
ing Central nervous system diseases, Digestive system
Latent factors capture disease relationship within three diseases, and Hemic & lymphatic diseases, respect-
disease areas ively. Interestingly, auto-immune diseases, such as
After benchmarking the performance of the BTF rheumatoid arthritis, asthma, psoriasis and Crohn’s
model in the cross-validation experiments, we fitted disease that manifest in different organs are localized
the model to the whole dataset. We chose 11 latent in the same neighboring area on the map. This im-
factors (see Additional file 1). Before using the fitted plies that latent factors are able to capture the intrin-
BTF model to make any predictions, we explored sic relationships of diseases within these disease areas.
whether the latent factors learned from the model are For the rest of the diseases, we did not observe dis-
biologically meaningful so that we can increase our tinct clustering patterns using t-SNE. This could
trust in the prediction made by the model. To do so, either be because the latent factors are not rich
we reduced the 11 latent factors to two dimensions enough to capture the relationship among these dis-
using t-SNE [35] to visualize how indications are dis- eases or these diseases are inherently interconnected
tributed and examined whether the learned latent by sharing similar pathological mechanisms [36].
a b
Fig. 4 Validation of BTF model prediction. a t-SNE visualization of indications based on the latent factors learned in BTF model. Each dot
represents one indication and the size of the dot is proportional to the number of targets that have been clinically failed. The inserted pie charts
show diseases composition of representative clusters of indications in the 2D visualization. 2D embedding was obtained by using perplexity = 30
in t-SNE and the visualization is consistent using different perplexity values in the range from 10 to 50. b BTF prediction scores of target-
indication pairs in Phase I-III clinical trials. The numbers are P-values (Wilcoxon rank sum tests) from comparing prediction scores of target-
indication pairs between any two phases
Yao et al. BMC Bioinformatics (2019) 20:69 Page 9 of 12
Prediction scores of target-indication pairs under clinical for IL6, a cytokine with a wide variety of biological func-
trials tions. It induces the acute phase response and is in-
To validate the prediction made by the BTF model, we volved in the final differentiation of B-cells into
chose 1246 novel target-indication pairs that were in Ig-secreting cells in lymphocyte and monocyte differen-
clinical trials (Phase I-III) at the time when we collected tiation. It acts on B-cells, T-cells, hepatocytes,
the data (May 2016), and thus did not have clinical out- hematopoietic progenitor cells and is required for the
come readouts. We compared the prediction scores gen- generation of T(H)17 cells. It also acts as a myokine and
erated by the BTF model on these target-indication pairs is discharged into the bloodstream after muscle contrac-
and noticed that the prediction scores of later phase tion and acts to increase the breakdown of fats and to
pairs are significantly higher than those of earlier phase improve insulin resistance [39]. Genetic polymorphism
pairs (Fig. 4b), which recapitulates the observation that of IL6 has been shown to be significantly associated with
drugs in later phases on average have a higher likelihood a form of psoriatic arthritis [40], and serum IL6 is con-
of approval [37]. Since we did not include phase infor- sidered as a biomarker for assessing disease activity in
mation of these target-indication pairs when training the patients with psoriasis, as well as for predicting respon-
model, these pairs serve as an independent test set and siveness of joint symptoms to biologic treatment [41].
the results increase our confidence in the predictions of Another target of interest is angiotensin II receptor type
the model. 1 (AGTR1), an important effector controlling blood pres-
Next, we conducted a literature search on the top 63 sure and volume in the cardiovascular system. It has been
hypotheses of the 1246 pairs based on a prediction score approved for many cardiovascular indications such as
threshold, which corresponds with 0.8 precision and heart failure, myocardial infarction, and hypertension. The
0.27 recall in the standard cross-validation experiment. predicted indication for AGTR1 is hypercholesterolemia,
We list 15 of these 63 hypotheses along with a relevant also known as high cholesterol. AGTR1 antagonism im-
literature reference in Table 3; the complete list of 63 proves hypercholesterolemia-associated endothelial dys-
can be found in Additional file 1. function [42] and attenuates the inflammatory and
As an example, interleukin 6 (IL6) is an approved drug thrombogenic responses to hypercholesterolemia in ve-
target for giant lymph node hyperplasia (Table 3). Our nules [43]. Significant association of AGTR1 polymorph-
results suggest that the current trials for psoriatic arth- ism with hypercholesterolemia was also observed in
ritis, which includes a Phase IIb trial of a monoclonal hypertension patients [44].
antibody against this protein [38], have a greater than
random chance of success. Psoriatic arthritis is chronic Discussion
inflammatory arthritis that is associated with psoriasis In this paper, we focused on the problem of predicting
and thus somewhat related to the successful indication clinically promising therapeutic hypotheses using
associative knowledge of targets and indications. We existing association evidence between drug targets and
compared tensor factorization with other traditional ma- disease indications. We illustrate that the method can
chine learning methods in a variety of benchmarking ex- achieve equal or better prediction performance com-
periments and identified two interesting findings from pared with a variety of baseline models across three sce-
the evaluation of this method: 1) the latent factors narios of drug discovery, and the learned model can
learned from the model align with known biological re- capture the known biological mechanism of human dis-
lationships among three human disease areas, and 2) the eases. Furthermore, we demonstrated an application of
method can be applied to different scenarios of drug dis- the method to predict outcomes of trials on novel indi-
covery and achieves competitive prediction performance. cations of approved drug targets. Future work includes
However, there are some limitations worth discussing expanding this method to targets and indications that
before deploying tensor factorization to propose novel previously have never been clinically tested and propos-
target-indication hypotheses. First, the model relies on ing novel target-indication hypotheses that can be devel-
the available compilation of evidence sources. Open Tar- oped into medicines with predicted high probabilities of
gets provided us with a good foundation, but clearly, success.
more sources could be gathered. Second, we treated
every clinical failure equally. Our preliminary analysis Additional files
has shown that some target-indications pairs have been
tried multiple times and are still being pursued clinically, Additional file 1: Supplementary material. (PDF 490 kb)
while some failed only once and were never tested again. Additional file 2: Target-indication association evidence data file.
Although the probabilistic framework of the model can (ZIP 227 kb)
potentially mitigate this problem, the model does not ex- Additional file 3: Target-only attribute data file. (ZIP 5 kb)
plicitly differentiate definitive failures from those that Additional file 4: Results of benchmark experiments. (XLSX 11 kb)
Publisher’s Note 24. Ma H, Yang H, Lyu MR, King I: Sorec: social recommendation using
Springer Nature remains neutral with regard to jurisdictional claims in probabilistic matrix factorization. In: Proceedings of the 17th ACM
published maps and institutional affiliations. conference on information and knowledge management: 2008. ACM: 931–
940.
Author details 25. Mnih A, Salakhutdinov RR. Probabilistic matrix factorization. In: Advances in
1
Computational Biology, GSK R&D, 1250 S. Collegeville Road, UP12-200, neural information processing systems; 2008. p. 1257–64.
Collegeville, PA, USA. 2Genetics, GSK R&D, 1250 S. Collegeville Road, 26. Salakhutdinov R, Mnih A: Bayesian probabilistic matrix factorization using
UP12-200, Collegeville, PA, USA. Markov chain Monte Carlo. In: Proceedings of the 25th international
conference on machine learning: 2008. ACM: 880–887.
Received: 26 September 2018 Accepted: 30 January 2019 27. Simm J, Arany A, Zakeri P, Haber T, Wegner JK, Chupakhin V, Ceulemans H,
Moreau Y: Macau: scalable bayesian multi-relational factorization with side
information using MCMC. arXiv preprint arXiv:150904610 2015.
28. Julia implementation of Bayesian tensor factorization algorithm [https://
References
github.com/jaak-s/BayesianDataFusion.jl].
1. Arrowsmith J, Miller P. Trial watch: phase II and phase III attrition rates 2011-
29. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian
2012. Nat Rev Drug Discov. 2013;12(8):569.
Data Analysis, vol. 2. Boca Raton, FL: CRC press; 2014.
2. Cook D, Brown D, Alexander R, March R, Morgan P, Satterthwaite G,
30. Bento AP, Gaulton A, Hersey A, Bellis LJ, Chambers J, Davies M, Krüger FA,
Pangalos MN. Lessons learned from the fate of AstraZeneca's drug pipeline:
Light Y, Mak L, McGlinchey S. The ChEMBL bioactivity database: an update.
a five-dimensional framework. Nat Rev Drug Discov. 2014;13(6):419–31.
Nucleic Acids Res. 2014;42(D1):D1083–90.
3. Bertram L, Tanzi RE. Thirty years of Alzheimer's disease genetics: the
31. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear
implications of systematic meta-analyses. Nat Rev Neurosci. 2008;9:768.
models via coordinate descent. J Stat Softw. 2010;33(1):1.
4. Koscielny G, An P, Carvalho-Silva D, Cham JA, Fumis L, Gasparyan R, Hasan
32. Friedman JH. Greedy function approximation: a gradient boosting machine.
S, Karamanis N, Maguire M, Papa E. Open targets: a platform for therapeutic
Ann Stat. 2001:1189–232.
target identification and validation. Nucleic Acids Res. 2016;45(D1):D985–94.
33. Chen T, Guestrin C: Xgboost: A scalable tree boosting system. In:
5. Kolda TG, Bader BW. Tensor decompositions and applications. SIAM Rev.
Proceedings of the 22nd ACM SIGKDD international conference on
2009;51(3):455–500.
knowledge discovery and data mining: 2016. ACM: 785–794.
6. Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender
systems. Computer. 2009;42(8). 34. Mazumder R, Hastie T, Tibshirani R. Spectral regularization algorithms for
7. Nickel M, Murphy K, Tresp V, Gabrilovich E. A review of relational machine learning large incomplete matrices. J Mach Learn Res. 2010;(11):2287–322.
learning for knowledge graphs. Proc IEEE. 2016;104(1):11–33. 35. Maaten Lvd HG. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(Nov):
8. Luo Y, Wang F, Szolovits P. Tensor factorization toward precision medicine. 2579–605.
Brief Bioinform. 2017;18(3):511–4. 36. Menche J, Sharma A, Kitsak M, Ghiassian SD, Vidal M, Loscalzo J, Barabási A-
9. Nelson MR, Tipney H, Painter JL, Shen J, Nicoletti P, Shen Y, Floratos A, L. Uncovering disease-disease relationships through the incomplete
Sham PC, Li MJ, Wang J. The support of human genetic evidence for interactome. Science. 2015;347(6224):1257601.
approved drug indications. Nat Genet. 2015;47(8):856. 37. Hay M, Thomas DW, Craighead JL, Economides C, Rosenthal J. Clinical
10. Rouillard AD, Hurle MR, Agarwal P. Systematic interrogation of diverse Omic development success rates for investigational drugs. Nat Biotechnol. 2014;
data reveals interpretable, robust, and generalizable transcriptomic features 32:40.
of clinically successful therapeutic targets. PLoS Comput Biol. 2018;14(5): 38. Mease PJ, Gottlieb AB, Berman A, Drescher E, Xing J, Wong R, Banerjee S.
e1006142. The efficacy and safety of clazakizumab, an anti–interleukin-6 monoclonal
11. Sun J, Zhu K, Zheng WJ, Xu H. A comparative study of disease genes and antibody, in a phase IIb study of adults with active psoriatic arthritis.
drug targets in the human protein interactome. BMC Bioinformatics. 2015; Arthritis Rheumatol. 2016;68(9):2163–73.
16(5):S1. 39. Uniprot entry of IL6 [http://www.uniprot.org/uniprot/P05231].
12. Heinemann F, Huber T, Meisel C, Bundschus M, Leser U. Reflection of 40. Cubino N, Montilla C, Usategui-Martín R, Cieza-Borrela C, Carranco T, Calero-
successful anticancer drug development processes in the literature. Drug Paniagua I, Quesada A, Cañete J, Queiro R, Sánchez M. Association of IL1Β
Discov Today. 2016;21(11):1740–4. (−511 a/C) and IL6 (−174 G> C) polymorphisms with higher disease activity
13. Moreau Y, Tranchevent L-C. Computational tools for prioritizing candidate and clinical pattern of psoriatic arthritis. Clin Rheumatol. 2016;35(7):1789–94.
genes: boosting disease gene discovery. Nat Rev Genet. 2012;13:523. 41. Muramatsu S, Kubo R, Nishida E, Morita A. Serum interleukin-6 levels in
14. Ghiassian SD, Menche J, Barabási A-L. A DIseAse MOdule detection response to biologic treatment in patients with psoriasis. Mod Rheumatol.
(DIAMOnD) algorithm derived from a systematic analysis of connectivity 2017;27(1):137–41.
patterns of disease proteins in the human Interactome. PLoS Comput Biol. 42. Wassmann S, Hilgers S, Laufs U, Böhm M, Nickenig G. Angiotensin II type 1
2015;11(4):e1004120. receptor antagonism improves hypercholesterolemia-associated endothelial
15. Carson MB, Lu H. Network-based prediction and knowledge mining of dysfunction. Arterioscler Thromb Vasc Biol. 2002;22(7):1208–12.
disease genes. BMC Med Genet. 2015;8(2):S9. 43. Petnehazy T, Stokes KY, Russell JM, Granger DN. Angiotensin II type-1
16. Yang P, Li X, Chua H-N, Kwoh C-K, Ng S-K. Ensemble positive unlabeled receptor antagonism attenuates the inflammatory and thrombogenic
learning for disease gene identification. PLoS One. 2014;9(5):e97079. responses to hypercholesterolemia in venules. Hypertension. 2005;45(2):
17. Chen C, Tong H, Xie L, Ying L, He Q: FASCINATE: fast cross-layer 209–15.
dependency inference on multi-layered networks. In: Proceedings of the 44. Morisawa T, Kishimoto Y, Kitano M, Kawasaki H, Hasegawa J. Influence of
22nd ACM SIGKDD International Conference on Knowledge Discovery and angiotensin II type 1 receptor polymorphism on hypertension in patients
Data Mining; San Francisco, California, USA. 2939784: ACM 2016: 765–774. with hypercholesterolemia. Clin Chim Acta. 2001;304(1):91–7.
18. Pharmaprojects Database [https://citeline.com/products/pharmaprojects]. 45. Costa PR, Acencio ML, Lemke N: A machine learning approach for genome-
19. Yao L, Rzhetsky A. Quantitative systems-level determinants of human genes wide prediction of morbid and druggable human genes based on systems-
targeted by successful drugs. Genome Res. 2008;18(2):206–13. level data. In: BMC Genomics: 2010. BioMed Central: S9.
20. Bull SC, Doig AJ. Properties of protein drug target classes. PLoS One. 2015; 46. Yang P, Li X-L, Mei J-P, Kwoh C-K, Ng S-K. Positive-unlabeled learning for
10(3):e0117955. disease gene identification. Bioinformatics. 2012;28(20):2640–7.
21. Mitsopoulos C, Schierz AC, Workman P, Al-Lazikani B. Distinctive behaviors 47. Emig D, Ivliev A, Pustovalova O, Lancashire L, Bureeva S, Nikolsky Y,
of Druggable proteins in cellular networks. PLoS Comput Biol. 2015;11(12): Bessarabova M. Drug target prediction and repositioning using an
e1004597. integrated network-based approach. PLoS One. 2013;8(4):e60618.
22. Mora A, Donaldson IM. Effects of protein interaction data integration, 48. Barabási A-L, Gulbahce N, Loscalzo J. Network medicine: a network-based
representation and reliability on the use of network properties for drug approach to human disease. Nat Rev Genet. 2010, 12:56.
target prediction. BMC Bioinformatics. 2012;13(1):294. 49. Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G,
23. Harrison RK. Phase II and phase III failures: 2013–2015. Nat Rev Drug Discov. Garcia F, Young N, et al. The genotype-tissue expression (GTEx) project. Nat
2016;15:817. Genet. 2013;45:580.
Yao et al. BMC Bioinformatics (2019) 20:69 Page 12 of 12