HHS Public Access
Author manuscript
Author Manuscript
KDD. Author manuscript; available in PMC 2018 February 08.
Published in final edited form as:
KDD. 2017 August ; 2017: 1633–1642. doi:10.1145/3097983.3098057.
MOLIERE: Automatic Biomedical Hypothesis Generation System
Justin Sybrandt1, Michael Shtutman2, and Ilya Safro1
1Clemson
University, School of Computing, Clemson SC, USA
2University
of South Carolina, Drug Discovery and Biomedical Sciences, Columbia SC, USA
Abstract
Author Manuscript
Hypothesis generation is becoming a crucial time-saving technique which allows biomedical
researchers to quickly discover implicit connections between important concepts. Typically, these
systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes
information from over 24.5 million documents. At the heart of our approach lies a multi-modal
and multi-relational network of biomedical objects extracted from several heterogeneous datasets
from the National Center for Biotechnology Information (NCBI). These objects include but are not
limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model
hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths
discovered within this network, and demonstrate the effectiveness of MOLIERE by performing
hypothesis generation on historical data. Our network, implementation, and resulting data are all
publicly available for the broad scientific community.
Author Manuscript
1 Introduction
Vast amounts of biomedical information accumulate in modern databases such as
MEDLINE [3], which currently contains the bibliographic data of over 24.5 million medical
papers. These ever-growing datasets impose a great difficulty on researchers trying to survey
and evaluate new information in the existing biomedical literature, even when advanced
ranking methods are applied. On the one hand, the vast quantity and diversity of available
data has inspired many scientific breakthroughs. On the other hand, as the set of searchable
information continues to grow, it becomes impossible for human researchers to query and
understand all of the data relevant to a domain of interest.
Author Manuscript
In 1986 Swanson hypothesized that novel discoveries could be found by carefully studying
the existing body of scientific research [45]. Since then, many groups have attempted to
mine the wealth of public knowledge. Efforts such as Swanson’s own Arrowsmith generate
hypotheses by finding concepts which implicitly link two queried keywords. His method and
others are discussed at length in Section 1.3. Ideally, an effective hypothesis generation
system greatly increases the productivity of researchers. For example, imagine that a
medical doctor believed that stem cells could be used to repair the damaged neural pathways
of stroke victims (as some did in 2014 [22]). If no existing research directly linked stem
cells to stroke victims, this doctor would typically have no choice but to follow his/her
intuition. Hypothesis generation allows this researcher to quickly learn the likelihood of
such a connection by simply running a query. Our hypothetical doctor may query the topics
stem cells and stroke for example. If the system returned topics such as paralysis then not
Sybrandt et al.
Page 2
Author Manuscript
only would the doctor’s intuition be validated, but he/she would be more likely to invest in
exploring such a connection. In this manner, an intelligent hypothesis generation system can
increase the likelihood that a researcher’s study yields usable new findings.
1.1 Our Contribution
We introduce a deployed system, MOLIERE [47], with the goal of generating more usable
results than previously proposed hypothesis generation systems. We develop a novel method
for constructing a large network of public knowledge and devise a query process which
produces human readable text highlighting the relationships present between nodes.
Author Manuscript
To the best of our knowledge, MOLIERE is the first hypothesis generation system to utilize
the entire MEDLINE data set. By using state-of-the-art tools, such as ToPMine [16] and
FastText [9], we are able to find novel hypotheses without restricting the domain of our
knowledge network or the resulting vocabulary when creating topics. As a result, MOLIERE
is more generalized and yet still capable of identifying useful hypotheses.
We provide our network and findings online for others in the scientific community [47].
Additionally, to aid interested biomedical researchers, we supply an online service where
users can request specific query results at http://jsybran.people.clemson.edu/mForm.php.
Furthermore, MOLIERE is entirely open-source in order to facilitate similar projects. See
https://github.com/JSybrandt/MOLIERE for the code needed to generate and query the
MOLIERE knowledge network.
Author Manuscript
In the following paper we describe our process for creating and querying a large knowledge
network built from MEDLINE and other NCBI data sources. We use natural language
processing methods, such as Latent Dirichlet Allocation (LDA) [8] and topical phrase
mining [16], along with other data mining techniques to conceptually link together abstracts
and biomedical objects (such as biomedical keywords and n-grams) in order to form our
network. Using this network we can run shortest path queries to discover a pathway between
two concepts which are non-trivially connected. We then find clouds of documents around
these pathways which contain knowledge representative of the path as a whole. PLDA+, a
scalable implementation of LDA [28], allows us to quickly find topic models in these clouds.
Unlike similar systems, we do not restrict PLDA+ to any set vocabulary. Instead, by using
topical phrase mining, we identify meaningful n-grams in order to improve the performance,
flexibility, and understandability of our LDA models. These models result in both
quantitative and qualitative connections which human researchers can use to inform their
decision making.
Author Manuscript
We evaluate our system by running queries on historical data in order to discover landmark
findings. For example, using data published on or before 2009, we find strong evidence that
the protein Dead Box RNA Helicase 3 (DDX3) can be applied to treat cancer. We also verify
the ability of MOLIERE to make predictions similar to previous systems with restricted LDA
[49].
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 3
1.2 Our Method in Summary
Author Manuscript
We focus on the domain of medicine because of the large wealth of public information
provided by the National Library of Medicine (NLM). MEDLINE is a database containing
over 24.5 million references to medical publications dating all the way back to the late 1800s
[3]. Over 23 million of these references include the paper’s title and abstract text. In addition
to MEDLINE, the NLM also maintains the Unified Medical Language System (UMLS)
which is comprised of three main resources: the metathesaurus, the semantic network, and
the SPECIALIST natural language processing (NLP) tools. These resources, along with the
rest of our data, are described in section 2.1.
Author Manuscript
Our knowledge base starts as XML files provided by MEDLINE, from which we extract
each publication’s title, document ID, and abstract text. We first process these results with
the SPECIALIST NLP toolset. The result is a corpus of text which has standardized
spellings (for example “colour” becomes “color”), no stop words (including medical specific
stop words such as Not Otherwise Specified (NOS)), and other characteristics which
improve later algorithms on this corpus. Then we use ToPMine to identify multi-word
phrases from that corpus such as “asthma attack,” allowing us to treat phrases as single
tokens [16]. Next, we send the corpus through FastText, the most recent word2vec
implementation, which maps each unique token in the corpus to a vector [30]. We can then
fit a centroid to each publication and use the Fast Library for Approximate Nearest
Neighbors (FLANN) to generate a nearest neighbors graph [32]. The result is a network of
MEDLINE papers, each of which are connected to other papers sharing a similar topic. This
network, combined with the UMLS metathesaurus and semantic network, constitutes our
full knowledge base. The network construction process is described in greater detail in
Section 2.
Author Manuscript
With our network, a researcher can query for the connections between two keywords. We
find the shortest path between the two keywords in the knowledge network, and extend this
path to identify a significant set of related abstracts. This subset contains many documents
which, due to our network construction process, all share common topics. We perform topic
modeling on these documents using PLDA+ [28]. The result is a set of plain text topics
which represent different concepts which likely connect the two queried keywords. More
information about the query process is detailed in Section 3.
Author Manuscript
We use landmark historical findings in order to validate our methods. For example, we show
the implicit link between Venlafaxine and HTR1A, and the involvement of DDX3 on Wnt
signaling. These queries and results are detailed in Section 4. In Sections 5 and 6 we discuss
challenges and open research questions we have uncovered during our work.
1.3 Related Work
The study and exploration of undiscovered public knowledge began in 1986 with Swanson’s
landmark paper [45]. Swanson hypothesized that fragments of information from the set of
public knowledge could be connected in such a way as to shed light on new discoveries.
With this idea, Swanson continued his research to develop Arrowsmith, a text-based search
application meant to help doctors make connections from within the MEDLINE data set [38,
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 4
Author Manuscript
44, 46]. To use Arrowsmith, researchers supply two UMLS keywords which are used to
find two sets of abstracts, A and C. The system then attempts to find a set B ≈ A∩C.
Assuming sets A and C do not overlap initially, implicit textual links are used to expand both
sets until some sizable set B is discovered. The experimental process was computationally
expensive, and queries were typically run on a subset of the MEDLINE data set (according
to [46] around 1,000 documents).
Author Manuscript
Spangler has also been a driving force in the field of hypothesis generation and mining
undiscovered public knowledge. His textbook [41] details many text mining techniques as
well as an example application related to hypothesis generation in the MEDLINE data set.
His research in this field has focused on p53 kinases and how these undiscovered
interactions might aid drug designers [42, 41]. His method leverages unstructured text
mining techniques to identify a network entities and relationships from medical text. Our
work differs from this paradigm by utilizing the structured UMLS keywords, their known
connections, and mined phrases. We do, however, rely on similar unstructured text mining
techniques, such as FastText and FLANN, to make implicit connections between the
abstracts.
Rzhetsky and Evans notice that current information gathering methods struggle to keep up
with the growing wealth of forgotten and hard to find information [17]. Their work in the
field of hypothesis generation has included a study on the assumptions made when
constructing biomedical models [15] and digital representations of hypothesis [40].
Author Manuscript
Divoli et al. analyze the assumptions made in medical research [15]. They note that
scientists often reach contradictory conclusions due to differences in each person’s underly
assumptions. The study in [15] highlights the variance of these preconceptions by surveying
medical researchers on the topic of cancer metastasis. Surprisingly, 27 of the 28 researchers
surveyed disagree with the textbook process of cancer metastasis. When asked to provide the
“correct” metastasis scenario, none of the surveyed scientists agree. Divoli’s study highlights
a major problem for hypothesis generation. Scientists often disagree, even in published
literature. Therefore, a hypothesis generation system must be able to produce reliable results
from a set of contradicting information.
Author Manuscript
In [40], Soldatova and Rzhetsky describe a standardized way to represent scientific
hypotheses. By creating a formal and machine readable standard, they envision a collection
of hypotheses which clearly describes the full spectrum of existing theories on a given topic.
Soldatova and Rzhetsky extend existing approaches by representing hypotheses as logical
statements which can be interpreted by Adam, a robot scientist capable of starting one
thousand experiments a day. Adam is successful, in part, because they model hypotheses as
an ontology which allows for Bayesian inference to govern the likelihood of a specific
hypothesis being correct.
DiseaseConnect, an online system that allows researchers to query for concepts intersecting
two keywords, is a notable contribution to hypothesis generation [27]. This system, proposed
by Liu et al., is similar to both our system and Arrowsmith [39] in its focus on UMLS
keywords and MEDLINE literature mining. Unlike our system, Liu et al. restrict
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 5
Author Manuscript
DiseaseConnect to simply 3 of the 130 semantic types. They supplement this subset with
concepts from the OMIM [19] and GWAS [6] databases, two genome specific data sets.
Still, their network size is approximately 10% of the size of MOLIERE. DiseaseConnect uses
its network to identify diseases which can be grouped by their molecular mechanisms rather
than symptoms. The process of finding these clusters depends on the relationships between
different types of entities present in the DiseaseConnect network. Users can view subnetworks relevant to their query online and related entities are displayed alongside the
network visualization.
Author Manuscript
Barabási et al. improve upon the network analytic approach to understand biomedical data in
both their work on the disease network [19] as well as their more generalized symptomsdisease network [51]. In the former [19], the authors construct a bipartite network of disease
phonemes and genomes to which they refer to as the Diseasome. Their inspiration is an
observation that genes which are related to similar disorders are likely to be related
themselves. They use the Diseasome to create two projected networks, the human disease
network (HDN), and the Disease Gene Network (DGN). In the latter [51], they construct a
more generalized human symptoms disease network (HSDN) by using both UMLS
keywords and bibliographic data. HSDN consists of data collected from a subset of
MEDLINE consisting of only abstracts which contained at least one disease as well as one
symptom, a subset consisting of approximately 850,000 records. From this set, Goh et al.
calculated keyword co-occurrence statistics in order to build their network. They validate
their approach using 1,000 randomly selected MEDLINE documents and, with the help of
medical experts, manually confirm that the relationship described in a document is reflected
meaningfully in HSDN. Ultimately, Goh et al. find strong correlations between the
symptoms and genes shared by common diseases.
Author Manuscript
Bio-LDA is a modification of LDA which limits the set of keywords to the set present in
UMLS [49]. This reduction improves the meaning and readability of topics generated by
LDA. Wang et al. also show in this work that their method can imply connections between
keywords which do not show up in the same document. For example, they note that
Venlafaxine and HTR1A both appear in the same topic even though both do not appear in
the same abstract. We explore and repeat these findings in Section 4.2.
1.4 Related and Incorporated Technologies
FastText is the most recent implementation of word2vec from Milkolov et al. [30, 31, 23,
Author Manuscript
9]. Word2vec is a method which utilizes the skip-gram model to identify the relationships
between words by analyzing word usage patterns. This process maps plain text words into a
high dimensional vector space for use in data mining applications. Similar words are often
grouped together, and the distances between words can reveal relationships. For example,
the distance between the words “Man” and “Woman” is approximately the same as the
distance between “King” and “Queen”. FastText improves upon this idea by leveraging
sub-strings in long rarely occurring words.
ToPMine, a project from El-Kishky et al., is focused on discovering multi-word phrases
from a large corpus of text [20]. This project intelligently groups unigrams together to create
n-gram phrases for later use in text mining algorithms. By using a bag-of-words topic model,
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 6
ToPMine groups unigrams based on their co-occurrence rate as well as their topical
Author Manuscript
similarity using a process they call Phrase LDA.
Latent Dirichlet Allocation [8] is the most common topic modeling process and PLDA+ is
a scalable implementation of this algorithm [20, 28]. Developed by Zhiyuan Liu et al.,
PLDA+ quickly identifies groups of words and phrases which all relate to a similar concept.
Although it is an open research question as to how best to interpret these results, simple
qualitative analysis allows for “ballpark” estimations. For instance, it may take a medical
researcher to wholly understand the topics generated from abstracts related to two keywords,
but anyone can identify that all words related to a concept of interest occur in the same topic.
Results like this, show that LDA has distinguished the presence of a concept in a body of
text.
Author Manuscript
2 Knowledge Network Construction
In order to discover hypotheses we construct a large weighted multi-layered network of
biomedical objects extracted from NLM data sets. Using this network, we run shortestcentroid-path queries (see Section 3) whose results serve as an input for hypothesis mining.
The wall clock time needed to complete this network construction pipeline is depicted in
Figure 1 (see details in Section 4.4). Omitted from this figure is the time spent preprocessing
the initial abstract text due to its embarrassingly parallel nature.
2.1 Data Sources
Author Manuscript
The NLM maintains multiple databases of medical information which are the main source of
our data. This includes MEDLINE [3], a source containing the metadata of approximately
24.5 million medical publications since the late 1800’s. Most of these MEDLINE records
include a paper’s title, authors, publication date, and abstract text.
In addition to MEDLINE, the NLM maintains UMLS [2], which in turn provides the
metathesaurus as well as a semantic network. The metathesaurus contains two million
keywords along with all known synonyms (referred to as “atoms”) used in medical text. For
example, the keyword “RNA” has many different synonyms such as “Ribonucleinicum
acidum”, “Ribonucleic Acid”, and “Gene Products, RNA” to name a few. These
metathesaurus keywords form a network comprised of multi-typed edges. For example, an
edge may represent a parent - child or a boarder concept - narrower concept relationship.
RNA has connections to terms such as “Nucleic Acids” and “DNA Synthesizers”. Lastly,
each keyword holds a reference to an object in the semantic network. RNA is an instance of
the “Nucleic Acid, Nucleoside, or Nucleotide” semantic type.
Author Manuscript
The UMLS semantic network is comprised of approximately 130 semantic types and is
connected in a similar manner as the metathesaurus. For example, the semantic type “Drug
Delivery Device” has an “is a” relationship with the “Medical Device” type, and has a
“contains” relationship with the “Clinical Drug” type.
MEDLINE, the metathesaurus, and the semantic network are represented in our network as
different layers. Articles which contain full text abstracts are represented as the abstract
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 7
Author Manuscript
layer nodes , keywords from the metathesaurus are represented as nodes in the keyword
layer , and items from the semantic network are represented as nodes in the semantic layer
.
2.2 Network Topology
Author Manuscript
We define a weighted undirected graph underlying our network
as G = (V, E), where V =
∪ ∪ . The construction of G was governed by two major goals. Firstly, the shortest
path between two indirectly related keywords should likely contain a significant number of
nodes in . If instead, this shortest path contained only − edges, we would limit
ourselves to known information contained within the UMLS metathesaurus. Secondly,
conceptual distance between topics should be represented as the distance between two nodes
in . This implies that we can determine the similarity between i, j ∈ V by the weight of
their shortest path. If ij ∈ E, this would imply that exists a previously known relationship
between i and j. We are instead interested in connections between distant nodes, as these
potentially represent unknown information. Below we describe the construction of each
layer in .
2.3 Abstract Layer
When connecting abstracts ( − edges), we want to ensure that two nodes i, j ∈ with
similar content are likely neighbors in the layer. In order to do this, we turned to the
UMLS SPECIALIST NLP toolset [1] as well as ToPMine [16] and FastText [9, 23]. Our
process for constructing is summarized in Figure 2.
Author Manuscript
First, we extract all titles, abstracts, and associated document ID (referred to as PMID within
MEDLINE) from the raw MEDLINE files. We then process these combined titles and
abstracts with the SPECIALIST NLP toolset to standardize spelling, strip stop words,
convert to ASCII, and perform a number of other data cleaning processes. We then use
ToPMine to generate meaningful n-grams and further clean the text. This process finds
tokens that appear frequently together, such as newborn and infants and combines them into
a single token newborn_infants. Cleaning and combining tokens in this manner greatly
increases the performance of FastText, the next tool in our pipeline.
Author Manuscript
When running ToPMine, we keep the minimum phrase frequency and the maximum number
of words per phrase set to their default values. We also keep the topic modeling component
disabled. On our available hardware, the MEDLINE data set can be processed in
approximately thirteen hours without topic modeling, but does not finish within three days if
topic modeling is enabled. Because the resulting phrases are of high quality even without the
topic modeling component, we accept this quality vs. time trade off. It is also important to
note that we modify the version of ToPMine distributed by El-Kishky in [16] to allow
phrases containing numbers, such as gene names like p53.
Next, FastText maps each token in our corpus to a vector υ ∈ ℝd, allowing us to fit a
centroid per abstract i ∈ . Using a sufficiently high-dimensional space ensures a good
separation between vectors. In other words, each abstract i ∈ is represented in ℝd as
, where xj are FastText vectors of k keywords in i.
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 8
Author Manuscript
We choose to use the skipgram model to train FastText and reduce the minimum word
count to zero. Because our data preprocessing and ToPMine have already stripped low
support words, we accept that any n-gram seen by FastText is important. Following
examples presented in [30, 31, 23] and others, we set the dimensionality of our vector space
d to 500. This is consistent with published examples of similar size, for example the Google
news corpus processed in [30]. Lastly, we increase the word neighborhood and number of
possible sub-words from five to eight in order to increase data quality.
Author Manuscript
Finally, we used FLANN [32] to create nearest neighbors graph from all i ∈ in order to
establish − edges in E. This requires that we presuppose a number of expected nearest
neighbors per abstract k. We set this tunable parameter to ten initially and noticed that this
value seemed appropriate. By studying the distances between connected abstracts, we
observed that most abstracts had a range of very close and relatively far “nearest neighbors”.
For our purposes in these initial experiments, we kept k = 10 and saw promising results. Due
to time and resource limitations, we were unable to explore higher values of k in this study,
but we are currently planning experiments where k = 100 and k = 1000. It is important to
note that the resulting network will have ≈ k(2.3 × 107) edges, so there is a considerable
trade-off between quality vs. space and time complexity.
After experimenting with both L2 and normalized cosine distances, we observed that L2
distance metric performs significantly better for establishing connections between centroids.
Unfortunately, we cannot utilize the k-tree optimization in FLANN along with nonnormalized cosine distance, making it computationally infeasible a dataset of our size. This
is because the k-tree optimization requires an agglomerative distance metric. Lastly, we
scale edges to the [0, 1] interval in order to relate them to other edges within the network.
Author Manuscript
2.4 Keyword Layer
Author Manuscript
The layer is imported from the UMLS metathesaurus. Each keyword is referenced by a
CUI number of UMLS. This layer links keywords which share already known connections.
These known connections are − edges. The metathesaurus connections link related
words; for example, the keyword “Protine p53” C0080055 is related to “Tumor Suppressor
Proteins” C0597611 and “Li-Fraumeni Syndrome” C0085390 among others. There exist 14
different types of connections between keywords representing relationships such as parent child or broader concept - narrower concept. We assign each a weight in the [0, 1] interval
corresponding to its relevance, and then scale all weights by a constant factor σ so the
average − edge are is stronger than the average − edge. The result is that a path
between two indirectly related concepts will more likely include a number of abstracts. We
selected σ = 2, but more study is needed to determine the appropriate edge weights within
the keyword layer.
2.5
−
Connections
In order to create edges between and , we used a simple metric of term frequencyinverse document frequency (tf-idf). UMLS provides not only a list of keywords, but all
known synonyms for each keyword. For example, the keyword Color C0009393 has the
American spelling, the British spelling, and the pluralization of both defined as synonyms.
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 9
Author Manuscript
Therefore we used the raw text abstracts and titles (before running the SPECIALIST NLP
tools) to calculate tf-idf. In order to quickly count all occurrences of UMLS keywords across
all synonyms, we implemented a simple parser. This was especially important because many
keywords in UMLS are actually multi-word phrases such as “Clustered Regularly
Interspaced Short Palindromic Repeats” (a.k.a. CRISPR) C3658200.
Author Manuscript
In order to count these keywords, we construct a parse tree from the set of synonyms. Each
node in the tree contains a word, a set of CUIs, and a set of children nodes, with the
exception of the root which contains the null string. We build this tree by parsing each
synonym word by word. For each word, we either create a new node in the tree, or traverse
to an already existing child node. We store each synonym’s CUI in the last node in its parse
path. Then, to parse a document, we simply traverse the parse tree. This can be done in
parallel over the set of abstracts. For each word in an abstract, we move from the current tree
node to a child representing the same word. If none exists, we return to the root node. At
each step of this traversal, we record the CUIs present at each visited node. In this manner,
we get a count of each CUI present in each abstract. Our next pass aggregates these counts
to discover the total number of usages per keyword across all abstracts. We calculate tf-idf
per keyword per abstract. Because our network’s weights represent distance, we take the
inverse of tf-idf to find the weight for an − edge. This is done simply by dividing a
CUI’s count across all abstracts by its count in a particular abstract. By calculating weights
this way, abstracts which use a keyword more often will have a lower weight, and therefore,
a shorter distance. We scale the edge weights to the [0, σ] interval so that these edges are
comparable to those within the and layers.
2.6 Semantic Layer
Author Manuscript
The UMLS supplies a companion network referred as the semantic network. This network
consists of semantic types, which are overarching concepts. These “types” are similar to the
function of a “type” in a programming language. In other words, it is a conceptual entity
embodied by instantiations of that type. In the UMLS network, elements of are analogous
to the instantiations of semantic types. While there are over two million elements of , there
are approximately 130 elements in . For example, the semantic type Disease or Syndrome
T047 is defined as “A condition which alters or interferes with a normal process, state, or
activity of an organism” [2]. There are thousands of keywords, such as “influenza”
C0021400 that are instances of this type.
Author Manuscript
The − edges are connected similarly to − edges. The overall structure is
hierarchical with “Event” T051 and “Entity” T071 being the most generalized semantic
types. Cross cutting connections are also present and can take on approximately fifty
different forms. These cross cutting relations also form a hierarchy of relationship types. For
example, “produces” T144 is a more specific relation than its parent “brings about” T187.
We initially included in our network by linking each keyword to its corresponding
semantic type. Unfortunately, in our early results we found that many shortest paths
traversed through rather than through . For example, if we were interested in two
diseases, it was possible for the shortest path would simply travel to the “Disease or
Syndrome” T047 type. This ultimately degraded the performance of our hypothesis
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 10
Author Manuscript
generation system. As a result we removed this layer, but that further study may find that
careful choice of − and − connection weights may make more useful. This is
further discussed in Section 5.
3 Query Process
The process of running a query within MOLIERE is summarized in Figure 3. Running a query
starts with the user selecting two nodes i, j ∈ V (typically, but not necessarily, i, j ∈ ). For
example, a query searching for the relationship between “stem cells” and “strokes” would be
input as keyword identifiers C0038250 and C1263853, respectively. This process simplifies
our query process, but determining a larger set of keywords and abstracts which best
represents a user’s search query is a future work direction.
Author Manuscript
After receiving two query nodes i and j, we find a shortest path between them, (ij)s, using
Dijkstra’s algorithm. These paths typically are between three and five nodes long and
contain up to three abstracts (unless the nodes are truly unrelated, see Section 4.1). We
observed that when (ij)s contains only two or three nodes in , that the ij relationship is
clearly well studied because it was solely supplied by the UMLS layer . We are more
interested in paths containing abstracts because these represent keyword pairs whose
relationships are less well-defined. Still, the abstracts we find along these shortest paths
alone are not likely to be sufficient to generate a hypothesis.
3.1 Hypothesis Modeling
Author Manuscript
Broadening (ij)s consists of two main phases, the results of which are depicted in Figure 4.
First, we select all nodes S = (ij)s ∩ . These abstracts along the path (ij)s represent papers
which hold key information relating two unconnected keywords. We find a neighborhood
around S using a weighted breadth-first traversal, selecting the closest 1,000 abstracts to S.
We will call this set N. Because was constructed as a nearest neighbors graph, it is likely
that the concepts contained in N will be similar to the concepts contained in S, which
increases the likelihood that important concepts will be detected by PLDA+ later in the
pipeline.
Author Manuscript
Next, we identify abstracts with contain information pertaining to the − connections
present in (ij)s. We do so in order to identify abstracts which likely contain concepts which a
human reader could use to understand the known relationship between two connected
keywords. We start by traversing (ij)s to find α, β ∈ such that α and β are adjacent in (ij)s.
From there, we find a set of abstracts C = {c : cα ∈ E ∧ cβ ∈ E}. That is, C is a subset of
abstracts containing both keywords α and β. Because (ij)s can have many edges between
keywords, and because thousands of abstracts can contain the same two keywords, it is
important to limit the size of C.
This process creates a set of around 1,300 -nodes. This set will typically contain around
15,000–20,000 words and is large enough for PLDA+ to find topics. We run PLDA+ and
request 20 topics. We find this provides a sufficient spread in our resulting data sets. The
trained model generated by PLDA+ is what is eventually returned by our query process.
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 11
Author Manuscript
For our experiments, we often must process tens of thousands of results and thus must train
topic models quickly. This is most apparent when running a one-to-many query such as the
drug repurposing example in 4.3. Additionally, the training corpus returned from a MOLIERE
query is often only a couple thousand documents large. As a result, we set the number of
topics and the number of iterations to relatively small values, 20 and 100 respectively.
Because we store intermediary results, it is trivial to retrain a topic model if the preliminary
result seems promising.
Author Manuscript
The process of analyzing a topic model and uncovering a human interpretable sentence to
describe a hypothesis is still a pressing open problem. The process as stated here does have
some strong benefits which are apparent in Section 4. These include the ability to find
correlations between medical objects, such as between a drug and multiple genes. In Section
6 we explain our initial plans to improve the quality of results which can be deduced from
these topic models.
4 Experiments
We conduct two major validation efforts to demonstrate our system’s potential for
hypothesis generation. For each of these experiments we use the same set of parameters for
our trained model and network weights. Our initial findings show our choices, detailed in
Section 2, to be robust. We plan to refine these choices with methods described in Section 6.
Author Manuscript
We repeat an experiment done by Wang et al. in [49] wherein we discover the implicit
connections between the drug Venlafaxine and the genes HTR1A and HTR2A. We also
perform a large scale study of Dead Box RNA Helicase 3 (DDX3) and its connection to
cancer adhesion and metastasis. Each of these experiments is described in greater detail in
the following sections. In this paper, we deliberately do not evaluate our experiments with
extremely popular objects such as p53. These objects are so highly connected within
hypothesis generation involving these keywords is easy for many different methods.
that
4.1 Network Profile
Author Manuscript
We conduct our experiments on a very large knowledge graph which has been constructed
according to Section 2. We initially created a network
containing information dating up
to and including 2016. This network consists of 24,556,689 nodes and 989,169,295 edges.
The network overall consists of largest strongly connected component containing 99.8% of
our network. The average degree of a node in
is 79.65, and we observe a high clustering
coefficient of 0.283. These metrics cause us to expect that the shortest path between two
nodes will be very short. Our experiments agree, showing that most shortest paths are
between three and six nodes long.
4.2 Venlafaxine to HTR1A
Wang et al. in [49] use a similar topic modeling approach, and find during one of their
experiments that Venlafaxine C0078569 appears in the same topic as the HTR1A and
HTR2A genes (C1415803 and C1825553 respectively). When looking into these results,
they find a stronger association between Venlafaxine and HTR1A. This finding is important
because Venlafaxine is used to treat depressive disorder and anxiety, which HTR1A and
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 12
Author Manuscript
HTR2A have been thought to affect, but as of 2009 no abstract contains this link. As a result,
this implicit connection is difficult to detect with many existing methods.
Results—As a result of running two queries, Venlafaxine to HTR1A, and Venlafaxine to
HTR2A, we can corroborate the findings of Wang et al. in [49]. We find that neither pair of
keywords is directly connected or connected through a single abstract. Nevertheless, phrases
such as “long term antidepressant treatment,” “action antidepressants,” and “antidepressant
drugs” are all prominent keywords in the HTR1A query. Meanwhile, the string “depress”
only occurs four times in unrelated phrases with the HTR2A results. The distribution of
depression related keywords from both queries can be see in figure 5.
Author Manuscript
Similarly, our results for HTR1A contain a single topic holding the phrases “anxiogenic,”
“anxiety disorders,” “depression anxiety disorders,” and “anxiolytic response.” In contrast,
our HTR2A results do not contain any phrases related to anxiety. The distribution of anxiety
related keywords from both queries can be see in figure 6.
Our findings agree with those of Wang et al. which were that a small association score of
0.34 between Venlafaxine and HTR1A indicates a connection which is likely related to
depressive disorder and anxiety. The association score between Venlafaxine and HTR2A, in
contrast, is a much higher 4.0. This indicates that the connection between these two
keywords is much weaker.
4.3 Drug Repurposing and DDX3’s Anti-Tumor Applications
Author Manuscript
Many genes are active in multiple cellular processes and in many cases they are found to be
active outside of the original area in which the gene was initially discovered. The prediction
of new processes is especially important for repurposing existing drugs (or drug target
genes) to a new application [5, 33, 4]. As an example, the drugs developed for the treatment
of infectious diseases were recently repurposed for cancer treatment. Extending applications
of existing drugs provides a tremendous opportunity for the development of cost-effective
treatments for cancers and other life-threatening diseases.
To estimate the predictive value of our system for the discovery of new applications of small
molecules we select Dead Box RNA Helicase 3 (DDX3) C2604356. DDX3 is the member
of Dead-box RNA helicase and was initially discovered to be a regulator of transcription and
propagation of Human Immunodeficiency Virus (HIV) as well as ribosomal biogenesis.
Initially, DDX3 was a target for the development of anti-viral therapy for the AIDS
treatment [25, 29].
Author Manuscript
More recently, DDX3 activity was found to be involved cancer development and progression
mainly through regulation of the Wnt signaling pathway [13, 50] and associated regulation
of Cell-cell and Cell-matrix adhesion, tumor cells invasion, and metastasis [12, 43, 48, 24].
Currently, DDX3 is an established target for anti-tumor drug development [10, 37, 11] and
represents a case for repurposing target anti-viral drugs into the application area of antitumor therapy.
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 13
Author Manuscript
To test this hypothesis, we analyze the data available on and before 12/31/2009, when no
published indication of links in between DDX3 and the Wnt signaling were available. We
compare DDX3 to all UMLS keywords containing the text “signal transduction”,
“transcription”, “adhesion”, “cancer”, “development”, “translation”, or “RNA” in their
synonym list. This search results in 9,905 keywords over which we query for relationships to
DDX3. From this large set of results we personally analyze a subset of important pairs.
Author Manuscript
Results—In our generated dataset, we found following text grouping within topics:
“substrate adhesion,” “RGD cell adhesion domain,” “cell adhesion factor,” “focal adhesion
kinase” which are indicative for the cell-matrix adhesion. The topics “cell-cell adhesion,”
“regulation of cell-cell adhesion,” “cell-adhesion molecules” indicate the involvement of
DDX3 into cell-cell adhesion regulation. The involvement of adhesion is associated with
topics related to tumor dissemination: “ Collaborative staging metastasis evaluation Cancer,”
“metastasis adhesion protein, human,” “metastasis associated in colon cancer 1” (selected in
between others similar topics).
Author Manuscript
The results above suggested that through analysis of the ≤2009 dataset we can predict the
involvement of DDX3 in tumor cell dissemination through the effects of Cell-cell and cellmatrix adhesion. Next, we analyzed, whether it will be possible to made inside of the
mechanisms of DDX3-dependent regulation of Wnt signaling. As shown recently, DDX3
involvement on Wnt signaling is based on the regulated Casein kinase epsilon, to affect
phosphorylation of the disheveled protein. Although we cannot predict the exact mechanism
of DDX3 based on ≤2009 dataset, the existence of multiple topics of signal-transduction
associated kinases, like “CELL ADHESION KINASE”, “activation by organism of defenserelated host MAP kinase-mediated signal transduction pathway”, “modulation of defenserelated symbiont mitogen-activated protein kinase-mediated signal transduction pathway by
organism”, suggested the ability of DDX3 to regulate kinases activities and kinase-regulated
pathways.
4.4 Experimental Setup
We performed all experiments on a single node within Clemson’s Palmetto supercomputing
cluster. To perform our experiments and construct our network, we use an HP DL580
containing four Intel Xeon x7542 chips. This 24 core node has 500 GB of memory and
access to a large ZFS-based file system where we stored experimental data.
Author Manuscript
For the DDX3 queries, we initially searched for all (ij)s where i = DDX3 and j ∈ . This
resulted in 1,350,484 shortest paths with corresponding abstract clouds. We used PLDA+ to
construct models for all of these paths. Discovering all (ij)s completed in almost 10 hours of
CPU time, and training the respective models completed in slightly over 68 hours of CPU
time. We ran PLDA+ in parallel, resulting in a wall time of only 12 hours. As mentioned
previously, this large dataset was filtered to the 9,905 paths we are interested in.
We generate the results for the Venlafaxine experiments in one hour of CPU time, which is
mostly spent loading our very large network and then running Dijkstra’s algorithm. After
this, the two resulting PLDA+ models were trained in parallel within a minute.
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 14
Author Manuscript
5 Deployment Challenges
In the following section we detail the challenges which we have faced and are expecting to
encounter while creating our system and deploying it to the research community.
Dynamic Information Updates
Author Manuscript
The process of creating our network is computationally expensive and for the purposes of
validation we must create multiple instances of our network representing different points in
time. Initially we would have liked to create these multiple instances from scratch, starting
from the MEDLINE archival distribution and rebuilding the network from there.
Unfortunately, this proved infeasible because creating a single network is a time consuming
process. Instead, we filter our network by removing abstracts and keywords which were
published after our select date. Additionally, the act of adding information to our network,
such as extending the 2016 network to create a 2017 network, is not straightforward. Ideally,
adding a small number abstracts or keywords should be a fast and dynamic process which
only affects localized regions of the network. If this were so, our deployed system could take
advantage of new ideas and connections as soon as they are published.
A deployed system could support dynamic updates with an amortized approach. Using
previously created FastText and ToPMine models, new documents could be fitted into an
existing network with suitably high performance. Of course, if a new document introduced a
new keyword or phrase, we would be unable to detect it initially. After some threshold of
new documents had been added to the network, we could then rerun the entire network
construction process to ensure that new keywords, phrases, and concepts would be properly
placed in the network.
Author Manuscript
Query Platform and Performance
Author Manuscript
Initially, we expected to use a graph database to make the query process easier. We surveyed
a selection of graph databases and found that Neo4j [14] provides a powerful query language
as well as a platform capable of holding our billion-edge network. Unfortunately, Neo4j
does not easily support weighted shortest path queries. Although some user suggestions did
hint that it may be possible, the process requires leveraging edge labels and custom java
procedures in a way that did not seem scalable. In place of Neo4j, we implemented
Dijkstra’s shortest path algorithm in C++ using skew heaps as the internal priority queue.
This implementation was chosen to minimize memory usage while maximizing speed and
readability. Because we implemented Dijkstra’s algorithm ourselves, we can also combine
the process of finding a shortest path and finding all neighboring abstracts for all keywords
from a specific source. With only these high level optimizations, we were able to generate
over 1,350,000 shortest paths and abstract neighborhoods in under ten hours, but generating
a single result takes slightly over one hour.
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 15
Author Manuscript
6 Lessons Learned and Open Problems
Specialized LDA
During last two decades there has been a number of significant attempts to design automatic
hypothesis generation systems [41, 44, 49]. However, most of these improve their
performance by restricting either their information space or the size of their dictionary. For
example, specialized versions of LDA such as Bio-LDA [49] uncover latent topics using a
dictionary that gives a priority to special terms. We find that such approaches are helpful
when general language may significantly over weigh a specialized language. However,
phrase mining approaches that recover n-grams, such as [16], produce accurate methods
without limiting the dictionary.
Hypothesis Viability and Novelty Assessment
Author Manuscript
Intuitively, a strong connection between two concepts in
means that there exist a
significant amount of research that covers a path between them. Similar observations are
valid for LDA, i.e., latent topics are likely to describe well known facts. As a result, the most
meaningful connections and interpretable topical inference are discovered with latent
keywords that are among the most well known concepts. However, real hypotheses are not
necessarily described using the most latent keywords in such topic models. In many cases,
the keywords required for a successful and interpretable hypothesis start to appear among
20–30 most latent topical keywords. Thus, a major open problem related is the process to
which one should select a combination of keywords and topics in order to represent a viable
hypothesis. This problem is also linked to the problem of assessing the viability of a
generated hypothesis.
Author Manuscript
These problems, as well as the problem of hypothesis novelty assessment, can be partially
addressed by using the Dynamic Topic Modeling (DTM) [7]. Our preliminary experiments
with scalable time-dependent clustered LDA [21] that significantly accelerates DTM
demonstrate a potential to discover dynamic topics in MEDLINE. The dynamic topics are
typically more realistic than those that can be discovered in the static network. This
significantly simplifies the assessment of viability and topic noise elimination.
Incorporating the Semantic Layer
Author Manuscript
In section 2.6 we describe the process in which we evaluated the UMLS semantic network
and found that it worsened our resulting shortest path queries. Further work could improve
the contribution that has on our overall network, possibly allowing to define the overall
structure of our knowledge graph. In order to do this, one would likely need to take into
account the hierarchy of relationship types present in this network, as well as the relative
relationship each element in has with its connection in . Ultimately, these different
relationships would need to inform a weighting scheme that balances the over
generalizations that introduces. For example, it may be useful to understand that two
keywords are both diseases, but it is much less useful to understand that two keywords are
“entities”.
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 16
Learning the Models of Hypothesis Generation
Author Manuscript
There is surprisingly little research focused on addressing the process of biomedical research
and how that process evolves over time. We would like to model the process of discovery
formation, taking into account the information context surrounding and preceding a
discovery. We believe we could do so by reverse engineering existing discoveries in order to
discover factors which altered the steps in a scientist’s research pipeline. Several promising
observations in this direction have been done by Foster et al. [18] who examined this
through Bourdieu’s field theory of science by developing a typology of research strategies
on networks extracted from MEDLINE. However, instead of reverse engineering their
models, they separate innovation steps from those that are more traditional in the research
pipeline.
Dynamic Keyword Discovery
Author Manuscript
Author Manuscript
One of the limitations we found when performing our historical queries is the delay between
the first major uses of a keyword and its appearance in the UMLS metathesaurus. Initially,
we planned to study the relationship between “CRISPR” C3658200 and “genome editing”
C4279981. To our surprise, many keywords related to this query did not exist in our
historical networks between 2009 and 2012, despite their frequent usage in cutting-edge
research during that time. To further confuse the issue, although the keyword “CRISPR” did
not appear in the UMLS releases on or before 2012, keywords containing “CRISPR” as a
substring, such as “CRISPR element metabolism” C1752766, do appear. We find this to be
contradictory and that these inconsistencies highlight the limitation of relying on so strongly
on keyword databases. Going forward, we plan to devise a way to extend a provided
keyword network, utilizing semantic connections we can find within the MEDLINE
document set. Projects like [42] have already shown this method can work in domains of
smaller scales with good results. The challenge will be to extend this method to perform
well when used on the entire MEDLINE data set.
Improving Performance of Algorithms with Graph Reordering Techniques
Author Manuscript
Cache-friendly layouts of graphs are known to generally accelerate the performance of the
path and abstract retrieval algorithms which we apply. Moreover, it is desirable to consider
this type of acceleration in order to make our system more suitable for regular modern
desktops. This is an important consideration as memory is not expected to be a major
bottleneck after the network is constructed. We propose to rearrange the network nodes by
minimizing such objectives as the minimum logarithmic or linear arrangements [36, 34]. On
a mixture of − , − , and − edges we anticipate an improvement of at least
20% in the number of cache misses according to [35].
Mass Evaluation
We note that evaluation techniques are largely an issue in the state of the art of hypothesis
generation. While some works feature large scale evaluation performed by many human
experts, a majority, this work included, are restricted to only a couple of promising results to
justify the system. In order to better evaluate and compare hypothesis generation techniques
we must devise a common and large scale suite of historical hypotheses. We are currently
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 17
Author Manuscript
evaluating whether a ground-truth network, like the drug-side-effect network SIDER [26],
can be a good source of such hypotheses. For example, if we identify a set of recently added
connections within SIDER, and predict a substantial percentage of those connections using
MOLIERE, then we may be more certain of our performance.
New Domains of Interest
We have considered other domains on which MOLIERE may perform well. These include
generating hypotheses regarding economics, patents, narrative fiction, and social
interactions. These are all domains where a hypothesis would involve finding new
relationships between distinct entities. We contrast this with domains such as mathematics
where the entity-relationship network is much less clear, and logical approaches from the
field of automatic theorem proving are more applicable.
Author Manuscript
7 Conclusions
Author Manuscript
In this study we describe a deployed biomedical hypothesis generation system, MOLIERE,
that can discover relationship hypotheses among biomedical objects. This system utilizes
information which exists in MEDLINE and other NLM datasets. We validate MOLIERE on
landmark discoveries using carefully filtered historical data. Unlike several other hypothesis
generation systems, we do not restrict the information retrieval domain to a specific
language or a subset of scientific papers since this method can lose an unpredictable amount
of information. Instead, we use recent text mining techniques that allow us to work with the
full heterogeneous data at scale. We demonstrate that MOLIERE successfully generates
hypotheses and recommend using it to advance biomedical knowledge discovery. Going
forward, we note a number of directions along which we can improve MOLIERE as well as
many existing hypothesis generation systems.
Acknowledgments
We would like to thank Dr. Lihn Ngo for his help in using the Palmetto supercomputer which ran our experiments,
and Cong Qiu for initial experiments with Neo4j.
References
Author Manuscript
1. Specialist nlp tools, 2006.
2. Umls reference manual, 2009.
3. Pubmed, 2016.
4. Allison, Malorye. Ncats launches drug repurposing program. 2012
5. Andronis, Christos, Sharma, Anuj, Virvilis, Vassilis, Deftereos, Spyros, Persidis, Aris. Literature
mining, ontologies and information visualization for drug repurposing. Briefings in bioinformatics.
2011; 12(4):357–368. [PubMed: 21712342]
6. Barrenas, Fredrik, Chavali, Sreenivas, Holme, Petter, Mobini, Reza, Benson, Mikael. Network
properties of complex human disease genes identified through genome-wide association studies.
PloS one. 2009; 4(11):e8090. [PubMed: 19956617]
7. Blei, David M., Lafferty, John D. Proceedings of the 23rd international conference on Machine
learning. ACM; 2006. Dynamic topic models; p. 113-120.
8. Blei, David M., Ng, Andrew Y., Jordan, Michael I. Latent dirichlet allocation. Journal of machine
Learning research. Jan.2003 3:993–1022.
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 18
Author Manuscript
Author Manuscript
Author Manuscript
Author Manuscript
9. Bojanowski, Piotr, Grave, Edouard, Joulin, Armand, Mikolov, Tomas. Enriching word vectors with
subword information. arXiv:1607.04606. 2016
10. Bol, Guus M., Vesuna, Farhad, Xie, Min, Zeng, Jing, Aziz, Khaled, Gandhi, Nishant, Levine,
Anne, Irving, Ashley, Korz, Dorian, Tantravedi, Saritha, et al. Targeting ddx3 with a small
molecule inhibitor for lung cancer therapy. EMBO molecular medicine. 2015; 7(5):648–669.
[PubMed: 25820276]
11. Bol, Guus Martinus, Xie, Min, Raman, Venu. Ddx3, a potential target for cancer treatment.
Molecular cancer. 2015; 14(1):188. [PubMed: 26541825]
12. Chen HH, Yu HI, Cho WC, Tarn WY. Ddx3 modulates cell adhesion and motility and cancer cell
metastasis via rac1-mediated signaling pathway. Oncogene. 2015; 34(21):2790–2800. [PubMed:
25043297]
13. Cruciat, Cristina-Maria, Dolde, Christine, de Groot, Reinoud EA., Ohkawara, Bisei, Reinhard,
Carmen, Korswagen, Hendrik C., Niehrs, Christof. Rna helicase ddx3 is a regulatory subunit of
casein kinase 1 in wnt–β-catenin signaling. Science. 2013; 339(6126):1436–1441. [PubMed:
23413191]
14. Neo4J Developers. Neo4j. Graph NoSQL Database [online]. 2012
15. Divoli, Anna, Mendonça, Eneida A., Evans, James A., Rzhetsky, Andrey. Conflicting biomedical
assumptions for mathematical modeling: the case of cancer metastasis. PLoS Comput Biol. 2011;
7(10):e1002132. [PubMed: 21998558]
16. El-Kishky, Ahmed, Song, Yanglei, Wang, Chi, Voss, Clare R., Han, Jiawei. Scalable topical phrase
mining from text corpora. Proceedings of the VLDB Endowment. 2014; 8(3):305–316.
17. Evans, James A., Rzhetsky, Andrey. Advancing science through mining libraries, ontologies, and
communities. Journal of Biological Chemistry. 2011; 286(27):23659–23666. [PubMed: 21566119]
18. Foster, Jacob G., Rzhetsky, Andrey, Evans, James A. Tradition and innovation in scientists research
strategies. American Sociological Review. 2015; 80(5):875–908.
19. Goh, Kwang-Il, Cusick, Michael E., Valle, David, Childs, Barton, Vidal, Marc, Barabási, AlbertLászló. The human disease network. Proceedings of the National Academy of Sciences. 2007;
104(21):8685–8690.
20. Griffiths, Thomas L., Steyvers, Mark. Finding scientific topics. Proceedings of the National
academy of Sciences. 2004; 101(suppl 1):5228–5235.
21. Gropp, Chris, Herzog, Alexander, Safro, Ilya, Wilson, Paul W., Apon, Amy W. Scalable dynamic
topic modeling with clustered latent dirichlet allocation (clda). arXiv:1610.07703. 2016
22. Hao, Lei, Zou, Zhongmin, Tian, Hong, Zhang, Yubo, Zhou, Huchuan, Liu, Lei. Stem cell-based
therapies for ischemic stroke. BioMed research international. 2014; 2014:468748. [PubMed:
24719869]
23. Joulin, Armand, Grave, Edouard, Bojanowski, Piotr, Mikolov, Tomas. Bag of tricks for efficient
text classification. arXiv:1607.01759. 2016
24. Krol, Jacek, Krol, Ilona, Patricia, Claudia, Alvarez, Patino, Fiscella, Michele, Hierlemann,
Andreas, Roska, Botond, Filipowicz, Witold. A network comprising short and long noncoding rnas
and rna helicase controls mouse retina architecture. Nature communications. 2015; 6
25. Kwong, Ann D., Rao, B Govinda, Jeang, Kuan-Teh. Viral and cellular rna helicases as antiviral
targets. Nature reviews Drug discovery. 2005; 4(10):845–853. [PubMed: 16184083]
26. Lee, Sejoon, Lee, Kwang H., Song, Min, Lee, Doheon. Building the process-drug–side effect
network to discover the relationship between biological processes and side effects. BMC
bioinformatics. 2011; 12(2):S2.
27. Liu, Chun-Chi, Tseng, Yu-Ting, Li, Wenyuan, Wu, Chia-Yu, Mayzus, Ilya, Rzhetsky, Andrey, Sun,
Fengzhu, Waterman, Michael, Chen, Jeremy JW., Chaudhary, Preet M., et al. Diseaseconnect: a
comprehensive web server for mechanism-based disease–disease connections. Nucleic acids
research. 2014; 42(W1):W137–W146. [PubMed: 24895436]
28. Liu, Zhiyuan, Zhang, Yuzhou, Chang, Edward Y., Sun, Maosong. Plda+: Parallel latent dirichlet
allocation with data placement and pipeline processing. ACM Transactions on Intelligent Systems
and Technology (TIST). 2011; 2(3):26.
29. Maga, Giovanni, Falchi, Federico, Garbelli, Anna, Belfiore, Amalia, Witvrouw, Myriam, Manetti,
Fabrizio, Botta, Maurizio. Pharmacophore modeling and molecular docking led to the discovery of
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 19
Author Manuscript
Author Manuscript
Author Manuscript
Author Manuscript
inhibitors of human immunodeficiency virus-1 replication targeting the human cellular aspartic
acid-glutamic acid-alanine- aspartic acid box polypeptide 3. Journal of medicinal chemistry. 2008;
51(21):6635–6638. [PubMed: 18834110]
30. Mikolov, Tomas, Chen, Kai, Corrado, Greg, Dean, Jeffrey. Efficient estimation of word
representations in vector space. arXiv:1301.3781. 2013
31. Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S., Dean, Jeff. Distributed
representations of words and phrases and their compositionality. Advances in neural information
processing systems. 2013:3111–3119.
32. Muja, Marius, Lowe, David G. Scalable nearest neighbor algorithms for high dimensional data.
IEEE Transactions on Pattern Analysis and Machine Intelligence. 2014; 36(11):2227–2240.
[PubMed: 26353063]
33. Oprea TI, Mestres J. Drug repurposing: far beyond new targets for old drugs. The AAPS journal.
2012; 14(4):759–763. [PubMed: 22826034]
34. Safro I, Ron D, Brandt A. Graph minimum linear arrangement by multilevel weighted edge
contractions. Journal of Algorithms. 2006; 60(1):24–41.
35. Safro, Ilya, Hovland, Paul D., Shin, Jaewook, Strout, Michelle Mills. Improving random walk
performance. CSC. 2009:108–112.
36. Safro, Ilya, Temkin, Boris. Multiscale approach for the network compression-friendly ordering. J.
Discrete Algorithms. 2011; 9(2):190–202.
37. Samal, Sabindra K., Routray, Samapika, Veeramachaneni, Ganesh Kumar, Dash, Rupesh,
Botlagunta, Mahendran. Ketorolac salt is a newly discovered ddx3 inhibitor to treat oral cancer.
Scientific reports. 2015; 5:9982. [PubMed: 25918862]
38. Smalheiser, Neil R., Swanson, Don R. Using arrowsmith: a computer-assisted approach to
formulating and assessing scientific hypotheses. Computer methods and programs in biomedicine.
1998; 57(3):149–153. [PubMed: 9822851]
39. Smalheiser, Neil R., Torvik, Vetle I., Zhou, Wei. Arrowsmith two-node search interface: A tutorial
on finding meaningful links between two disparate sets of articles in medline. Computer methods
and programs in biomedicine. 2009; 94(2):190–197. [PubMed: 19185946]
40. Soldatova, Larisa N., Rzhetsky, Andrey. Representation of research hypotheses. Journal of
biomedical semantics. 2011; 2(2):S9.
41. Spangler, Scott. Accelerating Discovery: Mining Unstructured Information for Hypothesis
Generation. Vol. 37. CRC Press; 2015.
42. Spangler, Scott, Wilkins, Angela D., Bachman, Benjamin J., Nagarajan, Meena, Dayaram, Tajhal,
Haas, Peter, Regenbogen, Sam, Pickering, Curtis R., Comer, Austin, Myers, Jeffrey N., et al.
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and
data mining. ACM; 2014. Automated hypothesis generation based on mining scientific literature;
p. 1877-1886.
43. Sun, Mianen, Song, Ling, Zhou, Tong, Gillespie, G Yancey, Jope, Richard S. The role of ddx3 in
regulating snail. Biochimica et Biophysica Acta (BBA)-Molecular Cell Research. 2011; 1813(3):
438–447. [PubMed: 21237216]
44. Swanson, Don, Smalheiser, Neil. Link analysis of medline titles as an aid to scientific discovery;
Proceedings of the AAAI Fall Symposium on Artificial Intelligence and Link Analysis; 1998. p.
94-97.
45. Swanson, Don R. Undiscovered public knowledge. The Library Quarterly. 1986; 56(2):103–118.
46. Swanson, Don R., Smalheiser, Neil R. Implicit text linkage between medline records: using arrowsmith as an aid to scientific discovery. Library trends. 1999; 48(1):48.
47. Sybrandt, Justin, Safro, Ilya. Moliere: Automatic biomedical hypothesis generation system.
medline knowledge graph. implementation. 2017
48. van Voss, Marise R Heerma, Schrijver, Willemijne AME., ter Hoeve, Natalie D., Hoefnagel,
Laurien D., Manson, Quirine F., van der Wall, Elsken, Raman, Venu, van Diest, Paul J., et al.
Dutch Distant Breast Cancer Metastases Consortium. The prognostic effect of ddx3 upregulation
in distant breast cancer metastases. Clinical & Experimental Metastasis. 2016:1–8. [PubMed:
27988895]
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 20
Author Manuscript
49. Wang, Huijun, Ding, Ying, Tang, Jie, Dong, Xiao, He, Bing, Qiu, Judy, Wild, David J. Finding
complex biological relationships in recent pubmed articles using bio-lda. PloS one. 2011;
6(3):e17243. [PubMed: 21448266]
50. Yim, Daniel GR., Virshup, David M. Unwinding the wnt action of casein kinase 1. Cell research.
2013; 23(6):737. [PubMed: 23567556]
51. Zhou, XueZhong, Menche, Jörg, Barabási, Albert-László, Sharma, Amitabh. Human symptoms–
disease network. Nature communications. 2014; 5
Author Manuscript
Author Manuscript
Author Manuscript
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 21
Author Manuscript
Author Manuscript
Figure 1.
Running times of each network construction phase. All phases run on a single node
described in section 4.4. Not shown: Initial text processing which was handled by a large
array of small nodes.
Author Manuscript
Author Manuscript
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 22
Author Manuscript
Figure 2.
MOLIERE network construction pipeline.
Author Manuscript
Author Manuscript
Author Manuscript
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 23
Author Manuscript
Figure 3.
MOLIERE query pipeline.
Author Manuscript
Author Manuscript
Author Manuscript
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 24
Author Manuscript
Author Manuscript
Figure 4.
Process of extending a path to a cloud of abstracts.
Author Manuscript
Author Manuscript
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 25
Author Manuscript
Author Manuscript
Figure 5.
Distribution of n-grams having to do with depression from Venlafaxine queries.
Author Manuscript
Author Manuscript
KDD. Author manuscript; available in PMC 2018 February 08.
Sybrandt et al.
Page 26
Author Manuscript
Author Manuscript
Figure 6.
Distribution of n-grams having to do with anxiety from Venlafaxine queries.
Author Manuscript
Author Manuscript
KDD. Author manuscript; available in PMC 2018 February 08.