Knowledge Based Systems (Sci)
Knowledge Based Systems (Sci)
Knowledge Based Systems (Sci)
150001, China
b Medical Record Room, The 2nd Affiliated Hospital of Harbin Medical University,
a School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001,
China
b Medical Record Room, The 2nd Affiliated Hospital of Harbin Medical University, Harbin
150086, China
1
Learning and inference in knowledge-based probabilistic model for medical diagnosis
Abstract
with a probabilistic model, we propose a methodology for the creation of a medical knowledge
network (MKN) in medical diagnosis. When a set of symptoms is activated for a specific
patient, we can generate a ground medical knowledge network composed of symptom nodes
and potential disease nodes. By Incorporating a Boltzmann machine into the potential functio n
of a Markov network, we investigated the joint probability distribution of the MKN. In order
to deal with numerical symptoms, a multivariate inference model is presented that uses
conditional probability. In addition, the weights for the knowledge graph were efficie ntly
learned from manually annotated Chinese Electronic Medical Records (CEMRs). In our
experiments, we found numerically that the optimum choice of the quality of disease node and
the expression of symptom variable can improve the effectiveness of medical diagnosis. Our
experimental results comparing a Markov logic network and the logistic regression algorithm
on an actual CEMR database indicate that our method holds promise and that MKN can
2
Learning and inference in knowledge-based probabilistic model for medical diagnosis
1. Introduction
The World Health Organization (WHO) reports that 422 million adults have diabetes, and
1.5 million deaths are directly attributed to diabetes each year [1]. Additionally, the number of
deaths caused by cardiovascular diseases (CVDs) and cancer is estimated to be 17.5 millio n
and 8.2 million, respectively [2]. The WHO report on cancer shows that new cases of cancer
will increase by 70 percent over the next two decades. In the face of this situation, researchers
have begun to pay more attention to health care. According to existing studies, more than 30
percent of cancer deaths could be prevented by early diagnosis and appropriate treatment [3].
Because an accurate diagnosis contributes to a proper choice of treatment and subsequent cure,
medical diagnosis plays a great role in the improvement of health care. Consequently, a means
to provide an effective intelligent diagnostic method to assist clinicians by reducing costs and
improving the accuracy of diagnosis has been a critical goal in efforts to enhance the patient
Classification is one of the most widely researched topics in medical diagnosis. The
general model classifies a set of symptom data into one of several predefined categories of
disease for cases of medical diagnosis. A decision tree [4-5] is a classic algorithm in the medical
classification domain, one that uses the information entropy method; however, it is sensitive to
inconsistencies in the data. The support vector machine [6-8] has a solid theoretical basis for
the classification task; because of its efficient selection of features, it has higher predictive
accuracy than decision trees. Bayesian networks [9-10], which are based on Bayesian theory
[11-12], describe the dependence relationship between the symptom variables and the disease
variables; these can be used in medical diagnosis. Other diagnostic models include neural
3
Learning and inference in knowledge-based probabilistic model for medical diagnosis
networks (NN) [13-15], fuzzy logic (FL) [16-17], and genetic algorithms (GAs) [18-20]. Each
Existing studies have mainly focused on exploring effective methods for improving the
accuracy of disease classification. However, these methods often ignore the importance of the
inferential model based on the first-order logic rules, it only applies to binary features which is
against the numerical characteristic of symptom. In this paper, we focus on combining medical
knowledge with a novel probabilistic model to assist clinicians in making intelligent diagnoses.
(1) In order to obtain medical knowledge from Chinese Electronic Medical Records
(CEMRs), we adopted techniques for the recognition of named entities and entity relationships.
By mapping named entities and entity relationships into sentences, we built a medic al
(2) We mapped the first-order knowledge base into a knowledge graph. This graph is
(edges). Furthermore, the graph can also be an intuitive reflection of the inferential structure of
the knowledge.
(3) We developed a novel probabilistic model for medical diagnosis that is based on
Markov network theory. For adapting to the requirements of multivariate feature in medical
network. It can simultaneously model both binary and numerical indexes of symptoms. The
4
Learning and inference in knowledge-based probabilistic model for medical diagnosis
(4) By a numerical comparison with other diagnostic models for CEMRs, we found that
our probabilistic model is more effective for diagnosing several diseases according to the
measures of precision for the first 10 results (P@10), recall for the first 10 results (R@10), and
The rest of this paper is organized as follows. In Section 2, we introduce Chinese Electronic
Medical Records and the knowledge graph. In Section 3, we review the fundamentals of
Markov networks and Markov logic networks. In Section 4, the knowledge-based probabilistic
derivation of learning and inference. In Section 5, we further evaluate the effectiveness and
accuracy of our probabilistic model for medical diagnosis. Finally, we conclude this paper and
Electronic medical records (EMRs) [22] are a systematized collection of patient health
information in a digital format. As the crucial carrier of recorded medical activity, EMRs
contain significant medical knowledge [23-24]. Therefore, for this study we adopted Chinese
Electronic Medical Records (CEMRs) in free-form text as the primary source of medical
knowledge. These CEMRs, which have had protected health information (PHI) [25] removed,
come from the Second Affiliated Hospital of Harbin Medical University, and we have obtained
the usage rights for research. These CEMRs mainly include five kinds of free-form text:
discharge summary, progress note, complaints of the patient, disease history of the patient, and
the communication log. Considering the abundance of medical knowledge and the difficulty of
5
Learning and inference in knowledge-based probabilistic model for medical diagnosis
Chinese text processing, we chose the discharge summary and the progress note as the source
for knowledge extraction. The structures of the discharge summary and progress note are
Characteristics of case
Diagnostic basis
Differential diagnosis
Treatment plan
Fig. 1. Sample of discharge summary from the Second Affiliated Hospital of Harbin Medical
University.
Treatment process
Discharge condition
Therapeutic effect
Physician's advice
Fig. 2. Sample of progress note from the Second Affiliated Hospital of Harbin Medical
University.
2.2. Corpus
The recognition of named entities [26] and entity relationships [27] is an important aspect
6
Learning and inference in knowledge-based probabilistic model for medical diagnosis
in the extraction of medical knowledge from CEMRs. Referencing the medical concept
annotation guideline and the assertion annotation guideline given by Informatics for Integrating
Biology and the Bedside (i2b2) [28], we have drawn on the guidelines for CEMRs [29] and
manually annotated the named entity and entity relationship of 992 CEMRs as the resource of
medical knowledge. For this diagnostic task, this study only kept “symptom” entities, “disease”
entities, and the “indication” relationship. The “indication” relationship holds when the related
“symptom” indicates that the patient suffers from the related “disease.” In addition, there are
three modifiers for “symptom” entities, namely present, absent, and possible.
The medical knowledge obtained from the 992 CEMRs can be comprehended as a set of
first-order logic rules among “symptom” entities, “disease” entities, and “indicatio n”
knowledge base may be constructed. However, the medical knowledge base lacks the
“symptom” entities as nodes and the “indication” relationships as edges. As the reliability of
medical knowledge increases, the corresponding edge’s weight gradually grows. The topology
7
Learning and inference in knowledge-based probabilistic model for medical diagnosis
The nodes in the knowledge graph are divided into two different colors according to the
type of entity, the red nodes and the green nodes representing “symptom” entities and “disease”
entities, respectively. This graph contains 173 kinds of disease and 508 kinds of symptom. As
combines first-order logic with a probability graph model for solving problems of complexity
methodology of Markov networks (MNs) [30]. From a first-order-logic point of view, it can
briefly present the uncertainty rules and can tolerate incomplete and contradictory problems in
A Markov network, which is a model for the joint distribution of a set of variables
X ( X1 , X 2 ,..., X n ) , provides the theoretical basis for a Markov logic network. The
8
Learning and inference in knowledge-based probabilistic model for medical diagnosis
1
P( X x) k ( x{k} )
Z k
(1)
where x{k } is the state of the kth clique. Z , known as the normalization function, is given
by Z x k k ( x{k } ) . The most widely used method for approximate inference in MN is
Markov chain Monte Carlo (MCMC), and Gibbs sampling in particular. Another popular
A medical knowledge base is a set of rules in first-order logic. Rules are composed using
four types of symbols: constants, variables, functions, and predicates. A term is any expression
atom is an atomic rule, all of whose arguments are ground terms. A possible world assigns a
A Markov logic network can be considered as a template for generating Markov networks.
Given different sets of constants, it will generate different Markov networks. According to the
definition of a Markov network, the joint distribution of a Markov logic network is given by
1 1
P( X x) exp( i ni ( x)) i ( x{i} ) ni ( x ) (2)
Z i Z i
where ni ( x) is the number of true groundings of the ith rule Ri in constants x ; i is the
weight for Ri ; x{i} is the state of the atoms appearing in Ri ; and i ( x{i} ) e i . Because
MLN only focuses on binary features, the constants x are discrete values and x {0,1} .
To apply MLN in medical diagnosis, the atom is considered as the medical entity. When
9
Learning and inference in knowledge-based probabilistic model for medical diagnosis
the medical entity presents an explicit condition for a patient, the corresponding atom of this
entity in MLN is assigned the value 1; otherwise, 0. By this mapping method, we are able to
convert medical knowledge into the binary rules of the MLN. As medical knowledge
accumulates, an MLN will be built, and the maximum probability model of MLN can be used
for medical diagnosis. Given a series of symptoms, the risk probability for a specific disease is
calculated by
probability can be transformed into a satisfiability problem, for which a set of variables is
4. Methodology
Although MLN can be used for medical diagnosis, it is only suitable for binary rules. The
reason is that the values of ni ( x, y) are uncountable when x is a continuous variable. Thus,
MLN is inefficient for multivariate rules. In the health care field, the indexes of symptoms are
often expressed in numeric form. Thus, the existing MLN methodology has some obvious
shortcomings for numeric-based diagnosis. This section addresses that problem. By changing
the form of expression of the potential function, we can incorporate the continuous variable x
into the joint distribution of MN, enabling the conditional probability model for inference to
be deduced via Boltzmann machine, and the learning model for calculating the weight for each
rule is proposed.
Based on the previously mentioned knowledge graph, we propose a model for handling
10
Learning and inference in knowledge-based probabilistic model for medical diagnosis
numeric-based diagnosis, combining the knowledge graph with the theoretical basis of Markov
networks. The novel theoretical framework is named the medical knowledge network (MKN).
medical knowledge in first-order logic and i is the reliability of Ri . Together with a finite
set of constants C {c1 , c2 ,..., cn } , it defines a ground medical knowledge network M L,C as
follows:
1. M L,C contains one multivariate node for each possible grounding of each medical entity
appearing in L. The value of the node is the quantified indicator of the symptom entity or
disease entity.
2. M L,C contains one weight for each piece of medical knowledge. This weight is the i
associated with Ri in L.
of the corresponding clique. Therefore, the potential function of MKN can also be regarded as
Incorporating the quantified indicator of each entity into the potential function is an important
step for numeric-based diagnosis. From statistical physics, we can express the potential
where ( D) is often called an energy function. The set D is the state of the “sympto m”
unrestricted Boltzmann machine [32], which is the one of the earliest types of Markov network.
The energy function associated with the “indication” relationship is defined by a particula r ly
11
Learning and inference in knowledge-based probabilistic model for medical diagnosis
( D) ij ( xi , x j ) ij xi x j (5)
where xi and x j represent the value of the “symptom” entity and the “disease” entity,
respectively, and ij is the contribution of the energy function. According to Eq. (1), the joint
1 1
P( X x)
Z i
i ( D) exp( ( D))
Z i
ri R ri R
1 s d
exp (i xri xri ) (6)
i
Z r R
i
1 s d
exp i xri xri
i
Z r R
i
parameters ui that encode individual node potentials. These activated individual variables
will stress the effect of the “symptom” entity in the energy function. The rewritten probability
formula is given as
1 s
P( X x) exp (i xri xri u xs xri )
s d
(7)
i
Z r R
ri
i
As can be seen, when a “symptom” entity is activated, the factor of the corresponding
individual node potential will be considered a major component of the model; this is exactly
consistent with a clinical diagnosis that is based on symptoms. In this paper, we adopt the
Gaussian potential function (GPF) as the individual node potential, which is expressed as
n
d xrsi xdj 2
u xr mxd e
( )
(8)
i j
j 1
12
Learning and inference in knowledge-based probabilistic model for medical diagnosis
where d xrs xdj represents the distance between node xrsi and its neighboring node x dj in the
i
knowledge graph. The influence factor is used for controlling the influence range of each
node and mxd is the quality of node x dj . The final probability model is defined as the
j
following distribution:
d xrsi xdj 2
1 s d n (
s
P( X x) exp i xri xri mxd e
)
x (9)
Z j 1 j ri
ri R
i
Using Definition 1 and the deduced joint distribution, Algorithm 1 provides the procedure
Input: List EMR: a list of the electronic medical records for training.
Output: Network: a medical knowledge network.
Begin
1: Initialize the lists of nodes Set node and list of edges Set edge in MKN.
2: Extract the entities and relationships from the List EMR → Rules = {rule1 ,
rule2 , …, rulen }.
3: Initialize the weights for Rules by a fixed value ω.
4: for rulei ∈ Rules do
5: Initialize the symptom node Nodesymptom and disease node Nodedisease.
6: Parse the symptom predicate and the disease predicate from rulei →
Nodesymptom and Nodedisease.
7: Set node ← Nodesymptom.
8: Set node ← Nodedisease.
9: Define the relationship Edgei between Nodesymptom and Nodedisease.
10: Add Edgei to Set edge, and assign ω as the weight of Edgei.
11: end for
12: Function PageRank(Set node, Set edge) end
13: After calculating the PageRank of all nodes, the MKN is built: Set node, Set edge
→ Network.
14: return Network.
Through traversing the rules and parsing the predicates, a medical knowledge network can
be implemented. The PageRank function is used as the quality of each node. To characterize
the reliability of medical knowledge, we set up a fixed value ω for the initial network.
13
Learning and inference in knowledge-based probabilistic model for medical diagnosis
4.2. Inference
Medical inference can answer the two common generic clinical questions: “What is the
probability that rule R1 holds given rule R2 ?” and “What is the probability of disease D1
given the symptom vector S1 ?” In response to the first problem, we can answer by computing
P( R1 | R2 , L, C ) P( R1 | R2 , M L ,C )
P( R1 R2 | M L ,C )
(10)
P( R2 | M L ,C )
P( X x | M
x R1 R2
L ,C )
P( X x | M
x R2
L ,C )
The set Ri is the set of rules where Ri holds, and P( x | M L,C ) is given by Eq. (9). Through
the free combinations of pairs of disconnected atoms in MKN, some new rules will be derived
by inference. When the probability of a new rule exceeds a certain threshold, it can be
concluded that the new rule is reliable under the current base. Rule inference not only helps
enrich the knowledge base but is also a self-learning mechanism for the MKN.
The second inference question is what is usually meant by “disease diagnosis.” On the
condition that the patient has a given symptom vector, we can predict the risk probability for a
specific disease. This can be classified as a typical problem of conditional probability. The risk
P(Y y | Bl bl )
n
( i )
d xr y j
exp i xri yri my j e xri
i (11)
r R j 1
i l
n
( i ) n ( i )
d xr y j d xr y j
exp i xri yri my j e
0
xri exp i xri yri m y j e
1 xri
i i
r R j 1
r R j 1
i l i l
where Rl is the set of ground rules in which disease y appears, and bl is the Markov
14
Learning and inference in knowledge-based probabilistic model for medical diagnosis
blanket of y . The Markov blanket of a node is the minimal set of nodes that renders it
independent of the remaining network; this is simply the set of that node’s neighbors in the
knowledge graph. Corresponding to the ith ground rule, yri is the value (0 or 1) of disease y .
In contrast with the MLN diagnostic model, MKN avoids the complexity problem of ni ( x, y)
and incorporates the quantitative value of symptom xri into the diagnostic model. The detailed
Following the Eq. (11), we propose a disease diagnostic algorithm based on MKN. To
provide a reliable diagnosis, we need calculate the risk of each disease. According to the
evidences, Algorithm 2 can generate a list of potential disease which is sorted by diagnostic
possibility.
15
Learning and inference in knowledge-based probabilistic model for medical diagnosis
4.3. Learning
A learning model is proposed for the calculation of the weight for each piece of medical
knowledge from the Chinese Electronic Medical Records. In this study, we adopted the
gradient descent method. Assuming independence among diseases, the learning model first
m
P* (Y y ) P Yl yl | M L,C (12)
l 1
where m is the dimension of disease vector y . By Eq. (12), the derivative of the log-
likelihood function with respect to the weight for the ith rule is
m
log P* (Y y ) log P Yl yl | M L ,C
i i l 1
(13)
m
log P Yl yl | M L ,C
l 1 i
The calculation of log P Yl yl | M L,C will be a stubborn problem. Therefore, we try
i
to construct the derivative of the log-likelihood. From Eq. (9), we know that the normaliza tio n
n
( i )
d xr y j
Z y exp i xri yri my j e xri (14)
i
r R j 1
i
n ( i )
d xr y j
log P(Y y | M L,C ) i xri yri my j e xri log Z (15)
i j 1
ri R
n
( i )
d xr y j
log P(Y y | M L ,C ) xri yri y exp i xri yri m y j e xri xri yr
1
i Z i
r R j 1 i
(16)
i
xri yri y P(Y y | M L ,C ) xri yri
16
Learning and inference in knowledge-based probabilistic model for medical diagnosis
where is the set of all possible values of y , and P(Y y | M L,C ) can be given by Eq. (9).
By bringing Eq. (16) into Eq. (13), the derivative of the log-likelihood with respect to the
weight for the ith rule can be naturally calculated. We get the final expression
m
log P* (Y y) xri yri y P(Y y | M L,C ) xri yri (17)
i l 1
i ,t i ,t 1 log P* (Y y) | t 1 (18)
i
The detailed procedure for the weight learning model is presented in Algorithm 3.
8: //sij represents the symptom value of evidencej for the ith rule
9: //dij represents the disease value of evidencej for the ith rule
10: end for
11: 𝜔𝑖,𝑡 = 𝜔𝑖,𝑡−1 + 𝜂 ∙ 𝑠𝑙𝑜𝑝𝑒.
12: end while
13: weight i ← 𝜔𝑖,𝑡 .
14: end for
15: return Weights.
In summary, we adopt the log-likelihood function and the gradient descent method to learn
the weight vector. Fortunately, the gradient of pseudo-log-likelihood can be calculated by the
joint probability distribution in finite time. Mapping the evidence to its Markov blanket is also
17
Learning and inference in knowledge-based probabilistic model for medical diagnosis
experiments using actual CEMRs. Based on the knowledge graph concept described in Section
2.2, we built an MKN for medical diagnosis. We chose the manually annotated 992 CEMRs
with the help of medical professionals and only kept the discharge summary and the progress
note as the source of knowledge. In the annotating process, we classified the entities into five
categories: disease, type of disease, symptom, test, and treatment; only the disease entity and
the symptom entity were extracted to complete the diagnostic task. Additionally, owing to the
lack of numerical indexes for symptoms in our CEMRs, we adopted modifiers for the symptom,
namely present, possible, and absent, to represent the symptom variable x, corresponding to 2,
1, and 0, respectively. Although the modifier of the symptom is not a continuous variable, a
After the MKN was constructed, we randomly selected 300 untagged CEMRs as the test
corpus, and conditional random fields (CRFs) were used to automatically recognize the disease
entities and symptom entities. Based on the symptom entities on each CEMR, we inferred the
diagnosis result and ascertained whether there was consistency between the diagnosed disease
The description and analysis of the experiments are mainly concerned with three aspects:
the parameter analysis, the weight learning, and the relative effectiveness of MKN and the
18
Learning and inference in knowledge-based probabilistic model for medical diagnosis
In this section, we focus on the optimum choices for the parameter values. In Eq. (11),
d xr y j is the distance between xri and its neighboring node y j in the knowledge graph.
i
Therefore, we define d xr y j 1. The influence factor represents the control range of each
i
node. If a symptom atom and a disease atom appear in a common rule, they have an interactio n
with each other, and the two atoms in each rule can be represented as two adjacent nodes in the
knowledge graph. Since we naturally assume that the symptom node only affects the nearest
The terms my j , which is the quality of the disease node, and xri , which is the symptom
variable in the ith rule, are both uncertainty parameters. The selection of expressions for my j
and xri will affect the accuracy of MKN in diagnosing disease. To begin, we experime nted
with three classical measures for my j : PageRank, degree, and betweenness centrality, and we
used discounted cumulative gain (DCG) [33] as the indicator to measure the accuracy of the
P
reli
DCGP rel1 (19)
i 2 logi2
where reli represents the relevance of the ith disease in the diagnosis result; a correct
As structural differences between the discharge summary and the progress note can lead
to different numbers of symptom for the same patient, two experiments were used to distinguis h
between them. Fig. 4 shows the DCG scores (y-axes) plotted against the serial numbers of 40
discharge summaries and 260 progress notes (x-axes) for the three measures of quality for the
disease node.
19
Learning and inference in knowledge-based probabilistic model for medical diagnosis
Fig. 4. DCG (discounted cumulative gain) for discharge summaries and progress notes using
Although the results show that the curves of the DCG scores are irregular, the DCG score
is 1 in most cases. From the DCG descriptions, the reason is that most of the CEMRs have only
one actual disease, and our model ranks this disease at the top of the diagnosis result. We also
observe that the effectiveness of the PageRank-based MKN is better than the other methods for
In order to describe the diagnosis result more directly, we adopt a second measure, R@10,
which is the recall for the first 10 results. If m actual diseases appear in a CEMR and the MKN
the blue, green, and red bars showing the results using PageRank, betweenness centrality, and
degree, respectively. Under PageRank, the recall for nearly half the records is 1.0. By contrast,
the results for betweenness centrality and degree are unsatisfactory because they have higher
proportions with a recall of 0.0 and lower proportions with 1.0. Considering both factors DCG
and R@10, we conclude that PageRank is more suitable to use as the quality of disease node
20
Learning and inference in knowledge-based probabilistic model for medical diagnosis
my j .
Fig. 5. Distribution of R@10 (recall for first 10 results) for discharge summaries and progress
The second uncertainty parameter is xri , which is the quantitative value of the symptom.
Although the discrete modifier of the symptom could be employed as the representation of xri
in this paper, it would not be the best choice for processing continuous values in the future. If
the continuous value of a symptom is used directly as xri , the problem of normalization across
different symptoms will be an important factor that could cause undesirable results. Therefore,
we seek a representation of xri that not only satisfies the requirements for discrete values but
also might be suitable for continuous values. The sigmoid function is a typical normaliza tio n
method. However, the domain of the sigmoid function does not match the value range for
symptoms. For the diagnostic task, we designed an improved sigmoid function to express xri ,
the quantitative value of symptom. The improved sigmoid function is defined as follows:
2
S ( x) ( x xnormal )2
1 (21)
1 e
where x is the value of the symptom, and xnormal is the normal value, corresponding to
absent (and represented by 0) in this paper. By the characteristic of a sigmoid function, we can
21
Learning and inference in knowledge-based probabilistic model for medical diagnosis
To further answer what kinds of representation of xri can improve the accuracy of the
diagnosis results, we continued with experiments comparing the sigmoid function, our
improved sigmoid, and discrete modifiers of the symptom. The experimental results are shown
in Figs. 6 and 7.
Fig. 6. DCG (discounted cumulative gain) for discharge summaries and progress notes using
Fig. 7. Distribution of R@10 (recall for first 10 results) for discharge summaries and progress
Following the same evaluation criteria, the results shown in Fig. 7 indicate that the
performance of the modifier-based variable is consistent with that of the improved sigmo id
function, whereas in Fig. 6 the performance of the improved sigmoid function is shown to be
22
Learning and inference in knowledge-based probabilistic model for medical diagnosis
a little better than that of the other methods. Hence, we conclude that the improved sigmo id
function can not only handle the continuous symptom variable, but also performs well in the
discrete field. In summary, considering MKN’s ability to migrate between continuous variables
and discrete variables, the improved sigmoid function should be employed as the expression
base, it is assumed to be true; otherwise, it is false. In other words, the inference of the MKN
depends completely on the existing medical knowledge. To test the effectiveness of the
learning method, we compared four types of weighting, including constant weighting, MLN -
MLN-based weight learning into typical weighting and nonnegative weighting. Because
negative weights would be generated by MLN, violating the credibility assumption of medical
knowledge, we rewrote the learning program “Tuffy,” [34] which is an open-source MLN
inference engine, to ensure that the learned weights would be nonnegative. When a negative
weight is learned, our solution is to replace the negative weight with the current minimum
weights to check whether it might influence the diagnosis results. Tables 1 and 2 summar ize
the results for the discharge summaries and the progress notes, respectively, showing P@10,
precision for the first 20 results (P@20), R@10, and average DCG.
Table 1
23
Learning and inference in knowledge-based probabilistic model for medical diagnosis
Index
P@10 P@20 R@10 DCG-AVG
Weight Type
Constant Weight of 0.5 0.875 0.9 0.62 1.06
Constant Weight of 1 0.875 0.9 0.62 1.06
MLN Weight 0.8056 0.8611 0.5233 0.822
Positive MLN Weight 0.8485 0.9091 0.51 0.8402
MKN Weight 0.9 0.95 0.67 1.0983
P@10 = precision for first 10 results; P@20 = precision for first 20 results; R@10 = recall
for first 10 results; DCG-AVG = average discounted cumulative gain.
MLN = Markov logic network; MKN = medical knowledge network.
Table 2
Index
P@10 P@20 R@10 DCG-AVG
Weight Type
Constant Weight of 0.5 0.7538 0.8115 0.5949 0.7909
Constant Weight of 1 0.7538 0.8115 0.5949 0.7909
MLN Weight 0.6473 0.7593 0.5006 0.6743
Positive MLN Weight 0.7409 0.8455 0.5442 0.7348
MKN Weight 0.7615 0.8692 0.6337 0.8269
P@10 = precision for first 10 results; P@20 = precision for first 20 results; R@10 = recall
for first 10 results; DCG-AVG = average discounted cumulative gain.
MLN = Markov logic network; MKN = medical knowledge network.
We can see that the constant weights of 0.5 and 1 give exactly the same results,
demonstrating that the diagnosis results are not at all influenced by the weights’ being equally
adjusted. Furthermore, the positive MLN-based weighting is better than the typical MLN-based
weighting by all evaluation criteria, whether from discharge summaries or progress notes. This
indicates that the results of diagnosis are significantly improved by positivizing the negative ly
for the reliability of our medical knowledge. Finally, we experimentally conclude that the
weight learning methods in order of effectiveness are MKN-based, constant, positive MLN-
24
Learning and inference in knowledge-based probabilistic model for medical diagnosis
After determining the uncertainty parameters and the type of weights, we compared three
diagnostic systems: MLN, MKN, and the logistic regression algorithm (LR). Fig. 8 shows the
DCG curves and the distribution of R@10 using all CEMRs. MKN is clearly more accurate
than the other methods, demonstrating the promise of this approach. According to the DCG
scores, LR performs well in some CEMRs but very poorly in others; its recall values are
uniformly poor. Although LR is used in diagnosing single diseases in some studies, it hardly
applies to diagnosis of multiple diseases, especially when a patient’s symptoms are sparse.
Compared with LR, MLN performs better in the DCG, even surpassing MKN for some records.
However, there is a greater difference between MLN and MKN in R@10. We believe that the
theoretical mechanism of MLN being based on rough binary logic is the main cause of the poor
effect; the binary atoms cannot precisely capture the degree of seriousness of the symptoms.
As a result of this defect, the actual top diseases cannot be ranked in the top 10 to the extent
possible, although these diseases are indeed detected, which is reflected in the DCG. On the
other hand, this testifies to the advantage of MKN’s multivariate atoms for medical diagnosis.
Fig. 8. DCG and R@10 for all CEMRs (Chinese Electronic Medical Records): MLN
(Markov logic network), MKN (medical knowledge network), and LR (logistic regression
algorithm).
25
Learning and inference in knowledge-based probabilistic model for medical diagnosis
Further, we calculated the average DCG and R@10 values for each of the three
Table 3
Overall, the best-performing diagnostic method is MKN, but about 20 percent of CEMRs
are still misdiagnosed completely. The most likely reason is that the medical knowledge base
created from 992 annotated CEMRs is minuscule. With the accumulation of medical
knowledge, we believe that the usefulness of MKN as an intelligent system will continue to
develop.
knowledge graph was constructed, which is composed of “disease” nodes and “sympto m”
nodes. Building on the theory of Markov networks and Markov logic networks, we developed
a novel probabilistic model, called the medical knowledge network. In order to address the
problem of numeric-based diagnosis, the model applies the energy function of Boltzma nn
machines as the potential function. Then, the mathematical derivation process of learning and
inference were rigorously deduced. In contrast to a Markov logic network, the medical
26
Learning and inference in knowledge-based probabilistic model for medical diagnosis
knowledge network adopts ternary rules or even continuously numeric rules, not being limited
to binary rules. In experiments, PageRank and our improved sigmoid function were applied as
the quality of disease node and the expression of the symptom variable, respectively. Empiric a l
tests with actual records illustrate that MKN can improve diagnostic accuracy. Through
comparisons with other algorithms, the effectiveness and promise of MKN were also
demonstrated.
ample space for the future. Directions for future work fall into three main areas:
Knowledge base: We plan to annotate more records and structure more medical knowledge to
Inference: We plan to test the effectiveness of the numeric rules after the test data have been
satisfied, identifying and exploiting the possibility of inference throughout the knowledge base.
Learning: We plan to develop algorithms for learning and replace the pseudo-log-likelihood
function, study dynamic approaches to weight learning, and build MKNs from sparse data and
incomplete data.
Acknowledgements
The Chinese Electronic Medical Records used in this paper were provided by the Second
Affiliated Hospital of Harbin Medical University. We would like to thank the reviewers for
their detailed reviews and insightful comments, which have helped to improve the quality of
this paper.
27
Learning and inference in knowledge-based probabilistic model for medical diagnosis
References
http://www.who.int/cancer/en/.
[4] A.T. Azar, S.M. El-Metwally, Decision tree classifiers for automated medical diagnosis,
Neural Computing and Applications. 23(7-8) (2013) 2387-2403.
[5] D. Lavanya, K.U. Rani, Ensemble decision tree classifier for breast cancer data,
International Journal of Information Technology Convergence and Services. 2(1) (2012)
17.
[6] Y.C.T.Bo. Jin, Support vector machines with genetic fuzzy feature transformation for
biomedical data classification, Inf Sci. 177(2) (2007) 476-489.
[7] M. Peker, A decision support system to improve medical diagnosis using a combination of
k-medoids clustering based attribute weighting and SVM, Journal of medical systems.
[9] Y.Y. Wee, W.P. Cheah, S.C. Tan, et al., A method for root cause analysis with a Bayesian
belief network and fuzzy cognitive map, Expert Systems with Applications. 42(1) (2015)
468-487.
[10] A.C. Constantinou, N. Fenton, W. Marsh, et al., From complex questionnaire and
interviewing data to intelligent Bayesian network models for medical decision support ,
28
Learning and inference in knowledge-based probabilistic model for medical diagnosis
[12] X. Liu, R. Lu, J. Ma, et al., Privacy-preserving patient-centric clinical decision support
system on naive Bayesian classification, IEEE journal of biomedical and health
informatics. 20(2) (2016) 655-668.
[13] A. Bhardwaj, A. Tiwari, Breast cancer diagnosis using genetically optimized neural
network model, Expert Systems with Applications. 42(1) (2015) 4611-4620.
[14] F. Amato, A. López, E.M. Peña-Méndez, et al., Artificial neural networks in medical
diagnosis, Journal of applied biomedicine. 11(2) (2013) 47-58.
[15] S. Palaniappan, R. Awang, Intelligent heart disease prediction system using data mining
[17] N.A. Korenevskiy, Application of Fuzzy Logic for Decision-Making in Medical Expert
Systems, Biomedical Engineering. 49(1) (2015) 46-49.
[18] C.A. Pena-Reyes, M. Sipper, Evolutionary computation in medicine: an overview,
Artificial Intelligence in Medicine. 19(1) (2000) 1-23.
[19] A. Jain, Medical Diagnosis using Soft Computing Techniques: A Review, Internatio na l
[24] S.L. Ting, S.K. Kwok, A.H.C. Tsang, et al., A hybrid knowledge-based approach to
supporting the medical prescription for general practitioners: Real case in a Hong Kong
29
Learning and inference in knowledge-based probabilistic model for medical diagnosis
550-563.
[26] J. Kazama, K. Torisawa, Exploiting Wikipedia as external knowledge for named entity
recognition, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning (EMNLP-CoNLL).
(2007) 698-707.
[27] M. Song, W.C. Kim, D. Lee, et al., PKDE4J: Entity and relation extraction for public
knowledge discovery, Journal of biomedical informatics. 57(1) (2015) 320-332.
[28] I2B2, “Informatics for Integrating Biology & the Bedside.” [Online] Avaiable :
https://www.i2b2.org/.
[29] WILAB-HIT, “Resources.” [Online] Avaiable: https://github.com/WILAB-
HIT/Resources/.
[30] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible infere nce ,
Morgan Kaufmann (2014).
[31] E. Ising, Beitrag zur theorie des ferromagnetismus, Zeitschrift für Physik A Hadrons and
Nuclei. 31(1) (1925) 253-258.
[32] G.E. Hinton, T.J. Sejnowski, Optimal perceptual inference, Proceedings of the IEEE
conference on Computer Vision and Pattern Recognition. (1983) 448-453.
[33] E. Yilmaz, E. Kanoulas, J.A. Aslam, A simple and efficient sampling method for
estimating AP and NDCG, Proceedings of the 31st annual international ACM SIGIR
conference on Research and development in information retrieval (2008).
30
Learning and inference in knowledge-based probabilistic model for medical diagnosis
Figure Captions
Fig. 1. Sample of discharge summary from the Second Affiliated Hospital of Harbin Medical
University.
Fig. 2. Sample of progress note from the Second Affiliated Hospital of Harbin Medical
University.
Fig. 4. DCG (discounted cumulative gain) for discharge summaries and progress notes using
Fig. 5. Distribution of R@10 (recall for first 10 results) for discharge summaries and progress
Fig. 6. DCG (discounted cumulative gain) for discharge summaries and progress notes using
Fig. 7. Distribution of R@10 (recall for first 10 results) for discharge summaries and progress
Fig. 8. DCG and R@10 for all CEMRs (Chinese Electronic Medical Records): MLN (Markov
logic network), MKN (medical knowledge network), and LR (logistic regression algorithm).
31