Learning Continuous-Time Bayesian Networks in Relational Domains: A Non-Parametric Approach
Learning Continuous-Time Bayesian Networks in Relational Domains: A Non-Parametric Approach
Learning Continuous-Time Bayesian Networks in Relational Domains: A Non-Parametric Approach
Tushar Khot
Kristian Kersting
Sriraam Natarajan
School of Informatics
and Computing
Indiana University
Bloomington, IN 47408
School of Informatics
and Computing
Indiana University
Bloomington, IN 47408
Abstract
Many real world applications in medicine, biology,
communication networks, web mining, and economics,
among others, involve modeling and learning structured
stochastic processes that evolve over continuous time.
Existing approaches, however, have focused on propositional domains only. Without extensive feature engineering, it is difficultif not impossibleto apply
them within relational domains where we may have
varying number of objects and relations among them.
We therefore develop the first relational representation
called Relational Continuous-Time Bayesian Networks
(RCTBNs) that can address this challenge. It features
a nonparametric learning method that allows for efficiently learning the complex dependencies and their
strengths simultaneously from sequence data. Our experimental results demonstrate that RCTBNs can learn
as effectively as state-of-the-art approaches for propositional tasks while modeling relational tasks faithfully.
Introduction
Modeling structured stochastic processes that evolve over
time is an important and challenging task with applications
in many fields ranging from surveillance to activity recognition to reliability monitoring of communication networks
to treatment planning in biomedicine. Classical AI solutions
such as Dynamic Bayesian networks (Murphy 2002) discretize the time into fixed intervals and perform the modeling on these intervals. While discretization is often indeed
reasonable, there are domains such as medicine in which no
natural discretization is available; if we discretize time too
finely, the learning problem can quickly become intractable,
but if we discretize time too coarsely, we lose information.
Therefore it is not surprising that models over finite
spaces but across continuous time have been proposed. The
most popular ones fall under the category of ContinuousTime Markov Processes. They are described by an initial
distribution over the states of the model and a matrix that
specifies the rate of transition between states. Its successor, Continuous-Time Bayesian Networks (CTBNs) make
the problem more tractable by factorizing the rate matrix so
c 2016, Association for the Advancement of Artificial
Copyright
Intelligence (www.aaai.org). All rights reserved.
that conditional independencies can be exploited (Nodelman, Shelton, and Koller 2002). The models can be learned
efficiently from data, see e.g. (Weiss, Natarajan, and Page
2012). Without extensive feature engineering, however, it is
difficultif not impossibleto apply CTBNs to relational
domains, in which e.g. there is a varying number of heterogeneous objects and relations among them. Many of todays
datasets, however, are inherently relational and have no natural timeslices. Consider an Electronic Health Record (EHR).
It typically contains demographic information, prescription
history, lab tests, diagnoses, along with imaging data and
possibly in the near future, genetic (SNPs) data as well. Another example is the human-to-X communication network
where users typically call many different people, use a multitude of apps, take pictures, listen to music, and so on. Statistical Relational Learning (SRL) has been proven successful in such domains by combining the power of first-order
logic and probability (Getoor and Taskar 2007). As far as
we are aware, there are no continuous-time SRL approaches.
It is possible to model relational CTBNs in CTPPL (Pfeffer 2009), a general purpose probabilistic programming language for processes over continuous time. However, the use
of a relational language allows us to develop a more powerful structure learning approach.
Consequently, we develop Relational Continuous-Time
Bayesian Networks (RCTBNs). They extend SRL towards
modeling in continuous time by lifting CTBNs to relational data. The syntax and semantics are based on Bayesian
Logic Programs (BLP) (Kersting and De Raedt 2002), and
the use of the logical notation allows for an expressive representation that does not fix the number of features in advance
yet results in a homogeneous process. Although already interesting on its own, we go one step further. Based on Friedmans functional-gradient boosting (Friedman 2001), we
develop a non-parametric learning approach for RCTBNs
that simultaneously learns the dependencies between the
trajectories in the data and the parameters that quantify
these dependencies. Our extensive experimental evaluation
demonstrates that RCTBNs are comparable to state-of-theart methods on propositional data but can handle relations
faithfully, where propositional methods either fail or need to
be engineered specifically for each task.
To summarize, we make the following key contributions:
(1) We present the first continuous-time relational model. (2)
We develop a non-parametric learning algorithm for learning these models. (3) We prove the convergence properties
of our algorithm. (4) Finally, we show that our algorithm
can even model propositional data created by other methods
effectively while faithfully modeling relational data.
We proceed as follows. After briefly reviewing CTBNs,
we lift them to the relational case and prove that the resulting RCTBNs are homogeneous. We then show how to learn
RCTBNs from data and prove convergence. Before concluding, we present our experimental results.
Background
While we will introduce relational concepts on-the-fly, let us
briefly review CTBNs and functional gradient boosting.
..
..
..
..
i
.
.
.
.
r
r
r
qx i x1 |paj qx i x2 |paj qx i |paj
i
i
i
i
i
i
i
i
where ri denotes the number of states the discrete vari-
P
able Xi can take, qxk |paj = k0 6=k qxk xk0 |paj and qxk |paj ,
i
i
i i
i
i
i
qxk xk0 |paj > 0. It is factored into two components: qXi , an
i
i i
exponential distribution over when the next transition will
occur, and Xi = qX k X k0 /qXik (k 0 6= k), a multinomial disi
i
tribution over the next state. For example, x1 x2 |paj is the
i
i i
probability of Xi transitioning from state 1 to state 2 given
its parents value, paji . If Xi is in state k, it would stay in this
state for an amount of time that is exponentially distributed
with parameter qxk |paj , given its parents value is paji . The
i
i
expected time to transition is Et = 1/qxk |paj . Upon transii
i
tioning, Xi shifts from its k-th state to k 0 -th state with probability qxk xk0 |paj /qxk |paj . For details, we refer to first CTBN
i
i
i
i i
work (Nodelman, Shelton, and Koller 2002).
Instead of a single probability distribution for Xi given a
specific instantiation of its parents as in BNs, CTBNs have
one CIM for every configuration of its parent set. Since
the
Q number of configurations of Pa(Xi ) equals to vi =
Xt Pa(Xi ) rt , there are vi conditional intensity matrices
for Xi in a CTBN model, and each CIM has
P ri ri parameters. Because of the constraint qxk |paj = k0 6=k qxk xk0 |paj ,
i
i
i i
i
the total number ofQindependent parameters for a CTBN
model over set X is Xi X vi (ri ri ri ). For an individual variable Xi , its trajectory can be viewed as a conditional
Markov process, which is inhomogeneous, because the intensities vary with time. It is worth noting that the CIM is
not a direct function of time but rather a function of the current values of its parent variables, which also evolve with
time. However, the global process of the variable set X is
a homogeneous Markov process. In fact, any CTBN can be
converted to a homogeneous Markov process by amalgamations (Nodelman 2007) over all its CIMs into a joint intensity
matrix. Simply put, amalgamations are the summation over
the expansions of all the CIMs over the entire variable set.
Within the joint intensity matrix, all intensities corresponding to two simultaneous changes are zero because both variables cannot transit at the same instant.
Finally, indeed it is sufficient to construct a CTBN with
two components: the initial distribution (denoted as PX0 )
which is specified as a Bayesian network B over X and
a continuous transition model, specified by: i). a directed
graph G similar as a BN frame but possibly with cycles; ii).
a set of CIMs QX|Pa(X) , for each variable Xi X.
LL(x)
the gradient w.r.t m (xi ), i.e., < xi , m (xi ) =
>.
m (xi )
To generalize from these regression examples, a regression
function m (generally regression tree) is learned to fit to
the gradients . The final model m = 0 + 1 + + m
is a sum over these regression trees.
This work has been extended to relational models (Natarajan et al. 2012; Karwath, Kersting, and Landwehr 2008;
Sutton et al. 2000; Natarajan et al. 2011) by replacing the
propositional trees with relational regression trees (Blockeel
and Raedt 1998). If we assume the probability of an
example (yi ) given its parents/neighbors(xi ) as a sigmoid of
, the functional gradient of each example hxi , yi i w.r.t is
log P (yi ; xi )
= I(yi = 1; xi ) P (yi = 1; xi )
(yi = 1; xi )
where I is the indicator function that is 1 if yi = 1 and 0
otherwise. The expression is simply the adjustment required
to match the predicted probability with the true label of the
example. If the example is positive and the predicted probability is smaller than 1, this gradient is positive indicating
that the predicted probability should move towards 1. Conversely, if the example is negative and the predicted probability is larger than 0, the gradient is negative, driving the
value the other way.
Relational CTBNs
We now introduce Relational CTBNs (RCTBNs) and then
develop an efficient learning approach for them.
F amilyM ember(x, y) denotes that x is a family member of y. A sample data set could just be:
Training examples
Diabetes(mary, t0 ).
Diabetes(ann, t2 ).
Diabetes(tom, t3 ).
Diabetes(john, t4 ).
Background Knowledge
F amilyM ember(ann, mary).
F amilyM ember(eve, mary).
F amilyM ember(ian, tom).
F amilyM ember(jack, bob).
F amilyM ember(bob, mary).
F amilyM ember(tom, mary).
It states that for all the persons (x) who do not have diabetes
at time t1 , if one of his/her family members (y) develops diabetes at t > t1 , then the probability that x will have diabetes
at time t2 follows the exponential distribution specified at
the end of the rule (with parameter q1 ). Another rule is:
x, t1 , @y, t, Diabetes(x, t1 )
F amilyM ember(y, x) Diabetes(y, t) t1 < t < t2
This rule contains different conditions from the first rule and
says that if no family members develop diabetes in the time
interval t1 to t2 , the probability that the person will have
diabetes is an exponential distribution with parameter q2 .
It is clear that, if we had used a propositional CTBN,
the dimension of this feature space (the number of CIMs)
would be exponential in the number of family members
the individual could have. In turn, when the data set is
rich in F amilyM ember relationships, this can greatly
deteriorate the learning efficiency as observed by Weiss
et.al(2012). Hence, following the standard observations inside SRL (Getoor and Taskar 2007), we exploit the ability to
tie parameters using logic. More formally:
Definition A RCTBN clause consists of two components: a
first-order horn clause that defines the effect of a set of influencing predicates on a target predicate and a conditional
intensity matrix (CIM) that models the probability of transition of the target given the influences.
Definition A RCTBN program2 is a set of RCTBN clauses.
Note that there could be multiple clauses with the same
head, indicating the target trajectory depends on multiple
parent trajectories. Second, since we are in the relational setting, there can be many instances that satisfy a single rule for
a single target. For instance, in the above rule, there can be
multiple family members (i.e., several instances of y) for the
current person x and there can be multiple rules for predicting onset of diabetes. One is tempted to use various combining rules, see e.g. (Natarajan et al. 2009), for combining the
2
Learning RCTBNs
We now develop a non-parametric approach for learning
RCTBNs from data. Note that if the training data is complete, then during the entire trajectory of X, the values of
Pa(X) are known. In turn, it is explicit at any time point
which conditional intensity matrix is dominating the dynamics of X. Thus, the likelihood of the parameters can be decomposed into the product of the likelihood of the individual
parameters (Nodelman, Shelton, and Koller 2003). This parameter modularity along with the structure modularity of
CTBNs allow the parameters of CTBNs to be learned locally. Besides, we notice that the likelihood function is also
decomposable over individual transitions along the trajectories because of the memoryless property of the exponential
distribution. These lay out the theoretical foundation for applying the non-parametric RFGB approach.
Examples
CV D(John, t0 ).
CV D(John, t1 ).
CV D(John, t2 ).
CV D(John, t3 ).
CV D(John, t4 ).
CV D(John, t5 ).
CV D(John, t6 ).
Facts
BP (John, t0 , low).
Diabetes(John, t0 , f alse).
BP (John, t1 , high).
BP (John, t2 , low).
Diabetes(John, t3 , true).
BP (John, t4 , high).
BP (John, t6 , low).
Note that each training example is a segment of a trajectory that includes not more than one transition of the target predicate. For example, Figure 1 shows the trajectories
of BloodP ressure(BP ), Diabetes and CV D for patient
John. Let us assume that the goal is to predict the transition
of CV D from low to high. As can be observed, every feature/predicate transits at different time points (fundamental
assumption of CTBNs). This can be represented in a logical
notation as we show in the bottom half of Figure 1.
Now, we first define the transition distribution from the
perspective of each segment:
0
00
0
j
xk xk |pa
i i
i
q k j
x |pa
i
i
(1)
q
00
j
xk xk |pa
i i
i
q k j
x |pa
i
i
k
negative examples (xki x
i ) :
P (xki xki |paji ) = exp qxk |paj T [xki xki |paji ]
i
k k0
i
i
k j
x |pa
i
i
i
k k0
e xi xi |pai (k 0 6= k) = e
, and the negative examples
now only contain one case where no transition happened.
So, the gradient function could be derived as:
k k0
LL
LL
j
=
e xi xi |pai
xk xk0 |paj
qxk xk0 |paj
i
i
i i
i i
k0
k
x
positive
examples
(x
i
i ):
j
j
k
0
j
xk |pa
k k0
i
i
exp e
T [xi xi |pai ]
e xi |pai T [xk
xk
|paji ]
i
i
k
j
k0
1exp e xi |pai T [xk
i xi |pai ]
(2)
k
k
k k
k j
0
(xki xik ) :
positive examples
(1prob+ ) ln(1prob+ )
LL
prob+
(3)
=
negative examples (xki xki ) :
xk xk0 |paj
i i
i
ln(1 prob+ )
This presents the weight updates of the examples in each
iteration based on the current model.
Plugging (3) into the RFGB results in a convergent
learner. More precisely,
Theorem 2. There exist global maxima for the loglikelihood
function.
Proof. The Hessian functions of the loglikelihood function
w.r.t. xk xk0 |paj equals to:
i
H=
0
positive examples (xki xki ) :
+
+
+
(prob +ln(1prob ))(prob 1)ln(1prob+ )
(prob+ )2
ln(1 prob+ )
(4)
Since prob+ [0, 1], we could derive the value ranges as:
ln(1 prob+ ) (, 0], (prob+ + ln(1 prob+ ))
(, 0], prob+ 1 [1, 0], and (prob+ )2 [0, 1], hence
the Hessian function w.r.t. xk xk0 |paj would always be noni i
i
positive for any xk xk0 |paj . So, the loglikelihood is a coni i
i
cave function of xk xk0 |paj , which has global maxima.
i
Theorem 3. RFGB for CTBNs converges when the predictions reach the true values. In other words, RFGB for CTBNs
will converge to global maxima.
Proof. For positive examples, let a = 1 prob+ , based on
(3) and L0 H opital0 s rule, lim
+ equals to
+
prob 1
lim
a0+
a ln a
ln a
1/a
+
= lim+
1 = lim+ 1/a2 = 0
1a
a0 1
a0
a
prob+ 0
lim
prob+ 0
ln(1 prob+ ) = 0
Figure 2: Left three columns, propositional domains (Q1): The results show no statistically significant difference between
RCTBNs and state-of-the-art methods. Right column, relational experiment (Q2): (Upper frame) RCTBN model of CVD.
(Bottom frame) The results show that RCTBNs can capture relations faithfully. As there is no continuous time relational
model, the results are compared against ground truth. Note that the ground truth does not have AUC = 1.0 because when
we generated the data using forward sampling, the values are sampled based on the transition intensity calculated by
P = 1 eqt . The AUC of ground truth is calculated based on these generating transition probabilities instead of the
determinant values (i.e. 0 for positive examples and 1 for negative examples).
over the target predicate, we generate a regression example
using the gradients for positive examples from (3). If the target predicate does not transit in this segment, we treat it as a
negative example, and compute the gradient accordingly.
Experiments
Our intention is to investigate whether RCTBNs truly generalize CTBNs to relational domains by addressing the following two questions:
(Q1) How does RCTBN compare to CTBNs on propositional domains?
(Q2) Can RCTBN-RFGB learn relational models faithfully
where CTBNs would fail or need extensive feature engineering ?
To this aim we evaluated RCTBN-RFGB on propositional
and relational data sets and compared it to state-ofthe-art where possible. More precisely, we employed
three standard CTBN datasets from prior literature,
i.e. the Drug model (Nodelman, Shelton, and Koller
2003), MultiHealth model (Weiss, Natarajan, and Page
2012) and S100 model (Weiss, Natarajan, and Page
2012). In the drug domain, the data is generated for
length of 10, the target predicate is JointP ain(X, T )
and other variables include the trajectories of Eating,
F ullstomach, Hungry, U ptake, Concentration,
Barometer, and Drowsy. The MultiHealth data has
the target predicate Atherosclerosis(X, T ) and trajectories of Gender, Smoking, Age, Glucose, BM I,
BP , ChestP ain, AbnormalECG, M I, T roponin,
Conclusion
We proposed the first relational model that can faithfully
model continuous time. Our model is inspired from the successes in the CTBNs and the SRL communities. Besides,
we adapt and refine the leading learning algorithm for SRL
models to the temporal case. The key innovation is that the
conditional intensity is defined as a monotonic function of
the RFBG function value, hence, maximizing w.r.t. function
values can maximize the loglikelihood of the trajectories.
Our resulting algorithm is non-parametric and can learn the
dependencies and the parameters together. Our initial results
are promising and show that we can indeed recover the original structure and faithfully model the relational trajectories.
Our current learning approach is discriminative and we
are only learning for a single query. Extending this to generative learning and modeling jointly over multiple predicates is our next step. More applications of the algorithm to
complex tasks such as Electronic Health Records, gene experimental databases, network flow modeling, and monitoring the reliability of human-to-X communication systems,
among others, with asynchronous events are exciting directions. Adaptation of different types of combination functions
is another important direction. Finally, improving inference
is essential to scale up our learning algorithm to a large number of tasks and is also an interesting direction for future
research.
Acknowledgments SY and SN gratefully acknowledge
National Science foundation grant no. IIS-1343940. KK acknowledges the support of the DFG Collaborative Research
Center SFB 876, project B4. The authors would like to thank
David Page and Jeremy Weiss for their inputs, Jeremy Weiss
for the mfCTBN code and inputs, and Christoph Ide for the
discussion on communication networks.
3
Note that our work is the first dynamic relational model to handle continuous time, see also the discussion in the introduction.
Other formalisms exist to handle continuous variables but not continuous time which in our case is not a variable by itself but rather
an argument of all the predicates. Hence, our relational baseline is
the ground truth.
References
Blockeel, H., and Raedt, L. D. 1998. Top-Down Induction
of First-Order Logical Decision Trees. Artif. Intell. 101(12):285297.
El-Hay, T.; Friedman, N.; Koller, D.; and Kupferman, R.
2006. Continuous Time Markov Networks. In UAI.
Friedman, J. 2001. Greedy function approximation: A gradient boosting machine. Annals of Statistics 29.
Getoor, L., and Taskar, B. 2007. Introduction to Statistical
Relational Learning. Adaptive computation and machine
learning. MIT Press.
Getoor, L.; Friedman, N.; Koller, D.; and Pfeffer, A. 2001.
Learning probabilistic relational models. In Dzeroski, S.,
and Lavrac, N., eds., Relational Data Mining. SpringerVerlag.
Karwath, A.; Kersting, K.; and Landwehr, N. 2008. Boosting relational sequence alignments. In ICDM.
Kersting, K., and De Raedt, L. 2002. Basic Principles of
Learning Bayesian Logic Programs. Technical report, Institute for Computer Science, University of Freiburg.
Murphy, K. 2002. Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. Dissertation, UC
Berkeley, Computer Science Division.
Natarajan, S.; Tadepalli, P.; Dietterich, T. G.; and Fern, A.
2009. Learning first-order probabilistic models with combining rules. Annals of Mathematics and AI.
Natarajan, S.; Joshi, S.; Tadepalli, P.; Kristian, K.; and Shavlik, J. 2011. Imitation learning in relational domains: A
functional-gradient boosting approach. In IJCAI.
Natarajan, S.; Khot, T.; Kersting, K.; Guttmann, B.; and
Shavlik, J. 2012. Gradient-based boosting for statistical relational learning: The relational dependency network case.
Machine Learning.
Nodelman, U.; Shelton, C.; and Koller, D. 2002. Continuous
Time Bayesian Networks. In Proceedings of the Eighteenth
Conference on Uncertainty in Artificial Intelligence (UAI),
378387.
Nodelman, U.; Shelton, C.; and Koller, D. 2003. Learning
Continuous Time Bayesian Networks. In Proc. Nineteenth
Conference on Uncertainty in Artificial Intelligence (UAI),
451458.
Nodelman, U. 2007. Continuous Time Bayesian Networks.
Ph.D. Dissertation, Stanford University.
Pearl, J. 1988. Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann
Publishers Inc.
Pfeffer, A. 2009. CTPPL: A continuous time probabilistic
programming language. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI),
19431950.
Sutton, R.; McAllester, D.; Singh, S.; and Mansour, Y. 2000.
Policy gradient methods for reinforcement learning with
function approximation. In NIPS.
Weiss, J. C.; Natarajan, S.; and Page, D. 2012. Multiplicative
Forests for Continuous-Time Processes. In NIPS, 467475.