Week 4
Week 4
Week 4
ST301/ST413, Chapter 4
20 October, 2022
4. Bayesian Networks
Question: How do we structure problems when they get big?
Answer: Build the model around a logical structure!
Note we are not focusing, at this stage, on learning from data; here
we just concentrate on building models that might make sense to
describe the problem under consideration. The next chapter will
combine this with learning from data.
4.1 Rational Thinking About Relevance
I We describe a problem through a set of variables
I Natural to describe situation in terms of relevance between
variables
I If we measure a variable X , is this information relevant to
forecast another variable Y ?
I For a statistician, knowing X is of no “relevance” to Y is
equivalent to saying that Y is independent of X
I For example if Y ∈ {Y1 , Y2 } where Yd , d = 1, 2 are rewards
after taking a decision d, and X is irrelevant for Y , noted
Y q X , then there is no point in measuring X
I Saying X is irrelevant to predictions about Y does not require
to commit to any quantitative statement: it is purely
qualitative.
Irrelevance and independence are too simple to express complex
dependence relationships between many variables. But conditional
analogues work!
Let (X , Y , Z ) be (vectors of) variables.
Definition A measurement X is irrelevant for predicting Y given the
measurement Z (written Y q X |Z ) if we believe that once we learn
Z then X will provide no extra useful information to predict Y .
I In Bayesian statistics we express our beliefs as a joint
probability mass or density function p(x , y , z) on (X , Y , Z ).
I So Y q X |Z iff the conditional density p(y |x , z) of Y |X , Z
doesn’t depend on x
I This is equivalent to ∀(x , y , z),
I Decomposition
X q Y , W |Z ⇒ X q Y |Z
I Weak union
X q Y , W |Z ⇒ X q W |Z , Y
I Contraction
X q Y |Z and X q W |Z , Y ⇒ X q Y , W |Z
I Perfect decomposition
X q Y |Z and X q W |Z , Y ⇔ X q Y , W |Z
I Symmetry states that, in any state of knowledge Z , if Y tells
us nothing new about X then X tells us nothing new about Y .
I Decomposition asserts that if two combined items of
information are judged irrelevant to X , then each separate item
is irrelevant as well.
I Weak union states that learning irrelevant information Y can’t
help the irrelevant information W become relevant to X .
I Contraction states that if we judge W irrelevant to X after
learning some irrelevant information Y , then W must have
been irrelevant before we learned Y .
I Perfect decomposition is the combination of decomposition,
weak union and contraction.
I These axioms define a structure called a semi-graphoid
I It is easy to prove that these axioms are satisfied if we consider
that conditional irrelevance is equal to conditional
independence. . .
Proof of symmetry
p(x) = p(x1 )p(x2 |x1 )p(x3 |x1 , x2 )...p(xn |x1 , .., xn−1 )
Xi q X Ri |X Qi
Definition
The Directed Acyclic Graph (DAG) associated with (n − 1)
irrelevance statements like those above has nodes {X1 , X2 , ..., Xn }
and an edge from Xi to Xj if Xi ∈ X Qj . The elements of X Qj are
the parents of Xj , and Xj is a child of each Xi ∈ X Qj . Any DAG
which has associated irrelevance statements deemed true by the DM
is valid. A Bayesian Network (BN) is a DAG together with its
associated factorised mass function.
Example
Suppose measurements (X1 , X2 , X3 , X4 , X5 ) have mass function
p(x1 , x2 , x3 , x4 , x5 ) = p(x1 )p(x2 |x1 )p(x3 |x2 )p(x4 |x3 , x2 )p(x5 |x3 )
X1 → X2 → X3 → X5
↓ .
X4
Note:
1. The DAG of a given set of irrelevance statements is always
acyclic because parents are listed before children.
2. Absence of edges informative. Complete graphs say nothing (ie
graph with all edges from Xj to Xi , 1 ≤ j < i ≤ n).
Example
The DAG for the Naive Bayes model is:
I This is the model we used in the introduction chapter, with Y
being disease and Xi the symptoms
I How many parameters are there in this model?
I Let n be number of diseases (ie number of levels in Y ) and m
be number of binary symptoms (so each Xi has only two levels)
I Naive Bayes model needs n − 1 probabilities to specify p(Y )
and for each disease y it needs m probabilities to specify
p(X1 , ..., Xm |Y = y ). Total is mn + n − 1 probabilities.
I In the full model (no irrelevance) we need n − 1 probabilities
for p(Y ) and for each disease y 2m − 1 probabilities for
p(X1 , ..., Xm |Y = y ). Total is 2m n − 1 probabilities.
I For example with m = 8 binary symptoms and n = 10 illnesses,
Naive Bayes needs 89 probabilities, full model needs 2, 559.
Example: Hidden Markov Model
C1 C2 C3 CT −1 CT
............
............
............
X1 X2 X3 XT −1 XT
T
Y T
Y
p(X1 , ..., XT , C1 , ..., CT ) = p(C1 ) p(Ck |Ck−1 ) p(Xk |Ck )
k=2 k=1
X1 → X2 → X3 → X5
↓ .
X4
Question Do two identical DAGs say different things depending on
order we elicit relationships?
Answer No! If two DAGs are identical then they represent
equivalent collections of irrelevance statements!
More remarkably, we can discover all logically deducible irrelevance
statements directly from the DAG. First we need to define some
terms from graph theory.
I Let G denote a directed graph. A vertex X is a parent of a
vertex Y , and Y is a child of X in G, iff there is a directed
edge X → Y from X to Y in G.
I Vertex Z is an ancestor of Y in G if Z = Y or there is a
directed path in G from Z to Y .
I Let X be a subset of the vertices V (G) of G. The ancestral set
of X - denoted by A(X) - is set of all the vertices in V (G) that
are ancestors of a vertex in X.
I The ancestral graph
Note for example that {X5 } separates {X1 } and {X2 } in the
skeleton of G, but NOT in the skeleton of G M (A).
The d-Separation Theorem
This derives all valid deductions to be read directly from the DAG.
It can be proved using only the properties at the start of this
chapter, and so applies even outside a Bayesian framework.
Theorem
Let A, B, C be any three disjoint subsets of {1, 2, . . . , n}, and G be
a valid DAG whose vertices V (G) = {X1 , X2 , . . . , Xn }. If X B
separates X C from X A in the skeleton of the moralised graph
G M (A(X A∪B∪C )) of the ancestral graph G(A(X A∪B∪C )), then
X C q X A |X B
X8 ← X7 → X9
↑ % ↑ %
X5 X6
↑ ↑
X4 X3
↑ % ↑
X2 ← X1
X7 X7
% ↑ |
X5 − X6 X5 − X6
↑ ↑ to | |
X4 X3 X4 X3
↑ % ↑ | |
X2 ← X1 X2 − X1
X5 X6
| |
X4 X3
| |
X2 − X1
A q (V − de(A) − A)|pa(A)
Markov blanket property.
I This is another special case of d-separation.
I The Markov blanket mb(v ) of a node V is defined as the set of
its parents, its children and other parents of its children
I Given its Markov blanket, a node is independent of the rest of
the network
I For all variables A ∈ V :
A q (V − mb(A) − A)|mb(A)
4.2.4 Building a Bayesian network
Example
A DM needs to predict the costs of programmes of work. An
employee gives a ballpark estimate B of the cost of each potential
scheme. DM produces a detailed estimate E and then decides the
programme he puts out to tender. The lowest bid T gets work. The
final cost charged is ultimately O.
I DM says he uses E to estimate probability of winning tender
price T
I This in turn is used to predict final cost O
I Therefore his DAG is
B → E → T → O
Queries of the Analyst
B → E → T → O
B → E → T → O
↑ %
G I
X Y X Y X Y X Y
↑ % ↓ % ↑ . ↓ .
Z Z Z Z
X8 ← X7 → X9 X8 − X7 − X9
↑ % ↑ % | % ↑
X5 X6 X5 X6
↑ ↑ to | |
X4 X3 X4 X3
↑ % ↑ | |
X2 ← X1 G X2 − X1 P
and
X8 ← X7 → X9 X8 − X7 − X9
↑ % ↑ % | % ↑
X5 X6 X5 X6
↑ ↑ to | |
X4 X3 X4 X3
↑ . ↓ | |
X2 → X1 G0 X2 − X1 P
Markov network
If we moralise a DAG G and then take the skeleton, we obtain the
equivalent Markov network M:
X8 ← X7 → X9 X8 − X7 − X9
↑ % ↑ % | |
X5 X6 X5 − X6
↑ ↑ to | |
X4 X3 X4 X3
↑ % ↑ | |
X2 ← X1 G X2 − X1 M
I A Markov network (aka Markov Random Field MRF) is another
very popular type of graphical model used to represent the
dependencies (less popular than BN though)
I A Markov network is less informative than a Bayesian network:
many Bayesian networks have the same equivalent Markov
network
I Both have pros and cons when it comes to modelling a specific
problem
I Neither Bayes nor Markov had anything to do with the
development of these models!
I We will focus on BN, but very strong links between these two
types of graphical models