Week 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Bayesian Statistics and Decision Theory

ST301/ST413, Chapter 4

20 October, 2022
4. Bayesian Networks
Question: How do we structure problems when they get big?
Answer: Build the model around a logical structure!
Note we are not focusing, at this stage, on learning from data; here
we just concentrate on building models that might make sense to
describe the problem under consideration. The next chapter will
combine this with learning from data.
4.1 Rational Thinking About Relevance
I We describe a problem through a set of variables
I Natural to describe situation in terms of relevance between
variables
I If we measure a variable X , is this information relevant to
forecast another variable Y ?
I For a statistician, knowing X is of no “relevance” to Y is
equivalent to saying that Y is independent of X
I For example if Y ∈ {Y1 , Y2 } where Yd , d = 1, 2 are rewards
after taking a decision d, and X is irrelevant for Y , noted
Y q X , then there is no point in measuring X
I Saying X is irrelevant to predictions about Y does not require
to commit to any quantitative statement: it is purely
qualitative.
Irrelevance and independence are too simple to express complex
dependence relationships between many variables. But conditional
analogues work!
Let (X , Y , Z ) be (vectors of) variables.
Definition A measurement X is irrelevant for predicting Y given the
measurement Z (written Y q X |Z ) if we believe that once we learn
Z then X will provide no extra useful information to predict Y .
I In Bayesian statistics we express our beliefs as a joint
probability mass or density function p(x , y , z) on (X , Y , Z ).
I So Y q X |Z iff the conditional density p(y |x , z) of Y |X , Z
doesn’t depend on x
I This is equivalent to ∀(x , y , z),

p(y |x , z) = p(y |z)

I Equivalently ∀(x , y , z),

p(x , y , z) = p(y |z)p(x |z)p(z)

I Here we consider that conditional irrelevance of information in


(X , Y , Z ) is the same as conditional independence of the
random variables (X , Y , Z ), respectively, but there are other
ways to define irrelevance
Axioms of irrelevance
I Symmetry
X q Y |Z ⇔ Y q X |Z

I Decomposition

X q Y , W |Z ⇒ X q Y |Z

I Weak union

X q Y , W |Z ⇒ X q W |Z , Y

I Contraction

X q Y |Z and X q W |Z , Y ⇒ X q Y , W |Z

I Perfect decomposition

X q Y |Z and X q W |Z , Y ⇔ X q Y , W |Z
I Symmetry states that, in any state of knowledge Z , if Y tells
us nothing new about X then X tells us nothing new about Y .
I Decomposition asserts that if two combined items of
information are judged irrelevant to X , then each separate item
is irrelevant as well.
I Weak union states that learning irrelevant information Y can’t
help the irrelevant information W become relevant to X .
I Contraction states that if we judge W irrelevant to X after
learning some irrelevant information Y , then W must have
been irrelevant before we learned Y .
I Perfect decomposition is the combination of decomposition,
weak union and contraction.
I These axioms define a structure called a semi-graphoid
I It is easy to prove that these axioms are satisfied if we consider
that conditional irrelevance is equal to conditional
independence. . .
Proof of symmetry

X q Y |Z ⇔ p(x , y |z) = p(x |z)p(y |z)


and
Y q X |Z ⇔ p(y , x |z) = p(y |z)p(x |z)
Proof of decomposition

X q Y , W |Z ⇒ p(x , y , w |z) = p(x |z)p(y , w |z)


X X
⇒ p(x , y , w |z) = p(x |z) p(y , w |z)
w w

⇒ p(x , y |z) = p(x |z)p(y |z) ⇒ X q Y |Z


Proof of weak union

X q Y , W |Z ⇒ p(x , y , w |z) = p(x |z)p(y , w |z)


⇒ p(y |z)p(x , w |z, y ) = p(x |z)p(y |z)p(w |y , z)
⇒ p(x , w |z, y ) = p(x |z)p(w |y , z)
⇒ p(x , w |z, y ) = p(x |z, y )p(w |y , z) ⇒ X q W |Z , Y
The last line uses p(x |z) = p(x |z, y ) because X q Y |Z obtained by
decomposition of X q Y , W |Z .
Proof of contraction
We have X q Y |Z and X q W |Z , Y and need to show X q Y , W |Z
By chain rule:

p(x , y , w |z) = p(x |z)p(y |z, x )p(w |x , y , z)

= p(x |z)p(y |z)p(w |y , z) = p(x |z)p(y , w |z)


⇒ X q Y , W |Z

I We have proved all properties above


I You can now tackle Exercise Sheet 4.
4.2 Bayesian Networks
I Bayesian Network
I Bayes Net
I Probabilistic Expert System
4.2.1 What a Bayesian Network does
I Graph evocative and understandable by DM, so feels owned
I Provides a faithful picture of links DM believes exists between
features of problem
I Corresponds to a set of statements about relevance which obey
the properties above, so graph has a logical integrity
I Admits embellishment into full probability description
I Can be used to guide Bayesian learning and fast computation
4.2.2 DAGs and Bayesian Networks
Let X1 , X2 , ..., Xn be discrete random variables with probability mass
function p(x). Then using the chain rule:

p(x) = p(x1 )p(x2 |x1 )p(x3 |x1 , x2 )...p(xn |x1 , .., xn−1 )

Suppose that for 2 ≤ i ≤ n

p(xi |x1 , .., xi−1 ) = p(xi |x Qi )

where x Qi ⊆ {x1 , ..., xi−1 }. Let x Ri = {x1 , ..., xi−1 } \x Qi .


Then these factorisations imply for 2 ≤ i ≤ n,

Xi q X Ri |X Qi
Definition
The Directed Acyclic Graph (DAG) associated with (n − 1)
irrelevance statements like those above has nodes {X1 , X2 , ..., Xn }
and an edge from Xi to Xj if Xi ∈ X Qj . The elements of X Qj are
the parents of Xj , and Xj is a child of each Xi ∈ X Qj . Any DAG
which has associated irrelevance statements deemed true by the DM
is valid. A Bayesian Network (BN) is a DAG together with its
associated factorised mass function.
Example
Suppose measurements (X1 , X2 , X3 , X4 , X5 ) have mass function

p(x1 , x2 , x3 , x4 , x5 ) = p(x1 )p(x2 |x1 )p(x3 |x2 )p(x4 |x3 , x2 )p(x5 |x3 )

p(x2 |x1 ) fn of (x1 , x2 ) =⇒ X Q2 = {X1 }, X R2 = ∅


p(x3 |x1 , x2 ) fn of (x2 , x3 ) =⇒ X Q3 = {X2 }, X R3 = {X1 }
p(x4 |x1 , x2 , x3 ) fn of (x2 , x3 , x4 ) =⇒ X Q4 = {X2 , X3 }, X R4 = {X1 }
p(x5 |x1 , x2 , x3 , x4 ) fn of (x3 , x5 ) =⇒ X Q5 = {X3 }, X R5 = {X1 , X2 , X4 }

X1 → X2 → X3 → X5
↓ .
X4
Note:
1. The DAG of a given set of irrelevance statements is always
acyclic because parents are listed before children.
2. Absence of edges informative. Complete graphs say nothing (ie
graph with all edges from Xj to Xi , 1 ≤ j < i ≤ n).
Example
The DAG for the Naive Bayes model is:
I This is the model we used in the introduction chapter, with Y
being disease and Xi the symptoms
I How many parameters are there in this model?
I Let n be number of diseases (ie number of levels in Y ) and m
be number of binary symptoms (so each Xi has only two levels)
I Naive Bayes model needs n − 1 probabilities to specify p(Y )
and for each disease y it needs m probabilities to specify
p(X1 , ..., Xm |Y = y ). Total is mn + n − 1 probabilities.
I In the full model (no irrelevance) we need n − 1 probabilities
for p(Y ) and for each disease y 2m − 1 probabilities for
p(X1 , ..., Xm |Y = y ). Total is 2m n − 1 probabilities.
I For example with m = 8 binary symptoms and n = 10 illnesses,
Naive Bayes needs 89 probabilities, full model needs 2, 559.
Example: Hidden Markov Model
C1 C2 C3 CT −1 CT
............

............

............
X1 X2 X3 XT −1 XT

T
Y T
Y
p(X1 , ..., XT , C1 , ..., CT ) = p(C1 ) p(Ck |Ck−1 ) p(Xk |Ck )
k=2 k=1

I Ct is discrete and ruled by a Markov chain


I Xt can be discrete, continuous, multivariate, etc
I Full study of this model (also called state-space model) in
ST337
Example of Hidden Markov Model
I We can build a HMM where pressure (high vs low) is the
hidden state and rain (yes or no) is the emission

I Since the emissions are distributed according to Bernoulli


distributions, this model is called a Bernoulli-HMM
4.2.3 Interpreting a DAG as irrelevance statements
Theorem
If DAG G ∗ is derived from a valid DAG G by adding edge from Xj
to Xi , 1 ≤ j < i ≤ n, then G ∗ is also valid.
Proof.
Conditional statements of G and G ∗ coincide except that
X Qi∗ = (X Qi , Xj )
 
X Ri∗ , Xj = X Ri

Since G is valid we have


Xi q X Ri |X Qi
Using the second equality above, this is equivalent to:
 
Xi q X Ri∗ , Xj |X Qi

By weak union we get


Xi q X Ri∗ | (X Qi , Xj )
Using the first equality above, we end up with:
Xi q X Ri∗ |X Qi∗ ⇒ G ∗ valid

Another factorisation - the same DAG
If in problem above order (X1 , X2 , X3 , X5 , X4 ) elicit

p(x2 |x1 ) is fn of (x1 , x2 )


p(x3 |x1 , x2 ) is fn of (x2 , x3 )
p(x5 |x1 , x2 , x3 ) is fn of (x3 , x5 )
p(x4 |x1 , x2 , x3 , x5 ) is fn of (x2 , x3 , x4 )

DAG is same as DAG of the earlier example, i.e.

X1 → X2 → X3 → X5
↓ .
X4
Question Do two identical DAGs say different things depending on
order we elicit relationships?
Answer No! If two DAGs are identical then they represent
equivalent collections of irrelevance statements!
More remarkably, we can discover all logically deducible irrelevance
statements directly from the DAG. First we need to define some
terms from graph theory.
I Let G denote a directed graph. A vertex X is a parent of a
vertex Y , and Y is a child of X in G, iff there is a directed
edge X → Y from X to Y in G.
I Vertex Z is an ancestor of Y in G if Z = Y or there is a
directed path in G from Z to Y .
I Let X be a subset of the vertices V (G) of G. The ancestral set
of X - denoted by A(X) - is set of all the vertices in V (G) that
are ancestors of a vertex in X.
I The ancestral graph

G(A(X)) = (V (G(A(X)), E (G(A(X)))

has vertex set V (G(A(X)) = A(X) and edge set

E (G(A(X)) = {e = Xe → Ye ∈ E (G) : Xe , Ye ∈ A(X)}


I A graph is mixed if it has directed and undirected edges.
I The moralised graph G M of G has the same vertex set and set
of directed edges as G but has an undirected edge between any
two vertices Xi , Xj ∈ V (G) when there is no directed edges
between them in G, but they are parents of a same child.
I If G M = G then G is said to be decomposable.
I The skeleton S(H) of a mixed graph H is one with the same
vertex set as H and an undirected edge between Xi and Xj if
and only if there is a directed or undirected edge between Xi
and Xj in H. (So just make directed edges undirected.)
I Let A, B, C be any 3 disjoint subsets of {1, 2, . . . , n} and
X A , X B , X C the corresponding sets of the vertices V (S) of an
undirected graph S. Then X B is said to separate X C from X A
in S iff any path from an Xa ∈ X A to an Xc ∈ X C passes
through at least one Xb ∈ X B .
Example
X1 → X5 → X6
% %
X2 → X4 → X7
↑ &
G X3 X8

Find ancestor set of {X5 , X7 }


A = A({X5 , X7 }) = {X1 , X2 , X3 , X4 , X5 , X7 }
X1 → X5
%
X2 → X4 → X7

G(A) X3
Moralise G
X1 → X5 → X6
| % | %
X2 → X4 → X7
 ↑ &
GM X3 X8
Moralise G(A)
X1 → X5
| %
X2 → X4 → X7
 ↑
G M (A) X3

Note for example that {X5 } separates {X1 } and {X2 } in the
skeleton of G, but NOT in the skeleton of G M (A).
The d-Separation Theorem
This derives all valid deductions to be read directly from the DAG.
It can be proved using only the properties at the start of this
chapter, and so applies even outside a Bayesian framework.
Theorem
Let A, B, C be any three disjoint subsets of {1, 2, . . . , n}, and G be
a valid DAG whose vertices V (G) = {X1 , X2 , . . . , Xn }. If X B
separates X C from X A in the skeleton of the moralised graph
G M (A(X A∪B∪C )) of the ancestral graph G(A(X A∪B∪C )), then

X C q X A |X B

On the other hand, if the graphical condition above doesn’t hold


then X C q X A |X B is not asserted (but not excluded either).
Example

X8 ← X7 → X9
↑ % ↑ %
X5 X6
↑ ↑
X4 X3
↑ % ↑
X2 ← X1

Check whether (X4 , X5 ) q (X3 , X6 )|(X2 , X7 ) or X4 q X6 |(X2 , X5 ) are


valid deductions.
Ancestor set of {X2 , X3 , X4 , X5 , X6 , X7 } is
{X1 , X2 , X3 , X4 , X5 , X6 , X7 }. So moralising ancestral graph and
forming the skeleton gives us

X7 X7
% ↑  |
X5 − X6 X5 − X6
↑ ↑ to | |
X4 X3 X4 X3
↑ % ↑ |  |
X2 ← X1 X2 − X1

I (X2 , X7 ) do not block all paths between (X4 , X5 ) and (X3 , X6 )


I The path X5 − X6 does not pass through either X2 or X7
I So cannot conclude from this DAG that
(X4 , X5 ) q (X3 , X6 )|(X2 , X7 )
On the other hand, to check X4 q X6 |X2 , X5 note that this
collection of variables has ancestor set {X1 , X2 , X3 , X4 , X5 , X6 }.
Now it is not necessary to moralise X5 − X6 because their child is
not in the ancestor set.
So here the skeleton we need is simply

X5 X6
| |
X4 X3
|  |
X2 − X1

Clearly (X2 , X5 ) separates all paths between X4 and X6 . So we can


conclude from this that X4 q X6 |X2 , X5 .
Alternative definition of d-separation.
A path p is said to be d-separated by a set of nodes Z if and only if
1. p contains a chain i → m → j or a fork i ← m → j such that
the middle node m is in Z
2. p contains an inverted fork (or collider) i → m ← j such that
the middle node m is not in Z and such that no descendant of
m is in Z .
A set Z is said to d-separate X from Y if and only if Z d-separates
every path from a node in X to a node in Y .
Local Markov property.
I This property is implied by the construction of the Bayesian
Network
I It is also a special case of the d-separation theorem
I Each variable is conditionally independent of its
non-descendants given its parent variables
I For all variables A ∈ V :

A q (V − de(A) − A)|pa(A)
Markov blanket property.
I This is another special case of d-separation.
I The Markov blanket mb(v ) of a node V is defined as the set of
its parents, its children and other parents of its children
I Given its Markov blanket, a node is independent of the rest of
the network
I For all variables A ∈ V :

A q (V − mb(A) − A)|mb(A)
4.2.4 Building a Bayesian network
Example
A DM needs to predict the costs of programmes of work. An
employee gives a ballpark estimate B of the cost of each potential
scheme. DM produces a detailed estimate E and then decides the
programme he puts out to tender. The lowest bid T gets work. The
final cost charged is ultimately O.
I DM says he uses E to estimate probability of winning tender
price T
I This in turn is used to predict final cost O
I Therefore his DAG is

B → E → T → O
Queries of the Analyst

B → E → T → O

Two irrelevance statements given as queries to DM: T q B|E and


O q (B, E )|T .
I “Can you think of any scenario when estimating the tender
price when the ballpark estimate might help refine a submitted
detailed estimate?”, and
I “Are there any scenarios where the detailed or ballpark estimate
might provide additional useful information about the final cost
over and above that provided by the winning tender price?”
DM recalls when tender price low final cost often higher than
reflected in tender price.
Reason: In periods of economic recession contractors submit
artificially low tenders to secure work. Then they find many spurious
“add-ons” to boost O.
New DAG reflecting this comment
Let I be index of amount of work

B → E → T → O
↑ %
G I

I (B, E ) q I estimates do not take into account the variation in


the market as reflected by I.
I T q B| (I, E ) - tender price is independent of ballpark estimate
once the detailed estimate and the index - reflecting any
deflation or inflation because of lack or abundance of work - is
taken into account.
I O q (B, E )| (I, T ) two original estimates are uninformative
about the final cost once availability of work is factored in.
Another question “If you had to resurrect the rough estimate of a
project because this had been lost and you had available your
detailed estimate would the detailed estimate alone be sufficient to
resurrect the rough estimate as accurately as possible?” (note
T q B|E ⇒ B q T |E )
4.2.5 Equivalent DAGs
I Two different DAGs can make same irrelevance statements.
I For example the first three DAGs below embody just
irrelevance statements X q Y |Z (⇔ Y q X |Z ).

X Y X Y X Y X Y
↑ % ↓ % ↑ . ↓ .
Z Z Z Z

I The fourth graph however is not equivalent! It represents


X q Y but not X q Y |Z .
I Two DAGs imply same irrelevance statements iff their DAGs
share the same “pattern”.
Definition
The pattern P of a DAG G = (V (G), E (G)) is a mixed graph with
same vertex set V (G) and the directed edge e ∈ E (G) Xi → Y
replaced by Xi − Y iff there is no other parent Xj of Y which is not
connected to or from Xi by an edge.
Alternate phrasing: All edges lose their direction EXCEPT those
going from multiple unmarried parents of a child to that child.
Pattern is also sometimes called CPDAG (completed partially
directed acyclic graph) or equivalence class.
Theorem
DAGs equivalent iff they have the same skeleton and same
configurations of unmarried parents, i.e. have same pattern.
Example
DAGs G and G 0 have the same pattern P and so are equivalent.

X8 ← X7 → X9 X8 − X7 − X9
↑ % ↑ % | % ↑ 
X5 X6 X5 X6
↑ ↑ to | |
X4 X3 X4 X3
↑ % ↑ |  |
X2 ← X1 G X2 − X1 P
and

X8 ← X7 → X9 X8 − X7 − X9
↑ % ↑ % | % ↑ 
X5 X6 X5 X6
↑ ↑ to | |
X4 X3 X4 X3
↑ . ↓ |  |
X2 → X1 G0 X2 − X1 P
Markov network
If we moralise a DAG G and then take the skeleton, we obtain the
equivalent Markov network M:

X8 ← X7 → X9 X8 − X7 − X9
↑ % ↑ % |  | 
X5 X6 X5 − X6
↑ ↑ to | |
X4 X3 X4 X3
↑ % ↑ |  |
X2 ← X1 G X2 − X1 M
I A Markov network (aka Markov Random Field MRF) is another
very popular type of graphical model used to represent the
dependencies (less popular than BN though)
I A Markov network is less informative than a Bayesian network:
many Bayesian networks have the same equivalent Markov
network
I Both have pros and cons when it comes to modelling a specific
problem
I Neither Bayes nor Markov had anything to do with the
development of these models!
I We will focus on BN, but very strong links between these two
types of graphical models

You might also like