BookSlides 6A Probability-Based Learning PDF

Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary
Fundamentals of Machine Learning for

Predictive Data Analytics
Chapter 6: Probability-based Learning
Sections 6.1, 6.2, 6.3
John Kelleher and Brian Mac Namee and Aoife D’Arcy
john.d.kelleher@dit.ie brian.macnamee@ucd.ie aoife@theanalyticsstore.com

1 Big Idea
2 Fundamentals
Bayes’ Theorem
Bayesian Prediction
Conditional Independence and Factorization
3 Standard Approach: The Naive Bayes’ Classifier

A Worked Example
4 Summary
Big Idea
(a)
(b)
Figure: A game of find the lady

(a)
Likelihood
Left Center Right
(b)
Figure: A game of find the lady : (a) the cards dealt face down on a
table; and (b) the initial likelihoods of the queen ending up in each
position.
(a)
Likelihood
Left Center Right
(b)
Figure: A game of find the lady : (a) the cards dealt face down on a
table; and (b) a revised set of likelihoods for the position of the queen
based on evidence collected.
(a)
Likelihood
Left Center Right
(b)
Figure: A game of find the lady : (a) The set of cards after the wind
blows over the one on the right; (b) the revised likelihoods for the
position of the queen based on this new evidence.
Figure: A game of find the lady : The final positions of the cards in
the game.
Big Idea
We can use estimates of likelihoods to determine the most
likely prediction that should be made.
More importantly, we revise these predictions based on
data we collect and whenever extra evidence becomes
available.
Fundamentals
Table: A simple dataset for M ENINGITIS diagnosis with descriptive

features that describe the presence or absence of three common
symptoms of the disease: H EADACHE, F EVER, and VOMITING.
ID H EADACHE F EVER VOMITING M ENINGITIS
1 true true false false
2 false true false false
3 true false true false
5 false true false true
8 true false true true
A probability function, P(), returns the probability of a

feature taking a specific value.
A joint probability refers to the probability of an
assignment of specific values to multiple different features.
A conditional probability refers to the probability of one
feature taking a specific value given that we already know
the value of a different feature
A probability distribution is a data structure that
describes the probability of each possible value a feature
can take. The sum of a probability distribution must equal
1.0.
A joint probability distribution is a probability distribution

over more than one feature assignment and is written as a
multi-dimensional matrix in which each cell lists the
probability of a particular combination of feature values
being assigned.
The sum of all the cells in a joint probability distribution
must be 1.0.
P(h, f , v , m), P(¬h, f , v , m)

 

 P(h, f , v , ¬m), P(¬h, f , v , ¬m) 


 P(h, f , ¬v , m), P(¬h, f , ¬v , m) 

 P(h, f , ¬v , ¬m), P(¬h, f , ¬v , ¬m) 
P(H, F , V , M) =  

 P(h, ¬f , v , m), P(¬h, ¬f , v , m) 


 P(h, ¬f , v , ¬m), P(¬h, ¬f , v , ¬m) 

 P(h, ¬f , ¬v , m), P(¬h, ¬f , ¬v , m) 
P(h, ¬f , ¬v , ¬m), P(¬h, ¬f , ¬v , ¬m)
Given a joint probability distribution, we can compute the

probability of any event in the domain that it covers by
summing over the cells in the distribution where that event
is true.
Calculating probabilities in this way is known as summing
out.
Bayes’ Theorem
Bayes’ Theorem
P(Y |X )P(X )
P(X |Y ) =
P(Y )
Bayes’ Theorem
Example
After a yearly checkup, a doctor informs their patient that he
has both bad news and good news. The bad news is that the
patient has tested positive for a serious disease and that the
test that the doctor has used is 99% accurate (i.e., the
probability of testing positive when a patient has the disease is
0.99, as is the probability of testing negative when a patient
does not have the disease). The good news, however, is that
the disease is extremely rare, striking only 1 in 10,000 people.
What is the actual probability that the patient has the

disease?
Why is the rarity of the disease good news given that the
patient has tested positive for it?
Bayes’ Theorem
P(t|d)P(d)
P(d|t) =
P(t)
P(t) = P(t|d)P(d) + P(t|¬d)P(¬d)

= (0.99 × 0.0001) + (0.01 × 0.9999) = 0.0101
0.99 × 0.0001
P(d|t) =
0.0101
= 0.0098
Bayes’ Theorem
Deriving Bayes theorem
P(Y |X )P(X ) = P(X |Y )P(Y )
P(X |Y )P(Y ) P(Y |X )P(X )

=
P(Y ) P(Y )
P(X |Y )P(Y
) P(Y |X )P(X )
=
P(Y )

P(Y )
P(Y |X )P(X )
⇒P(X |Y ) =
P(Y )
Bayes’ Theorem
The divisor is the prior probability of the evidence

This division functions as a normalization constant.
0 ≤ P(X |Y ) ≤ 1
X
P(Xi |Y ) = 1.0
i
Bayes’ Theorem
We can calculate this divisor directly from the dataset.

|{rows where Y is the case}|
P(Y ) =
|{rows in the dataset}|
Or, we can use the Theorem of Total Probability to

calculate this divisor.
X
P(Y ) = P(Y |Xi )P(Xi ) (1)
i
Bayesian Prediction
Generalized Bayes’ Theorem
P(q[1], . . . , q[m]|t = l)P(t = l)

P(t = l|q[1], . . . , q[m]) =
P(q[1], . . . , q[m])
Bayesian Prediction
Chain Rule
P(q[1], . . . , q[m]) =
P(q[1]) × P(q[2]|q[1])×
· · · × P(q[m]|q[m − 1], . . . , q[2], q[1])
To apply the chain rule to a conditional probability we just

add the conditioning term to each term in the expression:
P(q[1], . . . , q[m]|t = l) =
P(q[1]|t = l) × P(q[2]|q[1], t = l) × . . .
· · · × P(q[m]|q[m − 1], . . . , q[3], q[2], q[1], t = l)
Bayesian Prediction

H EADACHE F EVER VOMITING M ENINGITIS

true false true ?
Bayesian Prediction
P(M|h, ¬f , v ) =?
In the terms of Bayes’ Theorem this problem can be stated

as:
P(h, ¬f , v |M) × P(M)
P(M|h, ¬f , v ) =
P(h, ¬f , v )
There are two values in the domain of the M ENINGITIS

feature, ’true’ and ’false’, so we have to do this calculation
twice.
Bayesian Prediction
We will do the calculation for m first

To carry out this calculation we need to know the following
probabilities: P(m), P(h, ¬f , v ) and P(h, ¬f , v | m).

Bayesian Prediction
We can calculate the required probabilities directly from

the data. For example, we can calculate P(m) and
P(h, ¬f , v ) as follows:
|{d5 , d8 , d10 }| 3
P(m) = = = 0.3
|{d1 , d2 , d3 , d4 , d5 , d6 , d7 , d8 , d9 , d10 }| 10
|{d3 , d4 , d6 , d7 , d8 , d10 }| 6
P(h, ¬f , v ) = = = 0.6
|{d1 , d2 , d3 , d4 , d5 , d6 , d7 , d8 , d9 , d10 }| 10
Bayesian Prediction
However, as an exercise we will use the chain rule

calculate:
P(h, ¬f , v | m) =?

Bayesian Prediction
Using the chain rule calculate:
P(h, ¬f , v | m) = P(h | m) × P(¬f | h, m) × P(v | ¬f , h, m)

|{d8 , d10 }| |{d8 , d10 }| |{d8 , d10 }|
= × ×
|{d5 , d8 , d10 }| |{d8 , d10 }| |{d8 , d10 }|
2 2 2
= × × = 0.6666
3 2 2
Bayesian Prediction
So the calculation of P(m|h, ¬f , v ) is:

!
P(h|m) × P(¬f |h, m)
× P(v |¬f , h, m) × P(m)
P(m|h, ¬f , v ) =
P(h, ¬f , v )
0.6666 × 0.3
= = 0.3333
0.6
Bayesian Prediction
The corresponding calculation for P(¬m|h, ¬f , v ) is:
P(h, ¬f , v | ¬m) × P(¬m)

P(¬m | h, ¬f , v ) =
P(h, ¬f , v )
!
P(h|¬m) × P(¬f | h, ¬m)
× P(v |¬f , h, ¬m) × P(¬m)
=
P(h, ¬f , v )
0.7143 × 0.8 × 1.0 × 0.7
= = 0.6667
0.6
Bayesian Prediction
P(m|h, ¬f , v ) = 0.3333
P(¬m|h, ¬f , v ) = 0.6667
These calculations tell us that it is twice as probable that

the patient does not have meningitis than it is that they do
even though the patient is suffering from a headache and
is vomiting!
Bayesian Prediction
The Paradox of the False Positive

The mistake of forgetting to factor in the prior gives rise to
the paradox of the false positive which states that in
order to make predictions about a rare event the model has
to be as accurate as the prior of the event is rare or there is
a significant chance of false positives predictions (i.e.,
predicting the event when it is not the case).
Bayesian Prediction
Bayesian MAP Prediction Model

MMAP (q) = argmax P(t = l | q[1], . . . , q[m])
l∈levels(t)
P(q[1], . . . , q[m] | t = l) × P(t = l)

= argmax
l∈levels(t) P(q[1], . . . , q[m])
Bayesian MAP Prediction Model (without normalization)
MMAP (q) = argmax P(q[1], . . . , q[m] | t = l) × P(t = l)

l∈levels(t)

true true false ?
P(m | h, f , ¬v ) =?
P(¬m | h, f , ¬v ) =?
Bayesian Prediction
!
P(h|m) × P(f | h, m)
× P(¬v | f , h, m) × P(m)
P(m | h, f , ¬v ) =
P(h, f , ¬v )
0.6666 × 0 × 0 × 0.3
= =0
0.1
Bayesian Prediction
!
P(h|¬m) × P(f | h, ¬m)
× P(¬v | f , h, ¬m) × P(¬m)
P(¬m | h, f , ¬v ) =
P(h, f , ¬v )
0.7143 × 0.2 × 1.0 × 0.7
= = 1.0
0.1
Bayesian Prediction
P(m | h, f , ¬v ) = 0
P(¬m | h, f , ¬v ) = 1.0
There is something odd about these results!

Bayesian Prediction
Curse of Dimensionality
As the number of descriptive features grows the number of
potential conditioning events grows. Consequently, an
exponential increase is required in the size of the dataset as
each new descriptive feature is added to ensure that for any
conditional probability there are enough instances in the
training dataset matching the conditions so that the resulting
probability is reasonable.
Bayesian Prediction
The probability of a patient who has a headache and a

fever having meningitis should be greater than zero!
Our dataset is not large enough → our model is over-fitting
to the training data.
The concepts of conditional independence and
factorization can help us overcome this flaw of our current
approach.
If knowledge of one event has no effect on the probability

of another event, and vice versa, then the two events are
independent of each other.
If two events X and Y are independent then:
P(X |Y ) = P(X )
P(X , Y ) = P(X ) × P(Y )
Recall, that when two event are dependent these rules are:
P(X , Y )
P(X |Y ) =
P(Y )
P(X , Y ) = P(X |Y ) × P(Y ) = P(Y |X ) × P(X )
Full independence between events is quite rare.

A more common phenomenon is that two, or more, events
may be independent if we know that a third event has
happened.
This is known as conditional independence.
For two events, X and Y , that are conditionally

independent given knowledge of a third events, here Z , the
definition of the probability of a joint event and conditional
probability are:
P(X |Y , Z ) = P(X |Z )
P(X , Y |Z ) = P(X |Z ) × P(Y |Z )
P(X , Y )
P(X |Y ) =
P(Y ) P(X |Y ) = P(X )
P(X , Y ) = P(X |Y ) × P(Y ) P(X , Y ) = P(X ) × P(Y )
= P(Y |X ) × P(X )
X and Y are independent
X and Y are dependent
If the event t = l causes the events q[1], . . . , q[m] to

happen then the events q[1], . . . , q[m] are conditionally
independent of each other given knowledge of t = l and
the chain rule definition can be simplified as follows:
P(q[1], . . . , q[m] | t = l)
= P(q[1] | t = l) × P(q[2] | t = l) × · · · × P(q[m] | t = l)
m
Y
= P(q[i] | t = l)
i=1
Using this we can simplify the calculations in Bayes’

Theorem, under the assumption of conditional
independence between the descriptive features given the
level l of the target feature:
m
!
Y
P(q[i] | t = l) × P(t = l)
i=1
P(t = l | q[1], . . . , q[m]) =
P(q[1], . . . , q[m])
Withouth conditional independence
P(X , Y , Z |W ) = P(X |W ) × P(Y |X , W ) × P(Z |Y , X , W ) × P(W )
With conditional independence
P(X , Y , Z |W ) = P(X |W ) × P(Y |W ) × P(Z |W ) × P(W )

| {z } | {z } | {z } | {z }
Factor 1 Factor 2 Factor 3 Factor 4
The joint probability distribution for the meningitis dataset.

P(h, f , v , m), P(¬h, f , v , m)
 
 P(h, f , v , ¬m), P(¬h, f , v , ¬m) 
 
 P(h, f , ¬v , m), P(¬h, f , ¬v , m) 
 
 P(h, f , ¬v , ¬m), P(¬h, f , ¬v , ¬m) 
P(H, F , V , M) = 
 
 P(h, ¬f , v , m), P(¬h, ¬f , v , m) 

 P(h, ¬f , v , ¬m), P(¬h, ¬f , v , ¬m) 
 
 P(h, ¬f , ¬v , m), P(¬h, ¬f , ¬v , m) 
P(h, ¬f , ¬v , ¬m), P(¬h, ¬f , ¬v , ¬m)
Assuming the descriptive features are conditionally

independent of each other given M ENINGITIS we only need
to store four factors:
Factor1 : < P(M) >
Factor2 : < P(h|m), P(h|¬m) >
Factor3 : < P(f |m), P(f |¬m) >
Factor4 : < P(v |m), P(v |¬m) >
P(H, F , V , M) = P(M) × P(H|M) × P(F |M) × P(V |M)
Calculate the factors from the data.

Factor1 : < P(M) >
Factor2 : < P(h|m), P(h|¬m) >
Factor3 : < P(f |m), P(f |¬m) >
Factor4 : < P(v |m), P(v |¬m) >
Factor1 : < P(m) = 0.3 >

Factor2 : < P(h|m) = 0.6666, P(h|¬m) = 0.7413 >
Factor3 : < P(f |m) = 0.3333, P(f |¬m) = 0.4286 >
Factor4 : < P(v |m) = 0.6666, P(v |¬m) = 0.5714 >
Factor1 : < P(m) = 0.3 >

Factor2 : < P(h|m) = 0.6666, P(h|¬m) = 0.7413 >
Factor3 : < P(f |m) = 0.3333, P(f |¬m) = 0.4286 >
Factor4 : < P(v |m) = 0.6666, P(v |¬m) = 0.5714 >
Using the factors above calculate the probability of

M ENINGITIS=’true’ for the following query.

true true false ?
P(h|m) × P(f |m) × P(¬v |m) × P(m)

P(m|h, f , ¬v ) = P =
i P(h|Mi ) × P(f |Mi ) × P(¬v |Mi ) × P(Mi )
0.6666 × 0.3333 × 0.3333 × 0.3
= 0.1948
(0.6666 × 0.3333 × 0.3333 × 0.3) + (0.7143 × 0.4286 × 0.4286 × 0.7)
Factor1 : < P(m) = 0.3 >

Factor2 : < P(h|m) = 0.6666, P(h|¬m) = 0.7413 >
Factor3 : < P(f |m) = 0.3333, P(f |¬m) = 0.4286 >
Factor4 : < P(v |m) = 0.6666, P(v |¬m) = 0.5714 >
Using the factors above calculate the probability of

M ENINGITIS=’false’ for the same query.

true true false ?
P(h|¬m) × P(f |¬m) × P(¬v |¬m) × P(¬m)

P(¬m|h, f , ¬v ) = P =
i P(h|Mi ) × P(f |Mi ) × P(¬v |Mi ) × P(Mi )
0.7143 × 0.4286 × 0.4286 × 0.7
= 0.8052
(0.6666 × 0.3333 × 0.3333 × 0.3) + (0.7143 × 0.4286 × 0.4286 × 0.7)
P(m|h, f , ¬v ) = 0.1948
P(¬m|h, f , ¬v ) = 0.8052
As before, the MAP prediction would be

M ENINGITIS = ’false’
The posterior probabilities are not as extreme!
Standard Approach: The Naive

Bayes’ Classifier
Naive Bayes’ Classifier

m
!
Y
M(q) = argmax P(q[i] | t = l) × P(t = l)
l∈levels(t) i=1
Naive Bayes’ is simple to train!

1 calculate the priors for each of the target levels
2 calculate the conditional probabilities for each feature
given each target level.
Table: A dataset from a loan application fraud detection domain.
C REDIT G UARANTOR /
ID H ISTORY C O A PPLICANT ACCOMODATION F RAUD
1 current none own true
2 paid none own false
4 paid guarantor rent true
5 arrears none own false
6 arrears none own true
7 current none own false
9 current none rent false
10 none none own true
11 current coapplicant own false
13 current none rent true
17 arrears coapplicant rent false
18 arrears none free false
P(fr ) = 0.3 P(¬fr ) = 0.7
P(CH = ’none’ | fr ) = 0.1666 P(CH = ’none’ | ¬fr ) = 0
P(CH = ’paid’ | fr ) = 0.1666 P(CH = ’paid’ | ¬fr ) = 0.2857
P(CH = ’current’ | fr ) = 0.5 P(CH = ’current’ | ¬fr ) = 0.2857
P(CH = ’arrears’ | fr ) = 0.1666 P(CH = ’arrears’ | ¬fr ) = 0.4286
P(GC = ’none’ | fr ) = 0.8334 P(GC = ’none’ | ¬fr ) = 0.8571
P(GC = ’guarantor’ | fr ) = 0.1666 P(GC = ’guarantor’ | ¬fr ) = 0
P(GC = ’coapplicant’ | fr ) = 0 P(GC = ’coapplicant’ | ¬fr ) = 0.1429
P(ACC = ’own’ | fr ) = 0.6666 P(ACC = ’own’ | ¬fr ) = 0.7857
P(ACC = ’rent’ | fr ) = 0.3333 P(ACC = ’rent’ | ¬fr ) = 0.1429
P(ACC = ’free’ | fr ) = 0 P(ACC = ’free’ | ¬fr ) = 0.0714
Table: The probabilities needed by a Naive Bayes prediction model

calculated from the dataset. Notation key: FR=F RAUDULENT,
CH=C REDIT H ISTORY, GC = G UARANTOR /C O A PPLICANT, ACC =
ACCOMODATION, T=’true’, F=’false’.
P(fr ) = 0.3 P(¬fr ) = 0.7
P(CH = ’none’ | fr ) = 0.1666 P(CH = ’none’ | ¬fr ) = 0
P(CH = ’current’ | fr ) = 0.5 P(CH = ’current’ | ¬fr ) = 0.2857
P(CH = ’arrears’ | fr ) = 0.1666 P(CH = ’arrears’ | ¬fr ) = 0.4286
P(GC = ’guarantor’ | fr ) = 0.1666 P(GC = ’guarantor’ | ¬fr ) = 0
P(GC = ’coapplicant’ | fr ) = 0 P(GC = ’coapplicant’ | ¬fr ) = 0.1429
P(ACC = ’own’ | fr ) = 0.6666 P(ACC = ’own’ | ¬fr ) = 0.7857
P(ACC = ’free’ | fr ) = 0 P(ACC = ’free’ | ¬fr ) = 0.0714
C REDIT H ISTORY G UARANTOR /C O A PPLICANT ACCOMODATION F RAUDULENT

paid none rent ?
A Worked Example
P(fr ) = 0.3 P(¬fr ) = 0.7

Ym
P (q [k ] | fr ) × P (fr ) = 0.0139
k =1
m
Y
P (q [k ] | ¬fr ) × P(¬fr ) = 0.0245
k =1

paid none rent ?
A Worked Example
P(fr ) = 0.3 P(¬fr ) = 0.7

Ym
P (q [k ] | fr ) × P (fr ) = 0.0139
k =1
m
Y
P (q [k ] | ¬fr ) × P(¬fr ) = 0.0245
k =1

paid none rent ’false’
The model is generalizing beyond the dataset!
C REDIT G UARANTOR /
ID H ISTORY C O A PPLICANT ACCOMMODATION F RAUD
4 paid guarantor rent true
6 arrears none own true
9 current none rent false
10 none none own true
11 current coapplicant own false
13 current none rent true
17 arrears coapplicant rent false
18 arrears none free false
C REDIT H ISTORY G UARANTOR /C O A PPLICANT ACCOMMODATION F RAUDULENT

paid none rent ’false’
Summary
P(d|t) × P(t)
P(t|d) = (2)
P(d)
A Naive Bayes’ classifier naively assumes that each of the

descriptive features in a domain is conditionally
independent of all of the other descriptive features, given
the state of the target feature.
This assumption, although often wrong, enables the Naive
Bayes’ model to maximally factorise the representation that
it uses of the domain.
Surprisingly, given the naivety and strength of the
assumption it depends upon, a Naive Bayes’ model often
performs reasonably well.
1 Big Idea
2 Fundamentals
Bayes’ Theorem
Bayesian Prediction
3 Standard Approach: The Naive Bayes’ Classifier

A Worked Example
4 Summary

BookSlides 6A Probability-Based Learning PDF

Uploaded by

Copyright:

Available Formats

BookSlides 6A Probability-Based Learning PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BookSlides 6A Probability-Based Learning PDF

Uploaded by

Copyright:

Available Formats

Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary

Fundamentals of Machine Learning for

John Kelleher and Brian Mac Namee and Aoife D’Arcy

john.d.kelleher@dit.ie brian.macnamee@ucd.ie aoife@theanalyticsstore.com

3 Standard Approach: The Naive Bayes’ Classifier

Figure: A game of find the lady

Left Center Right

Left Center Right

Left Center Right

Table: A simple dataset for M ENINGITIS diagnosis with descriptive

A probability function, P(), returns the probability of a

A joint probability distribution is a probability distribution

P(h, f , v , m), P(¬h, f , v , m)

Given a joint probability distribution, we can compute the

What is the actual probability that the patient has the

P(t) = P(t|d)P(d) + P(t|¬d)P(¬d)

Deriving Bayes theorem

P(Y |X )P(X ) = P(X |Y )P(Y )

P(X |Y )P(Y ) P(Y |X )P(X )

The divisor is the prior probability of the evidence

We can calculate this divisor directly from the dataset.

Or, we can use the Theorem of Total Probability to

Generalized Bayes’ Theorem

P(q[1], . . . , q[m]|t = l)P(t = l)

To apply the chain rule to a conditional probability we just

ID H EADACHE F EVER VOMITING M ENINGITIS

H EADACHE F EVER VOMITING M ENINGITIS

In the terms of Bayes’ Theorem this problem can be stated

There are two values in the domain of the M ENINGITIS

We will do the calculation for m first

ID H EADACHE F EVER VOMITING M ENINGITIS

We can calculate the required probabilities directly from

However, as an exercise we will use the chain rule

ID H EADACHE F EVER VOMITING M ENINGITIS

Using the chain rule calculate:

P(h, ¬f , v | m) = P(h | m) × P(¬f | h, m) × P(v | ¬f , h, m)

So the calculation of P(m|h, ¬f , v ) is:

The corresponding calculation for P(¬m|h, ¬f , v ) is:

P(h, ¬f , v | ¬m) × P(¬m)

These calculations tell us that it is twice as probable that

The Paradox of the False Positive

Bayesian MAP Prediction Model

P(q[1], . . . , q[m] | t = l) × P(t = l)

Bayesian MAP Prediction Model (without normalization)

MMAP (q) = argmax P(q[1], . . . , q[m] | t = l) × P(t = l)

H EADACHE F EVER VOMITING M ENINGITIS

There is something odd about these results!

The probability of a patient who has a headache and a

Conditional Independence and Factorization

If knowledge of one event has no effect on the probability

Conditional Independence and Factorization

Full independence between events is quite rare.

Conditional Independence and Factorization

For two events, X and Y , that are conditionally

Conditional Independence and Factorization

If the event t = l causes the events q[1], . . . , q[m] to

Conditional Independence and Factorization

Using this we can simplify the calculations in Bayes’

Conditional Independence and Factorization