Lec 02 Bayesian Decision Theoryv 2024

Random Variables
Outline
• Notation: Probability, Probability Mass, Probability Density
• Jointly random variables
• Conditional Probability and Independence
• Expectation
• Covariance Matrix
Notation: Probability
If an experiment is run an infinite number of times, the probability of
event A is the fraction of those times on which event A occurs.
Axiom 1: every event 𝐴 has a non-negative probability.

Pr 𝐴 ≥ 0
Axiom 2: If an event Ω always occurs, we say it has probability 1.
Pr(Ω) = 1
Axiom 3: probability measures behave like set measures.
Pr 𝐴⋁𝐵 = Pr 𝐴 + Pr 𝐵 − Pr(𝐴⋀𝐵)
Axiom 3: probability measures behave like set measures.
Area of the whole rectangle is 𝑃 Ω = 1.
Area of Area of
this circle
A B
this circle
is 𝑃 𝐴 . is 𝑃 𝐵 .
Area of their intersection is 𝑃 𝐴⋀𝐵 .

Area of their union is 𝑃 𝐴⋁𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃(𝐴⋀𝐵)
Notation: Random Variables
A random variable is a function that summarizes the output of an
experiment. We use capital letters to denote random variables.
• Example: every Friday, Maria brings a cake to her daughter’s pre-
school. 𝑋 is the number of children who eat the cake.
We use a small letter to denote a particular outcome of the

experiment.
• Example: for the last three weeks, each week, 5 children had cake,
but this week, only 4 children had cake. Estimate 𝑃(𝑋 = 𝑥) for all
possible values of 𝑥.
Notation: 𝑃(𝑋 = 𝑥) is a number, but 𝑃 𝑋 is a
distribution
• 𝑃(𝑋 = 4) or 𝑃(4) is the probability mass or probability density of the
outcome “𝑋 = 4.” For example:
1
𝑃(𝑋 = 4) =
4
• 𝑃 𝑋 is the complete distribution, specifying 𝑃(𝑋 = 𝑥) for all

possible values of 𝑥. For example:
𝑥 4 5
𝑃 𝑋 = 𝑃(𝑥) 1 3
4 4
Discrete versus Continuous RVs
• 𝑋 is a discrete random variable if it can only take countably many different
values.
• Example: 𝑋 is the number of people living in a randomly selected city
𝑋 ∈ {1,2,3,4, … }
• Example: 𝑋 is the first word on a randomly selected page
𝑋 ∈ the, and, of, bobcat, …
• Example: 𝑋 is the next emoji you will receive on your cellphone
𝑋 ∈ {😀 , 😃 , 😄 , 😁 , 😆 , 😅 , 🤣, … }
• 𝑋 is a continuous random variable if it can take uncountably many
different values
• Example: 𝑋 is the energy of the next object to collide with Earth
𝑋 ∈ ℝ! (the set of all positive real numbers)
Probability Mass Function (pmf) is a type of
probability
• If 𝑋 is a discrete random variable, then 𝑃(𝑋) is its probability mass
function (pmf).
• A probability mass is just a probability. 𝑃 𝑋 = 𝑥 = Pr 𝑋 = 𝑥 is the
just the probability of the outcome “𝑋 = 𝑥.” Thus:
0≤𝑃 𝑋=𝑥
1 = 6 𝑃(𝑋 = 𝑥)
!
Probability Density Function (pdf) is NOT a
probability
• If 𝑋 is a density random variable, then 𝑃(𝑋) is its probability density
function (pdf).
• A probability density is NOT a probability. Instead, we define it as a
"
density (𝑃 𝑋 = 𝑥 = Pr 𝑋 ≤ 𝑥 ). Thus:
"!
0≤𝑃 𝑋=𝑥
$
1 = 7 𝑃 𝑋 = 𝑥 𝑑𝑥
#$
Outline
• Expectation
Jointly Random Variables
• Two or three random variables are “jointly random” if they are both
outcomes of the same experiment.
• For example, here are the temperature (X , in °C), and precipitation
(Y, symbolic) for six days in Urbana:
X=Temperature (°C) Y=Precipitation
January 11 4 cloud
January 12 1 cloud
January 13 -2 snow
January 14 -3 cloud
January 15 -3 clear
January 16 4 rain
Joint Distributions
Based on the data on prev slide, here is an estimate of the joint
distribution of these two random variables:
Y
P(X=x,Y=y)
snow rain cloud clear
-3 0 0 1/6 1/6
-2 1/6 0 0 0
X
1 0 0 1/6 0
4 0 1/6 1/6 0
Notation: Vectors and Matrices
• A normal-font capital letter is a random variable, which is a function
mapping from the outcome of an experiment to a measurement
• A normal-font small letter is a scalar instance
• A boldface small letter is a vector instance
• A boldface capital letter is a matrix instance
Notation: Vectors and Matrices
𝑃(𝑋 = 𝒙) is the probability that random variable X takes the value of the
vector 𝒙. This is just a shorthand for the joint distribution of 𝑥! , ⋯ , 𝑥" :
𝑥!
𝒙= ⋮ , 𝑃 𝑋 = 𝒙 = 𝑃 𝑋! = 𝑥! , ⋯ , 𝑋" = 𝑥"
𝑥"
Similarly, for a random matrix, we could write:
𝑥!,! ⋯ 𝑥!,"
𝑿= ⋮ ⋱ ⋮ , 𝑃 𝑋 = 𝑿 = 𝑃 𝑋!,! = 𝑥!,! , … , 𝑋$," = 𝑥$,"
𝑥$,! ⋯ 𝑥$,"
Marginal Distributions
Suppose we know the joint distribution 𝑃 𝑋, 𝑌 . We want to find the
two marginal distributions 𝑃(𝑋) :
• If the unwanted variable is discrete, we marginalize by adding:
𝑃 𝑋 = 6 𝑃 𝑋, 𝑌 = 𝑦
%
• If the unwanted variable is continuous, we marginalize by integrating:

𝑃 𝑋 = 7 𝑃 𝑋, 𝑌 = 𝑦 𝑑𝑦
Marginal Distributions
Here are the marginal distributions of the two weather variables:
𝑃(𝑋, 𝑌) snow rain cloud clear 𝑃(𝑋)
-3 0 0 1/6 1/6 1/3
-2 1/6 0 0 0 1/6
1 0 0 1/6 0 0
4 0 1/6 1/6 0 1/3
𝑃(𝑌) 1/6 1/6 1/2 1/6

Outline
• Expectation
Joint and Conditional distributions
• 𝑃 𝑋, 𝑌 is the probability (or pdf) that 𝑋 = 𝑥 and 𝑌 = 𝑦, over all 𝑥
and 𝑦. This is called their joint distribution.
• 𝑃(𝑌|𝑋) is the probability (or pdf) that 𝑌 = 𝑦 happens, given that 𝑋 =

𝑥 happens, over all 𝑥 and 𝑦. This is called the conditional
distribution of 𝑌 given 𝑋.
Joint probabilities are usually given in the problem statement
Area of the whole rectangle is Pr Ω = 1.
Suppose Suppose
Pr 𝐴 = 0.4
A B Pr 𝐵 = 0.2
Suppose Pr 𝐴⋀𝐵 = 0.1

Conditioning events change our knowledge!
For example, given that A is true…
Most of the events in this rectangle are no longer possible!
Only the
events
inside this
circle are
now
A A B𝐴⋀𝐵
possible.
Conditioning events change our knowledge! The probability of
B, given A, is the
For example, given that A is true… size of the event
𝐴⋀𝐵, expressed as
a fraction of the
If A always occurs, then
size of the event
by the axioms of
A:
probability, the
probability of
A
Pr 𝐵 𝐴 =
A=T
is 1.
𝐴⋀𝐵
Pr(𝐴⋀𝐵)
We can
Pr(𝐴)
say that
Pr(𝐴|𝐴) = 1.
Joint and Conditional distributions of random
variables
• 𝑃 𝑋, 𝑌 is the joint probability distribution over all possible
outcomes 𝑃 𝑋 = 𝑥, 𝑌 = 𝑦 .
• 𝑃(𝑋|𝑌) is the conditional probability distribution of outcomes
𝑃(𝑋 = 𝑥|𝑌 = 𝑦).
• The conditional is the joint divided by the marginal:
𝑃 𝑋 = 𝑥, 𝑌 = 𝑦
𝑃 𝑋=𝑥𝑌=𝑦 =
𝑃 𝑌=𝑦
Conditional is the joint divided by the marginal:
1 1 1
𝑃 𝑋, 𝑌 = cloud 0
𝑃 𝑋 𝑌 = cloud = = 6 6 6
𝑃 𝑌 = cloud 1/2
snow rain cloud clear P(X|Y=cloud)

-3 0 0 1/6 1/6 1/3
-2 1/6 0 0 0 0
1 0 0 1/6 0 1/3
4 0 1/6 1/6 0 1/3
Joint = Conditional×Marginal
𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑌 𝑃(𝑌)
Independent Random Variables
Two random variables are said to be independent
if:
𝑃 𝑋𝑌 =𝑃 𝑋
In other words, knowing the value of 𝑌 tells you
nothing about the value of 𝑋.
… and a more useful definition of
independence…
Plugging the definition of independence,
𝑃 𝑋𝑌 =𝑃 𝑋 ,
…into the “Joint = Conditional×Marginal” equation,
…gives us a more useful definition of independence.
Definition of Independence: Two random variables, X and Y, are
independent if and only if
𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑃(𝑌)
Independent events
Independent events occur with equal probability, regardless of whether
or not the other event has occurred:
Pr 𝐴|𝐵 = Pr 𝐴
Pr(𝐴⋀𝐵) = Pr 𝐴 Pr(𝐵)
A B
Outline
• Expectation
Expectation
• The expected value of a function is its weighted average, weighted by
its pmf or pdf.
• If 𝑋 and 𝑌 are discrete, then
𝐸 𝑓(𝑋, 𝑌) = 6 𝑓 𝑥, 𝑦 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦)
!,%
• If 𝑋 is continuous, then
$
𝐸 𝑓(𝑋, 𝑌) = A 𝑓 𝑥, 𝑦 𝑃 𝑋 = 𝑥, 𝑌 = 𝑦 𝑑𝑥𝑑𝑦
#$
Outline
• Expectation
Covariance
The covariance of two random
variables is the expected product of
their deviations:
Covar 𝑋, 𝑌
=𝐸 𝑋−𝐸 𝑋 𝑌−𝐸 𝑌
Two zero-mean random variables, with variances of

25, and with various values of covariance.
Public domain image,
https://commons.wikimedia.org/wiki/File:Varianz.gif
Covariance Matrix
Suppose 𝑋 = [𝑋! , … , 𝑋" ]% is a random vector. Its matrix of variances and
covariances (a.k.a. covariance matrix) is
Var 𝑋! ⋯ Covar 𝑋! , 𝑋"

% ⋮ ⋱ ⋮
Σ = 𝐸 𝑋 − 𝐸[𝑋] 𝑋 − 𝐸[𝑋] =
Covar 𝑋! , 𝑋" ⋯ Var 𝑋"
𝐸 𝑋! − 𝐸[𝑋! ] & ⋯ 𝐸 𝑋! − 𝐸[𝑋! ] 𝑋" − 𝐸[𝑋" ]

= ⋮ ⋱ ⋮
𝐸 𝑋! − 𝐸[𝑋! ] 𝑋" − 𝐸[𝑋" ] ⋯ 𝐸 𝑋" − 𝐸[𝑋" ] &
Summary
• Probability Mass and Probability Density
𝑑
𝑃 𝑋 = 𝑥 = Pr 𝑋 = 𝑥 … or … 𝑃 𝑋=𝑥 = Pr 𝑋 ≤ 𝑥
𝑑𝑥
𝑃 𝑋 = 𝒙 = 𝑃 𝑋! = 𝑥! , ⋯ , 𝑋" = 𝑥"
𝑃 𝑋 𝑌 = 𝑃 𝑋 ⟺ 𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑃(𝑌)
• Expectation
𝐸 𝑓(𝑋, 𝑌) = 7 𝑓 𝑥, 𝑦 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦) … or …
#,% '
𝐸 𝑓(𝑋, 𝑌) = 9 𝑓 𝑥, 𝑦 𝑃 𝑋 = 𝑥, 𝑌 = 𝑦 𝑑𝑥𝑑𝑦
&'
• Mean, Variance and Covariance
Σ = 𝐸 𝑋 − 𝐸[𝑋] 𝑋 − 𝐸[𝑋] (
Decision Theory
Outline
• Decision Theory
• Minimum Probability of Error
• Bayes’ Rule
• Accuracy, Error Rate, and the Bayes Error Rate
• Confusion Matrix, Precision & Recall, Sensitivity & Specificity
• Train, Dev, and Test Corpora
Decision Theory
• Suppose we have an experiment with two random variables,
X and Y.
• X is something we can observe, like the words in an email.
• Y is something we can’t observe, but we want to know. For
example, Y=1 means the email is spam (junk mail), Y=0 means it’s
ham (desirable mail).
• Can we train an AI to read the email, and determine
whether it’s spam or not?
Decision Theory
• 𝑌 = the correct label
• 𝑌 = the correct label as a random variable (“in general”)
• 𝑦 = the label observed in a particular experiment (“in particular”)
• 𝑓(𝑋) = the decision that we make, after observing the datum, 𝑋
• 𝑓(𝑋) = the function applied to random variable 𝑋 (“in general”)
• 𝑓(𝑥) = the function applied to a particular value of 𝑥 (“in particular”)
Deciding how to Decide: Loss and Risk
• Suppose that deciding 𝑓(𝑥), when the correct label is 𝑌 = 𝑦, costs us
a certain amount of money (or prestige, or safety, or points, or
whatever) – call that the loss, 𝑙(𝑓(𝑥), 𝑦)
• In general, we would like to lose as few points as possible (negative
losses are good…)
• Define the risk, 𝑅(𝑓), to be the expected loss incurred by using the
decision rule 𝑓(𝑋):
𝑅(𝑓) = 𝐸 𝑙(𝑓 𝑋 , 𝑌) = - - 𝑙 𝑓 𝑥 , 𝑦 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦)

! "
Minimum-Risk Decisions
• If we want to the smallest average loss (the smallest risk), then our
decision rule should be
𝑓 = argmin 𝑅(𝑓)
• In other words, for each possible 𝑥, we find the value of 𝑓(𝑥) that
minimizes our expected loss given that 𝑥, and that is the 𝑓(𝑥) that
our algorithm should produce.
Outline
• Decision Theory
• Bayes’ Rule
Zero-One Loss
Suppose that 𝑓(𝑥) is an estimate of the correct label, and
• We lose one point if 𝑓(𝑥) ≠ 𝑦
• We lose zero points if 𝑓(𝑥) = 𝑦
1 𝑓(𝑥) ≠ 𝑦
𝑙(𝑓(𝑥), 𝑦) = 6
0 𝑓 𝑥 =𝑦
Then the risk is
𝑅 𝑓 = 𝐸 𝑙 𝑓 𝑋 , 𝑌 = Pr(𝑓(𝑋) ≠ 𝑌)
Minimum Probability of Error
We can minimize the probability of error by designing f(x) so
that 𝑓(𝑥) = 1 when 𝑌 = 1 is more probable, and 𝑓(𝑥) = 0
when 𝑌 = 0 is more probable.
1 𝑃 𝑌 = 1 𝑋 = 𝑥 > 𝑃(𝑌 = 0|𝑋 = 𝑥)

𝑓(𝑥) = 6
0 𝑃 𝑌 = 1 𝑋 = 𝑥 < 𝑃(𝑌 = 0|𝑋 = 𝑥)
MPE = MAP
• The “minimum probability of error” (MPE) decision rule is
the rule that chooses f(X) in order to minimize the
probability of error:
𝑓(𝑥) = argmin 𝑃(Error|𝑋 = 𝑥)
• The “maximum a posteriori” (MAP) decision rule is the rule

that chooses 𝑓(𝑋) in order to maximize the a posteriori
probability:
𝑓(𝑥) = argmax 𝑃(𝑌 = 𝑓(𝑥)|𝑋 = 𝑥)
• Those two decision rules are the same. MPE = MAP.

Outline
• Decision Theory
• Bayes’ Rule
The Bayesian Scenario
• Let’s use 𝑥~𝑋 to mean that 𝑥 is an instance of random
variable 𝑋, and similarly 𝑦~𝑌.
• In order to minimize the probability of error, we just need to
know 𝑃(𝑌 = 𝑦|𝑋 = 𝑥) for every pair of values 𝑥~𝑋 and
𝑦~𝑌. Then we choose 𝑓(𝑥) = argmax 𝑃(𝑌 = 𝑦|𝑋 = 𝑥).
!
Example: spam detection
• But how can we estimate 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)?
• The prior probability of spam might be obvious. If 80% of all email on the internet
is spam, that means that
𝑃 𝑌 = 1 = 0.8, 𝑃 𝑌 = 0 = 0.2
• The probability of X given Y is also easy. Suppose we have a database full of

sample emails, some known to be spam, some known to be ham. We count how
often any word occurs in spam vs. ham emails, and estimate:
𝑃 𝑋 = 𝑥 𝑌 = 1 = frequency of the words 𝑥 in emails known to be spam
𝑃 𝑋 = 𝑥 𝑌 = 0 = frequency of the words 𝑥 in emails known to be ham
• Now we have 𝑃 𝑋 = 𝑥 𝑌 = 𝑦 and 𝑃 𝑌 = 𝑦 . How do we get 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)?

By Unknown -
[2][3], Public
Domain,
Bayes’ Rule
https://commons.
wikimedia.org/w/i
ndex.php?curid=1
4532025
Rev. Thomas Bayes
(1702-1761)
The reverend Thomas Bayes solved this problem for us in 1763. His proof is really
just the definition of conditional probability, applied twice in a row:
𝑃(𝑋 = 𝑥, 𝑌 = 𝑦)
𝑃 𝑌=𝑦𝑋=𝑥 =
𝑃(𝑋 = 𝑥)
𝑃 𝑋 = 𝑥 𝑌 = 𝑦 𝑃(𝑌 = 𝑦)
=
𝑃 𝑋=𝑥
The four Bayesian probabilities
𝑃(𝑌 = 𝑦)𝑃 𝑋 = 𝑥 𝑌 = 𝑦
𝑃(𝑋 = 𝑥)
This equation shows the relationship among four probabilities. This equation has
become so world-famous, since 1763, that these four probabilities have standard
universally recognized names that you need to know:
• 𝑃 𝑌 = 𝑦 𝑋 = 𝑥 is the a posteriori (after-the-fact) probability, or posterior
• 𝑃(𝑌 = 𝑦) is the a priori (before-the-fact) probability, or prior
• 𝑃 𝑋 = 𝑥 𝑌 = 𝑦 is the likelihood
• 𝑃(𝑋 = 𝑥) is the evidence
Bayes’ rule: the posterior equals the prior times the likelihood over the evidence.
MPE = MAP using Bayes’ rule
• MPE = MAP: to minimize the probability of error, design f(X) so that
𝑓(𝑥) = argmax 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)
!
• Bayes’ rule:
𝑃(𝑌 = 𝑦)𝑃 𝑋 = 𝑥 𝑌 = 𝑦
𝑃(𝑋 = 𝑥)
• Putting the two together:
𝑃(𝑌 = 𝑦)𝑃 𝑋 = 𝑥 𝑌 = 𝑦
𝑓(𝑥) = argmax
! 𝑃(𝑋 = 𝑥)
= argmax 𝑃(𝑌 = 𝑦)𝑃 𝑋 = 𝑥 𝑌 = 𝑦

!
Outline
• Decision Theory
• Bayes’ Rule
Accuracy
When we train a classifier, the metric that we usually report is
“accuracy.”
# tokens correctly classiKied

Accuracy =
# tokens total
Error Rate
Equivalently, we could report error rate, which is just 1-accuracy:
# tokens incorrectly classiKied

Error Rate =
# tokens total
Bayes Error Rate
The “Bayes Error Rate” is the smallest possible error rate of any
classifier with labels 𝑦 and features 𝑥:
Error Rate = - 𝑃 𝑋 = 𝑥 min 𝑃(𝑌 ≠ 𝑦|𝑋 = 𝑥)

!
"
It’s called the “Bayes error rate” because it’s the error rate of the
Bayesian classifier.
Outline
• Decision Theory
• Bayes’ Rule
The problem with accuracy
• In most real-world problems, there is one class label that is much more
frequent than all others.
• Words: most words are nouns
• Animals: most animals are insects
• Disease: most people are healthy
• It is therefore easy to get a very high accuracy. All you need to do is write a
program that completely ignores its input, and always guesses the majority
class. The accuracy of this classifier is called the “chance accuracy.”
• It is sometimes very hard to beat the chance accuracy. If chance=90%, and
your classifier gets 89% accuracy, is that good, or bad?
The solution: Confusion Matrix
Confusion Matrix =
• 𝑚, 𝑛 th element is
• the number of
tokens of the 𝑚th
class
• that were labeled,
by the classifier, as
belonging to the
𝑛th class.
Plaintext versions of the Miller & Nicely matrices, posted by
Dinoj Surendran,
http://people.cs.uchicago.edu/~dinoj/research/nicely.html
Confusion matrix for a binary classifier
Suppose that the correct label is Classified As:
either 0 or 1. Then the confusion 0 1
matrix is just 2x2.
Correct Label:
0
For example, in this box, you would 1

write the # tokens of class 1 that were
misclassified as class 0
False Positives & False Negatives
• TP (True Positives) = tokens that were Classified As:
correctly labeled as “1” 0 1
• FN (False Negatives) = tokens that
Correct Label:
should have been “1”, but were
mislabeled as “0” 0 TN FP
• FP (False Positives) = tokens that
should have been “0”, but were
mislabeled as “1” 1 FN TP
• TN (True Negative) = tokens that were
correctly labeled as “0”
Summaries of a Binary Confusion Matrix
The binary confusion matrix is standard in Classified As:
many fields, but different fields 0 1
summarize its content in different ways.
Correct Label:
• In medicine, it is summarized using 0 TN FP
Sensitivity and Specificity.
• In information retrieval (IR) and AI, we 1 FN TP
usually summarize it using Recall and
Precision.
Specificity and Sensitivity
Classified As:
Specificity = True Negative Rate (TNR):
𝑇𝑁 0 1
𝑇𝑁𝑅 = 𝑃(𝑓 𝑋 = 0|𝑌 = 0) =
𝑇𝑁 + 𝐹𝑃
Correct Label:
0 TN FP
1 FN TP
Sensitivity = True Positive Rate (TPR):
𝑇𝑃
𝑇𝑃𝑅 = 𝑃(𝑓 𝑋 = 1|𝑌 = 1) =
𝑇𝑃 + 𝐹𝑁
How IR and AI summarize confusions
Precision: Classified As:
𝑇𝑃 0 1
𝑃 = 𝑃(𝑌 = 1|𝑓 𝑋 = 1) =
𝑇𝑃 + 𝐹𝑃
Correct Label:
0 TN FP
1 FN TP
Recall = Sensitivity = TPR:
𝑇𝑃
𝑅 = 𝑃(𝑓 𝑋 = 1|𝑌 = 1) =
𝑇𝑃 + 𝐹𝑁
The Misdiagnosis Problem: Example
1% of women at age forty who participate in routine screening have
breast cancer. The test has sensitivity of 80%, and specificity of 90.4%.
Classified As:
𝑃 𝑓 𝑋 = 0, 𝑌 = 0 = 𝑃 𝑓 𝑋 = 0 𝑌 = 0 𝑃 𝑌 = 0 0 1
= 0.904 0.99 ≈ 0.895
Correct Label:
𝑃 𝑓 𝑋 = 1, 𝑌 = 0 = 𝑃 𝑓 𝑋 = 1 𝑌 = 0 𝑃 𝑌 = 0
0 0.895 0.095
= (0.096)(0.99) ≈ 0.095
𝑃 𝑓 𝑋 = 0, 𝑌 = 1 = 𝑃 𝑓 𝑋 = 0 𝑌 = 1 𝑃 𝑌 = 1
= (0.2)(0.01) = 0.002
1 0.002 0.008
𝑃 𝑓 𝑋 = 1, 𝑌 = 1 = 𝑃 𝑓 𝑋 = 1 𝑌 = 1 𝑃 𝑌 = 1
= (0.8)(0.01) = 0.008
The Misdiagnosis Problem
1% of women at age forty who participate in routine screening have breast cancer. 80%
of women with breast cancer will get positive mammographies. 9.6% of women
without breast cancer will also get positive mammographies.
A woman in this age group had a positive mammography in a routine screening. What
Recall:
is the probability that she actually has breast cancer? 𝑅 = 𝑃(𝑓 𝑋 = 1|𝑌 = 1)
= 0.8
𝑃 Test = 𝑇 Cancer = 𝑇 𝑃(Cancer = 𝑇)
𝑃 Cancer = 𝑇 Test = 𝑇 =
𝑃(Test = 𝑇) Precision:
𝑃 = 𝑃(𝑌 = 1|𝑓 𝑋 = 1)
𝑃 Test = 𝑇 Cancer = 𝑇 𝑃(Cancer = 𝑇) = 0.0776
=
𝑃 Test = 𝑇 Cancer = 𝑇 𝑃 Cancer = 𝑇 + 𝑃 Test = 𝑇 Cancer = 𝐹 𝑃(Cancer = 𝐹)
(0.8)(0.01)
= = 0.0776
(0.8)(0.01) + (0.096)(0.99)
Outline
• Decision Theory
• Bayes’ Rule
Accuracy on which corpus?
Consider the following experiment: among all of your friends’ pets,
there are 4 dogs and 4 cats.
1. Measure several attributes of each animal: weight, height, color,
number of letters in its name…
2. You discover that, among your friends’ pets, all dogs have 1-syllable
names, while the names of all cats have 2+ syllables.
3. Your classifier: an animal is a cat if its name has 2+ syllables.
4. Your accuracy: 100%
Is it correct to say that this classifier has 100%? Is it useful to say so?
Training vs. Test Corpora
Training Corpus = a set of data that you use in order to optimize the
parameters of your classifier (for example, optimize which features you
measure, how you use those features to make decisions, and so on).
Test Corpus = a set of data that is non-overlapping with the training set
(none of the test tokens are also in the training dataset) that you can use
to measure the accuracy.
• Measuring the training corpus accuracy is useful for debugging: if your
training algorithm is working, then training corpus accuracy should
always go up.
• Measuring the test corpus accuracy is the only way to estimate how
your classifier will work on new data (data that you’ve never yet seen).
This happened:
• Large Scale Visual Recognition Challenge 2015:
Each competing institution was allowed to test
up to 2 different fully-trained classifiers per
week.
• One institution used 30 different e-mail
addresses so that they could test a lot more
classifiers (200, total). One of their systems
achieved <46% error rate – the competition’s
best, at that time.
• Is it correct to say that that institution’s
algorithm was the best?
Training vs. development test vs. evaluation test corpora
Training Corpus = a set of data that you use in order to optimize the parameters of
your classifier (for example, optimize which features you measure, what are the
weights of those features, what are the thresholds, and so on).
Development Test (DevTest or Validation) Corpus = a dataset, separate from the

training dataset, on which you test 200 different fully-trained classifiers (trained,
e.g., using different training algorithms, or different features) to find the best.
Evaluation Test Corpus = a dataset that is used only to test the ONE classifier that
does best on DevTest. From this corpus, you learn how well your classifier will
perform in the real world.
Summary
• Bayes Error Rate:
Bayes Error Rate = - 𝑃 𝑋 = 𝑥 min 𝑃(𝑌 ≠ 𝑦|𝑋 = 𝑥)
!
"
• Confusion Matrix, Precision & Recall
𝑇𝑃 𝑇𝑃
Precision = , Recall =
𝑇𝑃 + 𝐹𝑃 𝑇𝑃 + 𝐹𝑁
Naïve Bayes
Naïve Bayes
• minimum probability of error using Bayes’ rule
• naïve Bayes
• unigrams and bigrams
• estimating the likelihood: maximum likelihood parameter estimation
• Laplace smoothing
MPE = MAP using Bayes’ rule
!
𝑃(𝑌 = 𝑦)𝑃 𝑋 = 𝑥 𝑌 = 𝑦
= argmax
! 𝑃(𝑋 = 𝑥)
= argmax 𝑃(𝑌 = 𝑦)𝑃 𝑋 = 𝑥 𝑌 = 𝑦

!
Naïve Bayes
• naïve Bayes
The problem with likelihood: Too many words
What does it mean to say that the words, x, have a particular probability?
Suppose our training corpus contains two sample emails:
Email1: 𝑌 = spam, 𝑋 =“Hi there man – feel the vitality! Nice meeting you…”
Email2: 𝑌 = ham, 𝑋 =“This needs to be in production by early afternoon…”
Our test corpus is just one email:

Email1: X=“Hi! You can receive within days an approved prescription for
increased vitality and stamina”
How can we estimate 𝑃(𝑋 = “Hi! You can receive within days an approved
prescripKon for increased vitality and stamina”|𝑌 = spam)?
Naïve Bayes: the “Bag-of-words” model
We can estimate the likelihood of an e-mail by pretending that the e-mail
is just a bag of words (order doesn’t matter).
With only a few thousand spam e-mails, we can get a pretty good estimate
of these things:
• 𝑃(𝑊 = “hi”|𝑌 = spam), 𝑃(𝑊 = “hi”|𝑌 = ham)

• 𝑃(𝑊 = “vitality”|𝑌 = spam), 𝑃(𝑊 = “vitality”|𝑌 = ham)
• 𝑃(𝑊 = “production”|𝑌 = spam), 𝑃(𝑊 = “production”|𝑌 = ham)
hi
Then we can approximate 𝑃(𝑋|𝑌) by assuming that the words, 𝑊, are vitality
conditionally independent of one another given the category label:
prescription
$
you for
𝑃(𝑋 = 𝑥|𝑌 = 𝑦) ≈ > 𝑃(𝑊 = 𝑤! |𝑌 = 𝑦) approved
!"#
Naïve Bayes Representation
• Goal: estimate likelihoods P(Document | Class)
and priors P(Class)
• Likelihood: bag of words representation
• The document is a sequence of words [𝑤@ , 𝑤A , … , 𝑤B ]
• The order of the words in the document is not important
• Each word is conditionally independent of the others given document
class
Why naïve Bayes is “naïve”
We call this model “naïve Bayes” because the words aren’t really conditionally independent
given the label. For example, the sequence “for you” is more common in spam emails than
it would be if the words “for” and “you” were conditionally independent.
True Statement:
𝑃 𝑋 = for you 𝑌 = Spam > 𝑃 𝑊 = for 𝑌 = Spam 𝑃 𝑊 = you 𝑌 = Spam
The naïve Bayes approximation simply says: estimating the likelihood of every word
sequence is too hard, so for computational reasons, we’ll pretend that sequence probability
doesn’t matter.
Naïve Bayes Approximation:
𝑃 𝑋 = for you 𝑌 = Spam ≈ 𝑃 𝑊 = for 𝑌 = Spam 𝑃 𝑊 = you 𝑌 = Spam
We use naïve Bayes a lot because, even though we know it’s wrong, it gives us
computationally efficient algorithms that work remarkably well in practice.
MPE = MAP using naïve Bayes
Using naïve Bayes, the MPE decision rule is:
𝑓(𝑥) = argmax 𝑃(𝑌 = 𝑦) 0 𝑃(𝑊 = 𝑤" |𝑌 = 𝑦)

!
"#$
Floating-point underflow
%
𝑓(𝑥) = argmax 𝑃(𝑌 = 𝑦) 6 𝑃(𝑊 = 𝑤" |𝑌 = 𝑦)

!
"#$
• That equation has a computational issue. Suppose that the probability of
any given word is roughly 𝑃(𝑊 = 𝑤" |𝑌 = 𝑦) ≈ 10&' , and suppose that
there are 103 words in an email. Then ∏%"#$ 𝑃(𝑊 = 𝑤" |𝑌 = 𝑦) = 10&'() ,
which gets rounded off to zero. This phenomenon is called “floating-point
underflow.”
• In order to avoid floating-point underflow, we can take the logarithm of the
equation above:
%
𝑓(𝑥) = argmax ln 𝑃(𝑌 = 𝑦) + @ ln 𝑃(𝑊 = 𝑤" |𝑌 = 𝑦)

!
"#$
Naïve Bayes
• naïve Bayes
Reducing the naivety of naïve Bayes
Remember that the bag-of-words model is unable to represent this fact:
True Statement:
𝑃 𝑋 = for you 𝑌 = Spam > 𝑃 𝑊 = for 𝑌 = Spam 𝑃 𝑊 = you 𝑌 = Spam
Though the bag-of-words model can’t represent that fact, we can

represent it using a slightly more sophisticated naïve Bayes model, called
a “bigram” model.
N-Grams
Claude Shannon, in his 1948 book A Mathematical Theory of Communication,
proposed that the probability of a sequence of words could be modeled using N-
grams: sequences of N consecutive words.
• Unigram: a unigram (1-gram) is an isolated word, e.g., “you”

• Bigram: a bigram (2-gram) is a pair of words, e.g., “for you”
• Trigram: a trigram (3-gram) is a triplet of words, e.g., “prescription for you”
• 4-gram: a 4-gram is a 4-tuple of words, e.g., “approved prescription for you”
Bigram naïve Bayes
A bigram naïve Bayes model approximates the bigrams as conditionally
independent, instead of the unigrams. For example,
𝑃 𝑋 = “approved prescription for you” 𝑌 = Spam ≈
𝑃 𝐵 = “approved prescription” 𝑌 = Spam ×

𝑃 𝐵 = “prescription for” 𝑌 = Spam ×
𝑃 𝐵 = “for you” 𝑌 = Spam
Advantages and disadvantages of bigram
models relative to unigram models
• Advantage: the bigram model can tell you if a particular bigram is
much more frequent in spam than in ham emails.
• Disadvantage: over-training. Even if probabilities of individual words
in the training and test corpora are similar, probabilities of bigrams
might be different.
Naïve Bayes
• naïve Bayes
What are “parameters”?
• Oxford English dictionary: parameter (noun): a numerical or other
measurable factor forming one of a set that defines a system or sets
the conditions of its operation.
• The naïve Bayes model has two types of parameters:
• The a priori parameters: 𝑃(𝑌 = 𝑦)
• The likelihood parameters: 𝑃(𝑊 = 𝑤C |𝑌 = 𝑦)
• In order to create a naïve Bayes classifiers, we must somehow

estimate the numerical values of those parameters.
Parameter estimation
Model parameters: feature likelihoods P(Word | Class) and priors P(Class)
• How do we obtain the values of these parameters?
prior P(word | spam) P(word | ham)
spam: 0.33
¬spam: 0.67
Parameter estimation: Prior
The prior, 𝑃(𝑌), is usually estimated in one of two ways.
• If we believe that the test corpus is like the training corpus, then we
just use frequencies in the training corpus:
Docs(𝑌 = Spam)
𝑃(𝑌 = Spam) =
Docs 𝑌 = Spam + Docs(𝑌 ≠ Spam)
where “Docs(Y=Spam)” means the number of documents in the
training corpus that have the label Y=Spam.
• If we believe that the test corpus is different from the training corpus,
then we set 𝑃(𝑌 = Spam) = the frequency with which we believe
spam will occur in the test corpus.
Parameter estimation: Likelihood
The likelihood, 𝑃(𝑊 = 𝑤" |𝑌 = 𝑦), is also estimated by counting.
The “maximum likelihood estimate of the likelihood parameter” is the
most intuitively obvious estimate:
Count(𝑊 = 𝑤" , 𝑌 = Spam)

𝑃(𝑊 = 𝑤" |𝑌 = Spam) =
Count 𝑌 = Spam
where “Count(𝑊 = 𝑤" , 𝑌 = Spam)” means the number of times that

the word 𝑤" occurs in the Spam portion of the training corpus, and
“Count 𝑌 = Spam ” is the total number of words in the Spam portion.
Naïve Bayes
• naïve Bayes
What is the probability that the sun will fail to
rise tomorrow?
• # times we have observed the sun to rise = 1,825,000
• # times we have observed the sun not to rise = 0
&
• Estimated probability the sun will not rise = =0
&'$,)*+,&&&
Oops….
Laplace Smoothing
• The basic idea: add 𝑘 “unobserved observations” to every possible
event
• # times the sun has risen or might have ever risen = 1,825,000+k
• # times the sun has failed to rise or might have ever failed to rise =
0+k
$,)*+,&&&',
• Estimated probability the sun will rise tomorrow =
$,)*+,&&&'*,
,
• Estimated probability the sun will not rise =
$,)*+,&&&'*,
• Notice that, if you add these two probabilities together, you get 1.0.
Laplace Smoothing for Naïve Bayes
• The basic idea: add 𝑘 “unobserved observations” to the count of every unigram
• If a word occurs 2000 times in the training data, Count = 2000+k
• If a word occur once in training data, Count = 1+k
• If a word never occurs in the training data, then it gets a pseudo-Count of k
• Estimated probability of a word that occurred Count(w) times in the training data: =
𝑘 + Count(𝑊 = 𝑤)
𝑃 𝑊=𝑤 =
𝑘 + ∑% 𝑘 + Count(𝑊 = 𝑣)
• Estimated probability of a word that never occurred in the training data (an “out of vocabulary” or OOV
word):
𝑘
𝑃 𝑊 = 𝑂𝑂𝑉 =
𝑘 + ∑% 𝑘 + Count(𝑊 = 𝑣)
• Notice that
𝑃 𝑊 = 𝑂𝑂𝑉 + K P(𝑊 = 𝑤) = 1
&
Conclusions
• MPE = MAP with Bayes’ rule:
𝑓(𝑥) = argmax log 𝑃(𝑌 = 𝑦) + log 𝑃(𝑋 = 𝑥|𝑌 = 𝑦)
• naïve Bayes:
$
log 𝑃(𝑋 = 𝑥|𝑌 = 𝑦) ≈ ; log 𝑃(𝑊 = 𝑤! |𝑌 = 𝑦)
!"#
• maximum likelihood parameter estimation:
Count(𝑊 = 𝑤! )
𝑃(𝑊 = 𝑤! ) =
∑% Count(𝑊 = 𝑣)
• Laplace Smoothing:
𝑘 + Count(𝑊 = 𝑤! )
𝑃(𝑊 = 𝑤! ) =
𝑘 + ∑% 𝑘 + Count(𝑊 = 𝑣)
Bayesian
Networks
Outline
• Review: Bayesian classifier
• The Los Angeles burglar alarm example
• Bayesian network: A better way to represent knowledge
• Inference using a Bayesian network
• Key ideas: Independence and Conditional independence
Review: Bayesian Classifier
• Class label 𝑌 = 𝑦, drawn from some set of labels
• Observation 𝑋 = 𝑥, drawn from some set of features
• Bayesian classifier: choose the class label, 𝑦, that minimizes your
probability of making a mistake:

!
Today: What if P(X,Y) is complicated, and the
naïve Bayes assumption is unreasonable?
• Example: 𝑌 is a scalar, but 𝑋 = 𝑋" , … , 𝑋"## $ is a vector
• Then, even if every variable is binary, 𝑃(𝑌 = 𝑦|𝑋 = 𝒙) is a table with
2"#" numbers. Hard to learn from data; hard to use.
• The naïve Bayes assumption simplified the problem as
"##
𝑃(𝑋" , … , 𝑋"## |𝑌) ≈ 5 𝑃(𝑋% |𝑌)

%&"
• … but what if that assumption is unreasonable? Do we then have no
alternative besides learning all 2"#" probabilities?
• Today: an alternative called a Bayesian network
Outline
The Los Angeles burglar alarm example
• Suppose I have a house in LA. I’m
in Champaign.
• My phone beeps in class: I have
messages from both of my LA
neighbors, John and Mary.
• Does getting messages from both
John and Mary mean that my
burglar alarm is going off?
• If my burglar alarm is going off,
does that mean my house is being CC BY-SA 3.0,
robbed, or is it just an earthquake? https://commons.wikimedia.org/w/index.php?curid=129209878
Variables
• 𝐵 = ⊤ if my house is being burglarized, else 𝐵 = ⊥
• 𝐸 = ⊤ if there’s an earthquake in LA right now, else 𝐸 = ⊥
• 𝐴 = ⊤ if my alarm is going off right now, else 𝐴 = ⊥
• 𝐽 = ⊤ if John is texting me, else 𝐽 = ⊥
• 𝑀 = ⊤ if Mary is texting me, else 𝑀 = ⊥
Inference Problem
• Given that 𝐽 = ⊤ and 𝑀 = ⊤, I want to know what is the probability
that I’m being burglarized
• In other words, what is 𝑃 𝐵 = ⊤|𝑀 = ⊤, 𝐽 = ⊤
• How on Earth would I estimate that probability? I don’t know how to
estimate that.
Available Knowledge
!"
• LA has 1 million houses & 41 burglaries/day: Pr(𝐵 = ⊤) = "######
$#
• There are ~20 earthquakes/year: 𝑃(𝐸 = ⊤) =
%&'
• My burglar alarm is pretty good:
𝐵 = ⊥, 𝐸 = ⊥ 𝐵 = ⊥, 𝐸 = ⊤ 𝐵 = ⊤, 𝐸 = ⊥ 𝐵 = ⊤, 𝐸 = ⊤
𝑃 𝐴 = ⊤ 𝐵, 𝐸 1 99 99 99
100 100 100 100
(
• Mary would text if there was an alarm: 𝑃 𝑀 = ⊤ 𝐴 = ⊤ = "#
"
• …but she also texts most days anyway: 𝑃 𝑀 = ⊤ 𝐴 = ⊥ = $
Combining the Available Knowledge
Putting it all together, we have … well, we have a big mess. And that’s not
including the variable J:
𝐵=⊥ 𝐵=⊤
𝑃(𝐵, 𝐸 = ⊥, 𝐴 = ⊥, 𝑀 = ⊥) 999959 345 99 1 41 345 99 1
1000000 365 100 2 1000000 365 100 2
𝑃(𝐵, 𝐸 = ⊥, 𝐴 = ⊥, 𝑀 = ⊤) 999959 345 99 1 41 345 99 1
1000000 365 100 2 1000000 365 100 2
𝑃(𝐵, 𝐸 = ⊥, 𝐴 = ⊤, 𝑀 = ⊥) 999959 345 1 1 41 345 1 1
1000000 365 100 10 1000000 365 100 10
𝑃(𝐵, 𝐸 = ⊥, 𝐴 = ⊤, 𝑀 = 𝑇) 999959 345 1 9 41 345 99 9
1000000 365 100 10 1000000 365 100 10
⋮ ⋮ ⋮
Outline
Bayesian network: A better way to represent
knowledge
A Bayesian network is a graph in which: B E
• Each variable is a node.
A
• An arrow between two nodes means
that the child depends on the parent.
J M
• If the child has no direct dependence
on the parent, then there is no arrow.
Bayesian network: A better way to represent
knowledge
For example, this graph shows my B E
knowledge that:
• My alarm rings if there is a burglary or A
an earthquake.
• John is more likely to call if my alarm J M
is going off.
• Mary is more likely to call if my alarm
is going off.
B E
Space complexity A
• Without the Bayes network: I have 5 variables, each is binary, so the
probability distribution 𝑃(𝐵, 𝐸, 𝐴, 𝑀, 𝐽) is a table with 2! = 32 entries. J M
• With the Bayes network:
• Two of the variables, B and E, depend on nothing else, so I just
need to know 𝑃(𝐵 = ⊤) and 𝑃(𝐸 = ⊤) --- 1 number for each of
them.
• M and J depend on A, so I need to know 𝑃(𝑀 = ⊤|𝐴 = ⊤) and
𝑃(𝑀 = ⊤|𝐴 = ⊥) – 2 numbers for each of them.
• A depends on both B and E, so I need to know 𝑃(𝐴 = ⊤|𝐵 = 𝑏, 𝐸 =
𝑒) for all 4 combinations of (𝑏, 𝑒)
• Total: 1+1+2+2+4 = 10 numbers to represent the whole
distribution!
B E
Complete description of my knowledge
about the burglar alarm A
J M
𝑃(𝐵 = ⊤) 41 𝑃(𝐸 = ⊤) 20
1000000 365
𝐵 = ⊥, 𝐸 = ⊥ 𝐵 = ⊥, 𝐸 = ⊤ 𝐵 = ⊤, 𝐸 = ⊥ 𝐵 = ⊤, 𝐸 = ⊤
𝑃 𝐴 = ⊤ 𝐵, 𝐸 1 99 99 99
100 100 100 100
𝐴=⊥ 𝑨=⊤ 𝐴=⊥ 𝑨=⊤

𝑃 𝐽=⊤𝐴 1 3 𝑃 𝑀=⊤𝐴 1 9
8 4 2 10
Space complexity
• This is a Bayes network to help
diagnose problems with your car’s
audio system.
• Naïve method: 41 binary variables, so
the distribution is a table with 2'" ≈
2×10"( entries.
• Bayes network: each variable has at
most four parents, so the whole
distribution can be described by less Huang, McMurran, Dhadyalla & Jones, ”Probability-based
than 41×2' = 656 numbers. vehicle fault diagnosis: Bayesian network method,” 2008
Outline
B E
Inference
A
Both John and Mary texted me. Am I being burglarized?
J M
𝑃(𝐵 = ⊤, 𝐽 = ⊤, 𝑀 = ⊤)
𝑃 𝐵 = ⊤|𝐽 = ⊤, 𝑀 = ⊤ =
𝑃 𝐵 = ⊤, 𝐽 = ⊤, 𝑀 = ⊤ + 𝑃(𝐵 = ⊥, 𝐽 = ⊤, 𝑀 = ⊤)
⊥ ⊥
𝑃 𝐵 = ⊤, 𝐽 = ⊤, 𝑀 = ⊤ = = = 𝑃( 𝐵 = ⊤, 𝐸 = 𝑒, 𝐴 = 𝑎, 𝐽 = ⊤, 𝑀 = ⊤)
"#⊤ $#⊤
⊥ ⊥
= = = 𝑃( 𝐵 = ⊤)𝑃 𝐸 = 𝑒 𝑃 𝐴 = 𝑎 𝐵 = ⊤, 𝐸 = 𝑒 𝑃 𝐽 = ⊤ 𝐴 = 𝑎 𝑃(𝑀 = ⊤|𝐴 = 𝑎)
"#⊤ $#⊤
B E
Time Complexity
A
J M
• Using a Bayes network doesn’t usually change the time complexity of a

problem.
• If computing 𝑃 𝐵 = ⊤|𝐽 = ⊤, 𝑀 = ⊤ required considering 32 possibilities
without a Bayes network, it still requires considering 32 possibilities
Some unexpected conclusions
• Burglary is so unlikely that, even if both Mary and John call, it is still
more probable that a burglary didn’t happen
𝑃 𝐵 = ⊤|𝐽 = ⊤, 𝑀 = ⊤ < 𝑃 𝐵 = ⊥|𝐽 = ⊤, 𝑀 = ⊤
• The probability of an earthquake is higher!

𝑃 𝐵 = ⊤|𝐽 = ⊤, 𝑀 = ⊤ < 𝑃 𝐸 = ⊤|𝐽 = ⊤, 𝑀 = ⊤
B E
Independence
A
• The variables B and E are independent

• Days with earthquakes and days w/o earthquakes have the same number of
burglaries: 𝑃 𝐵 = ⊤|𝐸 = ⊤ = 𝑃 𝐵 = ⊤|𝐸 = ⊥ = 𝑃 𝐵 = ⊤ .
Conditional Independence A
J M
• The variables J and M are conditionally independent of one another given

knowledge of A
• If you know that there was an alarm, then knowing that John texted gives no
extra knowledge about whether Mary will text:
𝑃 𝑀 = ⊤|𝐽 = ⊤, 𝐴 = ⊤ = 𝑃 𝑀 = ⊤|𝐽 = ⊥, 𝐴 = ⊤ = 𝑃 𝑀 = ⊤|𝐴 = ⊤
Conditionally Independent variables A
may not be independent J M
• The variables J and M are not independent!

• If you know that John texted, that tells you that there was probably an alarm.
Knowing that there was an alarm tells you that Mary will probably text you
too:
𝑃 𝑀 = ⊤|𝐽 = ⊤ ≠ 𝑃 𝑀 = ⊤|𝐽 = ⊥
Independent variables may not be B E
conditionally independent A
• The variables B and E are not conditionally independent of one another given
knowledge of A
• If your alarm is ringing, then you probably have an earthquake OR a burglary.
If there is an earthquake, then the conditional probability of a burglary goes
down:
𝑃 𝐵 = ⊤|𝐸 = ⊤, 𝐴 = ⊤ ≠ 𝑃 𝐵 = ⊤|𝐸 = ⊥, 𝐴 = ⊤
• This is called the “explaining away” effect. The earthquake “explains away”
the alarm, so you become less worried about a burglary.
B E
How to tell at a glance if
variables are independent A
and/or conditionally
J M
independent
• Variables are independent if they have no common ancestors
𝑃(𝐵 = ⊤, 𝐸 = ⊤) = 𝑃 𝐵 = ⊤ 𝑃(𝐸 = ⊤)
• Variables are conditionally independent of one another, given their
common ancestors, if (1) they have no common descendants, and
(2) none of the descendants of one are ancestors of the other
𝑃(𝐽 = ⊤, 𝑀 = ⊤|𝐴 = ⊤) = 𝑃 𝐽 = ⊤|𝐴 = ⊤ 𝑃(𝑀 = ⊤|𝐴 = ⊤)
Summary
• Review: Bayesian classifier: 𝑓(𝑥) = argmax 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)
!

• Each variable is a node.
• An arrow between two nodes means that the child depends on the parent.
⊥ ⊥
𝑃 𝐵 = ⊤, 𝐽 = ⊤ = < < 𝑃( 𝐵 = ⊤)𝑃 𝐸 = 𝑒 𝑃 𝐴 = 𝑎 𝐵 = ⊤, 𝐸 = 𝑒 𝑃 𝐽 = ⊤ 𝐴 = 𝑎
"#⊤ $#⊤

• Independent = no common ancestors
• Conditionally independent = (1) no common descendants, and (2) none of the descendants
of one are ancestors of the other
Learning
Public domain image: Classes at the University of Bologna. From Liber ethicorum des Henricus
de Alemannia, Laurentius a Voltolina, 14th century, scanned by The Yorck Project , 2002
Outline
• Biological inspiration
• Parametric learning example: Decision tree
• A mathematical definition of learning
• Overtraining
• Early stopping
Biological inspiration: Hebbian learning
“Neurons that fire together, wire together.
…
The general idea is an old one, that any two cells or systems of cells
that are repeatedly active at the same time will tend to become
`associated’ so that activity in one facilitates activity in the other.”
- D.O. Hebb, 1949

Biological inspiration: Long-term potentiation
Figures this page are public domain, by Thomas W. Sulcer, 2011
1. A synapse is repeatedly stimulated 3. More neurotransmitters
2. More dendritic receptors 4. A stronger link between neurons

Mathematical model: Learning
Parameters of the learning machine: how many dendritic
receptors exist? What types of neurotransmitter do they
respond to?
𝑋 =input signal
𝑓(𝑋) =output signal
Learning = adjust the parameters of the learning

machine so that 𝑓(𝑋) becomes the function we want
Mathematical model: Supervised Learning
Supervision: 𝒟 = 𝑥! , 𝑦! , … , 𝑥" , 𝑦" = training dataset containing pairs of
(example signal 𝑥# , desired system output 𝑦# )
𝑦#
( , )
( , ) 𝑥# ℓ
( , ) 𝑓(𝑥# )
( , )
( , ) ℓ 𝑦# , 𝑓(𝑥# )
( , )
( , )
Supervised Learning = adjust parameters of the
learner to minimize E ℓ 𝑌, 𝑓(𝑋)
Outline
• Overtraining
• Early stopping
Decision tree learning: An example
• The Titanic sank.
• You were rescued.
• You want to know if your friend
was also rescued.
• You can’t find them.
• Can you use machine learning
methods to estimate the
probability that your friend
survived?
Survival of the Titanic: A machine learning
approach
1. Gather data about as many of
the passengers as you can.
• X = variables that describe the
passenger, e.g., age, gender,
number of siblings on board.
• Y = 1 if the person is known to
have survived
2. Learn a function, f(X), that
matches the known data as
well as possible
3. Apply f(x) to your friend’s
facts, to estimate their
probability of survival
A mathematical definition of learning
• Environment: there are two random variables, 𝑋 and 𝑌, that are
jointly distributed according to
𝑃(𝑋, 𝑌)
• Data: 𝑃(𝑋, 𝑌) is unknown, but we have a sample of training data
𝒟 = {(𝑥! , 𝑦! ), … , (𝑥" , 𝑦" )}
• Objective: We would like a function 𝑓 that minimizes the expected
value of some loss function, ℓ 𝑌 , 𝑓(𝑋) :
ℛ = E ℓ 𝑌, 𝑓(𝑋)
• Definition of learning: Learning is the task of estimating the function
𝑓, given knowledge of 𝒟.
Outline
• Overtraining
• Early stopping
Overtraining
Consider the following experiment:
among all of your friends’ pets, there Syllables ≥ 2?
are 4 dogs and 4 cats.
1. Measure several attributes of each No Yes
animal: weight, height, color,
number of letters in its name…
Dog Cat
2. You discover that, among your
friends’ pets, all dogs have 1- 0.0, 50% 1.0, 50%
syllable names, while the names of
all cats have 2+ syllables.
Is it correct to say that this classifier has
100%? Is it useful to say so?
Training vs.
Test Corpora
• Suppose you need 100
branch-nodes to reach
zero training error
• … Then what is the
training error after you
find the best 100
questions?
• … and what is the error
on a different “test” set
then?
Training vs. Test Corpora
Training Corpus = a set of data that you use in order to optimize the
parameters of your classifier (for example, optimize which features you
measure, how you use those features to make decisions, and so on).
• Measuring the training corpus accuracy is important for debugging: if
your training algorithm is working, then training corpus error rate
should always go down.
Test Corpus = a set of data that is non-overlapping with the training set
(none of the test tokens are also in the training dataset) that you can use
to measure the error rate.
• Measuring the test corpus error rate is the only way to estimate how
your classifier will work on new data (data that you’ve never yet seen).
Training vs.
Test Corpora
• Training error is
sometimes called
“optimization error.” It
happens because you
haven’t finished
optimizing your
parameters.
• Test error =
optimization error +
generalization error
Training corpus error vs. Test corpus error
• Learning: Given 𝒟 = {(𝑥% , 𝑦% ), … , (𝑥& , 𝑦& )}, find the
function 𝑓(𝑋) that minimizes some measure of risk.
• Empirical risk, a.k.a. training corpus error:
&
1
ℛ'() = 0 ℓ 𝑦* , 𝑓(𝑥* )
𝑛
*+%
• True risk, a.k.a. expected test corpus error:
ℛ = E ℓ 𝑌, 𝑓(𝑋) = ℛ'() + ℛ,'-'./012/314-
Outline
• Overtraining
• Early stopping
Training vs.
Test Corpora
• As you iterate training,
error on the training set
should go to 0
• When should you stop
training?
Cheaters
always win
Why not just stop training

when test set error
reaches a minimum?
This happened:
• Large Scale Visual Recognition Challenge 2015:
Each competing institution was allowed to test
up to 2 different fully-trained classifiers per
week.
• One institution used 30 different e-mail
addresses so that they could test a lot more
classifiers (200, total). One of their systems
achieved <46% error rate – the competition’s
best, at that time.
• Is it correct to say that that institution’s
algorithm was the best?
Training vs. development test vs. evaluation test corpora
Training Corpus = a set of data that you use in order to optimize the parameters of
your classifier (for example, optimize which features you measure, what are the
weights of those features, what are the thresholds, and so on).
Development Test (DevTest or Validation) Corpus = a dataset, separate from the

training dataset, on which you test 200 different fully-trained classifiers (trained,
e.g., using different training algorithms, or different features) to find the best.
Evaluation Test Corpus = a dataset that is used only to test the ONE classifier that
does best on DevTest. From this corpus, you learn how well your classifier will
perform in the real world.
Train, Dev, Test
• Usually, minimum test error

and minimum dev error
don’t occur at the same time
• … but early stopping based
on the test set is cheating,
• … so early stopping based on
the dev set is the best we
can do w/o cheating.
Summary
• Biological inspiration: Neurons that fire together wire together.
Given enough training examples (𝑥# , 𝑦# ), can we learn a desired
function so that 𝑓(𝑥) ≈ 𝑦?
• Classification tree: Learn a sequence of if-then statements that
computes 𝑓(𝑥) ≈ 𝑦
• Mathematical definition of supervised learning: Given a training
dataset, 𝒟 = 𝑥! , 𝑦! , … , 𝑥" , 𝑦" , find a function 𝑓 that minimizes
the risk, ℛ = E ℓ 𝑌, 𝑓(𝑋) .
! "
• Overtraining: ℛ$%& = ∑ ℓ 𝑦# , 𝑓(𝑥# ) reaches zero if you train
" #'!
long enough.
• Early Stopping: Stop when error rate on the dev set reaches a
minimum

Lec 02 Bayesian Decision Theoryv 2024

Uploaded by

Copyright:

Available Formats

Lec 02 Bayesian Decision Theoryv 2024

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 02 Bayesian Decision Theoryv 2024

Uploaded by

Copyright:

Available Formats

Random Variables

Axiom 1: every event 𝐴 has a non-negative probability.

Area of their intersection is 𝑃 𝐴⋀𝐵 .

We use a small letter to denote a particular outcome of the

• 𝑃 𝑋 is the complete distribution, specifying 𝑃(𝑋 = 𝑥) for all

Similarly, for a random matrix, we could write:

• If the unwanted variable is continuous, we marginalize by integrating:

𝑃(𝑌) 1/6 1/6 1/2 1/6

• 𝑃(𝑌|𝑋) is the probability (or pdf) that 𝑌 = 𝑦 happens, given that 𝑋 =

Area of the whole rectangle is Pr Ω = 1.

Suppose Pr 𝐴⋀𝐵 = 0.1

snow rain cloud clear P(X|Y=cloud)

Two zero-mean random variables, with variances of

Var 𝑋! ⋯ Covar 𝑋! , 𝑋"

𝐸 𝑋! − 𝐸[𝑋! ] & ⋯ 𝐸 𝑋! − 𝐸[𝑋! ] 𝑋" − 𝐸[𝑋" ]

𝑅(𝑓) = 𝐸 𝑙(𝑓 𝑋 , 𝑌) = - - 𝑙 𝑓 𝑥 , 𝑦 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦)

1 𝑃 𝑌 = 1 𝑋 = 𝑥 > 𝑃(𝑌 = 0|𝑋 = 𝑥)

• The “maximum a posteriori” (MAP) decision rule is the rule

• Those two decision rules are the same. MPE = MAP.

• The probability of X given Y is also easy. Suppose we have a database full of

• Now we have 𝑃 𝑋 = 𝑥 𝑌 = 𝑦 and 𝑃 𝑌 = 𝑦 . How do we get 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)?

= argmax 𝑃(𝑌 = 𝑦)𝑃 𝑋 = 𝑥 𝑌 = 𝑦

# tokens correctly classiKied

# tokens incorrectly classiKied

Error Rate = - 𝑃 𝑋 = 𝑥 min 𝑃(𝑌 ≠ 𝑦|𝑋 = 𝑥)

For example, in this box, you would 1

Development Test (DevTest or Validation) Corpus = a dataset, separate from the

= argmax 𝑃(𝑌 = 𝑦)𝑃 𝑋 = 𝑥 𝑌 = 𝑦

Our test corpus is just one email:

• 𝑃(𝑊 = “hi”|𝑌 = spam), 𝑃(𝑊 = “hi”|𝑌 = ham)

𝑓(𝑥) = argmax 𝑃(𝑌 = 𝑦) 0 𝑃(𝑊 = 𝑤" |𝑌 = 𝑦)

𝑓(𝑥) = argmax 𝑃(𝑌 = 𝑦) 6 𝑃(𝑊 = 𝑤" |𝑌 = 𝑦)

𝑓(𝑥) = argmax ln 𝑃(𝑌 = 𝑦) + @ ln 𝑃(𝑊 = 𝑤" |𝑌 = 𝑦)

Though the bag-of-words model can’t represent that fact, we can

• Unigram: a unigram (1-gram) is an isolated word, e.g., “you”

𝑃 𝑋 = “approved prescription for you” 𝑌 = Spam ≈

𝑃 𝐵 = “approved prescription” 𝑌 = Spam ×

• In order to create a naïve Bayes classifiers, we must somehow

prior P(word | spam) P(word | ham)

Count(𝑊 = 𝑤" , 𝑌 = Spam)

where “Count(𝑊 = 𝑤" , 𝑌 = Spam)” means the number of times that

𝑓(𝑥) = argmax 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)

𝑃(𝑋" , … , 𝑋"## |𝑌) ≈ 5 𝑃(𝑋% |𝑌)

𝐴=⊥ 𝑨=⊤ 𝐴=⊥ 𝑨=⊤

• Using a Bayes network doesn’t usually change the time complexity of a

• The probability of an earthquake is higher!

• The variables B and E are independent

• The variables J and M are conditionally independent of one another given

• The variables J and M are not independent!

• Bayesian network: A better way to represent knowledge

• Inference using a Bayesian network

• Key ideas: Independence and Conditional independence

- D.O. Hebb, 1949

1. A synapse is repeatedly stimulated 3. More neurotransmitters

2. More dendritic receptors 4. A stronger link between neurons

𝑓(𝑋) =output signal

Learning = adjust the parameters of the learning

Why not just stop training

Development Test (DevTest or Validation) Corpus = a dataset, separate from the