Homework 1 - Theoretical Part: IFT 6390 Fundamentals of Machine Learning Ioannis Mitliagkas
Homework 1 - Theoretical Part: IFT 6390 Fundamentals of Machine Learning Ioannis Mitliagkas
Homework 1 - Theoretical Part: IFT 6390 Fundamentals of Machine Learning Ioannis Mitliagkas
Ioannis Mitliagkas
P(Y |X)P(X)
P(X|Y ) = .
P(Y )
1
ii. Now let’s say that this student is bilingual, and you know
that 80% of UdeM students are bilingual while 50% of McGill
students are. Given this information, what is the probability
that this student is affiliated with McGill ?
Table 1
This table tells us for example that the probability that a word chosen
at random in a document is “vote” is only 2/500 if the topic of the
document is sport, but it is 4/100 if the topic is politics.
2
(a) What is the probability that a random word in a document is
“goal” given that the topic is politics?
(b) In expectation, how many times will the word “goal” appear in a
document containing 200 words whose topic is sports?
3
(b) We define the maximum likelihood estimate by the value of θ
which maximizes the likelihood of having generated the dataset
D from the distribution fθ (x). Formally,
1(Xi ∈ Bj ).
Pn
We define µj for j ∈ {1, 2, · · · , N } as µj := i=1
(a) Using the fact that the total area underneath a probability den-
sity function is 1, express θN in terms of the other constants.
(b) Write down the log-likelihood of the data in terms of θ1 , θ2 , · · · , θN −1
and µ1 , µ2 , · · · , µN −1 .
(c) Find the maximum likelihood estimate of θj for j ∈ {1, 2, · · · , N }.
Consider a dataset {xj }nj=1 where each point x ∈ [0, 1]d . Let f (x) be
the true unknown data distribution. You decide to use a histogram
method to estimate the density f (x) and divide each dimension into
m bins.
4
(c) Consider the MNIST dataset with 784 dimensions. We divide
each dimension into 2 bins. How many digits (base 10) does the
total number of bins have?
(d) Assume the existence of an idealized MNIST classifier based on
a histogram of m = 2 bins per dimension. The accuracy of the
classifier increases by = 5% (starting from 10% and up to a
maximum of 100%) each time k = 4 new data points are added
to every bin. What is the smallest number of samples the classifier
requires to reach an accuracy of 90%?
(e) Assuming a uniform distribution over all bins, what is the prob-
ability that a particular bin is empty, as a function of d, m and
n?
Note the contrast between (b) and (e): even if for infinitely many
datapoints, the histogram will be arbitrarily accurate at estimating
the true distribution (b), in practice the number of samples required
to even get a single datapoint in every region grows exponentially with
d.
1 1 T Σ−1 (x−µ )
fµ0 ,Σ0 (x) = d p e− 2 (x−µ0 ) 0 0
(2π) 2 det(Σ0 )
1 1 T Σ−1 (x−µ )
fµ1 ,Σ1 (x) = d p e− 2 (x−µ1 ) 1 1
(2π) 2 det(Σ1 )
5
from Nd (µ0 , Σ0 ), and if the coin lands on heads (Y = 1) we draw X
from Nd (µ1 , Σ1 ).
(a) Calculate P(Y = 0|X = x), the probability that the coin landed
on tails given X = x ∈ Rd , as a function of µ0 , µ1 , Σ0 , Σ1 , and
x. Show all the steps of the derivation.
(b) Recall that the Bayes optimal classifier is hBayes (x) = argmax P(Y =
y∈{0,1}
y|X = x). Show that in this setting if Σ0 = Σ1 the Bayes optimal
classifier is linear in x.