Homework 1 - Theoretical Part: IFT 6390 Fundamentals of Machine Learning Ioannis Mitliagkas

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

IFT 6390 Fundamentals of Machine Learning

Ioannis Mitliagkas

Homework 1 - Theoretical part


• This homework must be done and submitted to Gradescope individu-
ally. You are welcome to discuss with other students but the solution
you submit must be your own. Note that we will use Gradescope’s
plagiarism detection feature. All suspected cases of plagiarism will be
recorded and shared with university officials for further handling.

• You need to submit your solution as a pdf file on Gradescope using


the homework titled Homework 1 - Theory.

1. Probability warm-up: conditional probabilities and Bayes


rule [5 points]

(a) Give the definition of the conditional probability of a discrete


random variable X given a discrete random variable Y .
(b) Consider a biased coin with probability 3/4 of landing on heads
and 1/4 on tails. This coin is tossed three times. What is the
probability that exactly two heads occur (out of the three tosses)
given that the first outcome was a head?
(c) Give two equivalent expressions of P (X, Y ):
(i) as a function of P(X) and P(Y |X)
(ii) as a function of P(Y ) and P(X|Y )
(d) Prove Bayes theorem:

P(Y |X)P(X)
P(X|Y ) = .
P(Y )

(e) A survey of certain Montreal students is done, where 55% of the


surveyed students are affiliated with UdeM while the others are
affiliated with McGill. A student is drawn randomly from this
surveyed group.
i. What is the probability that the student is affiliated with
McGill?

1
ii. Now let’s say that this student is bilingual, and you know
that 80% of UdeM students are bilingual while 50% of McGill
students are. Given this information, what is the probability
that this student is affiliated with McGill ?

2. Bag of words and single topic model [12 points]

We consider a classification problem where we want to predict the


topic of a document from a given corpus (collection of documents).
The topic of each document can either be sports or politics. 2/3 of the
documents in the corpus are about sports and 1/3 are about politics.
We will use a very simple model where we ignore the order of the words
appearing in a document and we assume that words in a document are
independent from one another given the topic of the document.
In addition, we will use very simple statistics of each document as fea-
tures: the probabilities that a word chosen randomly in the document
is either “goal”, “kick”, “congress”, “vote”, or any another word (de-
noted by other). We will call these five categories the vocabulary or
dictionary for the documents: V = {“goal”, “kick”, “congress”, “vote”,
other}.
Consider the following distributions over words in the vocabulary given
a particular topic:

P(word | topic = sports) P(word | topic = politics)


word = “goal” 2/100 0
word = “kick” 1/200 5/1000
word = “congress” 1/1000 1/50
word = “vote” 2/500 4/100
word = other 970/1000 935/1000

Table 1

This table tells us for example that the probability that a word chosen
at random in a document is “vote” is only 2/500 if the topic of the
document is sport, but it is 4/100 if the topic is politics.

2
(a) What is the probability that a random word in a document is
“goal” given that the topic is politics?
(b) In expectation, how many times will the word “goal” appear in a
document containing 200 words whose topic is sports?

(c) We draw randomly a document from the corpus. What is the


probability that a random word of this document is “goal”?
(d) Suppose that we draw a random word from a document and this
word is “kick”. What is the probability that the topic of the
document is sports?
(e) Suppose that we randomly draw two words from a document and
the first one is “kick”. What is the probability that the second
word is “goal”?
(f) Going back to learning, suppose that you do not know the condi-
tional probabilities given a topic or the probability of each topic
(i.e. you don’t have access to the information in table 1 or the
topic distribution), but you have a dataset of N documents where
each document is labeled with one of the topics sports and poli-
tics. How would you estimate the conditional probabilities (e.g.,
P(word = “goal” | topic = politics)) and topic probabilities (e.g.,
P(topic = politics)) from this dataset?

3. Maximum likelihood estimation [5 points]

Let x ∈ R be uniformly distributed in the interval [0, θ] where θ is a


parameter. That is, the pdf of x is given by
(
1/θ if 0 ≤ x ≤ θ
fθ (x) =
0 otherwise

Suppose that n samples D = {x1 , . . . , xn } are drawn independently


according to fθ (x).

(a) Let fθ (x1 , x2 , . . . , xn ) denote the joint pdf of n independent and


identically distributed (i.i.d.) samples drawn according to fθ (x).
Express fθ (x1 , x2 , . . . , xn ) as a function of fθ (x1 ), fθ (x2 ), . . . , fθ (xn )

3
(b) We define the maximum likelihood estimate by the value of θ
which maximizes the likelihood of having generated the dataset
D from the distribution fθ (x). Formally,

θM LE = arg max fθ (x1 , x2 , . . . , xn ),


θ∈R

Show that the maximum likelihood estimate of θ is max(x1 , . . . , xn )

4. Maximum likelihood meets histograms [10 points]

Let X1 , X2 , · · · , Xn be n i.i.d data points drawn from a piece-wise


constant probability density function over N equal size bins between
0 and 1 (B1 , B2 , · · · , BN ), where the constants are θ1 , θ2 , · · · , θN .
j−1 j
(
θj N ≤x< N for j ∈ {1, 2, · · · , N }
p(x; θ1 , · · · , θN ) =
0 otherwise

1(Xi ∈ Bj ).
Pn
We define µj for j ∈ {1, 2, · · · , N } as µj := i=1

(a) Using the fact that the total area underneath a probability den-
sity function is 1, express θN in terms of the other constants.
(b) Write down the log-likelihood of the data in terms of θ1 , θ2 , · · · , θN −1
and µ1 , µ2 , · · · , µN −1 .
(c) Find the maximum likelihood estimate of θj for j ∈ {1, 2, · · · , N }.

5. Histogram methods [10 points]

Consider a dataset {xj }nj=1 where each point x ∈ [0, 1]d . Let f (x) be
the true unknown data distribution. You decide to use a histogram
method to estimate the density f (x) and divide each dimension into
m bins.

(a) Show that for a measurable set S, E[1{x∈S} ] = P(x ∈ S).


(b) Combining the result of the previous question with the Law of
Large Numbers, show that the estimated probability
R
of falling in
bin i, as given by the histogram method, tends to Vi f (x)dx, the
true probability of falling in bin i, as n → ∞. Vi denotes the
volume occupied by bin i.

4
(c) Consider the MNIST dataset with 784 dimensions. We divide
each dimension into 2 bins. How many digits (base 10) does the
total number of bins have?
(d) Assume the existence of an idealized MNIST classifier based on
a histogram of m = 2 bins per dimension. The accuracy of the
classifier increases by  = 5% (starting from 10% and up to a
maximum of 100%) each time k = 4 new data points are added
to every bin. What is the smallest number of samples the classifier
requires to reach an accuracy of 90%?
(e) Assuming a uniform distribution over all bins, what is the prob-
ability that a particular bin is empty, as a function of d, m and
n?

Note the contrast between (b) and (e): even if for infinitely many
datapoints, the histogram will be arbitrarily accurate at estimating
the true distribution (b), in practice the number of samples required
to even get a single datapoint in every region grows exponentially with
d.

6. Gaussian Mixture [10 points]

Let µ0 , µ1 ∈ Rd , and let Σ0 , Σ1 be two d × d positive definite matrices


(i.e. symmetric with positive eigenvalues).
We now introduce the two following pdf over Rd :

1 1 T Σ−1 (x−µ )
fµ0 ,Σ0 (x) = d p e− 2 (x−µ0 ) 0 0

(2π) 2 det(Σ0 )

1 1 T Σ−1 (x−µ )
fµ1 ,Σ1 (x) = d p e− 2 (x−µ1 ) 1 1

(2π) 2 det(Σ1 )

These pdf correspond to the multivariate Gaussian distribution of


mean µ0 and covariance Σ0 , denoted Nd (µ0 , Σ0 ), and the multivari-
ate Gaussian distribution of mean µ1 and covariance Σ1 , denoted
Nd (µ1 , Σ1 ).
We now toss a balanced coin Y , and draw a random variable X in Rd ,
following this process : if the coin lands on tails (Y = 0) we draw X

5
from Nd (µ0 , Σ0 ), and if the coin lands on heads (Y = 1) we draw X
from Nd (µ1 , Σ1 ).

(a) Calculate P(Y = 0|X = x), the probability that the coin landed
on tails given X = x ∈ Rd , as a function of µ0 , µ1 , Σ0 , Σ1 , and
x. Show all the steps of the derivation.
(b) Recall that the Bayes optimal classifier is hBayes (x) = argmax P(Y =
y∈{0,1}
y|X = x). Show that in this setting if Σ0 = Σ1 the Bayes optimal
classifier is linear in x.

You might also like