Final F02soln
Final F02soln
Final F02soln
Final exam
December 5, 2002
Problem 1
We wish to estimate a mixture of two experts model for the data displayed in Figure 1.
The experts we can use here are linear regression models of the form
p(y|x, w) = N ( y; w1 x + w0 , σ 2 )
where N (y; µ, σ 2 ) denotes a Gaussian distribution over y with mean µ and variance σ 2 .
Each expert i can choose its parameters wi = [wi0 , wi1 ]T and σi2 independently from other
experts. Note that the first subindex i in wij refers to the expert.
The gating network in the case of two experts is given by a logistic regression model
P (expert = 1|x, v) = g( v1 x + v0 )
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
3.5 3.5 gate
3 3 (assigns 0.5 to each expert almost anywhere)
2.5 2.5
gate (1)
2 2
y
1 1
0.5 0.5
(2)
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
x x
1. (4 points) Suppose we estimate a mixture of two experts model based on the data in
Figure 1. You can assume that the estimation is successful in the sense that we will
find a setting of the parameters that maximizes the log-likelihood of the data. Please
indicate (approximately) in Figure 1a) the mean predictions from the two experts as
well as the decision boundary for the gating network. Label the mean predictions
– functions of x – with “(1)” and “(2)” corresponding to the two experts, and the
decision boundary with “gate”.
3. (3 points) Are the variances in the predictive Gaussian distributions of the experts
( ) larger,
( ) smaller,
( x ) about the same
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
s 1 2
P(x=1) 0 0.1 .01
1 2
P(x=2) 0.199 0
P(x=3) 0.8 0.7
.99 1
P(x=4) 0.001 0.2
.5 .5 .5 .5
Problem 2
Figure 2 shows a two-state HMM. The transition probabilities of the Markov chain are
given in the transition diagram. The output distribution corresponding to each state is
defined over {1, 2, 3, 4} and is given in the table next to the diagram. The HMM is equally
likely to start from either of the two states.
4. (2 points) Now, consider an output sequence 3 3 4. What are the first 2,2
two states of the most likely hidden state sequence?
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
5. (4 points) We can try to increase the modeling capacity of the HMM a bit by
breaking each state into two states. Following this idea, we created the diagram in
Figure 3. Can we set the initial state distribution and the output distributions so
that this 4-state model, with the transition probabilities indicated in the diagram,
would be equivalent to the original 2-state model? If yes, how? If no, why not?
No we cannot. First note that we have to associate the first two states in the 4-state
model with state 1 of the 2-state model. The probability of leaving the first two states
in the 4-state model, however, depends on time (whether the chain happens to be in
state 1 or 2). In contrast, in the 2-state model the probability of transitioning to 2
is always 0.01.
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
Problem 3
Figure 4 shows a graphical model over four binary valued variables, x1 , . . . , x4 . We do not
know the parameters of the probability distribution associated with the graph.
x1 x2
x3 x4
4. (2 points) The following table gives a possible partial specification of the conditional
probability P (x4 |x1 , x2 ) associated with the graph. Fill in the missing values so that
we could omit the arrow x1 → x4 in the graph and the graph would still adequately
represent the probability distribution.
P (x4 = 1|x1 = 0, x2 = 0) 0.8
P (x4 = 1|x1 = 0, x2 = 1) 0.4
P (x4 = 1|x1 = 1, x2 = 0) 0.8
P (x4 = 1|x1 = 1, x2 = 1) 0.4
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
5. (4 points) Let’s again focus on the original graph in Figure 4. Since we don’t know
the underlying probability distribution, we need to estimate it from observed data.
Unfortunately, the dataset we have is incomplete and contains only observations for
x2 and x3 . In other words, the dataset is D = {(xt2 , xt3 ), t = 1, . . . , n}. In the joint
distribution
( x ) P (x1 )
( x ) P (x2 )
( x ) P (x3 |x1 )
( ) P (x4 |x1 , x2 )
6. (4 points) If we use the EM algorithm to carry out the estimation task, what
posterior probabilities do we have to evaluate in the E-step? Please provide the
necessary posterior probabilities in the form P (· · · |xt2 , xt3 ).
We only need to evaluate P (x1 |xt3 ).
Since x4 is never observed, x1 and x2 are always independent. We thus have to
estimate only two independent parts x1 → x3 and x2 , where x1 is unobserved. In .
the EM-algorithm we need to fill-in the missing values for x1 , and thus evaluate
P (x1 |xt3 )
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
Problem 4
We try to select here between two models. Both models are logistic regression models
but differ in terms of the type of features that are used in making the predictions. More
specifically, the models have the common squashed additive form
where the input is a real number x ∈ R. The models differ in terms of the number and the
type of basis functions used:
model 1 : m = 1, φ1 (x) = x
model 2 : m = 2, φ1 (x) = x, φ2 (x) = sin(x)
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
2. (4 points) We will now switch to the Bayesian information criterion (BIC) for select
ing among the two models. Let L1 (n) be the log-probability of the labels that model
1 assigns to n training labels, where the probabilities are evaluated at the maximum
likelihood setting of the parameters. Let L2 (n) be the corresponding log-probability
for model 2. We imagine here that L1 (n) and L2 (n) are evaluated on the basis of the
first n training examples from a much larger set.
Now, in our empirical studies, we found that these log-probabilities are related in a
simple way:
How will we end up selecting between the two models as a function of the number of
training examples? Please choose one of the following cases.
( ) Always select 1
( ) Always select 2
( x ) First select 1, then 2 for larger n
( ) First select 2, then 1 for larger n
3. (4 points) Provide a brief justification for your answer to the previous question.
For large n we would select model 2 since it has a consistent (albeit small) advantage.
Initially, however, we would choose model 1 due to smaller complexity penalty.
To see this a bit more precisely, let’s recall the form of the BIC score: for model 1
it is defined as BIC1 = L1 (n) − d21 log(n), where d1 is the number of parameters in
model 1. The difference between the BIC scores for the two models is therefore
3−2
0.01·n � �� �
� �� � d2 − d1
BIC2 − BIC1 = L2 (n) − L1 (n) − log(n)
2
1
= 0.01 · n − log(n)
2
When n is small the complexity term dominates and BIC2 < BIC1 (the difference
is negative). For large n the linear increase of the log-likelihood difference overcomes
the logarithmic penalty and BIC2 > BIC1 .
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
Problem 5
Consider a simple two-class document (text) classification problem. Each document is
represented by a binary feature vector [φi1 , . . . , φiN ], where φki = 1 if keyword k is present in
the document, and zero otherwise. N is the number of keywords we have chosen to include.
We use a Naive Bayes model for this classification task. The joint distribution of the
features and the binary labels y ∈ {0, 1} is in this case given by
N
�
P (φ1 , . . . , φN , y) = P (y) P (φk |y)
k=1
1. (2 points) In the space below, draw the graphical model corresponding to Naive
Bayes generative model described above. Assume that N = 3 (three keywords).
y
φ1 φ2 φ3
2. (4 points) To be able to make use of training examples with possibly missing labels,
we will have to resort to the EM algorithm. In the EM algorithm we need to evaluate
the posterior probability of the label y given the document. We will use a message
passing algorithm (belief propagation) to get this posterior probability. The problem
here is that we relied on a rather careless friend to evaluate whether a document
contains any of the keywords. In other words, we do not fully trust the “observed”
values of the features. Let φ̂k be the “observed” value for the k th feature in a given
document. The evidence we now have about the actual value of φk is given by
P (φ̂k |φk ), which is a table that models how we expect the friend to respond.
Given that we observe φ̂k = 1, what is the message that φk needs to send to y?
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
There are a couple of options here depending on how you choose to incorporate
the evidence φ̂k = 1. Perhaps the simplest way is to put it directly in the single
node potential ψk (φk ). In other words, we define ψk (φk ) = P (φ̂k = 1|φk ), and
ψky (φk , y) = P (φk |y). In this notation, the message is
�
mk→y (y) = ψk (φk )ψky (φk , y)
φk =0,1
(t)
3. (3 points) We know that the EM algorithm is in some sense monotonic. Let θ̂k,y be
the estimate of the parameters θk,y in the beginning of iteration t of the EM algorithm,
(t) (t) (t)
and θ(t) be the vector of all parameter estimates in that iteration, [θ̂1,0 , θˆ1,1 , . . . , θˆN,y=1 ].
Which of the following quantities increases monotonically with t?
(t)
( ) P (φk = 1|y, θk,y ) for all k
�N
(x) P (φ1i , . . . , φNi
|θ(t) )
�i=1
N i i (t)
( ) i=1 P (yi = 1|φ1 , . . . , φN , θ )
4. (4 points) The class labels actually correspond to “relevant” and “irrelevant” doc
uments. In classifying any document as relevant or irrelevant, we have to take into
account that we might prefer to miss a few relevant documents if we can avoid mis
classifying a large number of irrelevant documents as relevant. To express such a
preference we define a utility U (y, ŷ), where y is the correct label and ŷ is how we
classify the document. Draw an influence diagram that incorporates the Naive Bayes
model, our decisions, and the utility. Mark each node in the graph according to the
variables (or utility) that they represent.
φ1
φ2 ŷ
y
φ3
5. (2 points) Let’s modify the Naive Bayes model a bit, to account for some of the
possible dependencies between the keywords. For example, suppose we order the
10
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
keywords so that it would be useful to model the dependency of φk on φk−1 , k =
2, . . . , N (the keywords may, for example, represent nested categories). We expect
these dependencies to be the same for each class, but the parameters can be affected
by the class label. Draw the graphical model for this – call it Sophisticated Bayes –
model. Please assume again that N = 3.
y
φ1 φ2 φ3
11
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].