Machine Learning 10-701 Final Exam May 5, 2015: Obvious Exceptions For Pacemakers and Hearing Aids
Machine Learning 10-701 Final Exam May 5, 2015: Obvious Exceptions For Pacemakers and Hearing Aids
Machine Learning 10-701 Final Exam May 5, 2015: Obvious Exceptions For Pacemakers and Hearing Aids
• If we catch you cheating or using any such device, the exam is over
for you. Your results will not count and you will have failed this
exam automatically. There might be more serious consequences, too.
No exceptions. Switch off your phones before the exam.
• There are more questions in the exam than what you are likely to be able to solve in 90 minutes time.
Choose wisely and answer those first that you believe you can solve easily. This is an opportunity.
• None of the questions should require more than 1 page to answer. Its OK to be concise if you address
the key points of the derivation (it should be human-readable, though).
Point Distribution
Problem Points
1 /10
2 /10
3 /15
4 /10
5 /15
6 /10
7 /15
Total /85
1
Machine Learning 10-701 Final Exam May 5, 2015
Solution:
False
The gradient of L2 loss can grow without bound whereas the L1 loss gradient is bounded, hence the influ-
ence of an outlier is limited.
2. Logistic loss is better than L2 loss in classification tasks.
Solution:
True
With logistic loss, correctly classified points that are far away from the decision boundary have much less
impact on the decision boundary
3. In terms of feature selection, L2 regularization is preferred since it comes up with sparse solutions.
Solution:
False
L1 regularization (LASSO) comes up with sparse solutions due to nonvanishing gradient at 0.
2. Write out a stochastic gradient descent learning algorithm for minimizing R(w)
Solution:
Randomly select a training instance xi , update w as follows:
w ← (1 − ληt )w − ηt gi
3. Name two common choices for choosing the learning rate and briefly explain their properties.
Solution:
O( √1t ), O( 1t ) for strong convexity, Polynomial decay, AdaGrad, etc.
2
Machine Learning 10-701 Final Exam May 5, 2015
2 Neural Networks
2.1 Quick Questions [3 points]
Provide a brief answer to the following questions (1-2 sentences).
1. State one advantage of linear rectified activation compared to logistic sigmoid activation.
Solution:
Stable gradient, sparse activation, easy computation, etc.
h1 h2
x1 x2 x3
We use linear rectified units σ(z) = max(0, z) as activation function for the hidden and the output layer.
Moreover, denote by l(y, t) = 12 (y − t)2 the loss function. Here t is the target value for the output unit y.
Denote by W and V weight matrices connecting input and hidden layer, and hidden layer and output
respectively. They are initialized as follows:
1 0 1
W = and V = 0 1 and x = [1, 2, 1] and t = 1.
1 −1 0
Also assume that we have at least one sample (x, t) given by the values above.
1. Write out symbolically (no need to plug in the specific values of W and V just yet) the mapping x → y
using σ, W, V .
Solution:
2. Assume that the current input is x = (1, 2, 1). The target value is t = 1. Compute the numerical output
value y, clearly show all intermediate steps. You can reuse the results of the previous question. Using
3
Machine Learning 10-701 Final Exam May 5, 2015
3. Compute the gradient of the loss function with respect to the weights. In particular, compute the
following terms symbolically:
∂l
• The gradient relative to V , i.e. ∂V
∂l
• The gradient relative to W , i.e. ∂W
• Compute the values numerically for the choices of W, V, x, y given above.
Solution:
The gradients are computed with the chain rule as follows:
>
∂l ∂y∂l
=
∂V ∂V
∂y
> >
∂l ∂h ∂y ∂l
=
∂W ∂W ∂h ∂y
4
Machine Learning 10-701 Final Exam May 5, 2015
α π zi xi ρkj β, γ
for all i for all k, j
Solution:
N
Y d
Y K
Y
p(π, z, x, ρ|α, β, γ) = p(π|α) p(zi |π) p(xij |ρzi j ) p(ρkj |β, γ)
i=1 j=1 k=1
K N d K
Γ(β + γ) Γ(kα) Y α−1 Y Y xij 1−xij
Y β−1
= π k π zi ρ z j (1 − ρz i j ) ρkj (1 − ρkj )γ−1
Γ(β)Γ(γ) Γk (α) i=1 j=1
i
k=1 k=1
K d
Γ(β + γ) Γ(kα) Y nk +α−1 Y β−1+mkj
= π k ρkj (1 − ρkj )γ−1+nk −mkj
Γ(β)Γ(γ) Γk (α) j=1
k=1
PN PN
Last step follows if we define nk = i=1 δ(zi =k) , mkj = i=1 δ(zi =k) xij .
3.2 Inference
1. Explain why it is nontrivial to infer the posterior π, z, ρ|α, β, γ, x directly.
2. Suggest at least one alternative approach to Expectation Maximization for solving this problem.
Solution:
3.2.1: There is no closed form solution p(x) which is the numerical normalizer.
3.2.2. Gibbs Smapling or any other MCMC methods.
5
Machine Learning 10-701 Final Exam May 5, 2015
to derive a lower bound on p(x, π, ρ|α, β, γ). Hint — you need to make an approximation for the distribution
over z when obtaining the variational distribution.
Solution:
a = ρ, π, b = z
X
⇒ log p(x, π, ρ|α, β, γ) = log p(x, π, z, ρ|α, β, γ)
z
≥ Ez∼q(z) [log p(π, z, ρ)] + H[q]
XK d
X
= Ez∼q(z) (nk + α − 1) log πk + (β − 1 + mkj ) log ρkj + (γ − 1 + nk − mkj ) log(1 − ρkj )
k j=1
+ H[q] + constant
Solution:
Qd
p(zi = k|π) j=1 p(xij |ρkj )
p(zi = k|xi , π, ρ) = PK Qd
k=1 p(zi = k|π) j=1 p(xij |ρkj )
d
x
Y
∝ πk ρkjij (1 − ρkj )1−xij
j=1
6
Machine Learning 10-701 Final Exam May 5, 2015
Solution:
• π:
P
i qi (k) + α − 1
π̂k = P .
i,k i (k) + K(α − 1)
q
+ H[q] + constant
7
Machine Learning 10-701 Final Exam May 5, 2015
C D E F
A B G
H I J K
Check whether the following independence holds or not? Provide a brief reason for your answer. Alterna-
tively, give a path for the Bayes ball algorithm as illustration.
• J ⊥ E|G
• J ⊥ E|K, G
• J ⊥ E|B
• J ⊥ E|A
• J ⊥ E|A, H
Solution:
8
Machine Learning 10-701 Final Exam May 5, 2015
α θd zdi wdi ψk H
for all i for all k
for all d
1. For the given graphical model as it is, denote the minimal set of random variables which affect zdi = k
for a given d, i pair?
2. Suppose we integrate out the random variables θd and ψk . Which nodes affect zdi = k for a given d, i
pair now?
Solution:
9
Machine Learning 10-701 Final Exam May 5, 2015
5.2 Kernels
What is the difference between a kernel obtained from the feature map φ(x) described above and a kernel
2
such as k(x, x0 ) = exp(− 12 kx − x0 k ). Can you characterize the type of functions that can be approximated?
Hint: what does the rank of the kernel matrix tell you?
Solution:
2
The data point is mapped to an infinite dimensional feature space in case of k(x, x0 ) = exp(− 12 kx − x0 k ).
Polynomials of order d.
p w|(x1 , y1 ), . . . (xm , ym ), τ 2 , σ 2
Solution:
= p w|σ 2
Y
p(yi |xi , w, τ 2 ) (1)
i
10
Machine Learning 10-701 Final Exam May 5, 2015
5.5 Prediction
Assume that we want to estimate y 0 |x0 using the previous observations, i.e. we regress on
Solution:
Mode of the posterior predictive p(y 0 |x0 , rest) is simply
11
Machine Learning 10-701 Final Exam May 5, 2015
fum = hvu , wm i + bu + bm .
Here vu and wm are vectors in Rd and bu , bm are scalars, indicating the bias.
Here U denotes the set of all users, M the set of all movies, and u ∼ m represents the sum over all (u, m)
pairs for which a rating exists. Write the optimal values of bu , provided that all other values are fixed. That
is, compute the optimal value of bu |v, w, bm , r.
Solution:
Set the derivative wrt. a particular user u0 to 0:
X
(fu0 m − ru0 m ) + λbu0 = 0
u0 ∼m
P
u0 ∼m (ru m
0 − fu0 m )
b u0 =
λ
where u0 m are the movies rated by u0 .
6.2 Optimization
Is the problem jointly convex in v and w? You need to prove this rather than just giving a binary answer. It
suffices to prove this a very simplistic case, say for only 1 user and 1 movie.
Solution:
Since
P we assume only one user and one movie, by ignoring the bias terms, the Hessian of O(f (v, w)) =
1 2
2 u∼m (fum − rum ) can be calculated as:
2w2
4vw − 2rum
4vw − 2rum 2v 2
It is not positive semi-definite. Therefore the problem is only element-wise convex but not jointly convex
in v and w.
12
Machine Learning 10-701 Final Exam May 5, 2015
Solution:
When a new user come in, we have little information about them, and thus the matrix factorization method
can not learn much associations between the new user and the existing users. We should use the demo-
graphics information of the user to bridge its associations with existing users. Many ideas can be applied
here. The most intuitive way is perhaps to do regression based on the demographics features and compute
a similarity between the new user and existing users, then approximate vu with a linear combination.
Active learning?
2. How could you accomplish this for a new movie?
Solution:
Using meta data of the movie as additional information to encode the similarity, perhaps approximating the
corresponding wm as a linear combination of existing movies based on their similarities in terms of meta
information. This can be encoded in the objective function.
Active learning?
13
Machine Learning 10-701 Final Exam May 5, 2015
Solution:
r s
r 0.4 0.6
s 0.3 0.7
Solution:
Solution:
With a large finite or countably infinite states, the standard way of computing the stationary distribution is
not possible. One good example is to compute the PageRank scores of the web.
We can use power iterations or MCMC which sinmulates random walks to find approximations to the
stationary distribution.
In terms of MCMC, with fixed parameters the Monte Carlo Standard Error (MCSE) is bounded by the
sample size. We can compute a Confidence Interval regarding how far our estimation is from the truth.
14
Machine Learning 10-701 Final Exam May 5, 2015
Solution:
Given day 5 and day 7 are rainy ,
Solution:
Given day 7 is rainy and day 10 is sunny, enumerate all possible sequence and evaluate their likelihoods.
15
Machine Learning 10-701 Final Exam May 5, 2015
16
Machine Learning 10-701 Final Exam May 5, 2015
17