Machine Learning 10-701 Final Exam May 5, 2015: Obvious Exceptions For Pacemakers and Hearing Aids

Machine Learning 10-701 Final Exam May 5, 2015
Name: Andrew ID:

Instructions
• Anything on paper is OK in arbitrary shape size, and quantity.
• Electronic devices are not acceptable. This includes iPods, iPads, Android tablets, Blackberries, Nokias,
Windows phones, Microsoft Surface, laptops, Chromebooks, MP3 players, digital cameras, Google
Glass, Android Wear, or anything else that requires electricity.1
• If we catch you cheating or using any such device, the exam is over
for you. Your results will not count and you will have failed this
exam automatically. There might be more serious consequences, too.
No exceptions. Switch off your phones before the exam.
• There are more questions in the exam than what you are likely to be able to solve in 90 minutes time.
Choose wisely and answer those first that you believe you can solve easily. This is an opportunity.
• None of the questions should require more than 1 page to answer. Its OK to be concise if you address
the key points of the derivation (it should be human-readable, though).
Point Distribution
Problem Points
1 /10
2 /10
3 /15
4 /10
5 /15
6 /10
7 /15
Total /85
1 Obvious exceptions for pacemakers and hearing aids.
1
1 Loss, Regularization and Optimization [10 points]

1.1 Quick Questions [4 points]
Explain in one or two sentences why the statements are true (or false).
1. L2 loss is more robust to outliers than L1 loss.
Solution:
False
The gradient of L2 loss can grow without bound whereas the L1 loss gradient is bounded, hence the influ-
ence of an outlier is limited.
2. Logistic loss is better than L2 loss in classification tasks.
Solution:
True
With logistic loss, correctly classified points that are far away from the decision boundary have much less
impact on the decision boundary
3. In terms of feature selection, L2 regularization is preferred since it comes up with sparse solutions.
Solution:
False
L1 regularization (LASSO) comes up with sparse solutions due to nonvanishing gradient at 0.
1.2 Gradient Descent [6 points]

Denote by x ∈ Rd data, by w ∈ Rd the weight vector, by y ∈ R labels, by λ > 0 a regularization constant,
and by m the total number of data points. Let R(w) be the regularized risk function:
m
(
1 2
1 X λ 2 ξ if |ξ| < 1
R(w) := l (yi − hw, xi i) + kwk where l(ξ) = 2 1
m i=1 2 |ξ| − 2 otherwise
1. Calculate the batch gradient of R(w) with respect to w.

Solution:
Define gi as the gradient of ith data point.
(
(hw, xi i − y)xi if | hw, xi i − y| < 1
gi =
sgn(hw, xi i − y)xi otherwise
m
∂R 1 X
= gi + λw
∂w m i=1
2. Write out a stochastic gradient descent learning algorithm for minimizing R(w)
Solution:
Randomly select a training instance xi , update w as follows:
w ← (1 − ληt )w − ηt gi
3. Name two common choices for choosing the learning rate and briefly explain their properties.
Solution:
O( √1t ), O( 1t ) for strong convexity, Polynomial decay, AdaGrad, etc.
2
2 Neural Networks
2.1 Quick Questions [3 points]
Provide a brief answer to the following questions (1-2 sentences).
1. State one advantage of linear rectified activation compared to logistic sigmoid activation.
Solution:
Stable gradient, sparse activation, easy computation, etc.
2. Does it make sense to initialize all weights in a deep network to 0.

Solution:
No.
Symmetry breaking: If all weights are equal, their gradients will be the same as well, and the gradients will
possibly vanish. Hence all neurons will learn the same feature.
2.2 Forward and Backward Propagation [7 points]

The following graph shows the structure of a simple neural network with a single hidden layer. The input
layer consists of three dimensions x = (x1 , x2 , x3 ). The hidden layer includes two units h = (h1 , h2 ). The
output layer includes one unit y. We ignore bias terms for simplicity.
h1 h2
x1 x2 x3
We use linear rectified units σ(z) = max(0, z) as activation function for the hidden and the output layer.
Moreover, denote by l(y, t) = 12 (y − t)2 the loss function. Here t is the target value for the output unit y.
Denote by W and V weight matrices connecting input and hidden layer, and hidden layer and output
respectively. They are initialized as follows:

1 0 1
W = and V = 0 1 and x = [1, 2, 1] and t = 1.
1 −1 0
Also assume that we have at least one sample (x, t) given by the values above.
1. Write out symbolically (no need to plug in the specific values of W and V just yet) the mapping x → y
using σ, W, V .
Solution:
y = σ(V σ(W x))
2. Assume that the current input is x = (1, 2, 1). The target value is t = 1. Compute the numerical output
value y, clearly show all intermediate steps. You can reuse the results of the previous question. Using
3
matrix form is recommended but not mandatory.

Solution:
h = σ(W x) = (2, 0)>

y = σ(V h) = 0
3. Compute the gradient of the loss function with respect to the weights. In particular, compute the
following terms symbolically:
∂l
• The gradient relative to V , i.e. ∂V
∂l
• The gradient relative to W , i.e. ∂W
• Compute the values numerically for the choices of W, V, x, y given above.
Solution:
The gradients are computed with the chain rule as follows:
>
∂l ∂y∂l
=
∂V ∂V
∂y
> >
∂l ∂h ∂y ∂l
=
∂W ∂W ∂h ∂y
Plugging in numerical values yields

> >
∂l ∂V h ∂y ∂l
= = h> g(y − t) = [−2g, 0]
∂V ∂V ∂V h ∂y
where 0 ≤ g ≤ 1 is the subgradient of ReLU

> > >
∂l ∂W x ∂h ∂y ∂l 0 0 0 1 0
= = M V > (y − t)x> = where M =
∂W ∂W ∂W x ∂h ∂y 0 0 0 0 0
4
3 Expectation Maximization for Clustering Bit Vectors [15 points]

d
Assume that we observe binary vectors xi ∈ {0, 1} , e.g. xi = 0 0 1 1 1 0 0 0 1 . Our goal is to find a
generative model that represents these vectors. A mixture of binary distributions might be a good idea. In
particular, we pick the following graphical model:
α π zi xi ρkj β, γ
for all i for all k, j
The associated generative model is given as follows:

1. For each topic k ∈ {1, ..., K} and for each coordinate j ∈ {1, . . . d}
(a) Draw ρkj ∼ Beta(β, γ)
2. Draw mixture weights π ∼ Dirichlet(α)
3. For each observation i ∈ {1, ..., N }
(a) Draw a component index zi ∼ Categorical(π)
(b) Draw each coordinate xij ∼ Binomial(ρzi j )
3.1 Generative Model

Write the joint probability p(π, z, x, ρ|α, β, γ) for a set of n vectors {x1 , . . . , xn } drawn from this distribution.
Hint, the Beta and Dirichlet distributions are given by
K
Γ(kα) Y α−1 Γ(β + γ) β−1
Dirichlet(π|α) = k
πk and Beta(ρ|β, γ) = ρ (1 − ρ)γ−1
Γ (α) Γ(β)Γ(γ)
k=1
Solution:
N
Y d
Y K
Y
p(π, z, x, ρ|α, β, γ) = p(π|α) p(zi |π) p(xij |ρzi j ) p(ρkj |β, γ)
i=1 j=1 k=1
K N d K
Γ(β + γ) Γ(kα) Y α−1 Y Y xij 1−xij
Y β−1
= π k π zi ρ z j (1 − ρz i j ) ρkj (1 − ρkj )γ−1
Γ(β)Γ(γ) Γk (α) i=1 j=1
i
k=1 k=1
K d
Γ(β + γ) Γ(kα) Y nk +α−1 Y β−1+mkj
= π k ρkj (1 − ρkj )γ−1+nk −mkj
Γ(β)Γ(γ) Γk (α) j=1
k=1
PN PN
Last step follows if we define nk = i=1 δ(zi =k) , mkj = i=1 δ(zi =k) xij .
3.2 Inference
1. Explain why it is nontrivial to infer the posterior π, z, ρ|α, β, γ, x directly.
2. Suggest at least one alternative approach to Expectation Maximization for solving this problem.
Solution:
3.2.1: There is no closed form solution p(x) which is the numerical normalizer.
3.2.2. Gibbs Smapling or any other MCMC methods.
5
3.3 Expectation Maximization Algorithm

Use the variational inequality
log p(a) ≥ Eb∼q(b) [log p(a, b)] + H[q]
to derive a lower bound on p(x, π, ρ|α, β, γ). Hint — you need to make an approximation for the distribution
over z when obtaining the variational distribution.
Solution:
a = ρ, π, b = z
X
⇒ log p(x, π, ρ|α, β, γ) = log p(x, π, z, ρ|α, β, γ)
z
≥ Ez∼q(z) [log p(π, z, ρ)] + H[q]
 
XK d
X
= Ez∼q(z)  (nk + α − 1) log πk + (β − 1 + mkj ) log ρkj + (γ − 1 + nk − mkj ) log(1 − ρkj )
k j=1
+ H[q] + constant
3.4 Expectation Step

Compute p(zi = k|xi , π, ρ).
Solution:
Qd
p(zi = k|π) j=1 p(xij |ρkj )
p(zi = k|xi , π, ρ) = PK Qd
k=1 p(zi = k|π) j=1 p(xij |ρkj )
d
x
Y
∝ πk ρkjij (1 − ρkj )1−xij
j=1
3.5 Maximization step

Using qi (k) = p(zi = k|xi , π, ρ) compute the estimates for π and ρ. Hint — if you didn’t manage to derive
the explicit expression for qi (zi ), you may plug in the term symbolically and give the update in terms of the
variational distribution q only.
6
Solution:
• π:
P
i qi (k) + α − 1
π̂k = P .
i,k i (k) + K(α − 1)
q
• ρ: Recall froms 3.3 that

 
XK d
X
log p(x, π, ρ|α, β, γ) ≥ = Ez∼q(z)  (nk + α − 1) log πk + (β − 1 + mkj ) log ρkj + (γ − 1 + nk − mkj ) log(1 − ρkj )
k j=1
+ H[q] + constant
Maximizing this lower bound leads to

P
β + i qi (k)xij − 1
ρkj = P
β + γ + i qi (k) − 2
7
4 Graphical Models [10 points]

4.1 Independence [5 points]
Consider the following graphical model:
C D E F
A B G
H I J K
Check whether the following independence holds or not? Provide a brief reason for your answer. Alterna-
tively, give a path for the Bayes ball algorithm as illustration.
• J ⊥ E|G
• J ⊥ E|K, G
• J ⊥ E|B
• J ⊥ E|A
• J ⊥ E|A, H
Solution:
• J ⊥ E|G: No, explaining away at I

• J ⊥ E|K, G: Yes, Interaction block
• J ⊥ E|B: No, explaining away
• J ⊥ E|A: No, flow of influence

• J ⊥ E|H, A: Yes, observation block
8
4.2 Relevant Random Variables [5 points]

Consider the following graphical model:
α θd zdi wdi ψk H
for all i for all k
for all d
1. For the given graphical model as it is, denote the minimal set of random variables which affect zdi = k
for a given d, i pair?
2. Suppose we integrate out the random variables θd and ψk . Which nodes affect zdi = k for a given d, i
pair now?
Solution:
• Depends on wdi , θdk and ψk .

• Depends on {wd0 i0 : zd0 i0 = k}.
Markov blanket!
9
5 Regression with Gaussian Prior [15 points]

Assume that we have some parameter w ∼ N (w0 , σ 2 1) drawn from a d-dimensional Normal distribution.
Moreover, assume that we perform regression with additive Normal noise . That is, assume that we have
y = hw, φ(x)i + ξ where ξ ∼ N (0, τ 2 ) and w ∼ N (w0 , σ 2 1) and φ(x) ∈ Rd .
5.1 Gaussian Process

Prove that y|x is drawn from a Gaussian Process. That is, show that y|x is normal with a suitable mean
function µ(x) and covariance kernel k(x, x0 ).
Solution:
y is a linear combination of w and ξ, each of which are Normal.
µ(x) = E[y] = E[hw, φ(x)i + ξ] = hw0 , φ(x)i + 0 = hw0 , φ(x)i

k(x, x0 ) = E[(y − hw0 , φ(x)i)(y 0 − hw0 , φ(x0 )i)T ]
= E[(hw, φ(x)i + ξ − hw0 , φ(x)i)(hw, φ(x0 )i + ξ − hw0 , φ(x0 )i)T ]
= E[(hw − w0 , φ(x)i + ξ)(hw − w0 , φ(x0 )i + ξ)T ]
= E[hw − w0 , φ(x)i hw − w0 , φ(x0 )i] + E[ξ]E[hw − w0 , φ(x)i] + E[ξ]E[hw − w0 , φ(x0 )i] + E[ξ 2 ]
= φ(x)T E[(w − w0 )(w − w0 )T ]φ(x0 ) + τ 2
= σ 2 φ(x)T φ(x0 ) + τ 2
= σ 2 hφ(x), φ(x0 )i + τ 2
5.2 Kernels
What is the difference between a kernel obtained from the feature map φ(x) described above and a kernel
2
such as k(x, x0 ) = exp(− 12 kx − x0 k ). Can you characterize the type of functions that can be approximated?
Hint: what does the rank of the kernel matrix tell you?
Solution:
2
The data point is mapped to an infinite dimensional feature space in case of k(x, x0 ) = exp(− 12 kx − x0 k ).
Polynomials of order d.
5.3 Posterior Distribution

Assume that we observe m pairs (xi , yi ). Show that the posterior distribution is normal.
p w|(x1 , y1 ), . . . (xm , ym ), τ 2 , σ 2

Solution:
p w|(x1 , y1 ), . . . (xm , ym ), τ 2 , σ 2 ∝ p w, (x1 , y1 ), . . . (xm , ym ), τ 2 , σ 2

= p w|σ 2
Y
p(yi |xi , w, τ 2 ) (1)
i
Each of them is Gaussian, so overall will be Gaussian.
10
5.4 Mean and Covariance

Compute the mean and covariance of the posterior distribution p w|(x1 , y1 ), . . . (xm , ym ), τ 2 , σ 2 .
Solution:
E[w|(x1 , y1 ), . . . (xm , ym ), τ 2 , σ 2 ] = w0 + σ 2 ΦT (σ 2 ΦΦT + τ 2 1)−1 (y − Φw0 )

(2)
Cw|(x1 ,y1 ),...(xm ,ym ),τ 2 ,σ2 = σ 2 1 − σ 4 Φ(σ 2 ΦΦT + τ 2 1)−1 Φ
where Φ = [φ(x1 ); φ(x2 ); ...; φ(xm )] and y = [y1 ; y2 ; ...; ym ]T .
5.5 Prediction
Assume that we want to estimate y 0 |x0 using the previous observations, i.e. we regress on
p y 0 |x0 , (x1 , y1 ), . . . (xm , ym ), τ 2 , σ 2

Derive the mode of p(y 0 |x0 , rest).
Solution:
Mode of the posterior predictive p(y 0 |x0 , rest) is simply
yˆ0 = φ(x0 )T (w0 + σ 2 ΦT (σ 2 ΦΦT + τ 2 1)−1 (y − Φw0 ))
11
6 Matrix Factorization [10 points]

Recommendations can be generated by a wide range of algorithms. While user-based or item-based simi-
larity methods are simple and intuitive, matrix factorization techniques are usually more effective because
they allow us to discover the latent features underlying the interactions between users and items.
That is, for observed ratings rum for a given (u, m) pair of a user u and a movie m, one typically tries to
estimate the score by
fum = hvu , wm i + bu + bm .
Here vu and wm are vectors in Rd and bu , bm are scalars, indicating the bias.
6.1 Bias Updates

Assume that our objective is given by
" #
1 X 2 λ X 2 2
X
2 2
(fum − rum ) + bu + kvu k + bm + kwm k where λ > 0.
2 u∼m 2
u∈U m∈M
Here U denotes the set of all users, M the set of all movies, and u ∼ m represents the sum over all (u, m)
pairs for which a rating exists. Write the optimal values of bu , provided that all other values are fixed. That
is, compute the optimal value of bu |v, w, bm , r.
Solution:
Set the derivative wrt. a particular user u0 to 0:
X
(fu0 m − ru0 m ) + λbu0 = 0
u0 ∼m
P
u0 ∼m (ru m
0 − fu0 m )
b u0 =
λ
where u0 m are the movies rated by u0 .
6.2 Optimization
Is the problem jointly convex in v and w? You need to prove this rather than just giving a binary answer. It
suffices to prove this a very simplistic case, say for only 1 user and 1 movie.
Solution:
Since
P we assume only one user and one movie, by ignoring the bias terms, the Hessian of O(f (v, w)) =
1 2
2 u∼m (fum − rum ) can be calculated as:
2w2

4vw − 2rum
4vw − 2rum 2v 2
It is not positive semi-definite. Therefore the problem is only element-wise convex but not jointly convex
in v and w.
6.3 Cold Start

1. How could you address the problem of recommending movies to a new user?
12
Solution:
When a new user come in, we have little information about them, and thus the matrix factorization method
can not learn much associations between the new user and the existing users. We should use the demo-
graphics information of the user to bridge its associations with existing users. Many ideas can be applied
here. The most intuitive way is perhaps to do regression based on the demographics features and compute
a similarity between the new user and existing users, then approximate vu with a linear combination.
Active learning?
2. How could you accomplish this for a new movie?
Solution:
Using meta data of the movie as additional information to encode the similarity, perhaps approximating the
corresponding wm as a linear combination of existing movies based on their similarities in terms of meta
information. This can be encoded in the objective function.
Active learning?
13
7 Weather Prediction [15 points]

The weather in Pittsburgh is notoriously fickle. For simplicity we only consider sun and rain and we assume
that the weather changes once per day. The weather satisfies the following transition probabilities:
• When it rains, the probability of sun the following day is 0.6.
• When the sun shines, the probability of rain the following day is 0.3.
7.1 Transition Matrix [3 points]

Write down the state transition matrix T for Pittsburgh’s weather.
Solution:
r s
r 0.4 0.6
s 0.3 0.7
7.2 Stationary Distribution [5 points]

1. What are the eigenvalues of T ? Bonus points if you can find stationary distribution.
Solution:
The first eigenvalue is 1 with eigenvector < 1, 1 >.

The second eigenvalue is 0.1 with eigenvector < 1, −0.5 >.
Solve a linear system π̄T = π̄ with the constraint that π̄1 + π̄2 = 1.
The stationary point is that P (r) = 31 , P (s) = 23
2. What does this mean for numerically computing the stationary distribution to at least 1% accuracy.
Solution:
With a large finite or countably infinite states, the standard way of computing the stationary distribution is
not possible. One good example is to compute the PageRank scores of the web.
We can use power iterations or MCMC which sinmulates random walks to find approximations to the
stationary distribution.
In terms of MCMC, with fixed parameters the Monte Carlo Standard Error (MCSE) is bounded by the
sample size. We can compute a Confidence Interval regarding how far our estimation is from the truth.
7.3 Unobserved Days [7 points]

Assume that we observe the weather over a ten day period. In particular, we observe the following:
• The sun shines on the first day.
• It rains on day 5.
• It rains on day 7.
• The sun shines on day 10.
1. What is the probability of sun on day 6? Derive the equation.
14
Solution:
Given day 5 and day 7 are rainy ,
P (r6 |r5 , r7 ) = 0.4 ∗ 0.4 = 0.16

P (s6 |r5 , r7 ) = 0.6 ∗ 0.3 = 0.18
0.18 9
P (r6 ) = =
0.16 + 0.18 17
2. What is the most likely weather sequence on days 8 and 9. Derive the equation rather than just stating
the result.
Solution:
Given day 7 is rainy and day 10 is sunny, enumerate all possible sequence and evaluate their likelihoods.
P (r8 , r9 |r7 , s1 0) = 0.4 ∗ 0.4 ∗ 0.6 = 0.096

P (r8 , s9 |r7 , s1 0) = 0.4 ∗ 0.6 ∗ 0.7 = 0.168
P (s8 , r9 |r7 , s1 0) = 0.6 ∗ 0.3 ∗ 0.6 = 0.108
P (s8 , s9 |r7 , s1 0) = 0.6 ∗ 0.7 ∗ 0.7 = 0.294
Thereore the most likely sequence is s8 → s9
15
(This page was left blank intentionally)
16
(This page was left blank intentionally)
17

Machine Learning 10-701 Final Exam May 5, 2015: Obvious Exceptions For Pacemakers and Hearing Aids

Uploaded by

Copyright:

Available Formats

Machine Learning 10-701 Final Exam May 5, 2015: Obvious Exceptions For Pacemakers and Hearing Aids

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning 10-701 Final Exam May 5, 2015: Obvious Exceptions For Pacemakers and Hearing Aids

Uploaded by

Copyright:

Available Formats

Machine Learning 10-701 Final Exam May 5, 2015

Name: Andrew ID:

1 Obvious exceptions for pacemakers and hearing aids.

1 Loss, Regularization and Optimization [10 points]

1.2 Gradient Descent [6 points]

1. Calculate the batch gradient of R(w) with respect to w.

2. Does it make sense to initialize all weights in a deep network to 0.

2.2 Forward and Backward Propagation [7 points]

y = σ(V σ(W x))

matrix form is recommended but not mandatory.

h = σ(W x) = (2, 0)>

Plugging in numerical values yields

where 0 ≤ g ≤ 1 is the subgradient of ReLU

3 Expectation Maximization for Clustering Bit Vectors [15 points]

The associated generative model is given as follows:

3.1 Generative Model

3.3 Expectation Maximization Algorithm

log p(a) ≥ Eb∼q(b) [log p(a, b)] + H[q]

3.4 Expectation Step

3.5 Maximization step

• ρ: Recall froms 3.3 that

Maximizing this lower bound leads to

4 Graphical Models [10 points]

• J ⊥ E|G: No, explaining away at I

• J ⊥ E|A: No, flow of influence

4.2 Relevant Random Variables [5 points]

• Depends on wdi , θdk and ψk .

5 Regression with Gaussian Prior [15 points]

y = hw, φ(x)i + ξ where ξ ∼ N (0, τ 2 ) and w ∼ N (w0 , σ 2 1) and φ(x) ∈ Rd .

5.1 Gaussian Process

µ(x) = E[y] = E[hw, φ(x)i + ξ] = hw0 , φ(x)i + 0 = hw0 , φ(x)i

5.3 Posterior Distribution

p w|(x1 , y1 ), . . . (xm , ym ), τ 2 , σ 2 ∝ p w, (x1 , y1 ), . . . (xm , ym ), τ 2 , σ 2

Each of them is Gaussian, so overall will be Gaussian.

5.4 Mean and Covariance

E[w|(x1 , y1 ), . . . (xm , ym ), τ 2 , σ 2 ] = w0 + σ 2 ΦT (σ 2 ΦΦT + τ 2 1)−1 (y − Φw0 )

p y 0 |x0 , (x1 , y1 ), . . . (xm , ym ), τ 2 , σ 2

Derive the mode of p(y 0 |x0 , rest).

yˆ0 = φ(x0 )T (w0 + σ 2 ΦT (σ 2 ΦΦT + τ 2 1)−1 (y − Φw0 ))

6 Matrix Factorization [10 points]

6.1 Bias Updates

6.3 Cold Start

7 Weather Prediction [15 points]

7.1 Transition Matrix [3 points]

7.2 Stationary Distribution [5 points]

The first eigenvalue is 1 with eigenvector < 1, 1 >.

7.3 Unobserved Days [7 points]

1. What is the probability of sun on day 6? Derive the equation.

P (r6 |r5 , r7 ) = 0.4 ∗ 0.4 = 0.16

P (r8 , r9 |r7 , s1 0) = 0.4 ∗ 0.4 ∗ 0.6 = 0.096

(This page was left blank intentionally)

(This page was left blank intentionally)

You might also like