Computational Biology Project Report

computational biology report 1
Nathan Raynal-Castang
June 2019
Contents
1 Em algorithm for detecting changepoints 1
2 EM algorithm for Gaussian mixtures 8
3 ABO groups and genetics 11
1 Em algorithm for detecting changepoints

We consider the following conditional probabilities describing our problem :
p(yi |z, θ) = θ1yi (1 − θ1 )1−yi for i=1,...,z-1

p(yi |z, θ) = θ2yi (1 − θ2 ) 1−yi
for i=z,...,n
We see that p(yi = 1|z, θ) = θ1 and p(yi = 0|z, θ) = 1 − θ1 for i=1,...,z-1,

as well as p(yi = 1|z, θ) = θ2 and p(yi = 0|z, θ) = 1 − θ2 for i=z,...,n. This
corresponds to the description of our problem.
question 1 for z = 1 :
due to independence of the yi with respect to each other

n
Y
p(y|z, θ) = p(yi |z, θ)
i=1
1
since z=1, every i is superior or equal to z implying
n
Y
p(y|z, θ) = θ2yi (1 − θ2 )1−yi
i=1
for z > 1 :
due to independence of the yi with respect to each other

n
Y
p(y|z, θ) = p(yi |z, θ)
i=1
following the definition of p(yi |z, θ) we get
z−1
Y n
Y
p(y|z, θ) = θ1yi (1 − θ1 )1−yi θ2yi (1 − θ2 )1−yi
i=1 i=z
question 2 We will now compute log(p(y|z, θ)) for z > 1 :
Using the result of question 1, properties of logarithms and sums we get the
following result
z−1
Y n
Y
log(p(y|z, θ)) = log( θ1yi (1 − θ1 )1−yi θ2yi (1 − θ2 )1−yi )
i=1 i=z
z−1
X n
X
= log(θ1yi (1 − θ1 )1−yi ) + log(θ2yi (1 − θ2 )1−yi )
i=1 i=z
z−1
X z−1
X n
X n
X
= log(θ1 ) yi + log(1 − θ1 ) 1 − yi + log(θ2 ) yi + log(1 − θ2 ) 1 − yi
i=1 i=1 i=z i=z
z−1
X z−1
X n
X n
X
= log(θ1 ) yi + log(1 − θ1 )(z − 1 + yi ) + log(θ2 ) yi + log(1 − θ2 )(n − z + 1 + yi )
i=1 i=1 i=z i=z
question 3 We suppose θ1 and θ2 known and we now define the following

quantity :
p(z|y, θ)
R(z) =
p(z = 1|y, θ)
2
we show here that
z−1
Y p(yj |θ1 , z)
R(z) =
j=1
p(yj |θ2 , z = 1)
A first useful remark is that y and z are not independent. we indeed have
p(y|z = 1) 6= p(y|z = 2) if θ1 6= θ2
Using the chain rule we compute
p(z, y, θ) p(z|y, θ)p(y|θ)

p(z, y|θ) = = = p(z|y, θ)p(z|θ)
p(θ) p(θ)
we deduce from it that
p(z|y, θ)
R(z) =
p(z = 1|y, θ)
p(z, y|θ)
=
p(z = 1|y, θ)p(y|θ)
p(z, y|θ)
= p(z=1,y|θ)p(y|θ)
p(y|θ)
p(z, y|θ)
=
(z = 1, y|θ)
using the chain rule again we get
p(z, y|θ)
R(z) =
(z = 1, y|θ)
p(y|z, θ)p(z|θ)
=
p(y|z = 1, θ)p(z = 1|θ)
using independence of z and θ we get
p(z = 1|θ) = p(z = 1|θ2 ) = p(z = 1)

p(z|θ) = p(z)
since we suppose the prior distribution on z uniform we have
1
p(z = 1) = p(z) =
n
3
we deduce that
p(y|z, θ)
R(z) =
p(y|z = 1, θ)
we can now use the computation of those two quantities in question 1
p(y|z, θ)
R(z) =
p(y|z = 1, θ)
Qz−1 Qn
i=1 θ1yi (1 − θ1 )1−yi i=z θ2yi (1 − θ2 )1−yi
= Qn yi 1−yi
i=1 θ2 (1 − θ2 )
Qz−1 yi 1−yi
i=1 θ1 (1 − θ1 )
= Qz−1 yi 1−yi
i=1 θ2 (1 − θ2 )
z−1
Y p(yi |θ1 , z)
=
i=1
p(yi |θ2 , z = 1)
This corresponds to the wanted result.
question 4 in this question we aim to show a relationship between R(z) and

R(z-1) for all z=2,...,n.
let’s first write what is R(z) and R(z-1) using the result of the previous
question
z−1
Y p(yi |θ1 , z)
R(z) =
i=1
p(yi |θ2 , z = 1)
z−2
Y p(yi |θ1 , z − 1)
R(z − 1) =
i=1
p(yi |θ2 , z = 1)
we can now write down the relation between the different quantities in the
equations :
p(yi |θ1 , z − 1) = p(yi |θ1 , z) for i 6= z − 1

= θ1yi (1 − θ1 ) 1−yi
for i = z − 1
we can then compute :
4
z−2
p(yz−1 |θ1 , z) Y p(yi |θ1 , z)
R(z) =
p(yz−1 |θ2 , z = 1) i=1 p(yi |θ2 , z = 1)
z−2
p(yz−1 |θ1 , z) Y p(yi |θ1 , z − 1)
R(z) =
p(yz−1 |θ2 , z = 1) i=2 p(yi |θ2 , z = 2)
p(yz−1 |θ1 , z)
= R(z − 1)
p(yz−1 |θ2 , z = 1)
using the definition of those conditional probabilities we get :

y
θ1 z−1 (1 − θ1 )1−yz−1
R(z) =R(z − 1) y
θ2 z−1 (1 − θ2 )1−yz−1
we now have a recursion formula to compute R(z) and from the definition
of R(z) we have R(1)=1, we can then deduce the following formula to compute
R(z).
if z > 1 :
z y
Y θ i−1 (1 − θ1 )1−yi−1
1
R(z) = y
θ i−1 (1 − θ2 )1−yi−1
i=2 2
if z = 1 :
R(1) = 1
using the definition of R(z) we know that
p(z|y, θ) = R(z)p(z = 1|y, θ)

we also know that
n
X
p(z = i|y, θ) = 1
i=1
which implies that

n
X
R(i)p(z = 1|y, θ) = 1
i=1
we finally can deduce that

1
p(z = 1|y, θ) = Pn
i=1 R(i)
we finally have all the pieces to compute p(z|y, θ)
R(z)
p(z|y, θ) = Pn
i=1 R(i)
5
question 5 in this question, we will compute
z−1
X
E1 = E[ yi |y, θ0 ]
i=1
in E1 the only random variable is z. we can then define φ(z) as a function

of this random variable :
z−1
X
φ1 (z) = yi
i=1
n
X
E1 = E[φ1 (z)|y, θ0 ] = φ1 (z)p(z|y, θ0 )
z=1
from this result we can easily compute E2

z−1
X
φ2 (z) = z − 1 − yi
i=1
n
X
E2 = E[φ2 (z)|y, θ0 ] = φ2 (z)p(z|y, θ0 )
z=1
in the same fashion than for E1 , we can compute E3 :

let’s define
n
X
φ3 (z) = yi
i=z
n
X
E3 = E[φ3 (z)|y, θ0 ] = φ3 (z)p(z|y, θ0 )
z=1
we finally compute E4 :
n
X
φ4 (z) = n − z + 1 − yi
i=z
n
X
E4 = E[φ4 (z)|y, θ0 ] = φ4 (z)p(z|y, θ0 )
z=1
Let’s compute the complexity for computing E1 .

z−1
X
φ1 (z) = yi
i=1
6
so it can be computed in O(n). Using the recursion formula for R(z) we can
compute R(i+1) using R(i) and one product, implying that
R(z)
p(z|y, θ0 ) = Pn
i=1 R(i)
can be computed in O(n) also. It implies that
φ1 (z)p(z|y, θ0 )
can be computed in O(n). We must sum it n times implying a complexity of

O(n2 ). However, we can compute p(z|y, θ0 ) and φ1 (z) for every z before doing
the computation of E1 . The complexity of those computations are in O(n).
Once we have the results, the sum of all φ1 (z)p(z|y, θ0 ) can be computed in
O(n). We then have O(n)+O(n)=O(n). Finally we can compute E1 in O(n).
question 6 Using Bayes formula we compute :
p(y, z|θ) =p(y|z, θ)p(z|θ)

=p(y|z, θ)p(z)
1
= p(y|z, θ)
n
we then have :
1
Q(θ, θ0 ) =E(log( p(y|z, θ))|y, θ0 )
n
1
=E(log( ) + log(p(y|z, θ))|y, θ0 )
n
= − log(n) + E(log(p(y|z, θ))|y, θ0 )
we can now compute E(log(p(y|z, θ))|y, θ0 ) using previous results from ques-
tion 2 and 5 :
E(log(p(y|z, θ))|y, θ0 ) =
=log(θ1 )E1 + log(1 − θ1 )E2 + log(θ2 )E3 + log(1 − θ2 )E4
n−1
=(log(θ1 ) − log(1 − θ1 ))E1 + log(1 − θ1 )( )
2
n+1
+ (log(θ2 ) − log(1 − θ2 ))E3 + log(1 − θ2 )( )
2
7
a local maximum is reached for
∇E(log(p(y|z, θ))|y, θ0 ) = (0, 0)
1 1
E1 − E2 = 0
θ1 1 − θ1
1 1
E3 − E4 = 0
θ2 1 − θ2
the solution to the equations are
E1
θ1 =
E1 + E2
E3
θ2 =
E3 + E4
we can now describe the full EM algorithm . First let’s choose randomly
θ0 . our vector y is known. We then compute E1 ,E2 ,E3 and E4 depending on
this θ0 . We use those values to compute the argument θ of the maximum of
E(log(p(y|z, θ))|y, θ0 ) . We use this value as the new θ0 , we call it θ1 and we
reiterate the process until reaching a terminating condition to determine.
question 7 Once we are confident in the result of the EM algorithm, we can

use the computed parameters to estimate p(z|y, θ) for every z. We then get an
empiric probability density function for Z|Y . From this we can compute an esti-
mator of the expectation of Z|Y and a confidence interval. https://en.wikipedia.org/wiki/Confidence interval
2 EM algorithm for Gaussian mixtures

question 1 knowing that p(zi = 1) = 1/2 :
1 1
p(yi |θ) = p0 (yi |θ) + p1 (yi |θ)
2 2
= p(zi = 0)p(yi |zi = 0, θ) + p(zi = 1)p(yi |zi = 1, θ)
question 3 using Baye’s rule we have
p(y, z|θ) = p(y|z, θ)p(z|θ)

as the zi are independent variables we have
n
Y p(y|z, θ)
p(y, z|θ) = p(y|z, θ) p(z|θ) =
i=1
2n
8
we know that
considering independence of the data we get
n
X 1 1 −(yi − m0 )2 −(yi − m1 )2
p(y|z, θ) = log( √ )zi + log( √ )(1 − zi ) − zi ( ) − (1 − zi )( )
i=1
2π 2π 2 2
n
1 1X
= n log( √ ) − zi (yi − m0 )2 ) + (1 − zi )((yi − m1 )2 )
2π 2 i=1
We conclude that
n
1 1X
log(p(y, z|θ)) = log(p(y|z, θ))−log(2n ) = n log( √ )−log(2n )− zi (yi −m0 )2 )+(1−zi )((yi −m1 )2 )
2π 2 i=1
with Cn = n log( √12π ) − log(2n )
question 4 we can see that the sum in the above expression is a bivariate
quadratic function. Let’s consider a general bivariate quadratic function of the
form :
Am20 + Bm21 + Cm0 + Dm1 + Em0 m1 + F

we can write our expression to express explicitly the coefficients A,B,C,D,E,F.
n
X
(1 − zi )(yi − m0 )2 + zi (yi − m1 )2
i=1
n
X
= yi2 + 2(zi − 1)yi m0 + (1 − zi )m20 + zi m21 − 2yi zi m1
i=1
we now have our coefficients :
9
n
X
A= 1 − zi
i=1
Xn
B= zi
i=1
Xn
C= 2(zi − 1)yi
i=1
Xn
D= −2yi zi
i=1
E=0
n
X
F = yi2
i=1
We know that a maximum is reached for :
−2BC
m0 =
4AC
−2AD
m1 =
4AB
let’s first compute m∗0 :
Pn Pn
−2 i=1 zi i=1 2(zi − 1)yi
m∗0 = Pn Pn
4 i=1 1 − zi i=1 zi
Pn
i=1 (1 − zi )yi
= P n
1 − zi
Pn i=1
(1 − zi )yi
= i=1 Pn
n − i=1 zi
We now compute m∗1 :
Pn Pn
−2 1 − zi i=1 −2yi zi
m∗1
= i=1
Pn Pn
4 i=1 1 − zi i=1 zi
Pn
yi zi
= Pi=1
n
i=1 zi
(m∗0 , m∗1 ) is then the argument for the maximum of our bivariate quadratic
function.
10
question 5 let’s compute p(zi = 1|yi , θ0 ) :
Using Bayes,
p(yi |zi = 1)p(zi = 1)

p(zi = 1|yi , θ0 ) =
p(yi )
We supposed that p(zi = 1) = 21 . Furthermore, we know that
p(yi |zi = 1) = N (yi |m1 , σ1 )
and that
1
p(yi ) = p(zi = 1)p(yi |zi = 1)+p(zi = 2)p(yi |zi = 2) = (N (yi |m1 , σ1 )+N (yi |m0 , σ0 ))
2
using the pdf of those normal law we find :
−(yi −m1 )2
0 e 2
p(zi = 1|yi , θ ) = −(yi −m0 )2 −(yi −m1 )2
e 2 +e 2
3 ABO groups and genetics

Question 1 The Hardy-Weinberg principle state that under some assump-
tions, the expected genotype frequencies in the diploid case are given by the
multinomial expansion for all alleles, in our case it’s :
(pA + pB + pO )2 = 1
We then have the following expected numbers for each genotype
nA/A = np2A nA/B = 2npA pB

nB/B = np2B nA/O = 2npA pO
nO/O = np2O nB/O = 2npB pO
we can deduce from it that :
nA = np2A + 2npA pO = n(2pA pO + p2A )

which implies that
nA n(2pA pO + p2A ) p2A

= 2 ⇔ nA/A = nA
nA/A npA 2pA pO + p2A
11
Using exactly the same method we can compute an equation for nA/O and
nB/B .
2pA pO
nA/O = nA
p2A + 2pA pO
p2B
nB/B = nB
2pB pO + p2B
question 3 estimator for pA :
p2A
nA/A = nA
2pA pO + p2A
p2A
⇔ nA/A = n(2pA pO + p2A )
2pA pO + p2A
⇔ nA/A = np2A
r
nA/A
⇔ = pA
n
using the same method , we give an estimator for pO and pB as well :

nA/O
pO =
2npA
r
nB/B
pB =
n
question 5 We are now going to write an EM algorithm to compute the

proportion of Alleles knowing the Phenotype and assuming the Hardy-Weinberg
principle is true. First we will clearly define our variable and parameters. Our
unknown parameter θ is composed of the three frequencies of the Alleles in the
population, namely :
θ = (pa , po , pb )
Our data is composed of the phenotype of the individuals. So our incomplete
data is
Y = (nA , nB , nAB , nO )
.
Finally the hidden data is composed of the alleles defining the phenotype
of individuals, every individuals can have a combination of two of the alleles
(A,B,O).
z = (nA/A , nO/A , nB/B , nO/B )

the other combination are known quantities.
12
To use an EM algorithm we need to be able to compute the following quan-
tity:
Q(θ, θ0 ) = E[log P (Y, Z|θ)|Y, θ0 ]
Q(θ, θ0 ) = E[log P (Y, Z|θ)|Y, θ0 ]

= E[log(Cn ) + 2nA/A log(pA ) + nA/O log(2pA pO ) + 2nB/B log(pB ) + nB/O log(2pB pO )
+ nA/B log(2pA pB ) + 2nO log(pO )|Y, θ0 ]
with Cn a constant depending only on n. nA/B , nO and nO are known

constant so
E[log P (Y, Z|θ)|Y, θ0 ] = log(Cn ) + nA/B log(2pA pB ) + 2nO log(pO ) + 2 log(pA )E[nA/A |Y, θ0 ]
+ log(2pA pO )E[nA/O |Y, θ0 ] + 2 log(pB )E[nB/B |Y, θ0 ] + log(2pB pO )E[nB/O |Y, θ0 ]
= log(Cn ) + nA/B log(2pA pB ) + 2nO log(pO ) + 2 log(pA )E[nA/A |Y = nA , θ0 ]
+ log(2pA pO )E[nA/O |Y = nA , θ0 ] + 2 log(pB )E[nB/B |Y = nB , θ0 ]
We can compute the conditional distributions contained in the expectations.

The conditional distributions follows the following binomial laws :
nA

(nA/A |Y = nA , θ0 ) ∼ (p0A )2
(p0A )2 +2p0A p0O
nA

(nA/O |Y = nA , θ0 ) ∼ 2(p0A )2 (p0O )2
(p0A )2 +2p0A p0O
nB

(nB/B |Y = nB , θ0 ) ∼ (p0B )2
(p0B )2 +2p0B p0O
nB

(nB/O |Y = nB , θ0 ) ∼ 2p0B p0O
(p0B )2 +2p0B p0O
The expectation of the conditional distributions are then equal to np.

we can then rewrite :
E[log P (Y, Z|θ)|Y, θ0 ] = log(Cn ) + nA/B log(2pA pB ) + 2nO log(pO ) + 2 log(pA )E[nA/A |Y = nA , θ0 ]
+ log(2pA pO )E[nA/O |Y = nA , θ0 ] + 2 log(pB )E[nB/B |Y = nB , θ0 ]
= log(Cn ) + nA/B log(2pA pB ) + 2nO log(pO ) + log(pA )n0A/A
+ log(2pA pO )n0A/O + log(2pB pO )n0B/O + log(pB )n0B/B
13
We now want to find a maximum for E[log P (Y, Z|θ)|Y, θ0 ]−Cn with respect
to (pA , pB , pO ) :
argmax(pA ,pB ,pO ) E[log P (Y, Z|θ)|Y, θ0 ] − Cn

with constraint pA + pB + pO = 1.
We use a Lagrange multiplier to find a maximum with this constraint :
L(θ, λ) = Q(θ, θ0 ) − Cn + λ(1 − pA − pB − pO )
∂ 2nA/A nA/O nA/B

L(θ, λ) = + + −λ=0
∂pA pA pA pA
∂ 2nB/B nB/O nA/B
L(θ, λ) = + + −λ=0
∂pB pB pB pB
∂ nA/O nB/O 2nO
L(θ, λ) = + + −λ=0
∂pO pO pO pO
∂
L(θ, λ) = 1 − pA − pB − pO = 0
∂λ
solving this system we obtain :
2nA/A + nA/O + nA/B

pA =
λ
2nB/B + nB/O + nA/B
pB =
λ
2nO + nB/O + nA/O
pO =
λ
1 = pA + pB + pO
If we sum the three first equations, knowing that we have pA + pB + pO = 1

we end up with
λ = 2(nA/A + nA/O + nA/B nB/B + nB/O + nA/B + nO ) = 2n

Our maximum is then reach for
2nA/A + nA/O + nA/B

p1A =
2n
2nB/B + n B/O + nA/B
p1B =
2n
1 2nO + nB/O + nA/O
pO =
2n
14
We now have every step of our EM algorithm.
• Using p0A , p0B , p0O we compute n0A/A , n0A/O , n0B/B , n0B/O
• We then use them to compute p1A , p1B , p1O
• We repeat the process until we reach convergence.
15

Computational Biology Project Report

Uploaded by

Copyright:

Available Formats

Computational Biology Project Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computational Biology Project Report

Uploaded by

Copyright:

Available Formats

computational biology report 1

2 EM algorithm for Gaussian mixtures 8

3 ABO groups and genetics 11

1 Em algorithm for detecting changepoints

p(yi |z, θ) = θ1yi (1 − θ1 )1−yi for i=1,...,z-1

We see that p(yi = 1|z, θ) = θ1 and p(yi = 0|z, θ) = 1 − θ1 for i=1,...,z-1,

due to independence of the yi with respect to each other

due to independence of the yi with respect to each other

following the definition of p(yi |z, θ) we get

question 2 We will now compute log(p(y|z, θ)) for z > 1 :

question 3 We suppose θ1 and θ2 known and we now define the following

Using the chain rule we compute

p(z, y, θ) p(z|y, θ)p(y|θ)

we deduce from it that

using the chain rule again we get

using independence of z and θ we get

p(z = 1|θ) = p(z = 1|θ2 ) = p(z = 1)

since we suppose the prior distribution on z uniform we have

we can now use the computation of those two quantities in question 1

This corresponds to the wanted result.

question 4 in this question we aim to show a relationship between R(z) and

p(yi |θ1 , z − 1) = p(yi |θ1 , z) for i 6= z − 1

we can then compute :

using the definition of those conditional probabilities we get :

p(z|y, θ) = R(z)p(z = 1|y, θ)

which implies that

we finally can deduce that

in E1 the only random variable is z. we can then define φ(z) as a function

from this result we can easily compute E2

in the same fashion than for E1 , we can compute E3 :

Let’s compute the complexity for computing E1 .

can be computed in O(n) also. It implies that

can be computed in O(n). We must sum it n times implying a complexity of

question 6 Using Bayes formula we compute :

p(y, z|θ) =p(y|z, θ)p(z|θ)

∇E(log(p(y|z, θ))|y, θ0 ) = (0, 0)

question 7 Once we are confident in the result of the EM algorithm, we can

2 EM algorithm for Gaussian mixtures

question 3 using Baye’s rule we have

p(y, z|θ) = p(y|z, θ)p(z|θ)

with Cn = n log( √12π ) − log(2n )

Am20 + Bm21 + Cm0 + Dm1 + Em0 m1 + F

we now have our coefficients :

We know that a maximum is reached for :

let’s first compute m∗0 :

p(yi |zi = 1)p(zi = 1)

We supposed that p(zi = 1) = 21 . Furthermore, we know that

p(yi |zi = 1) = N (yi |m1 , σ1 )

3 ABO groups and genetics

nA/A = np2A nA/B = 2npA pB

we can deduce from it that :

nA = np2A + 2npA pO = n(2pA pO + p2A )

nA n(2pA pO + p2A ) p2A

question 3 estimator for pA :

using the same method , we give an estimator for pO and pB as well :

question 5 We are now going to write an EM algorithm to compute the

z = (nA/A , nO/A , nB/B , nO/B )

Q(θ, θ0 ) = E[log P (Y, Z|θ)|Y, θ0 ]