Computational Biology Project Report
Computational Biology Project Report
Computational Biology Project Report
Nathan Raynal-Castang
June 2019
Contents
1 Em algorithm for detecting changepoints 1
question 1 for z = 1 :
1
since z=1, every i is superior or equal to z implying
n
Y
p(y|z, θ) = θ2yi (1 − θ2 )1−yi
i=1
for z > 1 :
z−1
Y n
Y
p(y|z, θ) = θ1yi (1 − θ1 )1−yi θ2yi (1 − θ2 )1−yi
i=1 i=z
Using the result of question 1, properties of logarithms and sums we get the
following result
z−1
Y n
Y
log(p(y|z, θ)) = log( θ1yi (1 − θ1 )1−yi θ2yi (1 − θ2 )1−yi )
i=1 i=z
z−1
X n
X
= log(θ1yi (1 − θ1 )1−yi ) + log(θ2yi (1 − θ2 )1−yi )
i=1 i=z
z−1
X z−1
X n
X n
X
= log(θ1 ) yi + log(1 − θ1 ) 1 − yi + log(θ2 ) yi + log(1 − θ2 ) 1 − yi
i=1 i=1 i=z i=z
z−1
X z−1
X n
X n
X
= log(θ1 ) yi + log(1 − θ1 )(z − 1 + yi ) + log(θ2 ) yi + log(1 − θ2 )(n − z + 1 + yi )
i=1 i=1 i=z i=z
p(z|y, θ)
R(z) =
p(z = 1|y, θ)
2
we show here that
z−1
Y p(yj |θ1 , z)
R(z) =
j=1
p(yj |θ2 , z = 1)
A first useful remark is that y and z are not independent. we indeed have
p(y|z = 1) 6= p(y|z = 2) if θ1 6= θ2
p(z|y, θ)
R(z) =
p(z = 1|y, θ)
p(z, y|θ)
=
p(z = 1|y, θ)p(y|θ)
p(z, y|θ)
= p(z=1,y|θ)p(y|θ)
p(y|θ)
p(z, y|θ)
=
(z = 1, y|θ)
p(z, y|θ)
R(z) =
(z = 1, y|θ)
p(y|z, θ)p(z|θ)
=
p(y|z = 1, θ)p(z = 1|θ)
1
p(z = 1) = p(z) =
n
3
we deduce that
p(y|z, θ)
R(z) =
p(y|z = 1, θ)
p(y|z, θ)
R(z) =
p(y|z = 1, θ)
Qz−1 Qn
i=1 θ1yi (1 − θ1 )1−yi i=z θ2yi (1 − θ2 )1−yi
= Qn yi 1−yi
i=1 θ2 (1 − θ2 )
Qz−1 yi 1−yi
i=1 θ1 (1 − θ1 )
= Qz−1 yi 1−yi
i=1 θ2 (1 − θ2 )
z−1
Y p(yi |θ1 , z)
=
i=1
p(yi |θ2 , z = 1)
let’s first write what is R(z) and R(z-1) using the result of the previous
question
z−1
Y p(yi |θ1 , z)
R(z) =
i=1
p(yi |θ2 , z = 1)
z−2
Y p(yi |θ1 , z − 1)
R(z − 1) =
i=1
p(yi |θ2 , z = 1)
we can now write down the relation between the different quantities in the
equations :
4
z−2
p(yz−1 |θ1 , z) Y p(yi |θ1 , z)
R(z) =
p(yz−1 |θ2 , z = 1) i=1 p(yi |θ2 , z = 1)
z−2
p(yz−1 |θ1 , z) Y p(yi |θ1 , z − 1)
R(z) =
p(yz−1 |θ2 , z = 1) i=2 p(yi |θ2 , z = 2)
p(yz−1 |θ1 , z)
= R(z − 1)
p(yz−1 |θ2 , z = 1)
we now have a recursion formula to compute R(z) and from the definition
of R(z) we have R(1)=1, we can then deduce the following formula to compute
R(z).
if z > 1 :
z y
Y θ i−1 (1 − θ1 )1−yi−1
1
R(z) = y
θ i−1 (1 − θ2 )1−yi−1
i=2 2
if z = 1 :
R(1) = 1
using the definition of R(z) we know that
R(z)
p(z|y, θ) = Pn
i=1 R(i)
5
question 5 in this question, we will compute
z−1
X
E1 = E[ yi |y, θ0 ]
i=1
we finally compute E4 :
n
X
φ4 (z) = n − z + 1 − yi
i=z
n
X
E4 = E[φ4 (z)|y, θ0 ] = φ4 (z)p(z|y, θ0 )
z=1
6
so it can be computed in O(n). Using the recursion formula for R(z) we can
compute R(i+1) using R(i) and one product, implying that
R(z)
p(z|y, θ0 ) = Pn
i=1 R(i)
φ1 (z)p(z|y, θ0 )
we then have :
1
Q(θ, θ0 ) =E(log( p(y|z, θ))|y, θ0 )
n
1
=E(log( ) + log(p(y|z, θ))|y, θ0 )
n
= − log(n) + E(log(p(y|z, θ))|y, θ0 )
we can now compute E(log(p(y|z, θ))|y, θ0 ) using previous results from ques-
tion 2 and 5 :
E(log(p(y|z, θ))|y, θ0 ) =
=log(θ1 )E1 + log(1 − θ1 )E2 + log(θ2 )E3 + log(1 − θ2 )E4
n−1
=(log(θ1 ) − log(1 − θ1 ))E1 + log(1 − θ1 )( )
2
n+1
+ (log(θ2 ) − log(1 − θ2 ))E3 + log(1 − θ2 )( )
2
7
a local maximum is reached for
1 1
E1 − E2 = 0
θ1 1 − θ1
1 1
E3 − E4 = 0
θ2 1 − θ2
the solution to the equations are
E1
θ1 =
E1 + E2
E3
θ2 =
E3 + E4
we can now describe the full EM algorithm . First let’s choose randomly
θ0 . our vector y is known. We then compute E1 ,E2 ,E3 and E4 depending on
this θ0 . We use those values to compute the argument θ of the maximum of
E(log(p(y|z, θ))|y, θ0 ) . We use this value as the new θ0 , we call it θ1 and we
reiterate the process until reaching a terminating condition to determine.
1 1
p(yi |θ) = p0 (yi |θ) + p1 (yi |θ)
2 2
= p(zi = 0)p(yi |zi = 0, θ) + p(zi = 1)p(yi |zi = 1, θ)
8
we know that
considering independence of the data we get
n
X 1 1 −(yi − m0 )2 −(yi − m1 )2
p(y|z, θ) = log( √ )zi + log( √ )(1 − zi ) − zi ( ) − (1 − zi )( )
i=1
2π 2π 2 2
n
1 1X
= n log( √ ) − zi (yi − m0 )2 ) + (1 − zi )((yi − m1 )2 )
2π 2 i=1
We conclude that
n
1 1X
log(p(y, z|θ)) = log(p(y|z, θ))−log(2n ) = n log( √ )−log(2n )− zi (yi −m0 )2 )+(1−zi )((yi −m1 )2 )
2π 2 i=1
question 4 we can see that the sum in the above expression is a bivariate
quadratic function. Let’s consider a general bivariate quadratic function of the
form :
n
X
(1 − zi )(yi − m0 )2 + zi (yi − m1 )2
i=1
n
X
= yi2 + 2(zi − 1)yi m0 + (1 − zi )m20 + zi m21 − 2yi zi m1
i=1
9
n
X
A= 1 − zi
i=1
Xn
B= zi
i=1
Xn
C= 2(zi − 1)yi
i=1
Xn
D= −2yi zi
i=1
E=0
n
X
F = yi2
i=1
−2BC
m0 =
4AC
−2AD
m1 =
4AB
Pn Pn
−2 i=1 zi i=1 2(zi − 1)yi
m∗0 = Pn Pn
4 i=1 1 − zi i=1 zi
Pn
i=1 (1 − zi )yi
= P n
1 − zi
Pn i=1
(1 − zi )yi
= i=1 Pn
n − i=1 zi
We now compute m∗1 :
Pn Pn
−2 1 − zi i=1 −2yi zi
m∗1
= i=1
Pn Pn
4 i=1 1 − zi i=1 zi
Pn
yi zi
= Pi=1
n
i=1 zi
(m∗0 , m∗1 ) is then the argument for the maximum of our bivariate quadratic
function.
10
question 5 let’s compute p(zi = 1|yi , θ0 ) :
Using Bayes,
and that
1
p(yi ) = p(zi = 1)p(yi |zi = 1)+p(zi = 2)p(yi |zi = 2) = (N (yi |m1 , σ1 )+N (yi |m0 , σ0 ))
2
using the pdf of those normal law we find :
−(yi −m1 )2
0 e 2
p(zi = 1|yi , θ ) = −(yi −m0 )2 −(yi −m1 )2
e 2 +e 2
(pA + pB + pO )2 = 1
We then have the following expected numbers for each genotype
11
Using exactly the same method we can compute an equation for nA/O and
nB/B .
2pA pO
nA/O = nA
p2A + 2pA pO
p2B
nB/B = nB
2pB pO + p2B
p2A
nA/A = nA
2pA pO + p2A
p2A
⇔ nA/A = n(2pA pO + p2A )
2pA pO + p2A
⇔ nA/A = np2A
r
nA/A
⇔ = pA
n
12
To use an EM algorithm we need to be able to compute the following quan-
tity:
E[log P (Y, Z|θ)|Y, θ0 ] = log(Cn ) + nA/B log(2pA pB ) + 2nO log(pO ) + 2 log(pA )E[nA/A |Y, θ0 ]
+ log(2pA pO )E[nA/O |Y, θ0 ] + 2 log(pB )E[nB/B |Y, θ0 ] + log(2pB pO )E[nB/O |Y, θ0 ]
= log(Cn ) + nA/B log(2pA pB ) + 2nO log(pO ) + 2 log(pA )E[nA/A |Y = nA , θ0 ]
+ log(2pA pO )E[nA/O |Y = nA , θ0 ] + 2 log(pB )E[nB/B |Y = nB , θ0 ]
nA
(nA/A |Y = nA , θ0 ) ∼ (p0A )2
(p0A )2 +2p0A p0O
nA
(nA/O |Y = nA , θ0 ) ∼ 2(p0A )2 (p0O )2
(p0A )2 +2p0A p0O
nB
(nB/B |Y = nB , θ0 ) ∼ (p0B )2
(p0B )2 +2p0B p0O
nB
(nB/O |Y = nB , θ0 ) ∼ 2p0B p0O
(p0B )2 +2p0B p0O
E[log P (Y, Z|θ)|Y, θ0 ] = log(Cn ) + nA/B log(2pA pB ) + 2nO log(pO ) + 2 log(pA )E[nA/A |Y = nA , θ0 ]
+ log(2pA pO )E[nA/O |Y = nA , θ0 ] + 2 log(pB )E[nB/B |Y = nB , θ0 ]
= log(Cn ) + nA/B log(2pA pB ) + 2nO log(pO ) + log(pA )n0A/A
+ log(2pA pO )n0A/O + log(2pB pO )n0B/O + log(pB )n0B/B
13
We now want to find a maximum for E[log P (Y, Z|θ)|Y, θ0 ]−Cn with respect
to (pA , pB , pO ) :
14
We now have every step of our EM algorithm.
15