Computational Biology Project Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

computational biology report 1

Nathan Raynal-Castang
June 2019

Contents
1 Em algorithm for detecting changepoints 1

2 EM algorithm for Gaussian mixtures 8

3 ABO groups and genetics 11

1 Em algorithm for detecting changepoints


We consider the following conditional probabilities describing our problem :

p(yi |z, θ) = θ1yi (1 − θ1 )1−yi for i=1,...,z-1


p(yi |z, θ) = θ2yi (1 − θ2 ) 1−yi
for i=z,...,n

We see that p(yi = 1|z, θ) = θ1 and p(yi = 0|z, θ) = 1 − θ1 for i=1,...,z-1,


as well as p(yi = 1|z, θ) = θ2 and p(yi = 0|z, θ) = 1 − θ2 for i=z,...,n. This
corresponds to the description of our problem.

question 1 for z = 1 :

due to independence of the yi with respect to each other


n
Y
p(y|z, θ) = p(yi |z, θ)
i=1

1
since z=1, every i is superior or equal to z implying
n
Y
p(y|z, θ) = θ2yi (1 − θ2 )1−yi
i=1

for z > 1 :

due to independence of the yi with respect to each other


n
Y
p(y|z, θ) = p(yi |z, θ)
i=1

following the definition of p(yi |z, θ) we get

z−1
Y n
Y
p(y|z, θ) = θ1yi (1 − θ1 )1−yi θ2yi (1 − θ2 )1−yi
i=1 i=z

question 2 We will now compute log(p(y|z, θ)) for z > 1 :

Using the result of question 1, properties of logarithms and sums we get the
following result

z−1
Y n
Y
log(p(y|z, θ)) = log( θ1yi (1 − θ1 )1−yi θ2yi (1 − θ2 )1−yi )
i=1 i=z
z−1
X n
X
= log(θ1yi (1 − θ1 )1−yi ) + log(θ2yi (1 − θ2 )1−yi )
i=1 i=z
z−1
X z−1
X n
X n
X
= log(θ1 ) yi + log(1 − θ1 ) 1 − yi + log(θ2 ) yi + log(1 − θ2 ) 1 − yi
i=1 i=1 i=z i=z
z−1
X z−1
X n
X n
X
= log(θ1 ) yi + log(1 − θ1 )(z − 1 + yi ) + log(θ2 ) yi + log(1 − θ2 )(n − z + 1 + yi )
i=1 i=1 i=z i=z

question 3 We suppose θ1 and θ2 known and we now define the following


quantity :

p(z|y, θ)
R(z) =
p(z = 1|y, θ)

2
we show here that
z−1
Y p(yj |θ1 , z)
R(z) =
j=1
p(yj |θ2 , z = 1)

A first useful remark is that y and z are not independent. we indeed have

p(y|z = 1) 6= p(y|z = 2) if θ1 6= θ2

Using the chain rule we compute

p(z, y, θ) p(z|y, θ)p(y|θ)


p(z, y|θ) = = = p(z|y, θ)p(z|θ)
p(θ) p(θ)

we deduce from it that

p(z|y, θ)
R(z) =
p(z = 1|y, θ)
p(z, y|θ)
=
p(z = 1|y, θ)p(y|θ)
p(z, y|θ)
= p(z=1,y|θ)p(y|θ)
p(y|θ)
p(z, y|θ)
=
(z = 1, y|θ)

using the chain rule again we get

p(z, y|θ)
R(z) =
(z = 1, y|θ)
p(y|z, θ)p(z|θ)
=
p(y|z = 1, θ)p(z = 1|θ)

using independence of z and θ we get

p(z = 1|θ) = p(z = 1|θ2 ) = p(z = 1)


p(z|θ) = p(z)

since we suppose the prior distribution on z uniform we have

1
p(z = 1) = p(z) =
n

3
we deduce that

p(y|z, θ)
R(z) =
p(y|z = 1, θ)

we can now use the computation of those two quantities in question 1

p(y|z, θ)
R(z) =
p(y|z = 1, θ)
Qz−1 Qn
i=1 θ1yi (1 − θ1 )1−yi i=z θ2yi (1 − θ2 )1−yi
= Qn yi 1−yi
i=1 θ2 (1 − θ2 )
Qz−1 yi 1−yi
i=1 θ1 (1 − θ1 )
= Qz−1 yi 1−yi
i=1 θ2 (1 − θ2 )
z−1
Y p(yi |θ1 , z)
=
i=1
p(yi |θ2 , z = 1)

This corresponds to the wanted result.

question 4 in this question we aim to show a relationship between R(z) and


R(z-1) for all z=2,...,n.

let’s first write what is R(z) and R(z-1) using the result of the previous
question

z−1
Y p(yi |θ1 , z)
R(z) =
i=1
p(yi |θ2 , z = 1)
z−2
Y p(yi |θ1 , z − 1)
R(z − 1) =
i=1
p(yi |θ2 , z = 1)

we can now write down the relation between the different quantities in the
equations :

p(yi |θ1 , z − 1) = p(yi |θ1 , z) for i 6= z − 1


= θ1yi (1 − θ1 ) 1−yi
for i = z − 1

we can then compute :

4
z−2
p(yz−1 |θ1 , z) Y p(yi |θ1 , z)
R(z) =
p(yz−1 |θ2 , z = 1) i=1 p(yi |θ2 , z = 1)
z−2
p(yz−1 |θ1 , z) Y p(yi |θ1 , z − 1)
R(z) =
p(yz−1 |θ2 , z = 1) i=2 p(yi |θ2 , z = 2)
p(yz−1 |θ1 , z)
= R(z − 1)
p(yz−1 |θ2 , z = 1)

using the definition of those conditional probabilities we get :


y
θ1 z−1 (1 − θ1 )1−yz−1
R(z) =R(z − 1) y
θ2 z−1 (1 − θ2 )1−yz−1

we now have a recursion formula to compute R(z) and from the definition
of R(z) we have R(1)=1, we can then deduce the following formula to compute
R(z).
if z > 1 :
z y
Y θ i−1 (1 − θ1 )1−yi−1
1
R(z) = y
θ i−1 (1 − θ2 )1−yi−1
i=2 2

if z = 1 :
R(1) = 1
using the definition of R(z) we know that

p(z|y, θ) = R(z)p(z = 1|y, θ)


we also know that
n
X
p(z = i|y, θ) = 1
i=1

which implies that


n
X
R(i)p(z = 1|y, θ) = 1
i=1

we finally can deduce that


1
p(z = 1|y, θ) = Pn
i=1 R(i)
we finally have all the pieces to compute p(z|y, θ)

R(z)
p(z|y, θ) = Pn
i=1 R(i)

5
question 5 in this question, we will compute
z−1
X
E1 = E[ yi |y, θ0 ]
i=1

in E1 the only random variable is z. we can then define φ(z) as a function


of this random variable :
z−1
X
φ1 (z) = yi
i=1
n
X
E1 = E[φ1 (z)|y, θ0 ] = φ1 (z)p(z|y, θ0 )
z=1

from this result we can easily compute E2


z−1
X
φ2 (z) = z − 1 − yi
i=1
n
X
E2 = E[φ2 (z)|y, θ0 ] = φ2 (z)p(z|y, θ0 )
z=1

in the same fashion than for E1 , we can compute E3 :


let’s define
n
X
φ3 (z) = yi
i=z
n
X
E3 = E[φ3 (z)|y, θ0 ] = φ3 (z)p(z|y, θ0 )
z=1

we finally compute E4 :
n
X
φ4 (z) = n − z + 1 − yi
i=z
n
X
E4 = E[φ4 (z)|y, θ0 ] = φ4 (z)p(z|y, θ0 )
z=1

Let’s compute the complexity for computing E1 .


z−1
X
φ1 (z) = yi
i=1

6
so it can be computed in O(n). Using the recursion formula for R(z) we can
compute R(i+1) using R(i) and one product, implying that

R(z)
p(z|y, θ0 ) = Pn
i=1 R(i)

can be computed in O(n) also. It implies that

φ1 (z)p(z|y, θ0 )

can be computed in O(n). We must sum it n times implying a complexity of


O(n2 ). However, we can compute p(z|y, θ0 ) and φ1 (z) for every z before doing
the computation of E1 . The complexity of those computations are in O(n).
Once we have the results, the sum of all φ1 (z)p(z|y, θ0 ) can be computed in
O(n). We then have O(n)+O(n)=O(n). Finally we can compute E1 in O(n).

question 6 Using Bayes formula we compute :

p(y, z|θ) =p(y|z, θ)p(z|θ)


=p(y|z, θ)p(z)
1
= p(y|z, θ)
n

we then have :

1
Q(θ, θ0 ) =E(log( p(y|z, θ))|y, θ0 )
n
1
=E(log( ) + log(p(y|z, θ))|y, θ0 )
n
= − log(n) + E(log(p(y|z, θ))|y, θ0 )

we can now compute E(log(p(y|z, θ))|y, θ0 ) using previous results from ques-
tion 2 and 5 :

E(log(p(y|z, θ))|y, θ0 ) =
=log(θ1 )E1 + log(1 − θ1 )E2 + log(θ2 )E3 + log(1 − θ2 )E4
n−1
=(log(θ1 ) − log(1 − θ1 ))E1 + log(1 − θ1 )( )
2
n+1
+ (log(θ2 ) − log(1 − θ2 ))E3 + log(1 − θ2 )( )
2

7
a local maximum is reached for

∇E(log(p(y|z, θ))|y, θ0 ) = (0, 0)

1 1
E1 − E2 = 0
θ1 1 − θ1
1 1
E3 − E4 = 0
θ2 1 − θ2
the solution to the equations are

E1
θ1 =
E1 + E2
E3
θ2 =
E3 + E4
we can now describe the full EM algorithm . First let’s choose randomly
θ0 . our vector y is known. We then compute E1 ,E2 ,E3 and E4 depending on
this θ0 . We use those values to compute the argument θ of the maximum of
E(log(p(y|z, θ))|y, θ0 ) . We use this value as the new θ0 , we call it θ1 and we
reiterate the process until reaching a terminating condition to determine.

question 7 Once we are confident in the result of the EM algorithm, we can


use the computed parameters to estimate p(z|y, θ) for every z. We then get an
empiric probability density function for Z|Y . From this we can compute an esti-
mator of the expectation of Z|Y and a confidence interval. https://en.wikipedia.org/wiki/Confidence interval

2 EM algorithm for Gaussian mixtures


question 1 knowing that p(zi = 1) = 1/2 :

1 1
p(yi |θ) = p0 (yi |θ) + p1 (yi |θ)
2 2
= p(zi = 0)p(yi |zi = 0, θ) + p(zi = 1)p(yi |zi = 1, θ)

question 3 using Baye’s rule we have

p(y, z|θ) = p(y|z, θ)p(z|θ)


as the zi are independent variables we have
n
Y p(y|z, θ)
p(y, z|θ) = p(y|z, θ) p(z|θ) =
i=1
2n

8
we know that
considering independence of the data we get

n
X 1 1 −(yi − m0 )2 −(yi − m1 )2
p(y|z, θ) = log( √ )zi + log( √ )(1 − zi ) − zi ( ) − (1 − zi )( )
i=1
2π 2π 2 2
n
1 1X
= n log( √ ) − zi (yi − m0 )2 ) + (1 − zi )((yi − m1 )2 )
2π 2 i=1

We conclude that

n
1 1X
log(p(y, z|θ)) = log(p(y|z, θ))−log(2n ) = n log( √ )−log(2n )− zi (yi −m0 )2 )+(1−zi )((yi −m1 )2 )
2π 2 i=1

with Cn = n log( √12π ) − log(2n )

question 4 we can see that the sum in the above expression is a bivariate
quadratic function. Let’s consider a general bivariate quadratic function of the
form :

Am20 + Bm21 + Cm0 + Dm1 + Em0 m1 + F


we can write our expression to express explicitly the coefficients A,B,C,D,E,F.

n
X
(1 − zi )(yi − m0 )2 + zi (yi − m1 )2
i=1
n
X
= yi2 + 2(zi − 1)yi m0 + (1 − zi )m20 + zi m21 − 2yi zi m1
i=1

we now have our coefficients :

9
n
X
A= 1 − zi
i=1
Xn
B= zi
i=1
Xn
C= 2(zi − 1)yi
i=1
Xn
D= −2yi zi
i=1
E=0
n
X
F = yi2
i=1

We know that a maximum is reached for :

−2BC
m0 =
4AC
−2AD
m1 =
4AB

let’s first compute m∗0 :

Pn Pn
−2 i=1 zi i=1 2(zi − 1)yi
m∗0 = Pn Pn
4 i=1 1 − zi i=1 zi
Pn
i=1 (1 − zi )yi
= P n
1 − zi
Pn i=1
(1 − zi )yi
= i=1 Pn
n − i=1 zi
We now compute m∗1 :

Pn Pn
−2 1 − zi i=1 −2yi zi
m∗1
= i=1
Pn Pn
4 i=1 1 − zi i=1 zi
Pn
yi zi
= Pi=1
n
i=1 zi

(m∗0 , m∗1 ) is then the argument for the maximum of our bivariate quadratic
function.

10
question 5 let’s compute p(zi = 1|yi , θ0 ) :

Using Bayes,

p(yi |zi = 1)p(zi = 1)


p(zi = 1|yi , θ0 ) =
p(yi )

We supposed that p(zi = 1) = 21 . Furthermore, we know that

p(yi |zi = 1) = N (yi |m1 , σ1 )

and that
1
p(yi ) = p(zi = 1)p(yi |zi = 1)+p(zi = 2)p(yi |zi = 2) = (N (yi |m1 , σ1 )+N (yi |m0 , σ0 ))
2
using the pdf of those normal law we find :
−(yi −m1 )2

0 e 2
p(zi = 1|yi , θ ) = −(yi −m0 )2 −(yi −m1 )2
e 2 +e 2

3 ABO groups and genetics


Question 1 The Hardy-Weinberg principle state that under some assump-
tions, the expected genotype frequencies in the diploid case are given by the
multinomial expansion for all alleles, in our case it’s :

(pA + pB + pO )2 = 1
We then have the following expected numbers for each genotype

nA/A = np2A nA/B = 2npA pB


nB/B = np2B nA/O = 2npA pO
nO/O = np2O nB/O = 2npB pO

we can deduce from it that :

nA = np2A + 2npA pO = n(2pA pO + p2A )


which implies that

nA n(2pA pO + p2A ) p2A


= 2 ⇔ nA/A = nA
nA/A npA 2pA pO + p2A

11
Using exactly the same method we can compute an equation for nA/O and
nB/B .
2pA pO
nA/O = nA
p2A + 2pA pO
p2B
nB/B = nB
2pB pO + p2B

question 3 estimator for pA :

p2A
nA/A = nA
2pA pO + p2A
p2A
⇔ nA/A = n(2pA pO + p2A )
2pA pO + p2A
⇔ nA/A = np2A
r
nA/A
⇔ = pA
n

using the same method , we give an estimator for pO and pB as well :


nA/O
pO =
2npA
r
nB/B
pB =
n

question 5 We are now going to write an EM algorithm to compute the


proportion of Alleles knowing the Phenotype and assuming the Hardy-Weinberg
principle is true. First we will clearly define our variable and parameters. Our
unknown parameter θ is composed of the three frequencies of the Alleles in the
population, namely :
θ = (pa , po , pb )
Our data is composed of the phenotype of the individuals. So our incomplete
data is
Y = (nA , nB , nAB , nO )
.
Finally the hidden data is composed of the alleles defining the phenotype
of individuals, every individuals can have a combination of two of the alleles
(A,B,O).

z = (nA/A , nO/A , nB/B , nO/B )


the other combination are known quantities.

12
To use an EM algorithm we need to be able to compute the following quan-
tity:

Q(θ, θ0 ) = E[log P (Y, Z|θ)|Y, θ0 ]

Q(θ, θ0 ) = E[log P (Y, Z|θ)|Y, θ0 ]


= E[log(Cn ) + 2nA/A log(pA ) + nA/O log(2pA pO ) + 2nB/B log(pB ) + nB/O log(2pB pO )
+ nA/B log(2pA pB ) + 2nO log(pO )|Y, θ0 ]

with Cn a constant depending only on n. nA/B , nO and nO are known


constant so

E[log P (Y, Z|θ)|Y, θ0 ] = log(Cn ) + nA/B log(2pA pB ) + 2nO log(pO ) + 2 log(pA )E[nA/A |Y, θ0 ]
+ log(2pA pO )E[nA/O |Y, θ0 ] + 2 log(pB )E[nB/B |Y, θ0 ] + log(2pB pO )E[nB/O |Y, θ0 ]
= log(Cn ) + nA/B log(2pA pB ) + 2nO log(pO ) + 2 log(pA )E[nA/A |Y = nA , θ0 ]
+ log(2pA pO )E[nA/O |Y = nA , θ0 ] + 2 log(pB )E[nB/B |Y = nB , θ0 ]

We can compute the conditional distributions contained in the expectations.


The conditional distributions follows the following binomial laws :

nA
 
(nA/A |Y = nA , θ0 ) ∼ (p0A )2
(p0A )2 +2p0A p0O
nA
 
(nA/O |Y = nA , θ0 ) ∼ 2(p0A )2 (p0O )2
(p0A )2 +2p0A p0O
nB
 
(nB/B |Y = nB , θ0 ) ∼ (p0B )2
(p0B )2 +2p0B p0O
nB
 
(nB/O |Y = nB , θ0 ) ∼ 2p0B p0O
(p0B )2 +2p0B p0O

The expectation of the conditional distributions are then equal to np.


we can then rewrite :

E[log P (Y, Z|θ)|Y, θ0 ] = log(Cn ) + nA/B log(2pA pB ) + 2nO log(pO ) + 2 log(pA )E[nA/A |Y = nA , θ0 ]
+ log(2pA pO )E[nA/O |Y = nA , θ0 ] + 2 log(pB )E[nB/B |Y = nB , θ0 ]
= log(Cn ) + nA/B log(2pA pB ) + 2nO log(pO ) + log(pA )n0A/A
+ log(2pA pO )n0A/O + log(2pB pO )n0B/O + log(pB )n0B/B

13
We now want to find a maximum for E[log P (Y, Z|θ)|Y, θ0 ]−Cn with respect
to (pA , pB , pO ) :

argmax(pA ,pB ,pO ) E[log P (Y, Z|θ)|Y, θ0 ] − Cn


with constraint pA + pB + pO = 1.
We use a Lagrange multiplier to find a maximum with this constraint :

L(θ, λ) = Q(θ, θ0 ) − Cn + λ(1 − pA − pB − pO )

∂ 2nA/A nA/O nA/B


L(θ, λ) = + + −λ=0
∂pA pA pA pA
∂ 2nB/B nB/O nA/B
L(θ, λ) = + + −λ=0
∂pB pB pB pB
∂ nA/O nB/O 2nO
L(θ, λ) = + + −λ=0
∂pO pO pO pO

L(θ, λ) = 1 − pA − pB − pO = 0
∂λ
solving this system we obtain :

2nA/A + nA/O + nA/B


pA =
λ
2nB/B + nB/O + nA/B
pB =
λ
2nO + nB/O + nA/O
pO =
λ
1 = pA + pB + pO

If we sum the three first equations, knowing that we have pA + pB + pO = 1


we end up with

λ = 2(nA/A + nA/O + nA/B nB/B + nB/O + nA/B + nO ) = 2n


Our maximum is then reach for

2nA/A + nA/O + nA/B


p1A =
2n
2nB/B + n B/O + nA/B
p1B =
2n
1 2nO + nB/O + nA/O
pO =
2n

14
We now have every step of our EM algorithm.

• Using p0A , p0B , p0O we compute n0A/A , n0A/O , n0B/B , n0B/O

• We then use them to compute p1A , p1B , p1O

• We repeat the process until we reach convergence.

15

You might also like