Computacional Biology
Computacional Biology
Computacional Biology
http://ocw.mit.edu
2n
n
n
2
)
n
2
=
n
(1)
e
n
For some small values of n such as 100, the number of alignments is already too big ( 10
60
) for this
enumerating strategy to be feasible.
1
Even dierent substitution events have dierent likelihood since specic polymerases are more or less likely to make specic
mistakes.
2
There was a note about evolutionary pressure to keep the genome compact. For example, the genome of viruses is very
small since they have the evolutionary pressure of tting it inside the small virus cell. Similarly, bacteria are pressured to keep
their genome small to save energy in replication and to allow easy exchange of genomic material between cells. On the other
hand, there is little pressure for human cells to maintain a small genome, so it is no surprise that the human genome is so large.
3
Non Boring Alignments, an alignment that doesnt match gaps on both sequences
n
4
We use the stirling approximation: n!
2n
n
e
n
2
2.3 Formulation 1: Longest Common Substring
As a rst attempt, suppose we treat the nucleotide sequences as strings over the alphabet A, C, G, and
T. Given two such strings, R and S, we might try to align them by nding the longest common substring
between them. In particular, these substrings cannot have gaps in them.
As an example, if R = ACGTCATCA and S = TAGTGTCA, the longest common substring between
them is GTCA. So in this formulation, we could align R and S along their longest common substring, GTCA,
to get the most matches.
A simple algorithm would be to just try aligning S with dierent osets with respect to R and keeping
track of the longest substring match found thus far. But this algorithm is quadratic (too slow) in the lengths
of R and S. Assuming |R| = |S| = n, we loop through each character in R trying to align S at that
oset. During each iteration, we then loop over the characters of S and compare them to the corresponding
characters in S looking for matching substrings. Thus the total running time would be O(n
2
).
2.4 Formulation 2: Longest Common Subsequence
Another formulation is to allow gaps in our subsequences and not just limit ourselves to substrings with no
gaps. Given a sequence X = x
1
. . . x
m
, we formally dene Z = z
1
. . . z
k
to be a subsequence of X if there
exists a strictly increasing sequence i
1
, . . . , i
k
of indices of X such that for all j = 1, . . . , k, we have x
ij
= z
j
(CLRS 350-1). In the longest common subsequence (LCS) problem, were given two sequences X and Y and
we want to nd the maximum-length common subsequence Z.
Consider the example from the previous section with sequences R and S. The longest common subse
quence is AGTTCA, a longer match than just the longest common substring.
2.5 Formulation 3: Sequence Alignment as Edit Distance
The previous LCS formulation is almost what we want. But so far we havent specied any cost functions that
dierentiate between the dierent types of edit operations: insertion, deletions, and substitutions. Implicitly
our cost function has been uniform implying that all operations are equally possible. But since substitutions
are much more likely, we want to bias our LCS solution with a cost function that prefers substitutions over
insertions and deletions.
So we recast Sequence Alignment as a special case of the classic Edit-Distance problem in computer
science (CLRS 366). We add varying penalties for dierent edit operations to reect biological occurences.
Of the four nucleotide bases, A and G are purines (larger, two fused rings), while C and T are pyrimidines
(smaller, one ring). Thus RNA polymerase, the enzyme the transcribes the nucleotide sequence, is much
more likely to confuse two purines or two pyrimidines since they are similar in structure. Accordingly, we
dene the following scoring matrix:
Figure 2: Cost matrix for matches and mismatches.
3
Now the problem is to the nd the least expensive (as per the cost matrix) operation sequence trans
forming the initial nucleotide sequence to the nal nucleotide sequence.
3 Dynamic Programming
Before we can tackle the problem algorithmically, we need to introduce a few computational tools.
Dynamic programming is a technique use to solve optimization problems by expressing the problem in
terms of its subproblems. Its hallmarks are optimal substructure and overlapping subproblems. A problem
exhibits optimal substructure if an optimal solution to the problem contains within it optimal solutions to
subproblems. (CLRS 339) Unlike in greedy algorithms, there is no locally optimal choice we can make to
reach a globally optimal solution. In dynamic progamming, we need to trace back through several choices
and we keep track of these choices by storing results to subproblems in a table. By lling in this table, we
build up a solution to entire problem from the bottom-up by rst solving increasing larger sub-problems.
3.1 Fibonacci Example
The Fibonacci numbers provide an instructive example of the benets of dynamic programming. The
Fibonacci sequence is recursively dened as F
0
= F
1
= 1, F
n
= F
n1
+ F
n2
for n 2. We wish to develop
an algorithm to compute the nth Fibonacci number.
3.2 Recursive (Top-Down) Solution
The simple top-down approach is to just apply the recursive denition. Heres a simple way to do this in
Python:
# Assume n is non-negative
def fib(n):
if n == 0 or n == 1: return 1
else: return fib(n-1) + fib(n-2)
But this top-down algorithm runs in exponential time. If T (n) is how long it takes to compute the n-th
Fibonacci number, we have that T (n) = T (n 1) + T (n 2), so T (n) = O(
n
).
5
The problem is that we
are repeating work by solving the same sub-problem many times.
Figure 3: The recursion tree for the fib procedure showing repeated subproblems.
5
Where is the golden ratio. In the slides it was claimed that T (n) = O(2
n
), which is incorrect, nonetheless this is still
exponential.
4
3.3 Dynamic Programming (Bottom-Up) Solution
For calculating the nth Fibonacci number, instead of beginning with F (n), and using recursion, we can start
computation from the bottom since we know we are going to need all of the subproblems anyway. This
way, we will save a lot of repeated work that would be done by the top-down approach and we will be able
to compute the n-th Fibonacci number in O(n) time. Heres a sample implementation of this approach in
Python:
def fib(n):
x = y = 1
for i in range(1, n):
z = x+y
x = y
y = z
return y
This method is optimized to only use constant space instead of an entire table since we only need the
answer to each subproblem once. But in general dynamic programming solutions, we want to store the
solutions to subproblems in a table since we may need to use them multiple times and we should be able
to quickly look them up in a table. Such solutions would look more like the sample implementation from
lecture (tweaked to properly compile):
def fib(n):
fib_table = [1, 1] + (n-1)*[0] # initialize table with zeros
for i in range(2,n+1):
fib_table[i] = fib_table[i-1] + fib_table[i-2]
return fib_table[n]
3.4 Memoization Solution
Yet another solution would be to use memoization which combines the top-down approach with the dynamic
programming approach. In this approach, we still use the recursive formula to compute the sequence but
we also store the results of subproblems in a table so that we dont have to resolve them. Heres a sample
memoization implementation in Python:
def fib(n):
fib_table = [1, 1] + (n-1)*[0]
def get(i):
if fib_table[i] != 0:
return fib_table[i]
else:
fib_table[i] = get(i-1) + get(i-2)
return fib_table[i]
if n == 0 or n == 1: return 1
else: return get(n-1) + get(n-2)
4 Needleman-Wunsch: Dynamic Programming Meets Sequence
Alignment
We will now use dynamic programming to tackle the harder problem of general sequence alignment. Given
two strings S and T , we want to nd the longest common subsequence which may or may not contain gaps.
Further we want the optimal such longest common subsequence as dened by our scoring function in Figure
2. We will use the notation S = S
1
. . . S
n
and T = T
1
. . . T
m
when talking about individual bases in the
sequences. Furthermore, well let d represent the gap penalty cost and s(x, y) represent the score of aligning
5
an x and a y. The alogorithm we willl develop in the following sections to solve sequence alignment is known
as the Needleman-Wunsch algorithm.
4.1 Step 1: Find a Subproblem
Suppose we have an optimal alignment for two sequences in which S
i
matches T
j
. The key insight is that
this optimal alignment is composed of optimal alignments between S
1
. . . S
i1
and T
1
. . . T
i1
and between
S
i+1
. . . S
n
and T
j+1
. . . T
n
. If not, we could cut-n-paste a better alignment in place of the suboptimal
alignments to achieve a higher score alignment (which contradicts the supposedly optimality of our given
alignment). Also note that the optimality score is additive, so we can obtain the score of the whole alignment
by adding the score of the alignment of the subsequences.
4.2 Step 2: Index Space of Subproblems
We need to parameterize the space of subproblems. We identify by F
ij
the score of the optimal alignment
of S
1
. . . S
i
and T
1
. . . T
j
. Having this parameterization allows us to maintain an m n matrix F with the
solutions (i.e. optimal scores) for all the subproblems.
4.3 Step 3: Make Locally Optimal Choices
We can compute the optimal solution for a subproblem by making an locally optimal choice based on the
results from the smaller subproblems. Thus, we need to establish a recursive function that shows how the
solutions to a given problem depends on its subproblems. And we use this recursive denition to ll up the
table F in a bottom-up fashion.
We can consider the 4 possibilities (insert, delete, substitute, match) and evaluate each of them based
on the results we have computed for smaller subproblems. To initialize the table, we set F
1j
= d j and
F
i1
= d i since those are the scores of aligning j and i gaps respectively.
Then we traverse the matrix row by row computing the optimal score for each alignment subproblem by
considering four possibilities which are summarized pictorially in Figure 4:
Sequence S has a gap at the current alignment position.
Sequence T has a gap at the current alignment position.
There is a mutation (nucleotide substitution) at the current position.
There is a match at the current position.
Figure 4: The dierent cases to consider in making a locally optimal choice.
6
We then use the possibility that produces the maximum score. We express this mathematically by the
recursive formula for F
ij
:
F
ij
= max
F
i1j
+ d insert gap in T
F
ij1
+ d insert gap in S (2)
F
ij
+ s(i, j) match or mutation
After traversing the matrix, the optimal score for the global alignment is given by F
mn
. The traversal
order needs to be such that we have solutions to given subproblems when we need them. Namely, to compute
F
ij
, we need to know the values to the left, up, and diagonally above F
ij
in the table. Thus we can traverse
the table in row or column major order or even diagonally from the top left cell to the bottom right cell.
Now, to obtain the actual alignment we just have to remember the choices that we made at each step.
4.4 Step 4: Constructing an Optimal Solution
Note the duality between paths through the matrix F and a sequence alignment. In evaluating each cell F
ij
we make a choice in choosing the maximum of the three possibilities. Thus the value of each (unitialized)
cell in the matrix is determined either by the cell to its left, above it, or diagonnally above it. A match and
a substitution are both represented as traveling in the diagonal direction; however, a dierent cost can be
applied for each, depending on whether the two base pairs we are aligning match or not.
To construct the actual optimal alignment, we need to traceback through our choices in the matrix. It is
helpful to maintain a pointer for each cell while lling up the table that shows which choice we made to get
the score for that cell. Then we can just follow our pointers backwards to reconstruct the optimal alignment.
4.5 Analysis
The runtime analysis of this algorithm is very simple. Each update takes O(1) time, and since there are mn
elements in the matrix F , the total running time is O(mn). Similarly, the total storage space is O(mn).
For the more general case where the update rule is perhaps more complicated, the running time may be
more expensive. For instance, if the update rule requires testing all sizes of gaps (e.g. the cost of a gap is
not linear), then the running time would be O(mn(m + n)).
4.6 Optimizations
The algorithm we showed is much faster than the originally proposed strategy of enumerating alignments and
performs well for sequences up to 10 kilo-bases long. However, for whole genome alignments, the algorithm
shown is not feasible, but we can make modications to it to further improve its performance.
4.6.1 Diagonal
One possible optimization is to ignore Mildly Boring Alignments (MBA), that is, alignments that have too
many gaps. In other words, we can limit ourselves to stay within some distance W from the diagonal in the
matrix of subproblems. This will lead to a time/space cost of O((m + n)W ).
Note, however, that this strategy no longer guarantees that we will obtain the optimal alignment. If one
wants to maintain this guarantee, this strategy can still be used to obtain a lower bound on the optimal
score, and then use this bound to limit the space of the matrix that we have to cover.
4.6.2 Using stretches of high conservation
Suppose now that we can quickly nd common stretches of very high conservation between the two sequences.
We can then use these stretches to anchor the optimal alignment. Then we are left with nding optimal
alignments for the subsequences between these stretches, greatly reducing the size of the problem and thus
the running time.
The question left, though, is how can we nd those conserved stretches very fast. This question of nding
local alignments will be the topic of the next lecture.
7
MIT OpenCourseWare
http://ocw.mit.edu
F (i 1, j 1) + s(x
i
, y
j
)
Termination : Bottom right
To nd local alignments, we should be able to skip nucleotides just to get to the segment
of interest. Thus, we allow the algorithm to start and stop anywhere with no penalty:
Initialization : F (i, 0) = 0
F (0, j) = 0
0
F (i 1, j) d
Iteration : F (i, j) = max
F (i, j 1) d
F (i 1, j 1) + s(x
i
, y
j
)
Termination : Anywhere
Another variation of alignments is semi-global alignment. The algorithm is as following:
Initialization : F (i, 0) = 0
0
F (i 1, j) d
Iteration : F (i, j) = max
F (i, j 1) d
F (i 1, j 1) + s(x
i
, y
j
)
Termination : Right Column
Sometimes it can be costly in both time and space to run these alignment algorithms.
Therefore, there exist some algorithmic variations to save time/space that work well in
practice. However, the details are outside of the scope of this lecture.
1.2 Generalized gap penalties
Gap penalties determine the score calculated for a subsequence and thus which match is
selected. Depending on the model, it could be a good idea to penalize dierently for, say, the
2
Figure 2: Karp Rabin Algorithm
gaps of dierent lengths. However, the is also cost associated with using more complex gap
penalty functions such as substantial slower running time. There exist nice approximations
to the gap penalty functions.
2 Linear-time exact string matching
When looking for exact matches of a pattern, there is a set of algorithms and data structures
that work well in some applications and another set that work well in other applications.
Some of them that are good to know are Z-algorithm, sux trees, sux arrays, etc. In this
lecture we will talk about Karp-Rabin algorithm.
2.1 Karp-Rabin Algorithm
The problem is as follows: in text T of length n we are looking for pattern P of length m.
The key idea behind this algorithm is to interpret strings as numbers that can be compared
in constant time. Translate the string P and substrings of T of length m into numbers (x
and y, respectively), slide x along T at every oset until there is a match. This is O(n).
One might raise the objection that converting a substring of T of length m into a number
y would take O(m) and therefore the total running time is O(nm). However, we only need
to do this only once since the number at the subsequent oset can be computed based on
the number at the previous oset using some bit operations: a subtraction to remove the
high-order bit, a multiplication to shift the characters left, and an addition to append the
low-order digit (a diagram will be attached). Therefore, it is only O(n + m).
These ideas can be used to rapidly index consecutive W-mers in a lookup table or hash
table, using the index of the previous string to compute the index of the previous string in
constant time, without having to read the entire string.
Also, there is always the risk that the numerical representations of x and y become
large enough to exceed the processor word size in which case there is no guarantee that the
3
Figure 3: The Blast Algorithm
comparisons are constant time. To eliminate this, we use hash (e.g., mod q) to map x and y
to a smaller space. As a tradeo, this is susceptible to collisions, which translate into false
positive matches. However, there is a x for that too.
3 The BLAST algorithm (Basic Local Alignment Search
Tool)
While Dynamic Programming (DP) is a nice way to nd alignments, it will often be too
slow. Since the DP is O(n
2
), matching two 3, 000, 000, 000 length sequences would take
about 9 10
18
operations. What well do is present BLAST, an alignment algorithm which
runs in O(n) time. For sequences of length 3, 000, 000, 000, this will be around 3, 000, 000, 000
times faster. The key to BLAST is that we only actually care about alignments that are very
close to perfect. A match of 70% is worthless; we want something that matches 95% or 99%
or more. What this means is that correct (near perfect) alignments will have long substrings
of nucleotides that match perfectly. As a simple example, if we have a 1000 length sequence
A and a second 1000 length sequence B which is identical to A except in 10 point mutations,
then by the Pigeonhole Principle
1
, there is a perfectly matching subsequence of length at
least 100. If we suppose that the mutations happen at random, the expected length of the
longest perfectly matching subsequence will actually be considerably longer. In biology, the
mutations that we nd will not actually be distributed randomly, but will be clustered in
nonfunctional regions of DNA while leaving untouched long stretches of functional DNA.
The other aspect of BLAST which allows us to speed up repeated queries is the ability
to preprocess a large database of DNA o-line. After preprocessing, searching for a sequence
of length m in a database of length n will take only O(m) time.
1
The Pigeonhole Principle states that if we have n pigeons stued into k holes, then some hole must have
at least n/k pigeons.
4
3.1 The BLAST algorithm
The steps are as follows:
1. Split query into overlapping words of length W (the W -mers)
2. Find a neighborhood of similar words for each word (see below)
3. Lookup each word in the neighborhood in a hash table to nd where in the database
each word occurs. Call these the seeds, and let S be the collection of seeds.
4. Extend the seeds in S until the score of the alignment drops o below some threshold
X.
5. Report matches with overall highest scores
Pre-processing step of BLAST is to make sure that all substrings of W consecutive
nucleotides will be included in our database (or in a hash table
2
). These are called the
W -mers of the database.
As in step 1, we rst split the query by looking at all substrings of W consecutive
nucleotides in the query. For protein BLAST, we then modify these to generate a neighbor
hood of similar words based on amino-acid similarity scores from the similarity matrix. We
take progressively more dissimilar words in our neighborhood until our similarity measure
drops below some threshold T . This allows us to still nd matches that dont actually have
W matching characters in a row, but which do have W very similar characters in a row.
We then lookup all of these words in our hash table to nd seeds of W consecutive
matching nucleotides. We then extend these seeds to nd our alignment. A very simple
method for doing this which works well in practice is to simply greedily scan forward and
stop as soon as the rst discrepancy is found. Alternatively, one could use the DP alignment
algorithm on just the DNA around the seed to nd an optimal alignment in the region of
the seed. Since this is a much shorter segment, this will not be as slow as running the DP
algorithm on the entire DNA database.
Note that if the W-mers are small (6-10 nucleotides or 4-6 amino-acids), or if all W-mers
are well-populated (that is, 4
W
< database size for nucleotides, or 20
W
< database size for
amino-acids), then a simple table lookup can be more ecient than using a hash table (i.e.
directly look up the address by interpreting the string as a number). A hash table becomes
necessary when 4
W
(or 20
W
) is much greater than the genome/proteome, in which case the
table is typically too large to t in memory, and even if it did, most of the entries in a direct
lookup table would be empty, wasting resources.
2
Hash tables allow one to index data in (expected) constant time even when the space of possible data is
too large to explicitly store in an array. The key to a hash table is a good hash function, which maps ones
data to a smaller range. One can then explicitly index this smaller range with an array. For a good hash
table, one needs a fast, deterministic hash function which has a reasonably uniformly distributed output
regardless of the distribution of the input. One also needs policies for handling hash collisions. For more
information on hashing, refer to the notes from 6.046.
5
3.2 Extensions to BLAST
Filtering Low complexity regions can cause spurious hits. For instance, if our
query has a string of copies of the same nucleotide and the database has a long stretch
of the same nucleotide, then there will be many many useless hits. To prevent this,
we can either try to lter out low complexity portions of the query or we can ignore
unreasonably over-represented portions of the database.
Two-hit BLAST If theres a mutation in the middle of a long sequence, then we
may not have a match with a long W -mer when there is still a lot of matching DNA.
Thus, we might look for a pair of nearby matching, but smaller, W -mers. Getting two
smaller W -mers to match is more likely than one longer W -mer. This allows us to
get a higher sensitivity with a smaller W , while still pruning out spurious hits. This
means that well spend less time trying to extend matches that dont actually match.
Thus, this allows us to improve speed while maintaining sensitivity.
Combs Recall fromyour biology classes that the third nucleotide in a triplet usually
doesnt actually have an eect on which amino acid is represented. This means that
each third nucleotide in a sequence is less likely to be preserved by evolution, since it
often doesnt matter. Thus, we might want to look for W -mers that look similar except
in every third codon. This is a particular example of a comb. A comb is simply a bit
mask which represents which nucleotides we care about when trying to nd matches.
We explained above why 110110110 . . . might be a good comb, and it turns out to be.
However, other combs are also useful. One way to choose a comb is to just pick some
nucleotides at random. This is called random projection.
4 The Statistics of Alignments
As described above, the BLAST algorithmuses a scoring (substitution) matrix to expand the
list of W-mers to look for and to determine an approximately matching sequence during seed
extension. But how do we construct this matrix in the rst place? How do you determine
the value of s(x
i
, y
j
)?
The most basic way to do this is to rst do some simple(r) alignments by hand (alter
nately, we can ask an algorithm to do this with some randomized parameters and then rene
those parameters iteratively). Once we have some sequences aligned, we can then look at the
regions that are the most conserved. From these regions, we can derive some probabilities
of a residues replacement with another, i.e., a substitution matrix. Lets develop a more
formal probabilistic model. (See slides on Page 5)
First, we have to step back and look at a probabilistic model of an alignment of two
sequences. For this discussion, were going to focus on protein sequences (although the
reasoning is similar for DNA sequences).
Suppose we have aligned two sequences x and y. How likely are x and y related through
evolution? We have to look at the likelihood ratio of the alignment score:
6
=
P(score|related)
P(score|unrelated)
More formally, for unrelated sequences, the probability of having an n-residue alignment
between x and y is a simple product of the probabilities of the individual sequences since
the residue pairings are independent. That is,
x = {x
1
. . . x
n
}
y = {y
1
. . . x
n
}
q
a
= P (amino acid a)
n n
P (x, y|U) = q
x
i
q
y
i
i=1 i=1
For related sequences, the residue pairings are no longer independent so we must use a
dierent joint probability:
p
ab
= P (evolution gave rise to a in x and b in y)
n
P (x, y|R) = p
x
i
y
i
i=1
Do some math and we get our desired expression for the likelihood ratio:
P (x, y|R)
i
n
=1
p
x
i
y
i
P (x, y|U)
i
n
=1
q
x
i
i
n
=1
q
y
i
n
i=1
p
x
i
y
i
n
i=1
q
x
i
q
y
i
Since we eventually want to compute a sum of scores and probabilities require add prod
ucts, we take the log of the product to get a handy summation:
P (x, y|R)
S log
P (x, y|U)
= log
p
x
i
y
i
q
x
i
q
y
i
i
s(x
i
, y
i
)
i
s(x
i
, y
i
) = log
p
x
i
y
i
q
x
i
q
y
i
7
The above expression for s(x
i
, y
i
) is then used to crank out a substitution matrix like
BLOSUM62 (Page 6, Slide 2). Incidentally, the BLOSUM62 matrix captures both chemical
similarity and evolutionary similarities, which gives more information than a single expert
system can. For example (in the slide), we see a positive score for matches between aspartate
(D) and glutamate (E), which are chemically similar. Additionally, the matrix shows tryp
tophan (W) matching with itself has a higher score than isoleucine (I) matching with itself,
i.e., s(I, I) < s(W, W ). This is primarily because tryptophan occurs in protein sequences
less frequently than isoleucine, meaning s(W, W ) has a smaller denominator, but it may also
mean that tryptophan is more highly conserved in general.
So now that were given two sequences and an alignment score, how do we determine
whether the score is statistically signicant? That is, how can we determine whether the two
sequences are related by evolution and not simply by chance? In particular, large database
size increases the probability of nding spurious matches. In 1990, Karlin and Altschul
determined that for two random sequences, the expected number of alignments with a score
of at least S is E(S) = Kmne
S
, which we can use to correct for false positives.
8
MIT OpenCourseWare
http://ocw.mit.edu
6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution
Fall 2008
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
1. Mitochondria play an important role in metabolic processes and are known as the powerhouse of the cell. They
represent an ancient bacterial invader that has been successfully subsumed by eukaryotic cells. Mitochondria have their
own genetic material, commonly referred to as mtDNA. Interestingly, a large fraction of inherited mitochondrial
disorders are due to nuclear genes encoding specific proteins targeted to the mitochondria and not due to mutations in
mtDNA itself. Hence it is imperative to identify these mitochondrial proteins to shed light on the molecular basis for
these disorders.
6.047/6.878 Fall 2008 Lecture #5
1. Overview
In the previous lecture we looked at examples of unsupervised learning techniques, such as clustering. In
this lecture we focus on the use of supervised learning techniques, which use the inherent structure of the
data to predict something about the state or class we should be in. There are two different approaches to
classification which use either (1) generative models that leverage probabilistic models that generate data,
and (2) discriminative models that use an appropriate function to tease apart the data. Nave Bayes
classifiers are an example of generative models and Support Vector Machines (SVMs) are example of
discriminative models. We will discuss actual biological applications of each of these models, specifically
in the use of Nave Bayes classifiers to predict mitochondrial proteins across the genome and the use of
SVMs for the classification of cancer based on gene expression monitoring by DNA microarrays. The
salient features of both techniques and caveats of using each technique will also be discussed.
We will discuss classification in the context of the problem of identifying mitochondrial proteins. If we
look across the genome, how do we determine which proteins are involved in mitochondrial processes or
more generally which proteins are targeted to the mitochondria
1
. This is particularly useful because if we
know the mitochondrial proteins, we can start asking interesting questions about how these proteins
mediate disease processes and metabolic functions. The classification of these mitochondrial proteins
involves the use of 7 features for all human proteins (1) targeting signal, (2) protein domains, (3) co-
expression, (4) mass spectrometry, (5) sequence homology, (6) induction, and (7) motifs.
In general, our approach will be to determine how these features are distributed for objects of different
classes. The classification decisions will then be made using probability calculus. For our case with 7
features for each protein (or more generally, each object), it is likely that each object is associated with
more than one feature. To simplify things let us consider just one feature and introduce the two key
concepts.
(1) We want the model for the probability of a feature given each class. So the features from each
class are drawn from what are known as class-conditional probability distributions (CCPD).
6.047/6.878Fall2008 Lecture#5ScribeNotes
P (Closs |coturc) =
P (coturc | Closs) P(Closs)
P(coturc)
(2) We model prior probabilities to quantify the expected a priori chance of seeing a class. For
the case of identifying mitochondrial proteins, we would like to have a low prior probability
of seeing the mitochondrial class. In this way we can bias ourselves against classifying
mitochondrial proteins. (But how do we actually determine these prior probabilities? We will
cover this in section 2.2).
A good example of how our second concept affects classification was discussed in lecture. We discussed
the example of Eugene. Eugene likes to read books in philosophy and has a penchant for tweed jackets.
The question posed was whether Eugene is more likely (in probabilistic terms) to be an ivy-league dean or
a truck driver. The answer was that he was more likely to be a truck driver, due to the fact that there are
many more truck drivers in the world than there are ivy-league deans. This overwhelming probability
outweighs the fact that ivy-league deans are more likely to enjoy books on philosophy and tweed jackets
than truck drivers. This prior knowledge influences our decision in classification and we can use this use
prior information as the probability of belonging to a certain class P(Class) to make our classifications.
Now we that we have the prior P(Class) and likelihood of a given feature, P(feature | Class), we can
simply apply Bayes rule to determine the posterior P(Class | feature) since we are interested in
classification based on features.
evaluate evidence
belief before evidence
belief after evidence
( ) Closs ) P ss Clo ( | coturc P
= ) urc cot | Closs P
( ) coturc P
evidence
posterior
likelihood
prior
This is the application of Bayes rule to our problem with the probabilities defined as shown above. More
generally we can think of the probabilities as below,
2
belief after evidence
evaluate evidence belief before evidence
6.047/6.878Fall2008 Lecture#5ScribeNotes
0(coturc) = log
P(coturc | Closs 1) P (Closs 1)
P(coturc | C ss 2) P (Closs 2)
We can take log probabilities to build a discriminant function as below,
> u
lo
Classifiers require a set of labeled data points. The number of data points within this data set will directly
impact the quality of the classifier, however, the number of data points necessary to build a training set will
be largely dependent on the problem to be solved. The importance of the training set is to correctly estimate
the feature-given-class likelihoods. This can be done using a multinomial distribution.
This involves getting P(Class) to complete the implementation of a Bayes classifier. There are three
general approaches
(1) Estimate the priors by counting and dividing by the total to obtain the fraction of classes in a
training set (again, this emphasizes the quality of the training set data)
(2) Estimate from expert knowledge, in that we might have information from actual
experimental work, and have reasoning about the distribution from higher-level data
(3) Assume no prior knowledge, so all classes are equally likely
3
6.047/6.878Fall2008 Lecture#5ScribeNotes
P (1, 2, . N| Closs) = P (1 |Closs) P (2 | Closs) P N |Closs)
(
P
g o = l )
N
. , , , (
2
1
0
(Closs 1) 1) P Closs |
> u
P ( | Closs 2) P (Closs 2)
Going back to our original example of mitochondrial proteins, we were looking at 7 features and not just a
single feature. This makes things much more complicated and exemplifies the curse of dimensionality.
We would require huge amounts of data to be able to implement the classifier. Most likely we would end
up with zero probability for a certain combination of features for a particular class, which might not be an
accurate representation of the real probabilities. We can get around this by assuming the features are
independent (this forms the crux of a Nave Bayes approach). Once we assume independence we can now
take the product of all the conditional probabilities and we have to only estimate the individual distributions
for each feature. So we now have,
(
where, f1 represents feature 1. Similarly, the discriminant function can be changed to the multiplication of
the prior probabilities.
Spcciicity =
IP
IP +FN
The classifier must be tested on a different set (termed the labeled test set) than the training set. Apparently,
a large number of publications continue to test their classifiers solely with their training data sets (resulting
in skewed performance metrics). To perform a robust classification we classify each object in the test set
and count the number of each type of error. In the case of a binary classification (i.e. an object either
belongs to Class 1 or Class 2), we have the following four types of errors,
(1) True positive (TP)
(2) True negative (TN)
(3) False positive (FP)
(4) False negative (FN)
The frequency of these errors can be encapsulated in performance metrics of a classifier which are defined
as,
(1) Sensitivity determines how good we are at classifying. In the case of our binary example,
what fraction of actual Class 1 classified as Class1?
(2) Specificity determines how tolerant we are to misclassifications. What fraction of not Class
2 (in the binary case, simply Class 1) classified as Class 2?
4
6.047/6.878Fall2008 Lecture#5ScribeNotes
P F
IN
+ IN
= Spcciicity
ink of it as minimizing FN) and higher the specific The higher the sensitivity (th ity (think of it as
minimizing FP) the better the classifier. However, achieving a high score in one of these areas (and not the
other) doesnt amount to much. Consider the classification of mitochondrial proteins, we can classify all
data points as mitochondrial proteins and get 100% sensitivity but 0% specificity. The MAESTRO
algorithm achieves a 99% specificity and a 71% sensitivity for the classification of mitochondrial proteins.
In all, we have defined Bayesian classifiers and shown how to determine all the probabilities required to
implement a Bayesian classifier and discussed a particular implementation that assumes independence of
features (Nave Bayesian classifiers). The final section describes the specifics of the application of
classification to determine mitochondrial proteins.
2.6 MAESTRO Mitochondrial Protein Classification
Calvo et al. (2006) sought to construct high-quality predictions of human proteins localized to the
mitochondrion by generating and integrating data sets that provide complementary clues about
mitochondrial localization. Specifically, for each human gene product p, they assign a score s
i
(p), using
each of the following seven genome-scale data sets targeting signal score, protein domain score, cis-motif
score, yeast homology score, ancestry score, coexpression score, and induction score (details of each of the
meaning and content of each of these data sets can be found in the manuscript). Each of these scores (s
1
s
7
)
can be used individually as a weak genome-wide predictor of mitochondrial localization. Each methods
performance was assessed using large gold standard curated training sets - 654 mitochondrial proteins
(T
mito
) maintained by the MitoP2 database1 and 2,847 nonmitochondrial proteins (T
~mito
) annotated to
localize to other cellular compartments. To improve prediction accuracy, the authors integrated these eight
approaches using a nave Bayes classifier that was implemented as a program called MAESTRO.
When MAESTRO was applied across the human proteome, 1451 proteins were predicted as mitochondrial
proteins and 450 novel proteins predictions were made. As mentioned in the previous section The
MAESTRO algorithm achieves a 99% specificity and a 71% sensitivity for the classification of
mitochondrial proteins, suggesting that even with the assumption of feature independence, Nave Bayes
classification techniques can prove extremely powerful for large-scale (i.e. genome-wide) scale
classification.
The previous section looked at using probabilistic (or generative) models for classification, this section
looks at using discriminative techniques in essence, can we run our data through a function to determine
5
6.047/6.878Fall2008 Lecture#5ScribeNotes
{-S, -4, -S, S, 4, S -2, -1, u, 1, 2, ]
-2.S
(x, y = x
2
{-S, -4, -S, S, 4, S] - (-S,2S), (-4,16), (-S,9), (S,9), (4,16), (S,2S
{-2, -1, u, 1, 2, ] - (-2,4), (-1,1), (u,u), (1,1), (2,4)
6.S
2
< 6.S
its structure? Such discriminative techniques avoid the inherent cost involved in generative models which
might require more information than is actually necessary.
Support vector machine techniques essentially involve drawing a vector thats perpendicular to the line
optimally separating the data. This vector is determined from a set of support vectors from the individual
data points. Interestingly, the vector can be determined from the scalar dot-product of these support vectors,
so even if we get to higher dimensions then we go from points in multi-dimensional space to just a simple
dot-product (allowing for simpler implementation for higher order data).
In this case when the data is not linearly separable we can use a kernel function (essentially a transform) to
map to a higher dimensional space where the data is now linearly separable. The beauty of this technique is
that since the SVM technique only requires the dot-product of the support vectors we can work directly
with the kernel mapping matrix and not with the actual data points. There are a number of transformations
(such as polynomial and sigmoidal) that serve as good kernel mapping functions and are furthered explored
in the next section.
3.1 Non-Linear Classification
What if the data are not linearly separable? One way to deal with this is to just to try to find the best linear
classifier for the data, and accept mistakes. Another, completely different, way to deal with this comes from
the fact that the more dimensions there are, the more likely it is that the data set is linearly separable. In a
certain sense, for any finite data set, as the number of dimensions goes to infinity, the probability that the
data is separable goes to 1.
How can we take advantage of this? Consider the two classes of one-dimensional data:
] and { . This data is clearly not linearly separable, and the best
separation boundary we can find might be x > .
Now, consider applying the transformation x - ). The data now can be written as new pairs,
), and
This data is separable by the rule y > , and in general the more dimensions we transform to, the more
separable it becomes.
An alternate way of thinking of this problem is to transform the classifier back in to the original low-
dimensional space. In this particular example, we would get the rule x , which would bisect the
number line at two points. In general, the higher dimensionality of the space that we transform to, the more
complicated a classifier we get when we transform back to the original space.
6
6.047/6.878Fall2008 Lecture#5ScribeNotes
(u) u
1
) (u
2
)
u
1
, u
2
K(u
1
, u
2
) = (u
1
) (u
2
)
x - (x, y = x
2
K(x
1
, x
2
) = (x
1
, x
1
2
) (x
2
, x
2
2
)
1
x
2
+(x
1
x
2
)
2
(u
1
, u
2
(x) = x
(u
1
, u
2
) = (1 + u
n
(x) = (1, 2
3.2 Kernels
We see that SVMs are dependent only on the dot products of the vectors. So, if we call our transformation
, for two vectors we only care about the value of ( . The trick with use of kernels consists
of realizing that for certain transformations , there exists a function K( ), such that,
In the above relation, the right-hand side is the dot product of vectors with very high dimension, but the
left-hand side is the function of two vectors with lower dimension. In our previous example of mapping
), we get
=x
Now we need not actually apply the transformation, we can do all our calculations in the lower
dimensional space, but get all the power of using a higher dimension.
Example kernels are the following,
(1) Linear kernel: K ) = u
1
u
2
which represents the trivial mapping of
(2) Polynomial kernel: K
1
u
2
)
which was used in the previous example, if we had instead used the transformation
x, x
2
)
(u
1
) (u
2
) = (1 + u
1
u
2
)
2
(u
1
, u
2
) = exp(-y|
2
)
1
(u
1
, u
2
) = tanh
, we would get a more polynomial-looking kernel of
(3) Radial Basis kernel: K u
1
-u |
2
This transformation is actually from a point u to a function (which can be thought of being a
point in Hilbert space) in a infinite-dimensional space. So what were actually doing is
transforming out training set into functions, and combining them to get a decision boundary.
The functions are Gaussians centered at the input points.
(4) Sigmoid kernel: K | y(u
1
1
u
2
+r)]
Sigmoid kernels have been popular for use in SVMs due to their origins in neural networks
(e.g. sigmoid kernel functions are equivalent to two-level, perceptron neural networks). It has
been pointed out in previous work (Vapnik 1995) that the kernel matrix may not be positive
semi-definite for certain values of the parameters y and r. The sigmoid kernel has nevertheless
been used in practical applications (see Scholkopf 1997).
One of the caveats of transforming the input data using a kernel is the risk of over-fitting (or over-
classifying) the data. More generally, the SVM may generate so many feature vector dimensions that it
does not generalize well to other data. To avoid over-fitting, cross-validation is typically used to evaluate
7
6.047/6.878Fall2008 Lecture#5ScribeNotes
hen using difficult-to-separate training sets, SVMs can incorporate a cost parameter C, to allow some
an we use just any function as our kernel? The answer to this is provided by Mercers Condition which
or any g(x), such that, F
2
is finite Jx ) x _g(
then,
JxJy ) y ) ( g x ) ( g y x, K(
u
In all, we have defined SVM discriminators and shown how to perform classification with appropriate
.3 Tumor Classifications with SVMs
generic approach for classifying two types of acute leukemias acute myeloid leukemia (AML) and
(1) whether there were genes whose expression pattern to be predicted was strongly correlated
or capable of assigning a
ss predictors
the fitting provided by each parameter set tried during the grid or pattern search process. In the radial-basis
kernel, you can essentially increase the value of until each point is within its own classification region
(thereby defeating the classification process altogether). SVMs generally avoid this problem of over-fitting
due to the fact that they maximize margins between data points.
W
flexibility in separating the categories. This parameter controls the trade-off between allowing training
errors and forcing rigid margins. It can thereby create a soft margin that permits some misclassifications.
Increasing the value of C increases the cost of misclassifying points and forces the creation of a more
accurate model that may not generalize well.
C
provides us an analytical criterion for choosing an acceptable kernel. Mercers Condition states that a
kernel K(x,y) is a valid kernel if and only if the following holds (Burgess 1998),
kernel mapping functions that allow performing computations on lower dimension while being to capture
all the information available at higher dimensions. The next section describes the application of SVMs to
the classification of tumors for cancer diagnostics.
3
A
acute lymphoid leukemia (ALL) was presented by Golub et al. (1999). This approach centered on
effectively addressing three main issues,
with the class distinction (i.e. can ALL and AML be distinguished)
(2) how to use a collection of known samples to create a class predict
new sample to one of two classes
(3) how to test the validity of their cla
8
6.047/6.878Fall2008 Lecture#5ScribeNotes
They addressed (1) by using a neighbourhood analysis technique to establish whether the observed
correlations were stronger than would be expected by chance. This analysis showed that roughly 1100
SVM approach to this same classification problem was implemented by Mukherjee et al. (2001). The
Genes Rejects Errors Confidence Level |d|
summarized in the following table (where |d| corresponds to the cutoff for rejection).
7129 3 0 ~93% 0.1
40 0 0 ~93% 0.1
5 3 0 ~92% 0.1
genes were more highly correlated with the AML-ALL class distinction than would be expected by chance.
To address (2) they developed a procedure that uses a fixed subset of informative genes (chosen based on
their correlation with the class distinction of AML and ALL) and makes a prediction based on the
expression level of these genes in a new sample. Each informative gene casts a weighted vote for one of
the classes, with the weight of each vote dependent on the expression level in the new sample and the
degree of that genes correlation with the class distinction. The votes are summed to determine the winning
class. To address (3) and effectively test their predictor by first testing by cross-validation on the initial
data set and then assessing its accuracy on an independent set of samples. Based on their tests, they were
able to identify 36 of the 38 samples (which were part of their training set!) and all 36 predictions were
clinically correct. On the independent test set 29 of 34 samples were strongly predicted with 100%
accuracy and 5 were not predicted.
A
output of classical SVM is a binary class designation. In this particular application it is particularly
important to be able to reject points for which the classifier is not confident enough. Therefore, the authors
introduced a confidence interval on the output of the SVM that allows for rejection of points with low
confidence values. As in the case of Golub et al. it was important for the authors to infer which genes are
important for the classification. The SVM was trained on the 38 samples in the training set and tested on
the 34 samples in the independent test set (exactly in the case of Golub et al.). The authors results are
hese results a significant improvement over previously reported techniques, suggesting that SVMs play an
. Semi-supervised learning
with only a few labeled data points, a large number of unlabeled data
of data first followed by the classification of the generated clusters.
9
T
important role in classification of large data sets (as those generated by DNA microarray experiments).
4
i
= i-th position of argmax
P (X, )
whereas posterior decoding calculates
^
i
= argmax
k
P (
i
= k|X),
2
where
1
P (
i
= k|X) = P (X, )
P (X)
.
:{
i
=k}
The term
:{
i
=k}
P (X, ) expresses the sum of the probabilities of all paths producing X
that pass through state k in position i.
Since P (X) is independent of k, dividing by it does not aect the argmax, so we can
write:
^
i
= argmax
k
P (X, )
:{
i
=k}
Recall that P (X) was calculated by the forward algorithm using the recursion for
f
k
(i) = P (x
1
...x
i
,
i
= k)
given on the right-middle slide of page 3. A simple way to calculate
:{
i
=k}
P (X, ) would
be to use this same recursion, except at the i + 1 step instead of summing over all states just
use state k. The problem with this is that if we wanted to calculate ^ we would need to do
this calculation for all i and all k.
The key observation is that
P (X, ) = P (
i
= k, X) = P (x
1
...x
i
,
i
= k)P (x
i+1
...x
N
|
i
= k) = f
k
(i) b
k
(i)
:{
i
=k}
so we can obtain all of these sums at once by running the forward and backward algorithms
and multiplying the values in the corresponding cells! Thus, computing the optimal ^
i
amounts only to testing each of the K possible states and choosing the one with maximum
probability.
Which of Viterbi or posterior decoding is better for real applications? The answer depends
partly on how likely the Viterbi optimal path is (i.e., on how much of the probability mass is
concentrated on it compared to other paths): if the optimal path is only marginally better
than alternatives, posterior decoding is probably more useful. Additionally, the application
itself plays a crucial role in determining the algorithm to apply. For example, if a biologist had
identied a possible promoter region but wanted to be more sure before running experiments,
he or she might wish to apply posterior decoding to estimate the likelihood (taking into
account all possible paths) that the region of interest was indeed a promoter region. There are
other examples, however, in which Viterbi decoding is more appropriate. Some applications
even combine the two algorithms.
3 Encoding memory in an HMM
Before moving on to discussing learning, we pause to describe an application of HMMs
requiring a new trick: incorporating memory into a Markov model by increasing the
number of states. First we present a motivating biological example.
3
Recall that in the previous lecture, we built an HMM with two states: a high-CG
promoter state with elevated probabilities of emitting Cs and Gs, and a background state
with uniform probabilities of emitting all nucleotides. Thus, our previous model, which we
will call HMM1, characterized promoters simply by abundance of Cs and Gs. However, in
reality the situation is more complicated. As we shall see, a more accurate model, which we
call HMM2, would instead monitor the abundance of CpG pairs, i.e., C and G nucleotides on
the same strand (separated by a phosphate on the DNA backbone, hence the name CpG).
Note that we call these CpGs simply to avoid confusion with bonded complementary CG
bases (on opposing strands).
Some biological background will shed light on the reason for introducing HMM2. In
the genome, not only the sequence of bases, but also their methylation states determine
the biological processes that act on DNA. That is, a methyl group can be added to a nu
cleotide, thus marking it for future reference. Proteins that later bind to the DNA (e.g., for
transcription or replication purposes) notice such modications that have been made to
particular bases. To elaborate on one interesting example, methylation allows error-checking
in the DNA replication process. Recall that replication is performed by unzipping the
DNA polymer and then rebinding complementary bases to each strand, resulting in a pair
of new double-stranded DNA helices. If, during this process, a discrepancy is found between
an old strand of DNA and a new strand (i.e., a pair of bases is not complementary), then
the oldcorrectbase is identied by its being methylated. Thus, the new, unmethylated
base can be removed and replaced with a base actually complementary to the original one.
Now, a side eect of methylation is that in a CpG pair, a methylated C has a high chance
of mutating to a T. Hence, dinucleotide CpGs are rare throughout the genome. However, in
active promoter regions, methylation is suppressed, resulting in a greater abundance of CpG
dinucleotides. Such regions are called CpG islands. These few hundred to few thousand
base-long islands thus allow us to identify promoters with greater accuracy than classifying
by abundance of Cs and Gs alone.
All of this raises the question of how to encode dinucleotides in a Markov model, which by
denition is memoryless: the next state depends only on the current state via predened
transition probabilities. Fortunately, we can overcome this barrier simply by increasing the
number of states in our Markov model (thus increasing the information stored in a single
state). That is, we encode in a state all the information we need to remember in order to
know how to move to the next state. In general, if we have an almost HMM that determines
transition probabilities based not only on the single previous state but also on a nite history
(possibly of both states and emissions), we can convert this to a true memoryless HMM by
increasing the number of states.
In our particular example, we wish to create HMM2 such that dinucleotide CpGs have
high frequency in promoter states and low frequency in the background. To do this, we use
eight states rather than just two, with each state encoding both the promoter/background
state as well as the current emission. We thus need to specify an 8 8 table of transition
probabilities. (Note, however, that each emission probability from a given state is now either
0 or 1, since the state already contains the information of which nucleotide to emit.) Details
4
of this construction are included in the lecture slides. (But note that on the middle-right slide
on page 4 it incorrectly states that the emissions probabilities are distinct for the + and
states. It is actually the transition probabilities that are dierent the emission probabilities
are just 0 and 1 in both cases.)
Although an 8 state model would normally have 64 transition probabilities the slides show
only 32; transition probabilities between + and - states are not included. Since transitions to
and from CpG islands are relatively rare it would require a very large amount of training data
to infer what all of these probabilities are. A simplication is to assume that all transition
probabilities from a + state to a - state are the same, regardless of the nucleotides between
which we are transitioning, and to similarly assume a single probability for transitions from
a - to a + state. In that case we would need these two transition parameters in addition
to the 16 shown on the slides. If our assumptions turn out to be biologically incorrect then
more parameters would be needed to have an accurate model.
During lecture someone pointed out that the transition probabilities given in the table
at the middle-left slide of page 4 are dierent depending on which strand you are moving
along. For example, when one strand transitions from T to C, the other transitions from G
to A, yet the numbers in the table are quite dierent (.355 and .161). This seemed to imply
that a dierent HMM is needed to predict CpG islands depending on which strand you are
moving along. However, this reasoning was not correct. For a single HMM to work on either
strand it must produce the same frequency of TpCs on one strand as GpAs on the other.
However, the frequency of TpCs is not simply the probability of transitioning from T to C.
Rather this transition probability must be multiplied by the frequency of T. We calculate
based on the transition probabilities in the + table that this HMM would produce As and
Ts with frequency about 15% and Cs and Gs with frequency about 35%. This calculation
is in Appendix 1. When this is taken into account the probabilities on the two strands are
within around 7% of each other. For example, the frequency of TpCs is 0.15*0.355 = 0.0525,
while the frequency of GpAs is 0.35*0.161 = 0.05635. We will chalk this dierence up to
sampling error.
4 Learning
The nal topic of this lecture is learning, i.e., having a machine gure out how to model
data. The word learning must be taken with a grain of salt. We do not give a machine
raw data and nothing else and expect it to gure out how to model this data from scratch.
Rather, we must use our intuition and knowledge of biology to come up with the general
framework of the model. Then we ask the machine to choose parameters for our model.
More specically, given a training sequence and an HMM with known layout but unspec
ied parameters, how does one infer the best choice of parameters? This general class of
problems splits naturally into two subclasses based on a whether or not we know the hidden
states in our training data.
For example, if we have a hidden Markov model to distinguish coding regions of genes
from other stretches of DNA, we might train our model with a sequence in which coding
5
genes have already been identied using other means. We would supply this information
for supervised learning. On the other hand, we might be working with a DNA sequence for
an organism that has not had any genes classied and that we expect would have dierent
parameter values from any that have already been classied. In that case, we could use
unsupervised learning to infer the parameters without our having to specify the genes in our
training data.
4.1 Supervised learning
In this case we assume we are given the layout of the HMM, an emitted sequence X, and
additionally the actual sequence of Markov states that produced X. In other words, our
training data is labeled). We wish to nd the transition and emission probabilities a and e
maximizing the probability P (X, ) of generating the sequence X and labels .
This version of learning is easy: since we already know the path of states the Markov
chain took, our intuition suggests that we should choose parameters proportional to the
observed frequencies of the corresponding events occurring. With i denoting position, k and
l denoting states, and x denoting a base, we optimally set
# of times
i
= k and x emitted
e(k, x) =
# of times
i
= k
and
# of times
i1
= k and
i
= l
a(k, l) = .
# of times
i1
= k
An example is given in the slide at the top-right of page 7. The counts are slightly
dierent depending on whether you consider the start and end to be separate states. If we
ignore the start and end then, for example, a(B, P ) = 0.25 since of the 4 B states (we are
excluding the nal one) there is exactly 1 that transitions to a P state. If we considered the
end to be a separate state then a(B, P ) would be 0.2, since we consider the nal B state as
transitioning to the end state. Similarly, we calculate e(B, C) = 0.4 because 2 of the 5 B
states emit a C.
We can use Lagrange multipliers to verify our intuition that the transition and emission
probabilities a and e obtained by counting maximize the probability P (X, ) of generating
the sequence X and labels . This calculation is done appendix 2.
In addition the initial state probabilities of the HMM need to be specied. We choose
these as the total frequency of each state in the training data.
A problem with the method of setting probabilities using our training data is that counts
of some transitions and emissions might be 0 even though the actual probability is small but
non-zero. If we do not correct this our HMM will consider these transitions and emissions to
be impossible, which could signicantly distort the results. To x this we add pseudocounts
which are small non-zero counts. For example, we could set these counts to be 1, or we could
make an estimate based on the a-priori probabilities.
6
4.2 Unsupervised learning
Now we assume we are given the HMM layout and sequence X and wish to nd parameters
a, e maximizing P (X) =
P (X, ). Note that X is now unlabeled: that is, we have no
knowledge of to begin with.
The idea is we start with some initial guess for the parameter values, use those parameter
values to obtain values (or probabilities) of the hidden states, use those to calculate new
parameter values, and iterate until convergence.
Our initial guess for the parameters might be guided by other knowledge. For example,
when working with a DNA sequence we might start with parameter values from a related
species for which we have an annotated genome (for which parameters could be obtained
using supervised learning). Or, we might just use arbitrary initial parameters. (The slide
on Viterbi Training on page 8 describes initialization as Same. This is a reference to
initialization in the Baum-Welch algorithm slide on page 10.)
The lecture presented two methods of implementing unsupervised learning, Viterbi Train
ing and the Baum-Welch Algorithm. Although Baum-Welch is theoretically more sound
Viterbi training is simpler so we discuss it rst.
Rather than maximizing P (X) =
P (X, ), at each step Viterbi training uses
,
the most-likely path. Since there are other paths and
usually only captures some of the
probability this is not the same as maximizing P (X), which would require looking at all
paths.
In Viterbi training we use the Viterbi algorithm to nd
given the current parameter
values. We then use this path to estimate the coecients a, e in the same way we used the
known path in supervised learning by counting emissions and transitions from each state.
These counts, adjusted by pseudocounts, are then used to calculate new parameter values,
and we repeat until convergence.
Side note: Posterior decoding can be used instead of the Viterbi algorithm, provided
you use non-zero parameters so all transitions are legal. This still uses only one path for
counting, but it is ^ instead of
.
In supervised learning and in Viterbi training we count frequencies in a single path to
estimate our parameters a and e. But for particular parameter values the model determines
probabilities for every state at each position, not just a single state. How can we take all of
these probabilities into account when estimating our parameters, rather than just a single
value at each position? That is what the Baum-Welch algorithm does.
The basic framework of the Baum-Welch algorithm is the same as for Viterbi training.
In both cases we start with an initial guess for parameter values, use those to gather infor
mation about the hidden states, and then use that information to calculate new parameter
values. The dierence is that rather than determining the counts of transitions and emis
sions, A
kl
, E
k
(b) by counting actual transitions and emissions in a particular path we do so
by adding up probabilities over all possible paths.
For A
kl
, instead of counting the number of places where state k is immediately followed
by state l, we add up, for all positions i, the probability that position i is in state k and
position i + 1 is in state l . To do this we must add up the probabilities of all paths that
7
pass through states k and l at positions i and i + 1, respectively.
This is very similar to the sum we calculated in posterior decoding, except we are spec
ifying the states at two consecutive positions instead of one. As in posterior decoding it
uses f and b calculated by the forward and backward algorithms, but instead of just taking
f
k
(i) b
k
(i) it must take f
k
(i) b
k
(i + 1) times an intermediate term that has to do with the
transition from position i to position i + 1. This calculation is shown on the lower left slide
of page 9. These must be summed over all positions i to calculate A
kl
. A similar method is
used to calculate E
k
(b), again using f and b.
A result of Baum-Welch guarantees convergence. A formal proof can be obtained by
using the principle of expectation maximization.
Although this method is guaranteed to converge, it could converge to a local rather than
global maximum. To work around this it is recommended that the whole procedure be
repeated many times with dierent initial guesses for the unknown parameters, and use the
one that converges to the best result.
5 Appendix 1
In the discussion of CpG islands above, there was a need to calculate the frequencies of
states given the transition probabilities. We can do that as follows. Suppose the probability
of transitioning from state k to state l is given by a
kl
and the probability of being in state
k is p
k
. We get to be in a particular state, k, at some position in the sequence by being at
some state, l, at the previous position in the sequence and then transitioning to state k. In
terms of probabilities, that means that p
k
=
l
a
lk
p
l
. In other words, p is an eigenvector with
eigenvalue 1 of the transpose of the transition matrix. Calculating this eigenvector for the +
matrix of transitions for CpG islands (slides page 4), yields an eigenvector of approximately
(1, 2.35, 2.35, 1) times an arbitrary constant. Adjusting to make the probabilities add up to
1 gives p = (.15, .35, .35, .15)
6 Appendix 2
We now use Lagrange multipliers to verify that during supervised learning the transition
and emission probabilities a and e obtained by counting maximize the probability P (X, )
of generating the sequence X and labels .
We are looking for
argmax
a,e
P (X, ) = argmax
a,e
e
i
(x
i
) a
i1
i
= argmax log(e
i
(x
i
)) + log(a
i1
i
)
a,e
i i
Since the rst term in the sum only depends on e, and the second only depends on a, we
are looking for:
argmax
e
log(e
i
(x
i
))
i
8
and
argmax log(a
i1
i
)
a
i
We will verify our intuitive answer only for the second sum but the same method works for
the rst one as well.
Let
C
kl
= # of times
i1
= k and
i
= l
Then we can rewrite our expression for a as:
argmax
a
C
kl
log(a
kl
)
k,l
Fix a particular value of k. Since state k transitions to exactly one other state, we know
that
l
a
kl
= 1. We want to nd the values of a
k1
, a
k2
, ..., a
kK
that maximize
l
C
kl
log(a
kl
)
given this constraint. From the method of Lagrange multipliers, we know this maximum
occurs at a local extremum of
l
C
kl
log(a
kl
)
l
a
kl
for some value of . Taking
derivatives with respect to a
kl
for each l and setting them all to 0, we get:
C
kl
=
a
kl
C
kl
a
kl
=
C
kl
= 1, so = C
kl
l l
C
kl
a
kl
= = a(k, l)
l
C
kl
as was to be demonstrated.
9
MIT OpenCourseWare
http://ocw.mit.edu
r s s s
s r s s
s s r s
s s s r
r s u u
s r u u
u u r s
u u s r
where s = 0.25(1 e
4t
), u = 0.25(1 + e
4t
e
2(+)t
), and r = 1 2s u.
From distances to trees
Given our generative Markov models and corresponding (time-dependent) substitution matrices,
we are now ready to determine a distance matrix. The elements of this distance matrix d
ij
are
a measure of the distance of two well aligned sequences. One possibility is to dene the distance
3
3
matrix (d
ij
) as the fraction of sites (f) where two sequences x
i
and x
j
are mismatched: d
ij
=
log (1 4f/3). This model blows up when f = 0.75 which places a threshold on the amount of
mismatch between two sequences.
To then use this distance matrix to measure actual distances between any pair of sequences (ie.
to construct a tree), there are two standard approaches: 1) ultrametric distances infer equidistant
paths from any leaf to the root while 2) additive distances dictate that all pairwise distances are
obtained by traversing a tree. Ultrametric trees are rarely valid since it assumes uniform rate of
evolution while additive distances are a less restrictive model. In practice, distance matrices is
neither ultrametric nor additive.
The distance matrix and tree duality also means that distances can be inferred by traversing
tree. If ultrametric distances are used, we have a guarantee of nding the correct tree by minimizing
the discrepancy between observed distances and tree-based distances. On the other hand, if additive
distances are used, neighbors are guaranteed to nd the correct tree. The tree building algorithm
can then be thought of tting data to constraints.
4 Tree building algorithms
4.1 UPGMA
UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is the simplest example of a
tree-building algorithm. It involves a hierarchical clustering algorithm that begins at the leaves
of the tree and works it way up to the root. As input, it takes a distance matrix and produces
an ultrametric tree (ie. in accordance with the molecular clock hypothesis of equal evolutionary
rates among species). Only when the input distance matrix is ultrametic is the UPGMA algorithm
guaranteed to form the correct tree. If, on the other hand, the distance matrix is additive, there is
no guarantee that the pairwise branch distances equal those specied in the distance matrix.
ALGORITHM:
1. Begin by initializing all sequences each to its own cluster. These will form the leaves of the
tree.
2. Find the pair of sequences with minimum distance from the distance matrix D. This pair
forms the rst cluster, and we can draw the rst part of the tree by linking the pair. (example:
from D, we nd seqA and seqB have the minimum distance of 10. We draw the tree by linking
seqA and seqB, with branch length of 5 (hence a total distance of 10 between them)).
3. Update matrix D: A new row and column representing seq
AB
is added to D. The distance
between seq
AB
and another seq
C
would be
1
2
(d
AC
+ d
BC
). Remove the rows and columns
associated with seqA and seqB. Note, overall, the matrix has shrunk by 1 row and 1 column.
From this point on, we can completely forget about seqA, seqB, and assume we just have
seq
AB
.
4. Repeat steps 2 and 2 until matrix D is empty
4
3
4
4.2 Neighbor-Joining
The more sophisticated N-J algorithm is used to generate phylogram trees in which branch lengths
are proportional to evolutionary rates. The N-J algorithm is guaranteed to produce the correct tree
if the input distance matrix is additive and may still produce a good tree even when the distance
matrix is not additive.
ALGORITHM:
1. Make a new matrix M from the distance matrix D with the same dimensions:
k
D
ik
+ D
jk
M
ij
= D
ij
N 2
where N is the number of sequences. This is the adjusted distance metric, which ensures M
ij
is minimal i i and j are neighbors (see proof in Durbin, p.189).
2. (similar to UPGMA Step 2) Find the pair of sequences with minimum distance from the new
distance matrix M. This pair forms the rst cluster, and we can draw the rst part of the tree
by linking the pair. (example: from M, we nd seqA and seqB have the minimum distance.
We draw the tree by linking seqA and seqB, via new node U. The branch length from A to
P
U is calculated as: D
AU
=
1
2
D
AB
+
k
D
Ak
+D
Bk
N2
. Also, D
BU
= D
AB
D
AU
).
3. (similar to UPGMA Step 3) Update matrix D. A new row and column representing node U
is added to D. The distance between U and another seqC would be (d
AC
+ d
BC
d
AB
).
Note, overall, the matrix has
1
2
Remove the rows and columns associated with seqA and seqB.
shrunk by 1 row and 1 column. From this point on, we can completely forget about seqA
and seqB, and assume we just have node U.
4. Repeat steps 1 thru 3 until matrix D is empty.
4.3 Parsimony
Another approach to building trees is parsimony; a method that doesnt rely on distance matri
ces but instead on sequence alignments. Parsimony will nd the tree that explains the observed
sequences with the minimal number of substitutions. The algorithm entails two computational sub
problems: 1) nding the parsimony cost of a given tree and 2) searching through all tree topologies.
The rst sub-problem is straightforward while the second is computationally very challenging and
best approached with Monte Carlo methods. As there is no closed form solution and hence no opti
mality criteria while searching through all topologies, a heuristic search is appropriate. Parsimony
uses dynamic programming in scoring and traceback to determine ancestral nucleotides.
4.3.1 Scoring
The parsimony scoring algorithm is akin to dynamic programming as it makes local assignments
at each step: penalizing sequence mismatches while assigning no score for sequence matches.
Initialization: Set cost C = 0; k = 2N 1
5
Iteration:
If k is a leaf, set R
k
= x
k
[u]
If k is not a leaf,
Let i, j be the daughter nodes;
Set R
k
= R
i
R
j
if intersection is nonempty
Set R
k
= R
i
R
j
, and C += 1, if intersection is empty
Termination: Minimal cost of tree for column u = C
4.3.2 Traceback
The traceback routine to nd ancestral nucleotides involves nding a path through the tree from
the leaf nodes to an ancestral nucleotide (that may or may not lead to the root of the tree). It can
be summarized as follows:
If the intersection of two sets (A and B) is empty, ancestor is A or B with equal cost
If the intersection of two sets is non-empty, ancestor is intersection with minimal cost
This method determines a non-unique path with minimum assignment of substitutions for a
given tree topology given hidden internal/intermediate nodes (that may correspond to extinct
species).
4.3.3 Bootstrapping
A shortcut to building trees based on parsimony is to build trees using only one column of the
multiple sequence alignment matrix. If repeated many times and a histogram tabulated of the
particular trees constructed, the most likely tree constructed can be easily found. The advantage
to such an approach is obvious: reduced complexity at the expense of accuracy.
4.4 Maximum Likelihood
Maximum likelihood methods combine statistical models with known evolutionary observed data.
They can be used to predict a host of interesting and realistic features - character and rate analysis,
sequences of (hypothetical) extinct species - but do so at the expense of increased complexity.
6
MIT OpenCourseWare
http://ocw.mit.edu
(n)
j=1
j
a
j
a
j
!
where
(n)
= ( + 1) . . . ( + n 1).
The main purpose of the Ewens Sampling Formula is to get an expected site fre
quencies spectrum (Slide 6, Page 2). This formula allows us to consider questions
like how many singletons do we expect given that we sample 100 individuals from
an innite population.
4. Retrospective approach
4.1. Coalescent Theory. Coalescent Theory was rst introduced by Kingman
in the early 1980s. Coalescent theory is a sample-based approach to population
genetics and is specically concerned with making inferences under a rigorous sta
tistical framework. The word coalescent refers to the occurrence, backward in
time, of a common ancestor between a pair of lineages that are both ancestral to
a present-day sample. The coalescent is a backward-time stochastic process that
models the ancestry of a sample of size n, back to the most recent common an
cestor (MRCA) of the entire sample. The coalescent makes detailed predictions
about patterns of genetic variation in the sample. All polymorphisms in the sam
ple can then be demonstrated by putting mutation events on dierent branches of
the coalescent tree (Slide 1, Page 4). The expected time to MCRA of a population
could be found at approximately T=4N generations before, but in practice, the
MCRA of a 20-invdividual sample can be very close to that of the total popula
tion. The ancestry of a sample of nite size (n) is composed of n 1 independent,
exponentially-distributed coalescence times, each ending with a coalescent event
between a random pair of lineages. When there are i ancestral lineages the rate of
coalescence is i(i 1)/2 and the expected time to a coalescent event is 2/(i(i 1)).
4.2. tests of selection within species. We construct three polymorphism sum
mary statistics:
(1) S, the number of segregating sites in the sample;
(2) , the average number of pairwise dierences;
(3)
i
, the number of sites that divide the sample into i and n i sequences.
These statistics can be used to estimate the important statistic (Slide 6, Page
4). Under neutrality all of the estimates of should be equal. However, selec
tion, population structure and other evolutionary forces could change the allelic
frequency spectrum in population and lead to dierences in these estimates. Based
on this inference, three selection tests are proposed by Tajima and Fu and Li re
spectively. (Slide 1, Page 5). In these tests, neutral expectation is used as null
hypothesis (Slide 3, Page 5), an alternative hypothesis (positive selection, balanc
ing selection, population structure/subdivision or population expansion) is tested
4 CAN CENIK, MATTHEW INCE, QU ZHANG
(Slide 4,5,6, Page 5 and Slide 1, Page 6). For example, positive selection would
lead to an excess of singletons while balancing selection would lead to a depletion
of singletons, doublets and triplets. However, population subdivision would also
yield a similar prole to balancing selection and population expansion will cause
a signal similar to positive selection. Therefore, these tests do not provide direct
evidence of selection.
Several other tests have been developed that tries to improve on the ideas of
the Tajimas D and Fu and Lis statistics. HKA Test, for example, measures the
polymorphism and divergence in two loci, and tests whether there is excess in one of
the two classes (Slide 2, 3, Page 6). The MK test, on the other hand, measures the
synonymous and nonsynonymous polymorphism and divergence in one locus, and
tests whether there is an excess in one class (Slide 4, 5, Page 6). Hence, MK test
requires information from a single locus but polymorphism data from 2 species/
4.3. Rate-based selection metric. To determine if a certain gene is under se
lection, we may use the rate-based selection metric. Recall that an amino acid is
determined by a codon of three adjacent nucleotides, and that dierent codons, such
as CGC and CGA, can represent the same amino acid, arginine in this case. A syn
onymous mutation leaves the amino acid sequence intact, while a non-synonymous
mutation changes it.
We use various methods, notably PAML, to compute d
N
, the rate at which
synonymous mutations occur, and d
S
, the rate at which synonymous mutations
occur. We dene the rate-based selection metric to be d
N
/d
S
. There are three
cases of interest:
(1) d
N
/d
S
< 1, purifying selection
(2) d
N
/d
S
= 1, neutral expectation
(3) d
N
/d
S
> 1, positive selection
In general, genes performing essential roles, such as those encoding hemoglobin,
undergo purifying selection, because changes tend to disrupt the genes function
and cause harm. Genes that perform peripheral functions can undergo positive
selection if some change is useful, but not necessary for survival. For instance,
changes to a mammals hair coloring could provide useful camouage.
Rate-based tests could be aected if synonymous sites are not completely neutral.
There is growing evidence for the use of preferred codons for certain amino acids
instead of the other equivalent codons. This phenomenon is known as codon bias.
There is evidence for positive correlation between d
N
/d
S
and dispensability, gene
length, and negative correlation with expression level, protein abundance, codon
bias, number of protein-protein interactions and centrality in interaction networks.
MIT OpenCourseWare
http://ocw.mit.edu
1. There are ecient learning algorithms to compute the parameters of the model and e
cient inference algorithms to predict hidden states;
2. It is easy to incorporate diverse evidence;
3. We can build on the best existing HMMs for gene calling.
A linear chain CRF is shown in the last slide on page 4 of [2]. This is a very simple
CRF model which enables us to do ecient training and inference. In this model, the input
data (DNA sequence, BLAST hits, ESTs, etc) is denoted collectively by X. The hidden state
labels (exon, intron, etc) are denoted by Y . We arrange all the labels in a linear chain and
assign a set of feature functions to every clique {y
i1
, y
i
, X} and denote them by f
j
(y
i1
, y
i
, X),
where j = 1, . . . , J. We also assign a weight
j
to each feature function and nally dene the
conditional probability to be
J N
1
P (Y X) =
j
f
j
(y
i
, y
i1
, X) exp | ,
Z(X)
j=1 i=1
where the normalization factor Z(X) is given by
J N
Y j=1 i=1
( ) = Z X
j
f
j
(y
i
, y
i1
, X) exp .
Doing so guarantees that
P (Y |X) = 1
Y
as required by the semantic of probabilities.
The basic idea underling this formality is that feature functions {f
j
} return real values on
pairs of labels and input data that we think are important for determining P (Y |X). We may
not know how the conditional probability is changed by this feature function or what is its
dependence on other evidence. However, we model this dependency by weights {
j
} and learn
these weights to maximize the likelihood of training data. Finally we normalize the probability
to ensure that P (Y |X) sums to one over all possible Y s. The advantage of this formality is
that it captures the conjunction property between labels and input data directly without resort
to P (X).
There are three fundamental steps in applying CRF models to gene prediction: Design,
Inference and Learning. In the following sections we will look at these three steps in details.
CRF Design
Selecting feature functions is the core issue in CRF applications. Technically speaking, feature
functions can be arbitrary functions that return real values for some pair of labels (y
i
, y
i1
)
and the input X. However we want to select feature functions that capture constraints or
conjunctions of label pairs (y
i
, y
i1
) and the input X that we think are important for P (Y |X).
Commonly-used feature functions include:
3
3
indicator function where f
j
(y
i
, y
i1
, X) = 1 for certain clique {y
i
, y
i1
, X} and 0 other
wise;
sum, product, etc over labels and input data;
some probability distribution over cliques {y
i
, y
i1
, X}, and so on.
As an interesting example, we can see two feature functions in the slides 4 and 5 on page 5 of [2].
We are in the situation that both BLAST and HMMer evidence suggest labeling y
i
as exon. This
will be hard to handle in HMMs but is very easy in CRF model: we simply include a BLAST
feature function and a HMMer feature function and assign two weights
BLAST
and
HMMer
for
these two feature functions. Thus, we do not need to worry about the dependence between the
BLAST and HMMer evidence. There is no requirement that evidence represented by feature
functions be independent. The underlying reason for this is we do not model P (X) in CRF
models. All we care about is which evidence contributes to the conditional probability P (Y |X).
The weights will determine the (relative) extent to which each set of evidence contributes and
interacts.
In practice, there may be thousands or even millions of arbitrary indicator feature functions
to incorporate into our CRF model. A naive approach is try all of them via brute force search,
which is obviously impractical. However, a much better strategy is to take advantage of all
the research on HMMs during the past decade by using the best HMM as our starting point.
That is, we start with feature functions derived from the best HMM-based gene prediction
algorithms. To see why this strategy works, one may look at slides 3, 4, 5 and 6 on page 6
of [2]. As it is shown there, HMM is simply a special case of CRF with all the feature function
weights = 1 and each feature function taking the special form
f
HMM
(y
i
, y
i1
, x
i
) = log(P (y
i
|y
i1
) P (x
i
|y
i
))
= log(P (y
i
|y
i1
)) + log(P (x
i
|y
i
))
= f
HMM-Transition
+ f
HMM-Emission
,
where we dene f
HMM-Transition
as
f
HMM-Transition
(y
i
, y
i1
, x
i
) = log(P (y
i
|y
i1
))
and dene f
HMM-Emission
by
f
HMM-Emission
(y
i
, y
i1
, x
i
) = log(P (x
i
|y
i
)).
Thus, we see that HMMs are just a very special form of the CRF model. As a result, adding
new evidence to existing HMM becomes simple because we can add arbitrary feature functions
to the CRF-version of our HMM and then learn the weights of these new feature functions
empirically. These weights will capture the impact of the incorporating new features into our
original HMM model.
HMMs and linear chain CRF explore the same family of conditional distributions P (Y |X),
and we can convert between HMMs and linear chain CRFs. In fact, HMMs and CRFs form
a generative-discriminative pair as demonstrated in slide 3 on page 7 of [2]. An important
reason for adopting linear chain CRF is because the underlying probability distribution is a
factoring distribution, reducing the computation complexity of probability distributions from
exponential time to linear time. Another reason is ecient inference capabilities, as we see in
the next section.
4
4
5
CRF Inference
Recall that our ultimate goal in gene prediction is, when given a sequence of DNA and some
evidence (denote them collectively by X), to select the best labeling Y . As with HMMs, the
most natural choice for best labeling is simple the most probable labeling given by
J N
1
arg max
Y
P (Y X) = arg max |
Y
j
f
j
(y
i
, y
i1
, X) exp .
Z(X)
j=1 i=1
But of course we can not aord to score every possible Y . The chain structure of the linear
chain CRF plays a crucial role in this respect since it guarantees the sub-optimality structure
of P (Y |X), just the same as in HMMs we have seen before. Therefore, following the notation
we used in the Viterbi algorithm, let v
k
(i) be the probability of the most likely path passing
through i and ending at state k, then we have the recurrent relation
J
v
k
(i) = max
(i 1) exp
j
f
j
(k, , X)
.
j=1
This recurrent relation enables us to compute the most probable labeling by dynamic program
ming, as the Viterbi algorithm for HMMs.
In fact to see the relation between CRFs and HMMs more explicitly, as shown in slides 2,
3 and 4 on page 8 of [2], we can dene the quantity
HMM
for HMMs by
HMM
(y
i
, y
i1
, X) = P (y
i
|y
i1
)P (x
i
|y
i
),
and dene an analogous quantity for generic CRFs by
CRF
(y
i
, y
i1
, X) = exp
j
f
j
(y
i
, y
i1
, x
i
)
.
j
Thus, we can rewrite all HMM equations (Viterbi, Forward, Backward, etc) in terms of
HMM
,
and then replace
HMM
with
CRF
, and get a set of analogous equations for CRF models.
CRF Learning
Suppose we are given an independent and identically distributed (i.i.d.) training set {(X
(k)
, Y
(k)
)},
which is usually a set of manually curated genes sequences for which all nucleotides are labeled.
Dene
J N
j=1 i=1
1
Z
(X)
P
(Y |X) =
j
f
j
(y
i
, y
i1
, X) exp ,
where
(X) =
j
f
j
(y
i
, y
i1
, X)
J N
Y j=1 i=1
5
exp .
6
K
We would like to chose = (
1
, . . . ,
J
) to maximize L() =
k=1
P
(Y
(k)
|X
(k)
), that is,
the product of the conditional probabilities of all the training data. For numerical stability
convenience (to avoid computational underow), we instead maximize
= log L(), the log-
likelihood of the data
j
i i1
we have
K J N K
(k) (k)
(k) (k)
( ) = ( ) log ( ) f , X Z X y , y
j
.
k=1 k=1
As one can show, ( ) is concave and therefore is guaranteed to have a global maximum. The
()
global maximum is obtained at = 0. As we see from a homework problem, at the maximum
K N K N
j=1 i=1
(k)
(k) (k)
, y
(k)
, X
(k)
) =
i
, X
(k)
)P
model
(Y
|X
(k)
). f
j
(y f
j
(y , y
i1 i i1
k i=1 k i=1 Y
Note that the lefthand side of the above equation is the actual count in the training data while
the righthand side corresponds to the expected count under the CRF modeled distribution.
This nice feature guarantees that, using maximum likelihood training, the characteristics of the
training data must also hold in our CRF model.
However, the feature functions are usually given in blackbox forms and in general are not
invertible. Therefore we have no way to nd the global maximum assignment of weights in
analytical closed forms. We can in stead rely on a gradient search algorithm outlined below.
Gradient Search Algorithm for Computing
1 Dene forward/backward variables similar to HMMs
2 Set weights arbitrarily
3 Compute Z(X) using forward/backward algorithms
4 Compute ()/
i
using Z(X) and forward/backward algorithms
5 Update each parameter by gradient search using ()/
i
(quasi-
Newton method)
6 Go to step 3 until convergence to global maximum
CRF Applications to Gene Prediction
To summarize, in order to apply CRFs in gene prediction, we rst design the feature functions
on label pairs {y
i
, y
j
} and X.
by
Then we use training data set to select the weights . Finally,
,
J N
j=1 i=1
1
P (Y X) =
j
f
j
(y
i
, y
i1
, X) exp |
Z(X)
we can use the selected feature functions and computed weights to nd the most probable
labeling Y for a given input sequence X.
6
On page 9 of [2], we nd a brief introduction to Conrad, a gene predictor based on CRF. One
can see that, when predicting on C. neoformans chr 9, Conrad outperforms the best previous
predictor Twinscan.
References
[1] Scribe notes - Lecture 6: CRFs for Computational Gene Prediction. MIT
6.047/6.878 Computational Biology, Sept. 25, 2007.
[2] James Galagan. Conditional Random Fields for Computational Gene Prediction. Lecture
note 6 of MIT 6.047/6.878 Computational Biology, Sept. 2007.
[3] Aron Culotta, David Kulp and Andrew McCallum. Gene prediction with conditional
random elds. Technical report UM-CS-2005-028, University of Massachusetts, Amherst,
7
http://www.cs.umass.edu/~culotta/pubs/culotta05gene.pdf
MIT OpenCourseWare
http://ocw.mit.edu
Example 2 Rain/Sprinkler
R
Rain
W
Watson
Wet
H
Holmes
Wet
S
Sprinkler
Holmes discovers his house is wet.
Could be rain or his sprinkler.
R
Rain
W
Watson
Wet
H
Holmes
Wet
S
Sprinkler
Now imagine Holmes sees Watsons grass is wet
Now we conclude it was probably rain
And probably not his sprinkler
Example 2 Rain/Sprinkler
Explaining Away
R
Rain
W
Watson
Wet
H
Holmes
Wet
S
Sprinkler
Initially we had two explanations for Holmes wet grass.
But once we had more evidence for R, this explained away H
and thus no reason for increase in S
Conditional Dependence
R
Rain
W
Watson
Wet
H
Holmes
Wet
S
Sprinkler
If we dont know H, R and S are
independent
But if we know H, R and S are conditionally dependent
Dont know H
Know H
Graph Semantics
Y
X Z
Y
X Z
Y
X Z
Serial Diverging Converging
Each implies a particular independence relationship
Three basic building blocks
Chain/Linear
| X Z
| X Z Y
Y X Z
Y X Z
Conditional Independence
Diverging
| X Z
| X Z Y
Y X Z
Y X Z
Conditional Independence
Converging
| X Z
| X Z Y
Y X Z
Y X Z
Conditional Dependence - Explaining Away
Graph Semantics
Y
X Z
Converging
Three basic building blocks
A B
.
D-Separation
Definition : Two nodes A and B are d-
separated (or blocked) if for every
path p between A and B there is a
node V such that either
1. The connection is serial or diverging
and V is known
2. The connection is converging and V
and all of its descendants are
unknown
If A and B are not d-separated, they are d-
connected
Three semantics combine in concept of d-separation
Equivalence of Networks
Two structures are equivalent if they represent
same independence relationship - they encode
the same space of probability distributions
Will return to this when we consider causal vs probabilistic networks
Y
X Z
Y
X Z
Y
X Z
Example
Bayesian Networks
A network structure S
Directed acyclic graph (DAG)
Nodes => random variables X
Encodes graph independence sematics
Set of probability distributions P
Conditional Probability Distribution (CPD)
Local distributions for X
A Bayesian network (BN) for X = {X
1
, X
2
, X
3
,, X
n
}
consists of:
Example Bayesian Network
R
Rain
W
Watson
Wet
H
Holmes
Wet
S
Sprinkler
( ) ( ) ( | ) ( | , ) ( | , , )
( ) ( ) ( | ) ( | , )
P X P R P S R P W R S P H R S W
P R P S P W R P H S R
=
=
BN Probability Distribution
Only need distributions over nodes and their
parents
1 1
1
1
( ) ( ,..., )
( ( ))
n
i i
i
n
i i
i
P X P X X X
P X pa X
=
=
=
=
4 4 4 18 entries + + =
Recall from HMM/CRF Lecture
( ) ( )
all nodes v
i i-1 i i
P(X,Y) = P(v|parents(v))
= P Y |Y P X |Y
Y
2
Y
3
Y
1
X
2
X
3
X
1
Y
4
X
4
Y
2
Y
3
Y
1
Y
4
X
2
Directed Graph Semantics
Factorization
Potential Functions over Cliques
(conditioned on X)
Markov Random Field
Factorization
( )
all nodes v
i i-1
P(Y|X) = P(v|clique(v),X)
= P Y |Y ,X
CPDs
R
W H
S
R S H P(H|R,S)
0 0 1 0.1
0 0 0 0.9
0 1 1 0.8
0 1 0 0.2
1 0 1 0.9
1 0 0 0.1
1 1 1 0.1
1 1 0 0.9
Discrete
Continuous
Bayesian Networks for Inference
Observe values (evidence on) of a set
of nodes, want to predict state of other
nodes
Exact Inference
J unction Tree Algorithm
Approximate Inference
Variational approaches, Monte Carlo
sampling
Observational inference
Walking Through an Example
R
W H
S
R=y R=n
W=y 1 0.2
W=n 0 0.8
P
0
(W|R)
R=y R=n
S=y 1,0 0.9,0.1
S=n 0,0 0,1
P
0
(H|R,S)
S=y S=n
0.1 0.9
P
0
(S)
R=y R=n
0.2 0.8
P
0
(R)
Walking Through an Example
R
W H
S
WR RHS
We define two clusters:
-WR, RHS
The key idea: the
clusters only
communicate through R
If they agree on R, all is
good
R
Walking Through an Example
R=y R=n
W=y 1 0.2
W=n 0 0.8
P
0
(W|R)
R=y R=n
S=y 1,0 0.9,0.1
S=n 0,0 0,1
P
0
(H|R,S)
S=y S=n
0.1 0.9
P
0
(S)
R=y R=n
0.2 0.8
P
0
(R)
WR RHS R
We will find it easier to work on this
representation:
We then need P(WR) and P(RHS):
P(WR) =P(R)P(W|R)
P(RHS) =P(R)P(S)P(H|R,S)
Walking Through an Example
R=y R=n
W=y 0.2 0.16
W=n 0 0.64
P
0
(W,R)
R=y R=n
S=y 0.02,0 0.072,0.008
S=n 0.18,0 0,0.72
P
0
(R,H,S)
R=y R=n
0.2 0.8
P
0
(R)
WR RHS R
We will find it easier to work on this
representation:
We then need P(WR) and P(RHS):
P(WR) =P(R)P(W|R)
P(RHS) =P(R)P(S)P(H|R,S)
Walking Through an Example
R=y R=n
W=y 0.2 0.16
W=n 0 0.64
P
0
(W,R)
R=y R=n
S=y 0.02,0 0.072,0.008
S=n 0.18,0 0,0.72
P
0
(R,H,S)
R=y R=n
0.2 0.8
P
0
(R)
WR RHS R
Note that by marginalizing out W
from P
0
(W,R) we get
P
0
(W)=(0.36,0.64)
This is our initial belief in Watsons
grass being (wet,not wet)
Walking Through an Example
R=y R=n
W=y 0.2 0.16
W=n 0 0.64
P
0
(W,R)
R=y R=n
S=y 0.02,0 0.072,0.008
S=n 0.18,0 0,0.72
P
0
(R,H,S)
R=y R=n
0.2 0.8
P
0
(R)
WR RHS R
Now we observe H=y
We need to do three things:
1. Update RHS with this info
2. Calculate a new P
1
(R)
3. Transmit P
1
(R) to update WR
H=y
Walking Through an Example
R=y R=n
W=y 0.2 0.16
W=n 0 0.64
P
0
(W,R)
R=y R=n
S=y 0.02,0 0.072,0
S=n 0.18,0 0,0
P
0
(R,H,S)
R=y R=n
0.2 0.8
P
0
(R)
WR RHS R
Updating RHS with H=y
We can simply
- Zero out all entries in RHS where
H=n
H=y
Walking Through an Example
R=y R=n
W=y 0.2 0.16
W=n 0 0.64
P
0
(W,R)
R=y R=n
S=y 0.074,0 0.264,0
S=n 0.662,0 0,0
P
1
(R,H,S)
R=y R=n
0.2 0.8
P
0
(R)
WR RHS R
Updating RHS with H=y
We can simply
- Zero out all entries in RHS where
H=n
- Re-normalize
But you can see that
this changes P(R)
from the perspective
of RHS
H=y
Walking Through an Example
R=y R=n
W=y 0.2 0.16
W=n 0 0.64
P
0
(W,R)
R=y R=n
0.2 0.8
P
0
(R)
WR RHS R
2. Calculate new P
1
(R)
Marginalize out H,S from RHS for:
P
1
(R) = (0.736,0.264)
Note also
P
1
(S) =(0.339,0.661)
P
0
(S) =(0.1,0.9)
R=y R=n
S=y 0.074,0 0.264,0
S=n 0.662,0 0,0
P
1
(R,H,S)
H=y
Walking Through an Example
R=y R=n
0.736 0.264
P
1
(R)
WR RHS R
2. Transmit P
1
(R) to update WR
P
1
(R)
1
1 1
0
0
P(W,R)=P(W|R)P(R)
P(R)
=P (W,R
P
)
(R)
R=y R=n
W=y 0.2 0.16
W=n 0 0.64
P
0
(W,R)
R=y R=n
S=y 0.074,0 0.264,0
S=n 0.662,0 0,0
P
1
(R,H,S)
H=y
Walking Through an Example
R=y R=n
0.736 0.264
P
1
(R)
WR RHS R
2. Transmit P
1
(R) to update WR
P
1
(R)
1
1 1
0
0
P(W,R)=P(W|R)P(R)
P(R)
=P (W,R
P
)
(R)
R=y R=n
W=y 0.736 0.052
W=n 0 0.211
P
1
(W,R)
R=y R=n
S=y 0.074,0 0.264,0
S=n 0.662,0 0,0
P
1
(R,H,S)
H=y
Walking Through an Example
R=y R=n
0.736 0.264
P
1
(R)
WR RHS R
2. Transmit P
1
(R) to update WR
P
1
(R)
1
1 1
0
0
P(W,R)=P(W|R)P(R)
P(R)
=P (W,R
P
)
(R)
R=y R=n
W=y 0.736 0.052
W=n 0 0.211
P
1
(W,R)
R=y R=n
S=y 0.074,0 0.264,0
S=n 0.662,0 0,0
P
1
(R,H,S)
P
1
(R)/P
0
(R)
H=y
Walking Through an Example
R=y R=n
0.736 0.264
P
1
(R)
WR RHS R
2. Transmit P
1
(R) to update WR
P
1
(R)
1
0
1 1
0
1
0
P(R)
P(W,R)=P(W|R)P(R)
=P (W,R)
P
P (R)
0.788 (W=y)=
P (W=y)=0.360
R=y R=n
W=y 0.736 0.052
W=n 0 0.211
P
1
(W,R)
R=y R=n
S=y 0.074,0 0.264,0
S=n 0.662,0 0,0
P
1
(R,H,S)
P
1
(R)/P
0
(R)
H=y
Walking Through an Example
R=y R=n
0.736 0.264
P
1
(R)
WR RHS R
P
2
(R)/P
1
(R)
R=y R=n
W=y 0.736 0.052
W=n 0 0.211
P
1
(W,R)
R=y R=n
S=y 0.074,0 0.264,0
S=n 0.662,0 0,0
P
1
(R,H,S)
P
2
(R)
H=y W=y
Now we observe W=y
1. Update WR with this info
2. Calculate a new P
2
(R)
3. Transmit P
2
(R) to update WR
Walking Through an Example
R=y R=n
0.93 0.07
P
2
(R)
WR RHS R
P
2
(R)/P
1
(R)
R=y R=n
W=y 0.93 0.07
W=n 0 0
P
2
(W,R)
R=y R=n
S=y 0.094,0 0.067,0
S=n 0.839,0 0,0
P
2
(R,H,S)
P
2
(R)
H=y W=y
P
2
(S=y)=0.161
P
1
(S=y)=0.339
P
0
(S=y) =0.1
- R is almost certain
- We have explained away H=y
- S goes low again
P*(S)/P(S)
Message Passing in J unction Trees
X Y S
e
Separator
State of separator S is information shared
between X and Y
When Y is updated, it sends a message to X
Message has information to update X to
agree with Y on state of S
4 4
4
5
5
1
2
1
1
1
3
2
Message Passing in J unction Trees
A node can send one message to a neighbor, only
after receiving all messages from each other neighbor
When messages have been passed both ways along a
link, it is consistent
Passing continues until all links are consistent
A B C
F E D
G H
e
e
3 4
HUGIN Algorithm
Select one node, V, as root
Call CollectEvidence(V):
Ask all neighbors of V to send message to V.
If not possible, recursively pass message to all neighbors but V
Call DistributeEvidence(V):
Send message to all neighbors of V
Recursively send message to all neighbors but V
A simple algorithm for coordinated message passing
in junction trees
BN to J unction Tree
A topic by itself. In summary:
Moral Graph undirected
graph with links between
all parents of all nodes
Triangulate add links so
all cycles>3 have cord
Cliques become nodes of
J unction Tree
R
W H
S
WR RHS R
Recall from HMM/CRF Lecture
( ) ( )
all nodes v
i i-1 i i
P(X,Y) = P(v|parents(v))
= P Y |Y P X |Y
Y
2
Y
3
Y
1
X
2
X
3
X
1
Y
4
X
4
Y
2
Y
3
Y
1
Y
4
X
2
Directed Graph Semantics
Factorization
Potential Functions over Cliques
(conditioned on X)
Markov Random Field
Factorization
( )
all nodes v
i i-1
P(Y|X) = P(v|clique(v),X)
= P Y |Y ,X
Applications
G1 G2 G4
G10 G7 G6
G9 G8
G3
G5
Measure some gene expression predict rest
Latent Variables
G1 G2 G4
G10 G7 G6
G9 G8
G3
G5
Drugs
D1
D4
D2
D3
Latent Variables
G1 G2 G4
G10 G7 G6
G9 G8
G3
G5
Drugs
D1
D4
D2
D3
Env
pH
O
2
Latent Variables
G1 G2 G4
G10 G7 G6
G9 G8
G3
G5
Drugs
D1
D4
D2
D3
Env
pH
O
2
Metabolites
M1 M2 M3 M4
Observation vs Intervention
Arrows not necessarily causal
BN models probability and correlation
between variables
For applications so far, we observe
evidence and want to know states of
other nodes most likely to go with
observation
What about interventions?
Example Sprinkler
If we observe the
Sprinkler on
Holmes grass
more likely wet
And more likely
summer
But what if we
force sprinkler on
Rain
W H
Sprinkler
Season
Intervention cut arrows from parents
Causal vs Probabilistic
This depends on
getting the arrows
correct!
Flipping all arrows
does not change
independence
relationships
But changes causality
for interventions
X Z Y
X Z Y
Learning Bayesian Networks
1. A network structure S
2. Parameters, , for probability
distributions on each node, given S
Given a set of observations D (i.e. expression
data set) on X, we want to find:
Relatively Easy
Learning
Given S, we can choose maximum likelihood
parameter
We can also choose to include prior information
P() in a bayesian approach
1
argmax ( | , ) ( ( ), )
n
i i
i
P D S P X pa X
=
= =
( | ,D) ( , | ) ( )
( | , )
bayes
P S P S D P
P S D d
=
=
How do we sample....?
Sampling Models - MCMC
( | ) ( ) ( | )
min 1,
( | ) ( ) ( | )
new new old new
old old new old
P D S P S Q S S
p
P D S P S Q S S
=
Markov Chain Monte Carlo Method
Sample from
Direct approach intractable due to partition function
MCMC
Propose Given S
old
, propose new S
new
with
probability Q(S
new
|S
old
)
Accept/Reject Accept S
new
as sample with
( | ) ( )
( | )
s k
( | ) ( )
s k
k
P D S P S
P S D
P D S P S
=
MIT OpenCourseWare
http://ocw.mit.edu
C
a unique genetic code (like other mycoplasmas)
2
4
As a rst step in the process of engineering a simplied cell, Dr. Knights lab has determined
the essential genes in the genome of M. orum by creating and sequencing a library of knockout
mutants. The next steps in the process include adding a plasmid system, adding a recET recom
bination system, and developing a kit of recoded parts (e.g. GFP, antibiotic resistance genes).
Biosafety
Synthetic biological systems pose at least three potential dangers: uncontrolled growth, horizontal
gene transfer from natural organisms, and horizontal gene transfer to natural organisms. Un
controlled growth could be checked by designing synthetic biological systems to require pairs of
nutrients: though evolution might circumvent a single nutrient requirement, it is unlikely to cir
cumvent two simultaneously.
Engineered organisms based on M. orum would be naturally resistant to horizontal gene
transfer (in either direction). UGA, normally a stop codon, is the preferred tryptophan codon
in M. orum; thus, gene transfer to natural organisms is unlikely. Similarly, gene transfer from
natural organisms is prevented by the fact that CGG is a nonsense codon in M. orum. Further
altering the genetic code in synthetic biological systems could increase this natural barrier.
References
[1] Barry Canton, Anna Labno, and Drew Endy. Renement and standardization of synthetic
biological parts and devices. Nature Biotechnology, 26(7):787793, 2008.
[2] Daniel G. Gibson, Gwynedd A. Benders, Cynthia Andrew-Pfannkoch, Evgeniya A. Denisova,
Holly Baden-Tillson, Jayshree Zaveri, Timothy B. Stockwell, Anushka Brownley, David W.
Thomas, Mikkel A. Algire, Chuck Merryman, Lei Young, Vladimir N. Noskov, John I. Glass,
J. Craig Venter, Clyde A. III Hutchinson, and Hamilton O. Smith. Complete Chemical Synthe
sis, Assembly, and Cloning of a Mycoplasma genitalium Genome. Science, 319(5867):12151220,
2008.
[3] John I. Glass, Nacyra Assad-Garcia, Nina Alperovich, Shibu Yooseph, Matthew R. Lewis, Mahir
Maruf, Clyde A. III Hutchinson, Hamilton O. Smith, and J. Craig Venter. Essential genes of a
minimal bacterium. Proc. Natl. Acad. Sci., 103(2):425430, 2006.
[4] Thomas Knight. Idempotent Vector Design for Standard Assembly of Biobricks. 2003.
[5] Reshma P. Shetty. Applying engineering principles to the design and construction of transcrip
tional devices. PhD thesis, Massachusetts Institute of Technology, 2008.
[6] Maung Nyan Win and Christina D. Smolke. A modular and extensible RNA-based gene-
regulatory platform for engineering cellular function. Proc. Natl. Acad. Sci., 104(36):14283
14288, 2007.
3