Computacional Biology

MIT OpenCourseWare
http://ocw.mit.edu
6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

Fall 2008
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Lecture 1
Aims: An introduction to the course, and an introduction to molecular biology
Administrative Details:
-There were 4 handouts in class: a course information handout, the first problem set (due
on Sept. 15 at 8 pm), the scribe policy, and a course survey. All of these are on the web
page.
-The first precept was on Friday 9/5 at 11: pm. The precept notes are posted on the web
page.
-There are 3 textbooks; they are on reserve at both MIT and at BU libraries. The course
uses 3 books because no single book covers the all of the material in the class. No
individual book covers both the algorithmic and the machine learning topics that will be
addressed in the class. Also, because computational biology is a rapidly changing field,
books quickly become outdated. The books are:
(1) Biological sequence analysis (Durbin, Eddy, Krogh, and Mitchison). The
caveat regarding this book: it is about 10 years old, which is a long time in computational
biology.
(2) An introduction to bioinformatics algorithms (Jones & Pevzner). As the title
suggests, this book is a good resource for learning about bioinformatics algorithms.
(3) Pattern classification (Duda, Hart, and Stork). A machine learning book.
Why Computational Biology?
There are a number of reasons that it is appropriate and useful to apply computational
approaches to the study of biological data.
-Many aspects of biology (such as sequence information) are fundamentally digital in
nature. This means that they are well suited to computational modeling and analysis.
-New technologies (such as sequencing, and high-throughput experimental techniques like
microarray, yeast two-hybrid, and ChIP-chip assays) are creating enormous and increasing
amounts of data that can be analyzed and processed using computational techniques
-Running time & memory considerations are critical when dealing with huge datasets . An
algorithm that works well on a small genome (for example, a bacteria) might be too time or
space inefficient to be applied to 1000 mammalian genomes. Also, combinatorial questions
dramatically increase algorithmic complexity.
-Biological datasets can be noisy, and filtering signal from noise is a computational problem.
-Machine learning approaches are useful to make inferences, classify biological features, &
identify robust signals.
-It is possible to use computational approaches to find correlations in an unbiased way, and
to come up with conclusions that transform biological knowledge. This approach is called
data-driven discovery.
-Computational studies can suggest hypotheses, mechanisms, and theories to explain
experimental observations. These hypotheses can then be tested experimentally.
-Computational approaches can be used not only to analyze existing data but also to
motivate data collection and suggest useful experiments. Also, computational filtering
can narrow the experimental search space to allow more focused and efficient experimental
designs.
-Datasets can be combined using computational approaches, so that information collected
across multiple experiments and using diverse experimental approaches can be brought to
bear on questions of interest.
-Effective visualizations of biological data can facilitate discovery.
-Computational approaches can be used to simulate & model biological data.
Finding Functional Elements: A Computational Biology Question
We then discussed a specific question that computational biology can be used to address:
how can one find functional elements in a genomic sequence? The slide that is filled with
letters shows part of the sequence of the yeast genome. Given this sequence, we can ask:
-What are the genes that encode proteins? How can we find features (genes, regulatory
motifs, and other functional elements) in the genomic sequence?
These questions could be addressed either experimentally or computationally. An
experimental approach to the problem would be creating a knockout, and seeing if the
fitness of the organism is affected. We could also address the question computationally by
seeing whether the sequence is conserved across the genomes of multiple species. If the
sequence is significantly conserved across evolutionary time, its likely to perform an
important function.
There are caveats to both of these approaches. Removing the element may not reveal its
functioneven if there is no apparent difference from the original, this could be simply
because the right conditions have not been tested. Also, simply because an element is not
conserved doesnt mean it isnt functional.
(Also, note that functional element is an ambiguous term. Certainly, there are many
types of functional elements in the genome that are not protein-encoding. Intriguingly,
90-95% of the human genome is transcribed (used as a template to make RNA). It isnt
known what the function of most of these transcribed regions are, or indeed if they are
functional).
Course Outline:
The first half of the course will cover foundational material, the second half of the course
will address open research questions. Each lecture will cover both biological problems and
the computational techniques that can be used to study them. Some of the major topics
that the course will address are:
1) Gene Finding
2) Sequence Alignment
3) Database Lookup: How can we effectively store and retrieve biological information?
4) Genome Assembly: How can we put together the small snippets of sequence
produced by sequencing technologies into a complete genome.
5) Regulatory motif discovery: How can we find the sequence motifs that regulate gene
expression?
6) Comparative genomics: How can we use the information contained in the similarities
and differences between species genomes to learn about biological function?
7) Evolutionary Theory: How can we infer the relationships among species using the
information contained in their genomes?
8) Gene expression analysis
9) Cluster Discovery: How can we find emergent features in the dataset?
10) Gibbs Sampling: How can we link clusters to the regulators responsible for the co-
regulation?
11) Protein Network Analysis: How can we construct and analyze networks that represent
the relationships among proteins?
12) Metabolic Modeling
13) Emerging Network Properties
A Crash Course in Molecular Biology:
The final portion of the lecture was a crash course in molecular biology.
The Central Dogma of Molecular Biology
The central dogma of molecular biology specifies how information is stored and used in the
cell. It states that DNA is transcribed into RNA, which is then translated into protein.
DNA
DNA is the molecule of heredity. Its structure reflects its ability to transmit information.
DNA is a double helix, with a phosphate backbone and bases in the center. Since each
complementary strand of the double helix contains sufficient information to reconstruct the
other, there are 2 copies of the data, and a single strand can uniquely produce the other
half. When the DNA is replicated, each strand is used as the template for a new,
complementary strand. (Thus, after DNA is copied, each of the two resulting copies contain
1 old and 1 new strand).
The two strands of the double helix are related by complementary base-pairing, and
connected by hydrogen bonds. DNA is composed of 4 nucleotides: A (adenine), C
(cytosine), T (thymine), and G (guanine). A pairs only with T and C pairs only with G. A
and T are connected by two hydrogen bonds, while C and G are connected by 3 bonds.
Therefore, the A-T pairing is weaker than the C-G pairing. (For this reason, the genetic
composition of bacteria that live in hot springs is ~80% G-C).
It is useful to have an abbreviation to refer to either of two bases because proteins may
recognize DNA based on the characteristics of a particular subset of the bases. A
transcription factor may be able to recognize either an A or a G in a specific position, for
example, because of the shared biochemical and structural properties of adenine and
guanine. Refer to the table in the lecture slides on DNA: the four bases to see the shared
properties of pairs of bases and the abbreviations for each of the 6 different possible 2-base
groupings.
The two DNA strands in the double helix are anti-parallel. DNA is not symmetrical: one end
of a strand is the 5 end, while the other is called the 3 end. The 5 end of one strand is
adjacent to the 3 end of the other. Replication occurs only in the 5 to 3 direction. For this
reason, when DNA unzips for replication, it can be copied continuously in the 5 to 3
direction for one of the strands (the leading strand), but only piecewise in the 5' to 3'
direction for the other strand (the lagging strand). Genes and other functional elements can
be found on either strand of DNA. By convention, DNA sequences are always written in the
5 to 3 direction.
The protein coding portions of DNA are called exons, while introns are non-coding
"intervening" regions found in genes. In prokaryotic cells, DNA is not localized to a
compartment. In eukaryotic cells, DNA is stored in the nucleus of the cell in a highly
compact form. Thus, DNA must be locally unpacked before it can be replicated or
transcribed into RNA.
RNA
RNA is produced when DNA is transcribed. Its structurally similar to DNA. However, in
RNA, uracil (U) is substituted for T. Also, RNA contains ribose instead of deoxyribose (a
difference of one oxygen). There are many different types of RNA:
mRNA (messenger RNA) contains the information to make a protein and is translated
into protein sequence. Immediately after the DNA sequence is transcribed, the new RNA
molecule is known as pre-RNA. In eukaryotes, introns must be removed from this pre-RNA
(also called primary transcript) to produce mature mRNA. Different regions of the primary
transcript may be spliced out to lead to different protein products (alternative splicing).
tRNA (transfer RNA) specifies codon-to-amino-acid translation. It contains a 3 base
pair anti-codon complementary to a codon on the mRNA, and carries the amino acid
corresponding to its anticodon attached to its 3 end.
rRNA (ribosomal RBA) forms the core of the ribosome, the organelle responsible for
the translation of mRNA to protein .
snRNA (small nuclear RNA) is involved in splicing (removing introns from)pre- mRNA,
as well as other functions.
Other functional kinds of RNA exist and are still being discovered. RNA molecules can have
complex three-dimensional structures and perform diverse functions in the cell. In fact,
there is an hypothesis (the RNA world hypothesis) that early life was entirely RNA-based.
According to this hypothesis, RNA served as both the information repository and the
functional workhorse in early organisms.
Protein
Protein is the molecule responsible for carrying out most of the tasks of the cell. Like RNA
and DNA, proteins are polymers made from repetitive subunits. Proteins are built out of 20
different amino acid subunits. Each amino acid has special properties of size, charge,
shape, and acidity. Since there are 20 amino acids and only 4 nucleotides, a sequence of 3
nucleotides is required to specify a single amino acid. In fact, each of the 64 possible
3-sequences of nucleotides (codon) uniquely specifies a particular amino acid (or is a stop
codon that terminates protein translation). Since there are 64 possible codon sequences,
the code is degenerate, and some amino acids are specified by multiple encodings. Most of
the degeneracy occurs in the 3
rd
codon position. The three-dimensional shape, and thus
the function, of a protein is determined by its sequence (however, solving the shape of a
protein from its sequence is an unsolved problem in computational biology!)
Regulation: from molecules to life
Transcription is one of the steps at which protein levels can be regulated. The promoter
region, a segment of DNA found upstream (past the 5 end) of genes, functions in
transcriptional regulation. The promoter region contains motifs that are recognized by
proteins called transcription factors. When bound, transcription factors can recruit RNA
polymerase, leading to gene transcription. However , transcription factors can also
participate in complex regulatory interactions. There can be multiple binding sites in a
promotor, which can act as a logic gate for gene activation. Regulation in eukaryokes can
be extremely complex, with gene expression affected not only by the nearby promoter
region, but also by distant enhancers and repressors.
We can use probabilistic models to identify genes that are regulated by a given transcription
factor. For example, given the set of motifs known to bind a given transcription factor, we
can compute the probability that a candidate motif also binds the transcription factor (see
the notes for precept #1). Comparative sequence analysis can also be used to identify
regulatory motifs, since regulatory motifs show characteristic patterns of evolutionary
conservation.
The lac operon in E. coli and other bacteria is an example of a simple regulatory circuit. In
bacteria, genes with related functions are often located next to each other, controlled by the
same regulatory region, and transcribed together; this group of genes is called an operon.
The lac operon functions in the metabolism of the sugar lactose, which can be used as an
energy source. However, the bacteria prefer to use glucose as an energy source, so if there
is glucose present in the environment the bacteria do not want to make the proteins that
are encoded by the lac operon. Therefore, transcription of the lac operon is regulated by an
elegant circuit in which transcription occurs only if there is lactose but not glucose present
in the environment.
Metabolism
Metabolism regulates the flow of mass and energy in order to keep an organism in a state of
low entropy. Metabolic reactions can be grouped into two basic types. In catabolism,
complex molecules are broken down to release energy. In anabolism, energy is used to
assemble complex molecules.
Enzymes are a critical component of metabolic reactions. The vast majority of (but not all!)
enzymes are proteins. Many biologically critical reactions have high activation energies, so
that the uncatalyzed reaction would happen extremely slowly or not at all. Enzymes speed
up these reactions, so that they can happen at a rate that is sustainable for the cell.
In living cells, reactions are organized into metabolic pathways. A reaction may have many
steps, with the products of one step serving as the substrate for the next. Also, metabolic
reactions often require an investment of energy (notably as a molecule called ATP), and
energy released by one reaction may be captured by a later reaction in the pathway.
Metabolic pathways are also important for the regulation of metabolic reactionsif any step
is inhibited, subsequent steps may lack the substrate or the energy that they need to
proceed. Often, regulatory checkpoints appear early in metabolic pathways, since if the
reaction needs to be stopped, it is obviously better to stop it before much energy has been
invested.
Systems Biology
Systems biology strives to explore and explain the behavior that emerges from the complex
interactions among the components of a biological system. One interesting recent paper in
systems biology is Metabolic gene regulation in a dynamically changing environment
(Bennett et al., 2008). This work makes the assumption that yeast is a linear, time-
invariant system, and runs a signal (glucose) through the system to observe the response.
A periodic response to low-frequency fluctuations in glucose level is observed, but there is
little response to high-frequency fluctuations in glucose level. Thus, this study finds that
yeast acts as a low-pass filter for fluctuations in glucose level.
Synthetic Biology
Not only can we use computational approaches to model and analyze biological data
collected from cells, but we can also design cells that implement specific logic circuits to
carry out novel functions. The task of designing novel biological systems is known as
synthetic biology.
A particularly notable success of synthetic biology is the improvement of artemesenin
production. Artemesenin is a drug used to treat malaria. However, artemisinin was quite
expensive to produce. Recently, a strain of yeast has been engineered to synthesize a
precursor to artemisinic acid at half of the previous cost.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
6.047/6.878 Lecture 2 - Sequence Alignment and Dynamic
Programming
September 15, 2008
1 Introduction
Evolution has preserved functional elements in the genome. Such elements often manifest themselves as
homologues, or similar sequences in descendants of a common ancestor. The two main types of homologous
sequences are orthologous and paralogous sequences. Orthologous sequences typically have similar functions
in both the ancestor and the descendant and arise through speciation events, while paralagous sequences
arise from common ancestors through gene duplication. Furthermore, paralagous genes imply a common
ancestor but, due to mutation and evolution the functionality of that particular gene has shifted considerably.
We will mostly be interested in studying orthologous gene sequences.
Aligning sequences is one of the main ways of discovering such similarities between dierent ancestors.
And in solving sequence alignment problems, the primary computational tool will be Dynamic Programming.
1.1 An Example Alignment
Within orthologus gene sequences, there are islands of conservation which are relatively large stretches of
nucleotides that are preserved between generations. These conserved elements typically imply functional
elements and vice versa. Although, note that conservation is sometimes just random chance.
As an example, we considered the alignment of the Gal10-Gal1 intergenic region for four dierent yeast
species, the rst whole genome alignment for crosspieces (Page 1 Slide 5). As we look at this alignment, we
note some areas that are more conserved than others. In particular, we note some small conserved motifs
such as CGG and CGC, which in fact are functional elements in the binding of Gal4. This example illustrates
how we can read evolution to nd functional elements.
Figure 1: Sequence alignment of Gal10-Gal1.
1
1.2 Genome Changes are Symmetric
The genome changes over time. In studying these changes, well conne ourselves to the level of nucleotide
mutations, such as substitutions, insertions and deletions of bases. Lacking a time machine, we cannot
compare genomes of living species with their ancestors, so were limited to just comparing the genomes of
living descendants.
For our purposes, time becomes a reversible concept. This means that we can consider the events that
change a genome from one species to another as occurring in reverse order (tracing up an evolutionary tree
towards a common ancestor) or forward order (tracing downwards from a common ancestor). We consider
all DNA changes to be symmetric in time: an insertion in one direction is equivalent to a deletion in the
other.
Furthermore, since mutations and genomic changes are independent of each other, it is irrelevant to
consider the direction of the branches in the evolutionary tree. Therefore to study the common ancestor
of two dierent species, we need only consider the intermediate stages in a mutation process from one of
the descendants to the other. In doing so, we can avoid the problem of not having the ancestral nucleotide
sequence.
2 Problem Formulations: Biology to CS
We need to translate the biological problem of sequence alignment into a computer science problem that
can be attacked by computational tools. In creating a model to represent the problem, we need to balance
biological relevance with computational tractability.
2.1 Goal of Alignment
The goal of alignment is to infer the edit operations that change a genome by looking only at the endpoints.
We dene the set of operations evolutionary operations to contain insertions, deletions, and mutations. And
since there are many possible sequences of events that could change one genome to the other, we also need
to establish an optimality criterion, e.g. minimum number of events or minimum cost.
In reality, insertions and deletions are much less likely than substitutions, and so we would prefer to
attribute a dierent cost to each of those events to make solution sequences prefer substitutions. The
algorithm we show will assume equal costs for each operation; however, it can be easily generalized for
varying cost.
1 2
2.2 Brute-Force Solution
Given a metric to score a given alignment, the simple brute-force algorithm just enumerates all possible
alignments, computes the score of each one, and picks the alignment with the maximum score. How many
possible alignments are there? If you consider only NBA
3
and n m, the number of alignments is
4

n + m

(n + m)! (2n)!

4n)
(2
e
n)
2n
2
2n
m
=
n!m!

(n!)
2

(
2n
n
n
2
)
n
2
=
n
(1)
e
n
For some small values of n such as 100, the number of alignments is already too big ( 10
60
) for this
enumerating strategy to be feasible.
1
Even dierent substitution events have dierent likelihood since specic polymerases are more or less likely to make specic
mistakes.
2
There was a note about evolutionary pressure to keep the genome compact. For example, the genome of viruses is very
small since they have the evolutionary pressure of tting it inside the small virus cell. Similarly, bacteria are pressured to keep
their genome small to save energy in replication and to allow easy exchange of genomic material between cells. On the other
hand, there is little pressure for human cells to maintain a small genome, so it is no surprise that the human genome is so large.
3
Non Boring Alignments, an alignment that doesnt match gaps on both sequences
n
4
We use the stirling approximation: n!
2n
n
e
n
2
2.3 Formulation 1: Longest Common Substring
As a rst attempt, suppose we treat the nucleotide sequences as strings over the alphabet A, C, G, and
T. Given two such strings, R and S, we might try to align them by nding the longest common substring
between them. In particular, these substrings cannot have gaps in them.
As an example, if R = ACGTCATCA and S = TAGTGTCA, the longest common substring between
them is GTCA. So in this formulation, we could align R and S along their longest common substring, GTCA,
to get the most matches.
A simple algorithm would be to just try aligning S with dierent osets with respect to R and keeping
track of the longest substring match found thus far. But this algorithm is quadratic (too slow) in the lengths
of R and S. Assuming |R| = |S| = n, we loop through each character in R trying to align S at that
oset. During each iteration, we then loop over the characters of S and compare them to the corresponding
characters in S looking for matching substrings. Thus the total running time would be O(n
2
).
2.4 Formulation 2: Longest Common Subsequence
Another formulation is to allow gaps in our subsequences and not just limit ourselves to substrings with no
gaps. Given a sequence X = x
1
. . . x
m
, we formally dene Z = z
1
. . . z
k
to be a subsequence of X if there
exists a strictly increasing sequence i
1
, . . . , i
k
of indices of X such that for all j = 1, . . . , k, we have x
ij
= z
j
(CLRS 350-1). In the longest common subsequence (LCS) problem, were given two sequences X and Y and
we want to nd the maximum-length common subsequence Z.
Consider the example from the previous section with sequences R and S. The longest common subse
quence is AGTTCA, a longer match than just the longest common substring.
2.5 Formulation 3: Sequence Alignment as Edit Distance
The previous LCS formulation is almost what we want. But so far we havent specied any cost functions that
dierentiate between the dierent types of edit operations: insertion, deletions, and substitutions. Implicitly
our cost function has been uniform implying that all operations are equally possible. But since substitutions
are much more likely, we want to bias our LCS solution with a cost function that prefers substitutions over
insertions and deletions.
So we recast Sequence Alignment as a special case of the classic Edit-Distance problem in computer
science (CLRS 366). We add varying penalties for dierent edit operations to reect biological occurences.
Of the four nucleotide bases, A and G are purines (larger, two fused rings), while C and T are pyrimidines
(smaller, one ring). Thus RNA polymerase, the enzyme the transcribes the nucleotide sequence, is much
more likely to confuse two purines or two pyrimidines since they are similar in structure. Accordingly, we
dene the following scoring matrix:
Figure 2: Cost matrix for matches and mismatches.
3
Now the problem is to the nd the least expensive (as per the cost matrix) operation sequence trans
forming the initial nucleotide sequence to the nal nucleotide sequence.
3 Dynamic Programming
Before we can tackle the problem algorithmically, we need to introduce a few computational tools.
Dynamic programming is a technique use to solve optimization problems by expressing the problem in
terms of its subproblems. Its hallmarks are optimal substructure and overlapping subproblems. A problem
exhibits optimal substructure if an optimal solution to the problem contains within it optimal solutions to
subproblems. (CLRS 339) Unlike in greedy algorithms, there is no locally optimal choice we can make to
reach a globally optimal solution. In dynamic progamming, we need to trace back through several choices
and we keep track of these choices by storing results to subproblems in a table. By lling in this table, we
build up a solution to entire problem from the bottom-up by rst solving increasing larger sub-problems.
3.1 Fibonacci Example
The Fibonacci numbers provide an instructive example of the benets of dynamic programming. The
Fibonacci sequence is recursively dened as F
0
= F
1
= 1, F
n
= F
n1
+ F
n2
for n 2. We wish to develop
an algorithm to compute the nth Fibonacci number.
3.2 Recursive (Top-Down) Solution
The simple top-down approach is to just apply the recursive denition. Heres a simple way to do this in
Python:
# Assume n is non-negative
def fib(n):
if n == 0 or n == 1: return 1
else: return fib(n-1) + fib(n-2)
But this top-down algorithm runs in exponential time. If T (n) is how long it takes to compute the n-th
Fibonacci number, we have that T (n) = T (n 1) + T (n 2), so T (n) = O(
n
).
5
The problem is that we
are repeating work by solving the same sub-problem many times.
Figure 3: The recursion tree for the fib procedure showing repeated subproblems.
5
Where is the golden ratio. In the slides it was claimed that T (n) = O(2
n
), which is incorrect, nonetheless this is still
exponential.
4
3.3 Dynamic Programming (Bottom-Up) Solution
For calculating the nth Fibonacci number, instead of beginning with F (n), and using recursion, we can start
computation from the bottom since we know we are going to need all of the subproblems anyway. This
way, we will save a lot of repeated work that would be done by the top-down approach and we will be able
to compute the n-th Fibonacci number in O(n) time. Heres a sample implementation of this approach in
Python:
def fib(n):
x = y = 1
for i in range(1, n):
z = x+y
x = y
y = z
return y
This method is optimized to only use constant space instead of an entire table since we only need the
answer to each subproblem once. But in general dynamic programming solutions, we want to store the
solutions to subproblems in a table since we may need to use them multiple times and we should be able
to quickly look them up in a table. Such solutions would look more like the sample implementation from
lecture (tweaked to properly compile):
def fib(n):
fib_table = [1, 1] + (n-1)*[0] # initialize table with zeros
for i in range(2,n+1):
fib_table[i] = fib_table[i-1] + fib_table[i-2]
return fib_table[n]
3.4 Memoization Solution
Yet another solution would be to use memoization which combines the top-down approach with the dynamic
programming approach. In this approach, we still use the recursive formula to compute the sequence but
we also store the results of subproblems in a table so that we dont have to resolve them. Heres a sample
memoization implementation in Python:
def fib(n):
fib_table = [1, 1] + (n-1)*[0]
def get(i):
if fib_table[i] != 0:
return fib_table[i]
else:
fib_table[i] = get(i-1) + get(i-2)
return fib_table[i]
if n == 0 or n == 1: return 1
else: return get(n-1) + get(n-2)
4 Needleman-Wunsch: Dynamic Programming Meets Sequence
Alignment
We will now use dynamic programming to tackle the harder problem of general sequence alignment. Given
two strings S and T , we want to nd the longest common subsequence which may or may not contain gaps.
Further we want the optimal such longest common subsequence as dened by our scoring function in Figure
2. We will use the notation S = S
1
. . . S
n
and T = T
1
. . . T
m
when talking about individual bases in the
sequences. Furthermore, well let d represent the gap penalty cost and s(x, y) represent the score of aligning
5
an x and a y. The alogorithm we willl develop in the following sections to solve sequence alignment is known
as the Needleman-Wunsch algorithm.
4.1 Step 1: Find a Subproblem
Suppose we have an optimal alignment for two sequences in which S
i
matches T
j
. The key insight is that
this optimal alignment is composed of optimal alignments between S
1
. . . S
i1
and T
1
. . . T
i1
and between
S
i+1
. . . S
n
and T
j+1
. . . T
n
. If not, we could cut-n-paste a better alignment in place of the suboptimal
alignments to achieve a higher score alignment (which contradicts the supposedly optimality of our given
alignment). Also note that the optimality score is additive, so we can obtain the score of the whole alignment
by adding the score of the alignment of the subsequences.
4.2 Step 2: Index Space of Subproblems
We need to parameterize the space of subproblems. We identify by F
ij
the score of the optimal alignment
of S
1
. . . S
i
and T
1
. . . T
j
. Having this parameterization allows us to maintain an m n matrix F with the
solutions (i.e. optimal scores) for all the subproblems.
4.3 Step 3: Make Locally Optimal Choices
We can compute the optimal solution for a subproblem by making an locally optimal choice based on the
results from the smaller subproblems. Thus, we need to establish a recursive function that shows how the
solutions to a given problem depends on its subproblems. And we use this recursive denition to ll up the
table F in a bottom-up fashion.
We can consider the 4 possibilities (insert, delete, substitute, match) and evaluate each of them based
on the results we have computed for smaller subproblems. To initialize the table, we set F
1j
= d j and
F
i1
= d i since those are the scores of aligning j and i gaps respectively.
Then we traverse the matrix row by row computing the optimal score for each alignment subproblem by
considering four possibilities which are summarized pictorially in Figure 4:
Sequence S has a gap at the current alignment position.
Sequence T has a gap at the current alignment position.
There is a mutation (nucleotide substitution) at the current position.
There is a match at the current position.
Figure 4: The dierent cases to consider in making a locally optimal choice.
6
We then use the possibility that produces the maximum score. We express this mathematically by the
recursive formula for F
ij
:
F
ij
= max

F
i1j
+ d insert gap in T
F
ij1
+ d insert gap in S (2)
F
ij
+ s(i, j) match or mutation
After traversing the matrix, the optimal score for the global alignment is given by F
mn
. The traversal
order needs to be such that we have solutions to given subproblems when we need them. Namely, to compute
F
ij
, we need to know the values to the left, up, and diagonally above F
ij
in the table. Thus we can traverse
the table in row or column major order or even diagonally from the top left cell to the bottom right cell.
Now, to obtain the actual alignment we just have to remember the choices that we made at each step.
4.4 Step 4: Constructing an Optimal Solution
Note the duality between paths through the matrix F and a sequence alignment. In evaluating each cell F
ij
we make a choice in choosing the maximum of the three possibilities. Thus the value of each (unitialized)
cell in the matrix is determined either by the cell to its left, above it, or diagonnally above it. A match and
a substitution are both represented as traveling in the diagonal direction; however, a dierent cost can be
applied for each, depending on whether the two base pairs we are aligning match or not.
To construct the actual optimal alignment, we need to traceback through our choices in the matrix. It is
helpful to maintain a pointer for each cell while lling up the table that shows which choice we made to get
the score for that cell. Then we can just follow our pointers backwards to reconstruct the optimal alignment.
4.5 Analysis
The runtime analysis of this algorithm is very simple. Each update takes O(1) time, and since there are mn
elements in the matrix F , the total running time is O(mn). Similarly, the total storage space is O(mn).
For the more general case where the update rule is perhaps more complicated, the running time may be
more expensive. For instance, if the update rule requires testing all sizes of gaps (e.g. the cost of a gap is
not linear), then the running time would be O(mn(m + n)).
4.6 Optimizations
The algorithm we showed is much faster than the originally proposed strategy of enumerating alignments and
performs well for sequences up to 10 kilo-bases long. However, for whole genome alignments, the algorithm
shown is not feasible, but we can make modications to it to further improve its performance.
4.6.1 Diagonal
One possible optimization is to ignore Mildly Boring Alignments (MBA), that is, alignments that have too
many gaps. In other words, we can limit ourselves to stay within some distance W from the diagonal in the
matrix of subproblems. This will lead to a time/space cost of O((m + n)W ).
Note, however, that this strategy no longer guarantees that we will obtain the optimal alignment. If one
wants to maintain this guarantee, this strategy can still be used to obtain a lower bound on the optimal
score, and then use this bound to limit the space of the matrix that we have to cover.
4.6.2 Using stretches of high conservation
Suppose now that we can quickly nd common stretches of very high conservation between the two sequences.
We can then use these stretches to anchor the optimal alignment. Then we are left with nding optimal
alignments for the subsequences between these stretches, greatly reducing the size of the problem and thus
the running time.
The question left, though, is how can we nd those conserved stretches very fast. This question of nding
local alignments will be the topic of the next lecture.
7
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
6.047/6.878 Lecture 3 - Rapid Sequence Alignment and
Database Search
September 11, 2008
Last lecture we saw how to use dynamic programming to compute sequence alignments in
O(n
2
) time. In particular, we considered performing so-called global alignment, in which we
want to match one entire sequence with another entire sequence. The goal of such alignments
is to be able to infer evolutionary events such as point mutations, insertions, deletions, etc. In
this lecture, we shall look at why and how we might want to do a local alignment rather than
global alignment and how to transform Needleman-Wunsch algorithm for global alignment
into Smith-Waterman algorithm for local alignments. We then look at O(n) algorithms for
exact string matching followed by the BLAST algorithm and inexact matching.
1 Global alignment vs. Local alignment
When nding a global alignment, we try to align two entire sequences. For example, we
might match one gene against another gene. However, this may not always be what we want.
Perhaps we want to align a gene against a whole cluster of genes, thus nding a similar gene
within the cluster. Alternatively, a long sequence might be rearranged or otherwise only
partly conserved. In practice, we shall be looking for regions of homology and similarity
within the genes. Therefore, local alignment methods become handy.
Formally, a local alignment of strings s and t is an alignment of a substring of s with a
substring of t.
1.1 Using Dynamic Programming for local alignments
In this section we will see how to nd local alignments with a minor modication of Needleman
Wunsch algorithm that we discussed last time for nding global alignments. To nd global
alignments:
1
Figure 1: Local Alignments
Initialization : F (i, 0) = 0

0

F (i 1, j) d
Iteration : F (i, j) = max
F (i, j 1) d

F (i 1, j 1) + s(x
i
, y
j
)
Termination : Bottom right
To nd local alignments, we should be able to skip nucleotides just to get to the segment
of interest. Thus, we allow the algorithm to start and stop anywhere with no penalty:
F (0, j) = 0

0

F (i 1, j) d
F (i, j 1) d

F (i 1, j 1) + s(x
i
, y
j
)
Termination : Anywhere
Another variation of alignments is semi-global alignment. The algorithm is as following:

0

F (i 1, j) d
F (i, j 1) d

F (i 1, j 1) + s(x
i
, y
j
)
Termination : Right Column
Sometimes it can be costly in both time and space to run these alignment algorithms.
Therefore, there exist some algorithmic variations to save time/space that work well in
practice. However, the details are outside of the scope of this lecture.
1.2 Generalized gap penalties
Gap penalties determine the score calculated for a subsequence and thus which match is
selected. Depending on the model, it could be a good idea to penalize dierently for, say, the
2
Figure 2: Karp Rabin Algorithm
gaps of dierent lengths. However, the is also cost associated with using more complex gap
penalty functions such as substantial slower running time. There exist nice approximations
to the gap penalty functions.
2 Linear-time exact string matching
When looking for exact matches of a pattern, there is a set of algorithms and data structures
that work well in some applications and another set that work well in other applications.
Some of them that are good to know are Z-algorithm, sux trees, sux arrays, etc. In this
lecture we will talk about Karp-Rabin algorithm.
2.1 Karp-Rabin Algorithm
The problem is as follows: in text T of length n we are looking for pattern P of length m.
The key idea behind this algorithm is to interpret strings as numbers that can be compared
in constant time. Translate the string P and substrings of T of length m into numbers (x
and y, respectively), slide x along T at every oset until there is a match. This is O(n).
One might raise the objection that converting a substring of T of length m into a number
y would take O(m) and therefore the total running time is O(nm). However, we only need
to do this only once since the number at the subsequent oset can be computed based on
the number at the previous oset using some bit operations: a subtraction to remove the
high-order bit, a multiplication to shift the characters left, and an addition to append the
low-order digit (a diagram will be attached). Therefore, it is only O(n + m).
These ideas can be used to rapidly index consecutive W-mers in a lookup table or hash
table, using the index of the previous string to compute the index of the previous string in
constant time, without having to read the entire string.
Also, there is always the risk that the numerical representations of x and y become
large enough to exceed the processor word size in which case there is no guarantee that the
3
Figure 3: The Blast Algorithm
comparisons are constant time. To eliminate this, we use hash (e.g., mod q) to map x and y
to a smaller space. As a tradeo, this is susceptible to collisions, which translate into false
positive matches. However, there is a x for that too.
3 The BLAST algorithm (Basic Local Alignment Search
Tool)
While Dynamic Programming (DP) is a nice way to nd alignments, it will often be too
slow. Since the DP is O(n
2
), matching two 3, 000, 000, 000 length sequences would take
about 9 10
18
operations. What well do is present BLAST, an alignment algorithm which
runs in O(n) time. For sequences of length 3, 000, 000, 000, this will be around 3, 000, 000, 000
times faster. The key to BLAST is that we only actually care about alignments that are very
close to perfect. A match of 70% is worthless; we want something that matches 95% or 99%
or more. What this means is that correct (near perfect) alignments will have long substrings
of nucleotides that match perfectly. As a simple example, if we have a 1000 length sequence
A and a second 1000 length sequence B which is identical to A except in 10 point mutations,
then by the Pigeonhole Principle
1
, there is a perfectly matching subsequence of length at
least 100. If we suppose that the mutations happen at random, the expected length of the
longest perfectly matching subsequence will actually be considerably longer. In biology, the
mutations that we nd will not actually be distributed randomly, but will be clustered in
nonfunctional regions of DNA while leaving untouched long stretches of functional DNA.
The other aspect of BLAST which allows us to speed up repeated queries is the ability
to preprocess a large database of DNA o-line. After preprocessing, searching for a sequence
of length m in a database of length n will take only O(m) time.
1
The Pigeonhole Principle states that if we have n pigeons stued into k holes, then some hole must have
at least n/k pigeons.
4
3.1 The BLAST algorithm
The steps are as follows:
1. Split query into overlapping words of length W (the W -mers)
2. Find a neighborhood of similar words for each word (see below)
3. Lookup each word in the neighborhood in a hash table to nd where in the database
each word occurs. Call these the seeds, and let S be the collection of seeds.
4. Extend the seeds in S until the score of the alignment drops o below some threshold
X.
5. Report matches with overall highest scores
Pre-processing step of BLAST is to make sure that all substrings of W consecutive
nucleotides will be included in our database (or in a hash table
2
). These are called the
W -mers of the database.
As in step 1, we rst split the query by looking at all substrings of W consecutive
nucleotides in the query. For protein BLAST, we then modify these to generate a neighbor
hood of similar words based on amino-acid similarity scores from the similarity matrix. We
take progressively more dissimilar words in our neighborhood until our similarity measure
drops below some threshold T . This allows us to still nd matches that dont actually have
W matching characters in a row, but which do have W very similar characters in a row.
We then lookup all of these words in our hash table to nd seeds of W consecutive
matching nucleotides. We then extend these seeds to nd our alignment. A very simple
method for doing this which works well in practice is to simply greedily scan forward and
stop as soon as the rst discrepancy is found. Alternatively, one could use the DP alignment
algorithm on just the DNA around the seed to nd an optimal alignment in the region of
the seed. Since this is a much shorter segment, this will not be as slow as running the DP
algorithm on the entire DNA database.
Note that if the W-mers are small (6-10 nucleotides or 4-6 amino-acids), or if all W-mers
are well-populated (that is, 4
W
< database size for nucleotides, or 20
W
< database size for
amino-acids), then a simple table lookup can be more ecient than using a hash table (i.e.
directly look up the address by interpreting the string as a number). A hash table becomes
necessary when 4
W
(or 20
W
) is much greater than the genome/proteome, in which case the
table is typically too large to t in memory, and even if it did, most of the entries in a direct
lookup table would be empty, wasting resources.
2
Hash tables allow one to index data in (expected) constant time even when the space of possible data is
too large to explicitly store in an array. The key to a hash table is a good hash function, which maps ones
data to a smaller range. One can then explicitly index this smaller range with an array. For a good hash
table, one needs a fast, deterministic hash function which has a reasonably uniformly distributed output
regardless of the distribution of the input. One also needs policies for handling hash collisions. For more
information on hashing, refer to the notes from 6.046.
5
3.2 Extensions to BLAST
Filtering Low complexity regions can cause spurious hits. For instance, if our
query has a string of copies of the same nucleotide and the database has a long stretch
of the same nucleotide, then there will be many many useless hits. To prevent this,
we can either try to lter out low complexity portions of the query or we can ignore
unreasonably over-represented portions of the database.
Two-hit BLAST If theres a mutation in the middle of a long sequence, then we
may not have a match with a long W -mer when there is still a lot of matching DNA.
Thus, we might look for a pair of nearby matching, but smaller, W -mers. Getting two
smaller W -mers to match is more likely than one longer W -mer. This allows us to
get a higher sensitivity with a smaller W , while still pruning out spurious hits. This
means that well spend less time trying to extend matches that dont actually match.
Thus, this allows us to improve speed while maintaining sensitivity.
Combs Recall fromyour biology classes that the third nucleotide in a triplet usually
doesnt actually have an eect on which amino acid is represented. This means that
each third nucleotide in a sequence is less likely to be preserved by evolution, since it
often doesnt matter. Thus, we might want to look for W -mers that look similar except
in every third codon. This is a particular example of a comb. A comb is simply a bit
mask which represents which nucleotides we care about when trying to nd matches.
We explained above why 110110110 . . . might be a good comb, and it turns out to be.
However, other combs are also useful. One way to choose a comb is to just pick some
nucleotides at random. This is called random projection.
4 The Statistics of Alignments
As described above, the BLAST algorithmuses a scoring (substitution) matrix to expand the
list of W-mers to look for and to determine an approximately matching sequence during seed
extension. But how do we construct this matrix in the rst place? How do you determine
the value of s(x
i
, y
j
)?
The most basic way to do this is to rst do some simple(r) alignments by hand (alter
nately, we can ask an algorithm to do this with some randomized parameters and then rene
those parameters iteratively). Once we have some sequences aligned, we can then look at the
regions that are the most conserved. From these regions, we can derive some probabilities
of a residues replacement with another, i.e., a substitution matrix. Lets develop a more
formal probabilistic model. (See slides on Page 5)
First, we have to step back and look at a probabilistic model of an alignment of two
sequences. For this discussion, were going to focus on protein sequences (although the
reasoning is similar for DNA sequences).
Suppose we have aligned two sequences x and y. How likely are x and y related through
evolution? We have to look at the likelihood ratio of the alignment score:
6

=

P(score|related)
P(score|unrelated)
More formally, for unrelated sequences, the probability of having an n-residue alignment
between x and y is a simple product of the probabilities of the individual sequences since
the residue pairings are independent. That is,
x = {x
1
. . . x
n
}
y = {y
1
. . . x
n
}
q
a
= P (amino acid a)
n n
P (x, y|U) = q
x
i
q
y
i
i=1 i=1
For related sequences, the residue pairings are no longer independent so we must use a
dierent joint probability:
p
ab
= P (evolution gave rise to a in x and b in y)
n
P (x, y|R) = p
x
i
y
i
i=1
Do some math and we get our desired expression for the likelihood ratio:
P (x, y|R)
i
n
=1
p
x
i
y
i
P (x, y|U)
i
n
=1
q
x
i
i
n
=1
q
y
i
n
i=1
p
x
i
y
i
n
i=1
q
x
i
q
y
i
Since we eventually want to compute a sum of scores and probabilities require add prod
ucts, we take the log of the product to get a handy summation:
P (x, y|R)
S log
P (x, y|U)
= log
p
x
i
y
i
q
x
i
q
y
i
i
s(x
i
, y
i
)
i
s(x
i
, y
i
) = log
p
x
i
y
i
q
x
i
q
y
i
7
The above expression for s(x
i
, y
i
) is then used to crank out a substitution matrix like
BLOSUM62 (Page 6, Slide 2). Incidentally, the BLOSUM62 matrix captures both chemical
similarity and evolutionary similarities, which gives more information than a single expert
system can. For example (in the slide), we see a positive score for matches between aspartate
(D) and glutamate (E), which are chemically similar. Additionally, the matrix shows tryp
tophan (W) matching with itself has a higher score than isoleucine (I) matching with itself,
i.e., s(I, I) < s(W, W ). This is primarily because tryptophan occurs in protein sequences
less frequently than isoleucine, meaning s(W, W ) has a smaller denominator, but it may also
mean that tryptophan is more highly conserved in general.
So now that were given two sequences and an alignment score, how do we determine
whether the score is statistically signicant? That is, how can we determine whether the two
sequences are related by evolution and not simply by chance? In particular, large database
size increases the probability of nding spurious matches. In 1990, Karlin and Altschul
determined that for two random sequences, the expected number of alignments with a score
of at least S is E(S) = Kmne
S
, which we can use to correct for false positives.
8
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008

1. Mitochondria play an important role in metabolic processes and are known as the powerhouse of the cell. They
represent an ancient bacterial invader that has been successfully subsumed by eukaryotic cells. Mitochondria have their
own genetic material, commonly referred to as mtDNA. Interestingly, a large fraction of inherited mitochondrial
disorders are due to nuclear genes encoding specific proteins targeted to the mitochondria and not due to mutations in
mtDNA itself. Hence it is imperative to identify these mitochondrial proteins to shed light on the molecular basis for
these disorders.

6.047/6.878 Fall 2008 Lecture #5
1. Overview

In the previous lecture we looked at examples of unsupervised learning techniques, such as clustering. In
this lecture we focus on the use of supervised learning techniques, which use the inherent structure of the
data to predict something about the state or class we should be in. There are two different approaches to
classification which use either (1) generative models that leverage probabilistic models that generate data,
and (2) discriminative models that use an appropriate function to tease apart the data. Nave Bayes
classifiers are an example of generative models and Support Vector Machines (SVMs) are example of
discriminative models. We will discuss actual biological applications of each of these models, specifically
in the use of Nave Bayes classifiers to predict mitochondrial proteins across the genome and the use of
SVMs for the classification of cancer based on gene expression monitoring by DNA microarrays. The
salient features of both techniques and caveats of using each technique will also be discussed.
2. Classification Bayesian Techniques
We will discuss classification in the context of the problem of identifying mitochondrial proteins. If we
look across the genome, how do we determine which proteins are involved in mitochondrial processes or
more generally which proteins are targeted to the mitochondria
1
. This is particularly useful because if we
know the mitochondrial proteins, we can start asking interesting questions about how these proteins
mediate disease processes and metabolic functions. The classification of these mitochondrial proteins
involves the use of 7 features for all human proteins (1) targeting signal, (2) protein domains, (3) co-
expression, (4) mass spectrometry, (5) sequence homology, (6) induction, and (7) motifs.

In general, our approach will be to determine how these features are distributed for objects of different
classes. The classification decisions will then be made using probability calculus. For our case with 7
features for each protein (or more generally, each object), it is likely that each object is associated with
more than one feature. To simplify things let us consider just one feature and introduce the two key
concepts.

(1) We want the model for the probability of a feature given each class. So the features from each
class are drawn from what are known as class-conditional probability distributions (CCPD).
6.047/6.878Fall2008 Lecture#5ScribeNotes

P (Closs |coturc) =
P (coturc | Closs) P(Closs)
P(coturc)
(2) We model prior probabilities to quantify the expected a priori chance of seeing a class. For
the case of identifying mitochondrial proteins, we would like to have a low prior probability
of seeing the mitochondrial class. In this way we can bias ourselves against classifying
mitochondrial proteins. (But how do we actually determine these prior probabilities? We will
cover this in section 2.2).
A good example of how our second concept affects classification was discussed in lecture. We discussed
the example of Eugene. Eugene likes to read books in philosophy and has a penchant for tweed jackets.
The question posed was whether Eugene is more likely (in probabilistic terms) to be an ivy-league dean or
a truck driver. The answer was that he was more likely to be a truck driver, due to the fact that there are
many more truck drivers in the world than there are ivy-league deans. This overwhelming probability
outweighs the fact that ivy-league deans are more likely to enjoy books on philosophy and tweed jackets
than truck drivers. This prior knowledge influences our decision in classification and we can use this use
prior information as the probability of belonging to a certain class P(Class) to make our classifications.
Now we that we have the prior P(Class) and likelihood of a given feature, P(feature | Class), we can
simply apply Bayes rule to determine the posterior P(Class | feature) since we are interested in
classification based on features.
evaluate evidence
belief before evidence
belief after evidence
( ) Closs ) P ss Clo ( | coturc P
= ) urc cot | Closs P
( ) coturc P
evidence
posterior
likelihood
prior

This is the application of Bayes rule to our problem with the probabilities defined as shown above. More
generally we can think of the probabilities as below,
P (Closs 1 | coturc) > P( Closs 2 | coturc)

P (coturc | Closs 1)P (Closs 1) > P (coturc | Clo 2) P(Closs 2)
We can now think about applying this recursively across classes and using the belief after evidence to
form new evidence and then find new beliefs after evidence. Once weve determined all our beliefs
after evidence for each class, we can use a decision rule to choose a particular class.
Given two classes, Bayes decision rule would be as follows,
ss
2
belief after evidence
evaluate evidence belief before evidence

0(coturc) = log
P(coturc | Closs 1) P (Closs 1)
P(coturc | C ss 2) P (Closs 2)
We can take log probabilities to build a discriminant function as below,
> u
lo
In this case the use of logarithms provide distinct advantages

(1) numerical stability,
(2) easier math (its easier to add the expanded terms than multiply them), and
(3) monotonically increasing discriminators.

This discriminant function does not capture the penalties associated with misclassification (in other words,
is one classification more detrimental than other). In this case, we are essentially minimizing the number of
misclassifications we make overall, but not assigning penalties to individual misclassifications. From
examples we discussed in class and in the problem set - if we are trying to classify a patient as having
cancer or not, it could be argued that it is far more harmful to misclassify a patient as being healthy if they
have cancer than to misclassify a patient as having cancer if they are healthy. In the first case, the patient
will not be treated and would be more likely to die, whereas the second mistake involves emotional grief
but no greater chance of loss of life. To formalize the penalty of misclassification we define something
called a loss function, L
kj
, which assigns a loss to the misclassification of an object as class j when the true
class is class k (a specific example of a loss function was seen in Problem Set 2).
2.1 Training Sets
Classifiers require a set of labeled data points. The number of data points within this data set will directly
impact the quality of the classifier, however, the number of data points necessary to build a training set will
be largely dependent on the problem to be solved. The importance of the training set is to correctly estimate
the feature-given-class likelihoods. This can be done using a multinomial distribution.
2.2 Getting Priors
This involves getting P(Class) to complete the implementation of a Bayes classifier. There are three
general approaches

(1) Estimate the priors by counting and dividing by the total to obtain the fraction of classes in a
training set (again, this emphasizes the quality of the training set data)
(2) Estimate from expert knowledge, in that we might have information from actual
experimental work, and have reasoning about the distribution from higher-level data
(3) Assume no prior knowledge, so all classes are equally likely

3

P (1, 2, . N| Closs) = P (1 |Closs) P (2 | Closs) P N |Closs)
(
P
g o = l )
N
. , , , (
2
1
0
(Closs 1) 1) P Closs |
> u
P ( | Closs 2) P (Closs 2)
2.3 Multiple Features
Going back to our original example of mitochondrial proteins, we were looking at 7 features and not just a
single feature. This makes things much more complicated and exemplifies the curse of dimensionality.
We would require huge amounts of data to be able to implement the classifier. Most likely we would end
up with zero probability for a certain combination of features for a particular class, which might not be an
accurate representation of the real probabilities. We can get around this by assuming the features are
independent (this forms the crux of a Nave Bayes approach). Once we assume independence we can now
take the product of all the conditional probabilities and we have to only estimate the individual distributions
for each feature. So we now have,
(
where, f1 represents feature 1. Similarly, the discriminant function can be changed to the multiplication of
the prior probabilities.
Spcciicity =
IP
IP +FN
2.4 Testing a classifier
The classifier must be tested on a different set (termed the labeled test set) than the training set. Apparently,
a large number of publications continue to test their classifiers solely with their training data sets (resulting
in skewed performance metrics). To perform a robust classification we classify each object in the test set
and count the number of each type of error. In the case of a binary classification (i.e. an object either
belongs to Class 1 or Class 2), we have the following four types of errors,

(1) True positive (TP)
(2) True negative (TN)
(3) False positive (FP)
(4) False negative (FN)
The frequency of these errors can be encapsulated in performance metrics of a classifier which are defined
as,
(1) Sensitivity determines how good we are at classifying. In the case of our binary example,
what fraction of actual Class 1 classified as Class1?

(2) Specificity determines how tolerant we are to misclassifications. What fraction of not Class
2 (in the binary case, simply Class 1) classified as Class 2?
4

P F
IN
+ IN
= Spcciicity
ink of it as minimizing FN) and higher the specific The higher the sensitivity (th ity (think of it as
minimizing FP) the better the classifier. However, achieving a high score in one of these areas (and not the
other) doesnt amount to much. Consider the classification of mitochondrial proteins, we can classify all
data points as mitochondrial proteins and get 100% sensitivity but 0% specificity. The MAESTRO
algorithm achieves a 99% specificity and a 71% sensitivity for the classification of mitochondrial proteins.

In all, we have defined Bayesian classifiers and shown how to determine all the probabilities required to
implement a Bayesian classifier and discussed a particular implementation that assumes independence of
features (Nave Bayesian classifiers). The final section describes the specifics of the application of
classification to determine mitochondrial proteins.

2.6 MAESTRO Mitochondrial Protein Classification

Calvo et al. (2006) sought to construct high-quality predictions of human proteins localized to the
mitochondrion by generating and integrating data sets that provide complementary clues about
mitochondrial localization. Specifically, for each human gene product p, they assign a score s
i
(p), using
each of the following seven genome-scale data sets targeting signal score, protein domain score, cis-motif
score, yeast homology score, ancestry score, coexpression score, and induction score (details of each of the
meaning and content of each of these data sets can be found in the manuscript). Each of these scores (s
1
s
7
)
can be used individually as a weak genome-wide predictor of mitochondrial localization. Each methods
performance was assessed using large gold standard curated training sets - 654 mitochondrial proteins
(T
mito
) maintained by the MitoP2 database1 and 2,847 nonmitochondrial proteins (T
~mito
) annotated to
localize to other cellular compartments. To improve prediction accuracy, the authors integrated these eight
approaches using a nave Bayes classifier that was implemented as a program called MAESTRO.

When MAESTRO was applied across the human proteome, 1451 proteins were predicted as mitochondrial
proteins and 450 novel proteins predictions were made. As mentioned in the previous section The
MAESTRO algorithm achieves a 99% specificity and a 71% sensitivity for the classification of
mitochondrial proteins, suggesting that even with the assumption of feature independence, Nave Bayes
classification techniques can prove extremely powerful for large-scale (i.e. genome-wide) scale
classification.
3. Classification Support Vector Machines
The previous section looked at using probabilistic (or generative) models for classification, this section
looks at using discriminative techniques in essence, can we run our data through a function to determine
5

{-S, -4, -S, S, 4, S -2, -1, u, 1, 2, ]
-2.S
(x, y = x
2
{-S, -4, -S, S, 4, S] - (-S,2S), (-4,16), (-S,9), (S,9), (4,16), (S,2S
{-2, -1, u, 1, 2, ] - (-2,4), (-1,1), (u,u), (1,1), (2,4)
6.S
2
< 6.S
its structure? Such discriminative techniques avoid the inherent cost involved in generative models which
might require more information than is actually necessary.

Support vector machine techniques essentially involve drawing a vector thats perpendicular to the line
optimally separating the data. This vector is determined from a set of support vectors from the individual
data points. Interestingly, the vector can be determined from the scalar dot-product of these support vectors,
so even if we get to higher dimensions then we go from points in multi-dimensional space to just a simple
dot-product (allowing for simpler implementation for higher order data).

In this case when the data is not linearly separable we can use a kernel function (essentially a transform) to
map to a higher dimensional space where the data is now linearly separable. The beauty of this technique is
that since the SVM technique only requires the dot-product of the support vectors we can work directly
with the kernel mapping matrix and not with the actual data points. There are a number of transformations
(such as polynomial and sigmoidal) that serve as good kernel mapping functions and are furthered explored
in the next section.

3.1 Non-Linear Classification

What if the data are not linearly separable? One way to deal with this is to just to try to find the best linear
classifier for the data, and accept mistakes. Another, completely different, way to deal with this comes from
the fact that the more dimensions there are, the more likely it is that the data set is linearly separable. In a
certain sense, for any finite data set, as the number of dimensions goes to infinity, the probability that the
data is separable goes to 1.

How can we take advantage of this? Consider the two classes of one-dimensional data:
] and { . This data is clearly not linearly separable, and the best
separation boundary we can find might be x > .
Now, consider applying the transformation x - ). The data now can be written as new pairs,
), and

This data is separable by the rule y > , and in general the more dimensions we transform to, the more
separable it becomes.

An alternate way of thinking of this problem is to transform the classifier back in to the original low-
dimensional space. In this particular example, we would get the rule x , which would bisect the
number line at two points. In general, the higher dimensionality of the space that we transform to, the more
complicated a classifier we get when we transform back to the original space.

6

(u) u
1
) (u
2
)
u
1
, u
2
K(u
1
, u
2
) = (u
1
) (u
2
)
x - (x, y = x
2
K(x
1
, x
2
) = (x
1
, x
1
2
) (x
2
, x
2
2
)
1
x
2
+(x
1
x
2
)
2
(u
1
, u
2
(x) = x
(u
1
, u
2
) = (1 + u
n
(x) = (1, 2
3.2 Kernels

We see that SVMs are dependent only on the dot products of the vectors. So, if we call our transformation
, for two vectors we only care about the value of ( . The trick with use of kernels consists
of realizing that for certain transformations , there exists a function K( ), such that,

In the above relation, the right-hand side is the dot product of vectors with very high dimension, but the
left-hand side is the function of two vectors with lower dimension. In our previous example of mapping
), we get
=x

Now we need not actually apply the transformation, we can do all our calculations in the lower
dimensional space, but get all the power of using a higher dimension.

Example kernels are the following,
(1) Linear kernel: K ) = u
1
u
2

which represents the trivial mapping of
(2) Polynomial kernel: K
1
u
2
)
which was used in the previous example, if we had instead used the transformation
x, x
2
)
(u
1
) (u
2
) = (1 + u
1
u
2
)
2
(u
1
, u
2
) = exp(-y|
2
)
1
(u
1
, u
2
) = tanh
, we would get a more polynomial-looking kernel of

(3) Radial Basis kernel: K u
1
-u |
2

This transformation is actually from a point u to a function (which can be thought of being a
point in Hilbert space) in a infinite-dimensional space. So what were actually doing is
transforming out training set into functions, and combining them to get a decision boundary.
The functions are Gaussians centered at the input points.
(4) Sigmoid kernel: K | y(u
1
1
u
2
+r)]
Sigmoid kernels have been popular for use in SVMs due to their origins in neural networks
(e.g. sigmoid kernel functions are equivalent to two-level, perceptron neural networks). It has
been pointed out in previous work (Vapnik 1995) that the kernel matrix may not be positive
semi-definite for certain values of the parameters y and r. The sigmoid kernel has nevertheless
been used in practical applications (see Scholkopf 1997).

One of the caveats of transforming the input data using a kernel is the risk of over-fitting (or over-
classifying) the data. More generally, the SVM may generate so many feature vector dimensions that it
does not generalize well to other data. To avoid over-fitting, cross-validation is typically used to evaluate
7

hen using difficult-to-separate training sets, SVMs can incorporate a cost parameter C, to allow some
an we use just any function as our kernel? The answer to this is provided by Mercers Condition which
or any g(x), such that, F
2
is finite Jx ) x _g(
then,
JxJy ) y ) ( g x ) ( g y x, K(

u
In all, we have defined SVM discriminators and shown how to perform classification with appropriate
.3 Tumor Classifications with SVMs
generic approach for classifying two types of acute leukemias acute myeloid leukemia (AML) and
(1) whether there were genes whose expression pattern to be predicted was strongly correlated
or capable of assigning a
ss predictors
the fitting provided by each parameter set tried during the grid or pattern search process. In the radial-basis
kernel, you can essentially increase the value of until each point is within its own classification region
(thereby defeating the classification process altogether). SVMs generally avoid this problem of over-fitting
due to the fact that they maximize margins between data points.

W
flexibility in separating the categories. This parameter controls the trade-off between allowing training
errors and forcing rigid margins. It can thereby create a soft margin that permits some misclassifications.
Increasing the value of C increases the cost of misclassifying points and forces the creation of a more
accurate model that may not generalize well.

C
provides us an analytical criterion for choosing an acceptable kernel. Mercers Condition states that a
kernel K(x,y) is a valid kernel if and only if the following holds (Burgess 1998),

kernel mapping functions that allow performing computations on lower dimension while being to capture
all the information available at higher dimensions. The next section describes the application of SVMs to
the classification of tumors for cancer diagnostics.

3

A
acute lymphoid leukemia (ALL) was presented by Golub et al. (1999). This approach centered on
effectively addressing three main issues,

with the class distinction (i.e. can ALL and AML be distinguished)
(2) how to use a collection of known samples to create a class predict
new sample to one of two classes
(3) how to test the validity of their cla
8

They addressed (1) by using a neighbourhood analysis technique to establish whether the observed
correlations were stronger than would be expected by chance. This analysis showed that roughly 1100

SVM approach to this same classification problem was implemented by Mukherjee et al. (2001). The
Genes Rejects Errors Confidence Level |d|
summarized in the following table (where |d| corresponds to the cutoff for rejection).

7129 3 0 ~93% 0.1
40 0 0 ~93% 0.1
5 3 0 ~92% 0.1

genes were more highly correlated with the AML-ALL class distinction than would be expected by chance.
To address (2) they developed a procedure that uses a fixed subset of informative genes (chosen based on
their correlation with the class distinction of AML and ALL) and makes a prediction based on the
expression level of these genes in a new sample. Each informative gene casts a weighted vote for one of
the classes, with the weight of each vote dependent on the expression level in the new sample and the
degree of that genes correlation with the class distinction. The votes are summed to determine the winning
class. To address (3) and effectively test their predictor by first testing by cross-validation on the initial
data set and then assessing its accuracy on an independent set of samples. Based on their tests, they were
able to identify 36 of the 38 samples (which were part of their training set!) and all 36 predictions were
clinically correct. On the independent test set 29 of 34 samples were strongly predicted with 100%
accuracy and 5 were not predicted.

A
output of classical SVM is a binary class designation. In this particular application it is particularly
important to be able to reject points for which the classifier is not confident enough. Therefore, the authors
introduced a confidence interval on the output of the SVM that allows for rejection of points with low
confidence values. As in the case of Golub et al. it was important for the authors to infer which genes are
important for the classification. The SVM was trained on the 38 samples in the training set and tested on
the 34 samples in the independent test set (exactly in the case of Golub et al.). The authors results are
hese results a significant improvement over previously reported techniques, suggesting that SVMs play an
. Semi-supervised learning
with only a few labeled data points, a large number of unlabeled data
of data first followed by the classification of the generated clusters.
9
T
important role in classification of large data sets (as those generated by DNA microarray experiments).

4
n some scenarios we have a data set I

points and inherent structure in the data. This type of scenario both clustering and classification do not
perform well and a hybrid approach is required. This semi-supervised approach could involve the clustering
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
6.047/6.878 Lecture 6: HMMs I
September 24, 2008
1 Introduction
To this point, the class has dealt mainly with known, characterized gene sequences. We have
learned several methods of comparing genes in hopes of quantifying divergence, highlight
ing similar gene segments which have been preserved among species, and nding optimal
alignments of sequences such that we are better able to understand the evolution of the
sequence. We have used BLAST, hashing, projections and neighborhood search to nd se
quences in a database. Having learned these methods, we now can begin to address the
question:
1.1 What do we do with a new found piece of DNA?
The rst approach is usually to compare the DNA to known sequences, for which the meth
ods mentioned work well. But many times the new sequence cannot be matched to genes
within a database. Then, one could compare several unknown sequences and try to group
them for later use. All these technics build on comparisons of DNA sequences and on the
perseverance of homologous parts. But why not take an orthogonal approach and try to
just understand the given piece itself better. This could involve analyzing the sequence for
unusual properties like k-mer frequencies and motifs or recurring patterns. Here we want to
explore the more constructive approach of modeling properties in a generative model.
1
1.2 Modeling!
Modeling involves making the hypotheses about the sequence, that it is a concatenation
of distinct functional subsequences. These subsequences have similar properties and can
be classied as a small number of dierent types. For example there are several biological
sequence types, such as promoters, rst exons, and intergenic sequences. One distinctive
feature of these types are dierent frequencies of the four bases. In this lecture we focus
on this property and assume that the likelihood of a base occurring in a sequence is only
dependent on the type of its subsequence (and not e.g. on predescending or following bases).
Before giving a more formal denition of this model lets see what we can use it for:
We can generate typical DNA sequences (of a certain type)
By comparing how likely it is that a given sequence is generated by one or another
type we can interfere types of unknown sequences.
Allowing transitions between certain types with given probabilities allows us in addition
to interfering the types of subsequences to interfere the splitting into type-sequences
of a whole sequence
Using (annotated) examples we can train our model to most likely generate the existing
sequences and thus learn characteristics of types
1.3 Why using a probabilistic model?
There are many good reasons for choosing our generative model to be probabilistic. First
since the nature of biological data is noisy we can not (even with a perfect model) expect
to have a deterministic model explaining this data. Secondly using probability in our model
allows us to utilize the powerful mathematical tools and calculi of probability theory. It also
opens the possibility to classify decisions and interferences in a much more expressive way
by using probabilities to give degrees of belief instead of hard classications.
2

2 The one state model
To get some feeling for the concept of modeling and for the purpose of learning (some of) the
probabilistic methods and tricks needed we start with the simplest model possible namely
the one state model. For this model we assume that a sequence is generated over time, while
at each point of time every base has a certain globally xed probability to occur (which is
therefore independent from time, history, ...).
A simple example would be to a apply this model to GC-rich regions by assuming that in
such a sequence nucleotide are independent and occur with frequencies P (A) = 0.15,P (T ) =
0.13,P (G) = 0.30 and P (C) = 0.42. If we have now a GC-rich region sequence S =
AAATGCGCATTTCGAA, then we can infer P (S|MP ), the probability of getting this
exact sequence S given our model MP by rst using independence:
|S|
P (S|MP ) = P (S
i
|MP )
i=1
This product can always be rearranged with respect to the nucleotides gives a probability
only dependent on the nucleotide count. In our example this is:
P (S|MP ) = P (A)
6
P (T )
4
P (G)
3
P (C)
2
= (0.15)
6
(0.13)
4
(0.30)
3
(0.42)
2
= 1.55E 11
Remark:
Note that this is the probability of getting one sequence with 6As, 4Ts, 3Gs, and 2Cs. This
probability needs to be multiplied by the number of permutations of this sequence in order
to nd the probability of getting any sequence with the given nucleotide counts. When
3

using the model, we assume that for any position in the generated sequence the nucleotide
probabilities hold. We do not account for any dependencies among nucleotides which may
come into play.
Knowing P (S|MP ) alone is not a great information especially since there is hardly a good
intuition as to whether or not the computed small probabilities are relatively high or not.
Thus we create an alternative background model with uniform probabilities P (A) = P (T ) =
P (G) = P (C) = 0.25 and get that the probability that the background generated our
sequence S is with P (S|B) = 9.31E 10 actually nearly ten times higher. Since
P (S|B)/P (S|MP ) > 1
we have gained some condence that S is more likely not a GC-rich region but a background
sequence.
Instead of using this rule we can take (the monotone) logarithm on both sides.
Score = log(P (S|B)/P (S|MP )) = log(P (S|B)) log(P (S|MP )) > 0
Using again independence leads to the so called log-likelihood ratios:
Score = log(P (S
i
|MP )/P (S
i
|B))
i
Computing the total score is therefore just adding up of the log-likelihood ratios for each
nucleotide in the sequence. In our example these ratios are A : 0.73, T : 0.94, G : 0.26,
C : 0.74.
2.1 Use and drawback of the one state model
As we have seen we can use the one state model to decide whether a sequence is more likely
a background or a GC-rich region. However, we need some extension to use this approach
for nding the GC-rich parts in a whole sequence. One idea would be to take windows (e.g.
of xed size) in the sequence and decide for these windows whether they are likely to be
GC-rich. Besides computational problems this does not give a satisfactory model/technic
since in reality regions of one type are not of a specic length. Therefore in the next section
we extend our model such that it can itself decide in which type of subsequence it is and
freely transition between the types.
3 Markov Chains and Hidden Markov Models
A Markov chain M = (Q, A, p) consists of a (typically nite) set of states Q, transition
state probabilities A - where A = (a
i,j
) is a stochastic matrix with probabilities a
i,j
that M
4
changes from being in state q
i
to state q
j
in one step - and the distribution p over Q giving
the probabilities to start in a certain state.
A Markov chain starts in one state (according to p) and transitions each time step to the
same or an other state (according to A). The state trajectory of such a random walk is the
generated sequence.
A useful generalization of this model is the hidden Markov model which instead of directly
outputting the state sequence, couples every state with a probability distribution over an
output alphabet . In addition to transitioning every time an output is drawn from the dis
tribution of the current state. Since it is possible that one character has non-zero probability
to be generated in two (or more) dierent states, one can not identify the state trajectory
form the output. The states are hidden for the observer.
Note the important property (and assumption) that the next state in each of both model
only depends on the current state. Thus the model is memoryless.
3.1 Example: GC-rich regions
Having HMM we can easily make a model for a whole sequence containing both background
and promoter regions:
5
Remark:
The above diagram shows a HHM with a 85% chance that when in the background state,
the next item in the sequence will also be part of the background DNA. Similary there is
a 15% chance that there will be a state transition from background to promoter . . .. Note
that the probabilities of all transition choices from a given state form a distribution (their
sum is 100%.
3.2 Using MC and HMM as a Generative Model
These models can be easily used as a generative model. Instead of getting a sequence by
drawing nucleotides form one distribution as in the simple model, we now rst initialize the
model using the start probabilities of p, remember the current state where we are in, draw a
nucleotide from the corresponding distribution and perform a state transition according to
the probabilities in A. In such a way a sequence corresponds to a path in the state space.
6
4 (Algorithmic) questions related to HMMs
Having the correspondence between paths in a HMM and a sequence present several prob
lems of interest. The simplest one is to compute the probability of a given path or parse
sequence. Having made this connection between sequences, paths and probabilities we can
score paths according to their probability and might want to nd the single highest score
path that could have generated a given sequence. This is done by the Viterbi Algorithm. We
will also be interested in computing the total probability of a given sequence being generated
by a particular HMM over all possible state paths that could have generated it; the method
we present is yet another application of dynamic programming, known as the forward algo
rithm. One motivation for computing this probability is the desire to measure the accuracy
of a model. Being able to compute the total probability of a sequence allows us to compare
alternate models by asking the question: Given a portion of a genome, how likely is it that
each HMM produced this sequence? Finally, even though we now know the Viterbi decod
ing algorithm for nding the single optimal path, we want to nd the most likely state at any
position of a sequence (given the knowledge that our HMM produced the entire sequence).
This question is named posterior decoding. The posterior decoding - which is not given here
- applies both the forward algorithm and the closely related backward algorithm.
Notice that all questions given here want information related to all possible paths. This is
critical since there is an exponential number of such possible sequences/paths. Indeed, in an
7
HMM with k states, at each position we can be in any of k states; hence, for a sequence of
length n, there are k
n
possible parses. This looks familiar to the problem we faced for com
puting optimal alignments. The similarities go further: As discussed in Lecture, matching
states of an HMM to a nucleotide sequence is somewhat similar to the problem of alignment
in that we wish to map each character to the hidden state that generated it. The only
dierence is that we can reuse states, i.e., return to the same state. It is therefor no surprise
that the very general dynamic programming technic we used for computing alignments also
helps here.
4.1 Likelihood of a path
But before we jump into these complexities we have a look at the evaluation of just one
path. Analog to the simple example we can compute the probability of one such genera
tion by multiplying the probabilities that the model makes exactly the choices we assumed.
Again the exploitation of the independence allows an easy calculation:
As an example are here two paths and their probabilities:
Path 1:
8
Path 2:
4.2 Viterbi Algorithm and Decoding
As promised we want now nd the optimal (most likely) path creating a sequence. The
reason why we can use dynamic programming for this lies in the following fact quite typical
for questions on paths: Every sub path of an optimal path is optimal and vice versa every
optimal path can be constructed by constantly putting together smaller optimal sub path
(or prolonging a smaller path).
This leads to an the following recurrence:
(We assume to have a single start state 0 which transits according to p to the next state)
V
k
(l) = Probability of the most likely path of length l from state 0 to state k
V
k
(l + 1) = e
k
(x
l+1
) max
j
a
j,k
V
j
(l)
which can be solved easily:
9
Using the log of the probabilities simplies the algorithm slightly, since all products get
transformed into sums.
4.3 Forward Algorithm and Evaluation
We begin by discussing the forward algorithm which, given a sequence X = x
1
, . . . , x
n
and
an HMM which produced it, nds the total probability P (X) of X over all possible Markov
paths =
1
, . . . ,
n
that could have produced it. We are interested in total probability
for multiple reasons. One is that the Viterbi optimal path may have an extremely small
probability of actually occurring; moreover, suboptimal paths may have probabilities very
close to the optimum. For example, in the HMM discussed last time, suppose the Viterbi
path includes a promoter region, i.e., a sequence of consecutive promoter states. Then an
alternate path that merely switches from the promoter state to the background state for one
base and then switches back is only slightly less probable than the optimal.
In general, suboptimal paths account for a much greater proportion of the probability distri
10

bution than the Viterbi path alone, and so working with our system over all possible paths
may allow us to make better inferences about the true structure of the sequence. Also, the
optimal path may be very sensitive to the choice of parameters of our model, and since we do
not actually know these exactlyand worse, rounding errors may aect the answerbeing
able to look at more paths is likely to be informative.
Finally, as discussed in the introduction, being able to compute total probability allows us
to compare two HMMs. One cheap alternative to actually computing this probability is to
simply calculate the probability of the Viterbi path predicted in each case, but we can do
better: in fact, the forward algorithm for computing the real answer is no more expensive
than Viterbi!
Just as Viterbi found the most likely path in polynomial time, we can nd the total proba
bility of a sequence in polynomial timedespite the challenge of an exponential number of
possible parsesvia the principle of dynamic programming. More precisely, we will calculate
total probabilities iteratively by reusing solutions to subproblems. In this case we are not
looking for the optimal solution but rather for a sum of previously computed probabilities,
but the rough idea is the same as in Viterbi. Indeed, we still expand out the problem one
layer at a time, the only dierence being that this time we take a sum over previous states
and transitions rather than a max:
f
k
(l) = Probability that the sequence up to position l was generated ending in state k
f
k
(l + 1) = e
k
(x
l+1
)
j
a
j,k
f
j
(l)
In words, we have found a recursion that allows us to compute the total probability of
generating the sequence up to a position i and being in state l at that position, using the
(previously computed) total probabilities of reaching position i1 and being in each possible
state. Thus, we may compute each of these probabilities by lling in a dynamic programming
table with columns indexed by position i and rows indexed by state l. When the algorithm
completes, summing the values in the rightmost column (which takes into account all possi
ble nal states) gives the total probability of generating the entire sequence.
11
The time complexity of the algorithm is O(K
2
N) because lling in each of the KN cells in
the table requires using data from all K cells in the previous column; this is the same as
for Viterbi. The space required is O(KN) if we wish to remember the entire table of proba
bilities but only O(K) if not, since we need only retain values from one previous column to
compute the next.
There is one caveat to the forward algorithm which we now note. In the case of Viterbi, we
were able to use log probabilities because we only multiplied probabilities of independent
events. However, in the forward algorithm, we add probabilities; thus, it appears that we
can no longer work with logs. Due to the small probability of generating any given sequence
or part of a sequence, numerical accuracy thus becomes a concern.
Nevertheless, it turns out that we can overcome this challenge in practice by approximating
when necessary via an algebra trick. We wish to nd a function F giving us F (log m, log n) =
12

log(m + n). Write
m + n
log(m + n) = log m + log
m
n
= log m + log 1 +
m
= log m + log(1 + exp(log n log m)).
We now consider three cases:
m n: All quantities in the nal expression are well-behaved.
m n: We can ignore the second term and approximate log(m + n) to log m.
m n: Swap m and n and apply the previous case.
These observations allow us to keep using log probabilities, and so in fact the problem is
solved.
5 Encoding memory in an HMM
Finally we want to describe how to incorporate memory into a Markov model by increasing
the number of states. First we present a motivating biological example.
Recall that we built an HMM with two states: a high-CG promoter state with elevated
probabilities of emitting Cs and Gs, and a background state with uniform probabilities of
emitting all nucleotides. Thus, our previous model, which we will call HMM1, characterized
promoters simply by abundance of Cs and Gs. However, in reality the situation is more
complicated. As we shall see, a more accurate model would instead monitor the abundance
of CpG pairs, i.e., C and G nucleotides on the same strand (separated by a phosphate on
the DNA backbone, hence the name CpG). Note that we call these CpGs simply to avoid
confusion with bonded complementary CG bases (on opposing strands).
Some biological background will shed light on the reason for introducing the new model.
In the genome, not only the sequence of bases, but also their methylation states determine
the biological processes that act on DNA. That is, a methyl group can be added to a nu
cleotide, thus marking it for future reference. Proteins that later bind to the DNA (e.g., for
transcription or replication purposes) notice such modications that have been made to
particular bases. To elaborate on one interesting example, methylation allows error-checking
in the DNA replication process. Recall that replication is performed by unzipping the
DNA polymer and then rebinding complementary bases to each strand, resulting in a pair
of new double-stranded DNA helices. If, during this process, a discrepancy is found between
an old strand of DNA and a new strand (i.e., a pair of bases is not complementary), then
the oldcorrectbase is identied by its being methylated. Thus, the new, unmethylated
13
base can be removed and replaced with a base actually complementary to the original one.
Now, a side eect of methylation is that in a CpG pair, a methylated C has a high chance
of mutating to a T. Hence, dinucleotide CpGs are rare throughout the genome. However, in
active promoter regions, methylation is suppressed, resulting in a greater abundance of CpG
dinucleotides. Such regions are called CpG islands. These few hundred to few thousand
base-long islands thus allow us to identify promoters with greater accuracy than classifying
by abundance of Cs and Gs alone.
All of this raises the question of how to encode dinucleotides in a Markov model, which by
denition is memoryless: the next state depends only on the current state via predened
transition probabilities. Fortunately, we can overcome this barrier simply by increasing the
number of states in our Markov model (thus increasing the information stored in a single
state). That is, we encode in a state all the information we need to remember in order to
know how to move to the next state. In general, if we have an almost HMM that deter
mines transition probabilities based not only on the single previous state but also on a nite
history (possibly of both states and emissions), we can convert this to a true memoryless
HMM by increasing the number of states.
In our particular example, we create a Markov chain such that dinucleotide CpGs have high
frequency in promoter states and low frequency in the background. To do this, we use eight
states rather than just two, with each state encoding both the promoter/background state as
well as the current emission. This gives an 8 state HMM. Note, however, that each emission
probability in this HMM from a given state is now either 0 or 1, since the state already
contains the information of which nucleotide to emit. Thus the HMM is actually a simple
Markov chain as pictured here:
14
15
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
6.047/6.878 Lecture 7: HMMs II, September 25, 2008
October 1, 2008
The previous lecture introduced hidden Markov models (HMMs), a technique used to
infer hidden information such as whether a particular nucleotide is part of the coding
sequence of a gene, from observable information, such as the sequence of nucleotides. Recall
that a Markov chain consists of states Q, initial state probabilities p, and state transition
probabilities A. The key assumption that makes the chain a Markov chain is that the
probability of going to a particular state depends only on the previous state, not on all the
ones before that. A hidden Markov model has the additional property of emitting a series
of observable outputs, one from each state, with various emission probabilities E. Because
the observations do not allow one to uniquely infer the states, the model is hidden.
The principle we used to determine hidden states in an HMM was the same as the
principle used for sequence alignment. In alignment, we had an exponential number of
possible sequences; in the HMM matching problem, we have an exponential number of
possible parse sequences, i.e., choices of generating states. Indeed, in an HMM with k states,
at each position we can be in any of k states; hence, for a sequence of length n, there are
k
n
possible parses. As we have seen, in both cases we nonetheless avoid actually doing
exponential work by using dynamic programming.
HMMs present several problems of interest beyond simply nding the optimal parse
sequence, however. So far, we have discussed the Viterbi decoding algorithm for nding the
single optimal path that could have generated a given sequence, and scoring (i.e., computing
the probability of) such a path. We also discussed the Forward algorithm for computing
the total probability of a given sequence being generated by a particular HMM over all
possible state paths that could have generated it; the method is yet another application
of dynamic programming. One motivation for computing this probability is the desire to
measure the accuracy of a model. Being able to compute the total probability of a sequence
allows us to compare alternate models by asking the question: Given a portion of a genome,
how likely is it that each HMM produced this sequence?
Although we now know the Viterbi decoding algorithm for nding the single optimal
path, we will talk about another notion of decoding known as posterior decoding, which
nds the most likely state at any position of a sequence (given the knowledge that our HMM
produced the entire sequence). The posterior decoding algorithm will apply both the forward
algorithm and the closely related backward algorithm. After this discussion, we will pause
for an aside on encoding memory in a Markov chain before moving on to cover learning
1
algorithms: that is, given a sequence and an HMM structure, how to choose parameters
for that HMM which best describe the sequence.
1 The backward algorithm
In analogy with the forward algorithm discussed last time, we may dene a backward al
gorithm with only slight changes. This time, given a starting position in the sequence and
the current state of the HMMi.e., a certain square in the dynamic programming table
we compute the total probability of reaching the end of the sequence, taken over all paths
leaving the chosen square.
In other words, we calculate:
b
k
(i) = P (x
i+1
...x
N
|
i
= k)
This can be done iteratively using a recurrence similar to the forward recurrence; we do
not elaborate on the details, but they are provided in the lecture slides. In this case, after
the algorithm nishes, we can obtain the total probability of the HMM generating the given
sequence by summing over the states in the rst column.
One might wonder why the backward algorithm is useful given that we already know
how to compute total probability using the forward algorithm. One simple reason is that
the backward algorithm provides a check against numerical errors, but more interestingly,
combining our knowledge from the forward and backward algorithms allows us to nd the
probability of the HMM being in a given state at any position of the sequence. This is called
posterior decoding, which is the topic of the next section.
2 Posterior decoding
In contrast to the Viterbi optimal sequence of states, which nds the most likely path of
states that could have generated the input sequence X, posterior decoding nds a sequence
of states such that at each individual position i, the decoded state
i
is the most likely
(given the HMM and the assumption it generated X). Note that posterior decoding is thus
not a path decoding algorithm: the path it produces will be less likely than the Viterbi
optimal path; indeed, it may even be nonsensical, for example if two consecutive states have
a zero transition probability.
Formally, the dierence between Viterbi and posterior decoding is that in position i, the
Viterbi algorithm calculates
i

= i-th position of argmax

P (X, )
whereas posterior decoding calculates
^
i
= argmax
k
P (
i
= k|X),
2

where

1
P (
i
= k|X) = P (X, )
P (X)
.
:{
i
=k}
The term
:{
i
=k}
P (X, ) expresses the sum of the probabilities of all paths producing X
that pass through state k in position i.
Since P (X) is independent of k, dividing by it does not aect the argmax, so we can
write:

^
i
= argmax
k
P (X, )
:{
i
=k}
Recall that P (X) was calculated by the forward algorithm using the recursion for
f
k
(i) = P (x
1
...x
i
,
i
= k)
given on the right-middle slide of page 3. A simple way to calculate
:{
i
=k}
P (X, ) would
be to use this same recursion, except at the i + 1 step instead of summing over all states just
use state k. The problem with this is that if we wanted to calculate ^ we would need to do
this calculation for all i and all k.
The key observation is that
P (X, ) = P (
i
= k, X) = P (x
1
...x
i
,
i
= k)P (x
i+1
...x
N
|
i
= k) = f
k
(i) b
k
(i)
:{
i
=k}
so we can obtain all of these sums at once by running the forward and backward algorithms
and multiplying the values in the corresponding cells! Thus, computing the optimal ^
i
amounts only to testing each of the K possible states and choosing the one with maximum
probability.
Which of Viterbi or posterior decoding is better for real applications? The answer depends
partly on how likely the Viterbi optimal path is (i.e., on how much of the probability mass is
concentrated on it compared to other paths): if the optimal path is only marginally better
than alternatives, posterior decoding is probably more useful. Additionally, the application
itself plays a crucial role in determining the algorithm to apply. For example, if a biologist had
identied a possible promoter region but wanted to be more sure before running experiments,
he or she might wish to apply posterior decoding to estimate the likelihood (taking into
account all possible paths) that the region of interest was indeed a promoter region. There are
other examples, however, in which Viterbi decoding is more appropriate. Some applications
even combine the two algorithms.
3 Encoding memory in an HMM
Before moving on to discussing learning, we pause to describe an application of HMMs
requiring a new trick: incorporating memory into a Markov model by increasing the
number of states. First we present a motivating biological example.
3
Recall that in the previous lecture, we built an HMM with two states: a high-CG
promoter state with elevated probabilities of emitting Cs and Gs, and a background state
with uniform probabilities of emitting all nucleotides. Thus, our previous model, which we
will call HMM1, characterized promoters simply by abundance of Cs and Gs. However, in
reality the situation is more complicated. As we shall see, a more accurate model, which we
call HMM2, would instead monitor the abundance of CpG pairs, i.e., C and G nucleotides on
the same strand (separated by a phosphate on the DNA backbone, hence the name CpG).
Note that we call these CpGs simply to avoid confusion with bonded complementary CG
bases (on opposing strands).
Some biological background will shed light on the reason for introducing HMM2. In
the genome, not only the sequence of bases, but also their methylation states determine
the biological processes that act on DNA. That is, a methyl group can be added to a nu
cleotide, thus marking it for future reference. Proteins that later bind to the DNA (e.g., for
transcription or replication purposes) notice such modications that have been made to
particular bases. To elaborate on one interesting example, methylation allows error-checking
in the DNA replication process. Recall that replication is performed by unzipping the
DNA polymer and then rebinding complementary bases to each strand, resulting in a pair
of new double-stranded DNA helices. If, during this process, a discrepancy is found between
an old strand of DNA and a new strand (i.e., a pair of bases is not complementary), then
the oldcorrectbase is identied by its being methylated. Thus, the new, unmethylated
base can be removed and replaced with a base actually complementary to the original one.
Now, a side eect of methylation is that in a CpG pair, a methylated C has a high chance
of mutating to a T. Hence, dinucleotide CpGs are rare throughout the genome. However, in
active promoter regions, methylation is suppressed, resulting in a greater abundance of CpG
dinucleotides. Such regions are called CpG islands. These few hundred to few thousand
base-long islands thus allow us to identify promoters with greater accuracy than classifying
by abundance of Cs and Gs alone.
All of this raises the question of how to encode dinucleotides in a Markov model, which by
denition is memoryless: the next state depends only on the current state via predened
transition probabilities. Fortunately, we can overcome this barrier simply by increasing the
number of states in our Markov model (thus increasing the information stored in a single
state). That is, we encode in a state all the information we need to remember in order to
know how to move to the next state. In general, if we have an almost HMM that determines
transition probabilities based not only on the single previous state but also on a nite history
(possibly of both states and emissions), we can convert this to a true memoryless HMM by
increasing the number of states.
In our particular example, we wish to create HMM2 such that dinucleotide CpGs have
high frequency in promoter states and low frequency in the background. To do this, we use
eight states rather than just two, with each state encoding both the promoter/background
state as well as the current emission. We thus need to specify an 8 8 table of transition
probabilities. (Note, however, that each emission probability from a given state is now either
0 or 1, since the state already contains the information of which nucleotide to emit.) Details
4
of this construction are included in the lecture slides. (But note that on the middle-right slide
on page 4 it incorrectly states that the emissions probabilities are distinct for the + and
states. It is actually the transition probabilities that are dierent the emission probabilities
are just 0 and 1 in both cases.)
Although an 8 state model would normally have 64 transition probabilities the slides show
only 32; transition probabilities between + and - states are not included. Since transitions to
and from CpG islands are relatively rare it would require a very large amount of training data
to infer what all of these probabilities are. A simplication is to assume that all transition
probabilities from a + state to a - state are the same, regardless of the nucleotides between
which we are transitioning, and to similarly assume a single probability for transitions from
a - to a + state. In that case we would need these two transition parameters in addition
to the 16 shown on the slides. If our assumptions turn out to be biologically incorrect then
more parameters would be needed to have an accurate model.
During lecture someone pointed out that the transition probabilities given in the table
at the middle-left slide of page 4 are dierent depending on which strand you are moving
along. For example, when one strand transitions from T to C, the other transitions from G
to A, yet the numbers in the table are quite dierent (.355 and .161). This seemed to imply
that a dierent HMM is needed to predict CpG islands depending on which strand you are
moving along. However, this reasoning was not correct. For a single HMM to work on either
strand it must produce the same frequency of TpCs on one strand as GpAs on the other.
However, the frequency of TpCs is not simply the probability of transitioning from T to C.
Rather this transition probability must be multiplied by the frequency of T. We calculate
based on the transition probabilities in the + table that this HMM would produce As and
Ts with frequency about 15% and Cs and Gs with frequency about 35%. This calculation
is in Appendix 1. When this is taken into account the probabilities on the two strands are
within around 7% of each other. For example, the frequency of TpCs is 0.15*0.355 = 0.0525,
while the frequency of GpAs is 0.35*0.161 = 0.05635. We will chalk this dierence up to
sampling error.
4 Learning
The nal topic of this lecture is learning, i.e., having a machine gure out how to model
data. The word learning must be taken with a grain of salt. We do not give a machine
raw data and nothing else and expect it to gure out how to model this data from scratch.
Rather, we must use our intuition and knowledge of biology to come up with the general
framework of the model. Then we ask the machine to choose parameters for our model.
More specically, given a training sequence and an HMM with known layout but unspec
ied parameters, how does one infer the best choice of parameters? This general class of
problems splits naturally into two subclasses based on a whether or not we know the hidden
states in our training data.
For example, if we have a hidden Markov model to distinguish coding regions of genes
from other stretches of DNA, we might train our model with a sequence in which coding
5
genes have already been identied using other means. We would supply this information
for supervised learning. On the other hand, we might be working with a DNA sequence for
an organism that has not had any genes classied and that we expect would have dierent
parameter values from any that have already been classied. In that case, we could use
unsupervised learning to infer the parameters without our having to specify the genes in our
training data.
4.1 Supervised learning
In this case we assume we are given the layout of the HMM, an emitted sequence X, and
additionally the actual sequence of Markov states that produced X. In other words, our
training data is labeled). We wish to nd the transition and emission probabilities a and e
maximizing the probability P (X, ) of generating the sequence X and labels .
This version of learning is easy: since we already know the path of states the Markov
chain took, our intuition suggests that we should choose parameters proportional to the
observed frequencies of the corresponding events occurring. With i denoting position, k and
l denoting states, and x denoting a base, we optimally set
# of times
i
= k and x emitted
e(k, x) =
# of times
i
= k
and
# of times
i1
= k and
i
= l
a(k, l) = .
# of times
i1
= k
An example is given in the slide at the top-right of page 7. The counts are slightly
dierent depending on whether you consider the start and end to be separate states. If we
ignore the start and end then, for example, a(B, P ) = 0.25 since of the 4 B states (we are
excluding the nal one) there is exactly 1 that transitions to a P state. If we considered the
end to be a separate state then a(B, P ) would be 0.2, since we consider the nal B state as
transitioning to the end state. Similarly, we calculate e(B, C) = 0.4 because 2 of the 5 B
states emit a C.
We can use Lagrange multipliers to verify our intuition that the transition and emission
probabilities a and e obtained by counting maximize the probability P (X, ) of generating
the sequence X and labels . This calculation is done appendix 2.
In addition the initial state probabilities of the HMM need to be specied. We choose
these as the total frequency of each state in the training data.
A problem with the method of setting probabilities using our training data is that counts
of some transitions and emissions might be 0 even though the actual probability is small but
non-zero. If we do not correct this our HMM will consider these transitions and emissions to
be impossible, which could signicantly distort the results. To x this we add pseudocounts
which are small non-zero counts. For example, we could set these counts to be 1, or we could
make an estimate based on the a-priori probabilities.
6

4.2 Unsupervised learning
Now we assume we are given the HMM layout and sequence X and wish to nd parameters
a, e maximizing P (X) =

P (X, ). Note that X is now unlabeled: that is, we have no
knowledge of to begin with.
The idea is we start with some initial guess for the parameter values, use those parameter
values to obtain values (or probabilities) of the hidden states, use those to calculate new
parameter values, and iterate until convergence.
Our initial guess for the parameters might be guided by other knowledge. For example,
when working with a DNA sequence we might start with parameter values from a related
species for which we have an annotated genome (for which parameters could be obtained
using supervised learning). Or, we might just use arbitrary initial parameters. (The slide
on Viterbi Training on page 8 describes initialization as Same. This is a reference to
initialization in the Baum-Welch algorithm slide on page 10.)
The lecture presented two methods of implementing unsupervised learning, Viterbi Train
ing and the Baum-Welch Algorithm. Although Baum-Welch is theoretically more sound
Viterbi training is simpler so we discuss it rst.
Rather than maximizing P (X) =

P (X, ), at each step Viterbi training uses
,
the most-likely path. Since there are other paths and

usually only captures some of the
probability this is not the same as maximizing P (X), which would require looking at all
paths.
In Viterbi training we use the Viterbi algorithm to nd

given the current parameter
values. We then use this path to estimate the coecients a, e in the same way we used the
known path in supervised learning by counting emissions and transitions from each state.
These counts, adjusted by pseudocounts, are then used to calculate new parameter values,
and we repeat until convergence.
Side note: Posterior decoding can be used instead of the Viterbi algorithm, provided
you use non-zero parameters so all transitions are legal. This still uses only one path for
counting, but it is ^ instead of
.
In supervised learning and in Viterbi training we count frequencies in a single path to
estimate our parameters a and e. But for particular parameter values the model determines
probabilities for every state at each position, not just a single state. How can we take all of
these probabilities into account when estimating our parameters, rather than just a single
value at each position? That is what the Baum-Welch algorithm does.
The basic framework of the Baum-Welch algorithm is the same as for Viterbi training.
In both cases we start with an initial guess for parameter values, use those to gather infor
mation about the hidden states, and then use that information to calculate new parameter
values. The dierence is that rather than determining the counts of transitions and emis
sions, A
kl
, E
k
(b) by counting actual transitions and emissions in a particular path we do so
by adding up probabilities over all possible paths.
For A
kl
, instead of counting the number of places where state k is immediately followed
by state l, we add up, for all positions i, the probability that position i is in state k and
position i + 1 is in state l . To do this we must add up the probabilities of all paths that
7

pass through states k and l at positions i and i + 1, respectively.
This is very similar to the sum we calculated in posterior decoding, except we are spec
ifying the states at two consecutive positions instead of one. As in posterior decoding it
uses f and b calculated by the forward and backward algorithms, but instead of just taking
f
k
(i) b
k
(i) it must take f
k
(i) b
k
(i + 1) times an intermediate term that has to do with the
transition from position i to position i + 1. This calculation is shown on the lower left slide
of page 9. These must be summed over all positions i to calculate A
kl
. A similar method is
used to calculate E
k
(b), again using f and b.
A result of Baum-Welch guarantees convergence. A formal proof can be obtained by
using the principle of expectation maximization.
Although this method is guaranteed to converge, it could converge to a local rather than
global maximum. To work around this it is recommended that the whole procedure be
repeated many times with dierent initial guesses for the unknown parameters, and use the
one that converges to the best result.
5 Appendix 1
In the discussion of CpG islands above, there was a need to calculate the frequencies of
states given the transition probabilities. We can do that as follows. Suppose the probability
of transitioning from state k to state l is given by a
kl
and the probability of being in state
k is p
k
. We get to be in a particular state, k, at some position in the sequence by being at
some state, l, at the previous position in the sequence and then transitioning to state k. In
terms of probabilities, that means that p
k
=
l
a
lk
p
l
. In other words, p is an eigenvector with
eigenvalue 1 of the transpose of the transition matrix. Calculating this eigenvector for the +
matrix of transitions for CpG islands (slides page 4), yields an eigenvector of approximately
(1, 2.35, 2.35, 1) times an arbitrary constant. Adjusting to make the probabilities add up to
1 gives p = (.15, .35, .35, .15)
6 Appendix 2
We now use Lagrange multipliers to verify that during supervised learning the transition
and emission probabilities a and e obtained by counting maximize the probability P (X, )
of generating the sequence X and labels .
We are looking for
argmax
a,e
P (X, ) = argmax
a,e
e
i
(x
i
) a
i1
i
= argmax log(e
i
(x
i
)) + log(a
i1
i
)
a,e
i i
Since the rst term in the sum only depends on e, and the second only depends on a, we
are looking for:

argmax
e
log(e
i
(x
i
))
i
8

and

argmax log(a
i1
i
)
a
i
We will verify our intuitive answer only for the second sum but the same method works for
the rst one as well.
Let
C
kl
= # of times
i1
= k and
i
= l
Then we can rewrite our expression for a as:
argmax
a
C
kl
log(a
kl
)
k,l
Fix a particular value of k. Since state k transitions to exactly one other state, we know
that
l
a
kl
= 1. We want to nd the values of a
k1
, a
k2
, ..., a
kK
that maximize
l
C
kl
log(a
kl
)
given this constraint. From the method of Lagrange multipliers, we know this maximum
occurs at a local extremum of
l
C
kl
log(a
kl
)
l
a
kl
for some value of . Taking
derivatives with respect to a
kl
for each l and setting them all to 0, we get:
C
kl
=
a
kl
C
kl
a
kl
=

C
kl

= 1, so = C
kl

l l
C
kl
a
kl
= = a(k, l)
l
C
kl
as was to be demonstrated.
9
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
1
6.047/6.878 Computational Biology September 29, 2008
Lecture 8: Computational Gene Prediction and GHMMs
Lecturer: James E. Galagan
Overview of gene prediction
One of the fundamental problems in computational biology identication of genes in very long
genome sequences. As we know DNA is a sequence of nucleotide molecules, or bases, which
encode instructions for generation of proteins. However, not all of these bases correspond
directly to amino acids. Even within a gene, a relatively small percentage of the nucleotides
might actually be translated into an amino acid chain.
For example, Eukaryotic genomes contain introns, or long segments of non-coding nu
cleotides within a gene; the introns are discarded during processing into mature RNA leaving
only the exons linked together to contribute to protein manufacturing. Also, both Prokaryotic
and Eukaryotic genomes contain substantial intergenic regions, and there are many other types
of segments such as introns, start and stop codons, etc. which do not code for proteins directly,
but are crucial to the protein synthesis in other ways. The contiguous subsequence of bases
that is nally parsed out of DNA during processing into RNA is called the coding sequence,
and is comprised of the exons with all introns removed. This sequence is then deterministically
translated into amino acids. Our task in gene prediction (or genome annotation) is:
Given a DNA sequence (X) with zero or more genes and some evidence associated with
the given DNA sequence, determine a labeling (Y ) that assigns to each base a label according
to the functionality of that part of the gene. Some of the most basic labels are:
Intron Non-coding regions within a gene; removed during processing into mature RNA.
Exon Coding regions. 3-letter words, or codons, correspond directly to amino acids.
Intergenic Region Non-coding region located between two genes.
Start/Stop Codons Specic sequences appearing at the beginning and end of all genes.
Acceptor Site Specic sequence appearing at the end of an intron / the beginning of an exon.
Donor Site Specic sequence appearing at the beginning of an intron / the end of an exon.
With these labels, we can think of gene prediction as parsing a sequence of letters into
words. In this analogy, the dierent regions of a DNA sequence are similar to dierent types
of words (nouns, verbs, etc.), and just as we must follow grammatical rules when constructing
sentences, there are syntactical rules for parsing DNA as well. For example, we only allow
introns to occur within genes, and we expect to nd start and stop codons before and after
genes occur. There are likely to be many valid parses, and in general we are searching for
the most probable legal parse of the sequence.
1
2 Gene prediction methods
How do we annotate genes? We would like to consider all evidence available to make the most
accurate annotation possible. We can classify our evidence into two main categories: intrinsic
and extrinsic. Intrinsic evidence is information gathered directly from the nucleotide sequence,
or features of the sequence. This includes short signicant subsequences like start codons, stop
codons, a TATA box, etc. called signals, along with content regions, such as general statistics
about the relative levels of the four nucleotide bases in a region. For example, in exon regions,
all codons are not used equally, and we can translate codon frequency statistics into nucleotide
frequency information. In intergenic regions or introns, there is a fundamental dierence in
nucleotide frequencies compared to exons resulting from the dierence in evolutionary pressure
between the two regions. All of this evidence must be taken in context statistically; for exam
ple, features are short sequences which will appear randomly throughout DNA sequences, so
the presence of a start codon by itself isnt strong evidence. However, nding combinations of
signals such as:
Start Codon Acceptor Site Donor Site Stop Codon
constitues stronger evidence because it makes sense syntactically, whereas
Acceptor Site Stop Codon Start Codon Donor Site
does not, as it isnt a valid parse. Other examples of measures we can incorporate are the
lack of stop codons in the middle of exons, and the length of the regions (exons tend to be
smaller, around 200 bases, introns can be serveral kbps in length). Apart from the evidence
intrinsic to the DNA sequence, we can also incorporate some extrinsic evidence, like direct
experimental evidence, BLAST hits, and the signicance of each BLAST hit, HMMer hits,
alignments from sequences of actual RNA to determine exons vs. introns, etc. The main
question of gene prediction algorithms is how to combine all of this evidence together.
3 HMMs in Gene Prediction
The most popular method so far of combining evidence is the HMM (Hidden Markov Models).
In this probabilistic model, we represent the labels as hidden states which constitute a Markov
chain. For each pair of base x
i
and label y
j
, we also assign an emission probability to model
the probability of observing base x
i
when the hidden state is y
j
. We denote this as e
y
j
(x
i
).
An important feature here is that we can ensure that the model produces only legal parses of
a genome by assigng transition probabilities between states a
jk
, representing the probability of
transitioning into state k from state j. If k = intron and j = intergenic, then it is clear that
a
jk
= 0, as we cannot move from an intron region into an intergenic regions.
This model is also called a generative model, since a convenient way to think of the HMM
is to imagine that there is a machine with a button. And by pushing the button, one can
generate a genome sequence according to some probability distribution. In gene prediction,
we use maximum-likelihood training to compute state transition probabilities and emission
probabilities, and then nd the most likely hidden state sequence Y = y
1
. . . y
n
. Note that
2
4
HMM in fact models a joint distribution over bases and hidden states, namely P (X, Y ) =
P (Labels, Sequence).
However, in gene prediction problems, we are given X and need to predict Y . This process
clearly calls for a Dynamic Programming solution, namely the Viterbi algorithm. While HMMs
have been applied to the gene prediction problem, there are some assumptions implicit to
the model which are dicult to reconcile with real data. The most troubling problem is the
distribution of state durations. For any state y
j
of a HMM, we have a probability p of staying
in state y
j
, and probability (1 p) of transitioning to another state. This gives us a geometric
distribution for the state duration of y
j
, described by the discrete distribution P (D = d) =
p
d
(1 p). The problem arises from the fact that region lengths in DNA often do not resemble
the geometric distribution. Another limitation of the HMM model is that we think of each
state emitting a single nucleotide.
Introduction of Generalized Hidden Markov Models
To address these issues, we nd it desirable to think of a Generalized Hidden Markov Model
(GHMM) which allows us to explicitly specify several important aspects which were implicit in
HMMs. We will allows states in the GHMM to emit arbitrary length sequences, and we will
allow the emission probabilities to be an arbitrary function as well. Finally, we will specify
transition probabilities explicitly so that they may be, for example, a function of the current
length of the sequence. This allows us to model state durations according to experimental
observations more realistically, resulting in higher accuracy parses. A GHMM consists of the
following components which are similar to HMMs:
- States Q
- Observations V
- Initial State Probabilities
i
= P (a
0
= i)
and with the following dierences:
- Duration Probabilities f
k
(d) = P (state k of length D)
- Emission Probabilities e
k
(X
a,a+d
) = P
k
(X
a
. . . X
a+d
|q
k
, d)
We wish to apply GHMMs again to the problem of gene annotation. Therefore, GHMMs
must return the probability of a subsequence given a state and duration. Thinking about
GHMMs in this way allows us to consider the set of states more abstractly without worrying
about how the arbitrary functions are implemented. They could consist of anything capable of
returning a probability distribution, i.e. a function, a properly set up neural network, etc.
Generalizing HMMs into this framework, however, increases the complexity of the problem
of computing optimal paths such that it isnt feasible to run a Viterbi algorithm in the most
general sense. The cause for the increase in the complexity of the problem arises from computing
the optimal value for step i. We now must consider not only the optimal state computed for
i 1, but also further back, as the optimization algorithm must now consider that we could
have changed states just now... or we may just now realize that we changed states 4 positions
ago.
3
5
This leads to a more complex recursion algorithm for the Viterbi algoritm on GHMMs. The
recursion for a standard HMM and for a GHMM are both presented below for comparison:
HMM: V
k
(i) = e
k
(x
i
) max[V
j
(i 1) a
jk
]
j

GHMM: V
k
(i) = max max[P (x
i
. . . x
id
|k) V
j
(i d) P (d k) a
jk
]
j d
|
The GHMM recursion is more complex because we must take the maximum of both all
previous states j, and all previous durations d. Inside the recursion the GHMM formula is
straightforward:
- P (x
i
. . . x
id
|k) : Probability of seeing the sequence (x
i
. . . x
id
) of length d in state k.
- V
j
(i d) : Maximum probabilitiy of ending in state j at step i d
- P (d|k) : Probability that state k has duration d.
- a
jk
: Probability of a transition from state j to k.
Whereas the running time of the Viterbi algorithm on HMMs is O(K
2
L), the running
time on GHMMs is much worse, O(K
2
L
3
). The dependence on length turns out to make this
algorithm unfeasible on gene sequences on the scale of the human genome.
GENSCAN
To address this concern and to make GHMMs useful, many implementations have been cre
ated that realize algorithmic speed-ups by selecting clever implementations of the arbitrary
functions, or by making some other assumptions that help to limit the size of the expanded
search space. The lecture focused on work by Chris Burge in his PhD thesis [2] on a particular
implementation, called GENSCAN. This algorithm provided a good balance between capturing
the complexity of the model, yet made a few key assumptions to reduce the algorithmic run
time.
Some key features of GENSCAN include:
- Explicitly dened state duration probabilities
- Use a 5th order markov model to predict coding and non-coding sequences. (Look at the
previous 5 nucleotides when predicting the next one.)
- Keep track of the frame of the coding sequence. When an intron interrupts a sequence,
the model keeps track of the phase.
- Search both the 5 3 and 3 5 strand at the same time.
- Considers dierent types of exons, i.e. initial exons which follow the start codon, internal
exons located between an acceptor site and a donor site, nal exons preceeding the stop
codon, or single exons which arent interrupted by any introns.
4
- Create two main state types, C states and D states. The model is constructed so that
we alternate between C and D states with each transition.
The most important aspect of GENSCAN, which provides algorithmic simplication we
require to acheive acceptable running times is the concept of C and D states. This has two
eects. First it reduces how far into the past we need to look. It also species that D states
must have a length distribution with properties similar to the length distributions given by a
HMM. Specically, the dierence between the probability of a D state having duration d vs.
d 1 is given by a constant factor p
k
. Therefore:
P (Length= d|D state) = P (Length= d 1|D state) p
k
.
f
k
(d) = P (state duration d state k) = p
k
f
k
(d 1)
This forces us to use geometric-like distributions, which we can get away with for intergenic
regions, introns, UTRs, etc. However, for regions with state duration distributions radically
dierent from the geometric distribution, or C states, GENSCAN species explicit formulas
that capture these semantics.
Note: This is the draft of the notes for this lecture. The nal version will include a more
thorough consideration of the algorithmic details of GENSCAN.
References
[1] James Galagan. Conditional Random Fields for Computational Gene Prediction. Lecture
note 6 of MIT 6.047/6.878 Computational Biology, Sept. 2007.
[2] Burge, C., Karlin, S. Prediction of complete gene structures in human genomic DNA.
5
http://ai.stanford.edu/~serafim/cs262/Papers/GENSCAN.pdf
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
1
6.047/6.878 Lecture 10
Molecular Evolution and Phylogenetics
October 8, 2008
Notes from Lecture
Error in slide 13: N
substitions
< N
mutations

Next 2 lectures given by guest speakers
The section on Maximum Likelihood Methods is left for recitation
Motivations
Evolution by natural selection is responsible for the divergence of species populations through three
primary mechanisms: populations being altered over evolutionary time and speciating into separate
branches, hybridization of two previously distinct species into one, or termination by extinction.
Given the vastness of time elapsed since life rst emerged on this planet, many distinct species have
evolved which are all related to one another; phylogenetics is the study of evolutionary relatedness
among species and populations. Traditional phylogeny asks how do species evolve and before
the advent of genomic data mostly relied on physiological data (bone structure from fossils, etc).
We are interested in tackling phylogenetics from a dierent perspective; analyzing DNA sequence
data in order to determine relationships between and among species. At the core, we would like
to detect evidence of natural selection in populations. This is an increasingly important area of
research in computational biology and is starting to nd commercial applications in the realm of
personal genomics: it was recently announced that a joint MIT & Harvard aliated company was
established to sequence individual genomes for $5000 (other private companies including 23 and
me, deCODEme, are already doing this.
We will formulate this biological problem in computational terms by studying two probabilistic
models of divergence: Jukes-Cantor & Kimura. Two purely algorithmic approaches (UPGMA &
Neighbor-Joining) will be introduced to build species or gene trees from these relatedness data (the
distinction between the species & gene trees is explained below).
Among the many open problems in phylogenetics that we can currently address with genomics
are how similar two species are, what migration paths early humans took when they rst left the
African continent by studying variations in identical genes of a number of local tribes around the
world (The National Genographic Project is one such example), and determining our closest living
cousins (chimpanzees or gorillas?), among many others. Many open questions in evolutionary
biology have already been answered by genomnic phylogenetics (a major recent one being the
revelation that the closest living relative of the whale is the hippopotamus).
1
Information in phylogenetics is best represented using trees that succinctly show relationships
among species or genes. There are a number of important issues regarding inferring phylogenies
when modeling evolution with trees that one should be aware of:
nodes linking branches (exact type of common ancestors)
meaning of branch lengths (time scaled or not)
type of splitting event at branches (usually binary)
As an aside relating to the last bulleted point, Professor Pavel Pevzner of UCSD (author of our
class textbook on Bioinformatic Algorithms) mentioned in his recent talk at MIT that the order
of convergence (are humans closer to dogs or mouse?) may require a trifurcation model (split in
three ways) instead.
It is important to note that gene & species divergence are two distinct events. The same gene
(or slight variations thereof) can be found in dierent species (organisms that cannot interbreed).
To think about it another way, a species tree is a special case of a gene tree, consisting of an
orthologous sequence of the same common gene. Moreover, in a species tree, we can have gene ow
between the dierent branches of the tree. If every leaf is an organism, it is a species tree. A
gene tree captures both speciation & duplication events with the length between the root and the
various leaves of the tree being a measure of the number of mutations between the two. The level
of tree complexity (branching lengths & number) dictates what types of algorithms to use. In this
class we focus on sequence comparison to develop gene & species trees.
There are many advantages of using genomes in phylogenetics. A major one being the vast
amount of information we have access to: consider for a moment that for every position in the
genome, particularly individual amino acid positions within the protein sequence, lies a corre
sponding trait. A small number of traits is usually used to assemble species trees, for instance
in traditionaly phylogenetics (before the advent of genomic data), one could compare the panda
skeletal structure with that of bears & raccoons. The underlying premise in creating trees from
traits is the parsimony principle: nd a tree that best explains the set of traits in a minimum
number of changes. Unfortunately, there are also a number of complications having to do with
the fact that traits are typically ill behaved: back mutations are frequent (nails becoming short,
then long and nally back to short), inaccessibility to ancestral sequences and diculties in corre
lating substitution rate with time. With the regards to back mutations, even though evolution is
traditionally thought of as divergent always acting to increase entropy, there are rare instances of
convergent evolution (or homoplacy). Homoplacy is the phenomenon where two separate lineages,
completely independent of one another, undergo the same changes that lead to their convergence.
It is a strictly random process that has not infrequently been observed in nature.
To take the human population as an example, the mutation rate within the human genome
since our species left Africa is comparable to a phylogenetic event. The mutations are infrequent,
roughly 1000 mutations (single nucleotide polymorphisms or SNPs) in the total 3 billion nucleotide
genome. This is why producing a map of the human family tree can even be undertaken given the
manageability of such complexity.
How are genes produced? There are two main mechanisms: 1) duplication: new versions of
old genes (most frequent) and 2) de novo genes: intronic (non-coding) segements or regulatory
connection of sequences become coding (hi-jacking of nonfunctional nucleotides does happen but
at a much lower rate).
2

2
We will study three types of trees: cladogram, phylogram and ultrametric. The three kinds of
trees associate dierent meanings to the branch lengths, in order they are: no meaning, genetic
change and elapsed time. Species with much shorter lifespans and higher reproduction rates will
tend to show much larger genetic changes (cf. mouse and human genes). Tree building methods can
be constructed by decoupling the notion of the algorithm used and the denition of the optimality
condition. We will need to think carefully about what we choose as an optimality condition and
how well an algorithm reaches the optimality condition. As an example, clustering algorithms
do not have an optimality criteria (they dont maximize/minimize an objective function). In
computationally biology, we are faced with an dilemma: whether to design an algorithm that
does the right thing or whether to design a more suitable model (e.g., how frequently changes
observed in a sequence can be used to infer a distance to another sequence).
Modeling Evolution
We need a method to measure evolutionary rates so that a distance matrix can be constructed
before we can develop a tree. A distance matrix will allow us to turn a set of sequences into a set
of pairwise distances between sequences. Lets focus on two types of single nucleotide mutations:
transitions (A - G, C - T) or transversions (A - T, G - C) happening one at a time. We
consider two stationary Markov models represented by a nucleotide substitute matrix that assume
nucleotides evolve independently of one another.
The Jukes-Cantor approach assumes a constant rate of evolution assigning one rate to self-
mutation (A - A, etc) and another to cross mutation (A - one of C,G,T). The Jukes-Cantor
AGCT substitution matrix:

S =
r s s s
s r s s
s s r s
s s s r
For short times, the evolutionary rate is constant: r = 1 3 and s = .

the rate is a function of time: r = 0.25 1 + 3e
4t
and s = 0.25 1 e
4t
. The Kimrua model
goes one further and takes into account that transitions are more frequent than transversions. The
Kimura AGCT substituation matrix:
For longer times,

S =
r s u u
s r u u
u u r s
u u s r
where s = 0.25(1 e
4t
), u = 0.25(1 + e
4t
e
2(+)t
), and r = 1 2s u.
From distances to trees
Given our generative Markov models and corresponding (time-dependent) substitution matrices,
we are now ready to determine a distance matrix. The elements of this distance matrix d
ij
are
a measure of the distance of two well aligned sequences. One possibility is to dene the distance
3
3
matrix (d
ij
) as the fraction of sites (f) where two sequences x
i
and x
j
are mismatched: d
ij
=
log (1 4f/3). This model blows up when f = 0.75 which places a threshold on the amount of
mismatch between two sequences.
To then use this distance matrix to measure actual distances between any pair of sequences (ie.
to construct a tree), there are two standard approaches: 1) ultrametric distances infer equidistant
paths from any leaf to the root while 2) additive distances dictate that all pairwise distances are
obtained by traversing a tree. Ultrametric trees are rarely valid since it assumes uniform rate of
evolution while additive distances are a less restrictive model. In practice, distance matrices is
neither ultrametric nor additive.
The distance matrix and tree duality also means that distances can be inferred by traversing
tree. If ultrametric distances are used, we have a guarantee of nding the correct tree by minimizing
the discrepancy between observed distances and tree-based distances. On the other hand, if additive
distances are used, neighbors are guaranteed to nd the correct tree. The tree building algorithm
can then be thought of tting data to constraints.
4 Tree building algorithms
4.1 UPGMA
UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is the simplest example of a
tree-building algorithm. It involves a hierarchical clustering algorithm that begins at the leaves
of the tree and works it way up to the root. As input, it takes a distance matrix and produces
an ultrametric tree (ie. in accordance with the molecular clock hypothesis of equal evolutionary
rates among species). Only when the input distance matrix is ultrametic is the UPGMA algorithm
guaranteed to form the correct tree. If, on the other hand, the distance matrix is additive, there is
no guarantee that the pairwise branch distances equal those specied in the distance matrix.
ALGORITHM:
1. Begin by initializing all sequences each to its own cluster. These will form the leaves of the
tree.
2. Find the pair of sequences with minimum distance from the distance matrix D. This pair
forms the rst cluster, and we can draw the rst part of the tree by linking the pair. (example:
from D, we nd seqA and seqB have the minimum distance of 10. We draw the tree by linking
seqA and seqB, with branch length of 5 (hence a total distance of 10 between them)).
3. Update matrix D: A new row and column representing seq
AB
is added to D. The distance
between seq
AB
and another seq
C
would be
1
2
(d
AC
+ d
BC
). Remove the rows and columns
associated with seqA and seqB. Note, overall, the matrix has shrunk by 1 row and 1 column.
From this point on, we can completely forget about seqA, seqB, and assume we just have
seq
AB
.
4. Repeat steps 2 and 2 until matrix D is empty
4
3
4

4.2 Neighbor-Joining
The more sophisticated N-J algorithm is used to generate phylogram trees in which branch lengths
are proportional to evolutionary rates. The N-J algorithm is guaranteed to produce the correct tree
if the input distance matrix is additive and may still produce a good tree even when the distance
matrix is not additive.
ALGORITHM:
1. Make a new matrix M from the distance matrix D with the same dimensions:
k
D
ik
+ D
jk
M
ij
= D
ij

N 2
where N is the number of sequences. This is the adjusted distance metric, which ensures M
ij
is minimal i i and j are neighbors (see proof in Durbin, p.189).
2. (similar to UPGMA Step 2) Find the pair of sequences with minimum distance from the new
distance matrix M. This pair forms the rst cluster, and we can draw the rst part of the tree
by linking the pair. (example: from M, we nd seqA and seqB have the minimum distance.
We draw the tree by linking seqA and seqB, via new node U. The branch length from A to
P
U is calculated as: D
AU
=
1
2
D
AB
+
k
D
Ak
+D
Bk
N2
. Also, D
BU
= D
AB
D
AU
).
3. (similar to UPGMA Step 3) Update matrix D. A new row and column representing node U
is added to D. The distance between U and another seqC would be (d
AC
+ d
BC
d
AB
).
Note, overall, the matrix has
1
2
Remove the rows and columns associated with seqA and seqB.
shrunk by 1 row and 1 column. From this point on, we can completely forget about seqA
and seqB, and assume we just have node U.
4. Repeat steps 1 thru 3 until matrix D is empty.
4.3 Parsimony
Another approach to building trees is parsimony; a method that doesnt rely on distance matri
ces but instead on sequence alignments. Parsimony will nd the tree that explains the observed
sequences with the minimal number of substitutions. The algorithm entails two computational sub
problems: 1) nding the parsimony cost of a given tree and 2) searching through all tree topologies.
The rst sub-problem is straightforward while the second is computationally very challenging and
best approached with Monte Carlo methods. As there is no closed form solution and hence no opti
mality criteria while searching through all topologies, a heuristic search is appropriate. Parsimony
uses dynamic programming in scoring and traceback to determine ancestral nucleotides.
4.3.1 Scoring
The parsimony scoring algorithm is akin to dynamic programming as it makes local assignments
at each step: penalizing sequence mismatches while assigning no score for sequence matches.
Initialization: Set cost C = 0; k = 2N 1
5

Iteration:
If k is a leaf, set R
k
= x
k
[u]
If k is not a leaf,
Let i, j be the daughter nodes;
Set R
k
= R
i
R
j
if intersection is nonempty
Set R
k
= R
i
R
j
, and C += 1, if intersection is empty
Termination: Minimal cost of tree for column u = C
4.3.2 Traceback
The traceback routine to nd ancestral nucleotides involves nding a path through the tree from
the leaf nodes to an ancestral nucleotide (that may or may not lead to the root of the tree). It can
be summarized as follows:
If the intersection of two sets (A and B) is empty, ancestor is A or B with equal cost
If the intersection of two sets is non-empty, ancestor is intersection with minimal cost
This method determines a non-unique path with minimum assignment of substitutions for a
given tree topology given hidden internal/intermediate nodes (that may correspond to extinct
species).
4.3.3 Bootstrapping
A shortcut to building trees based on parsimony is to build trees using only one column of the
multiple sequence alignment matrix. If repeated many times and a histogram tabulated of the
particular trees constructed, the most likely tree constructed can be easily found. The advantage
to such an approach is obvious: reduced complexity at the expense of accuracy.
4.4 Maximum Likelihood
Maximum likelihood methods combine statistical models with known evolutionary observed data.
They can be used to predict a host of interesting and realistic features - character and rate analysis,
sequences of (hypothetical) extinct species - but do so at the expense of increased complexity.
6
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
6.047 LECTURE 11, MOLECULAR
EVOLUTION/COALESCENCE/SELECTION/KAKS
OCTOBER 09, 2008
1. Introduction
Evolution is the shaping of phenotypic and genotypic variation by natural selec
tion. The relative importance of selection and mutation has been a long standing
question in population genetics. And, wed like to understand the relative impor
tance of each of these forces. In particular, wed like to know which genes are
undergoing active selection. neutrality tests are a key tool for addressing this ques
tion. There are many dierent evolutionary forces that might cause deviations from
neutrality such as dierential mutation rates, recombination, population structure,
drift in addition selection and others. Therefore, it is easier to develop test of neu
trality rather than directly searching for signatures of selection. From a historical
perspective, most of these ideas were developed theoretically in the last century.
And only in the last few decades that we were able to gather data to directly test
these theories. 1983 is the rst time that we have molecular polymorphism data.
First, we need to understand the neutral model, focusing on inheritance alone.
Historically, this area of study dates back to Darwins time. Scientists in the 1860s
believed in blending inheritance which stated that each organ was determined by
a dierent gemmule. In this model, children inherit a blending of their parents
corresponding gemmules, so if mom had a yellow gemmule and dad had a blue
gemmule, then baby would inherit a greenish gemmule. As Darwins critics
pointed out, this model predicts that everyones gemmules blend and blend until
all gemmules are a drab shade of gray, losing all genetic diversity. Specically,
Fleeming Jenkins in 1867 showed that the total genetic variation will be halved
assuming this mode of inheritance.
However, Mendel had developed a theory particulate inheritance around the
same time but was not widely recognized. Only in the beginning of 20th cen
tury, researchers appreciated the importance of his results. Mendels more accurate
model, developed as a result of his famous study of pea plants, states that each
trait is determined by two corresponding alleles, which together determine a per
sons phenotype. Ospring receive one allele from each parent, selected randomly
from the parents two copies. The allele model is quite close to what actually
happens when gametes pair during meiosis, and correctly predicts the observed
phenomena of dominance and recessivity. These ideas are summarized as the Law
of Segregation and the Law of Independent Assortment.
1
2 CAN CENIK, MATTHEW INCE, QU ZHANG
2. Hardy-Weinberg Law
A natural question to ask was how genetic variation changes under the particulate
theory of inheritance. The Hardy-Weinberg Law (1908) answers this question in an
ideal population. This model assumes innite population size, completely random
mating, and an absence of selection, mutation, or migration. Considering two
versions A and a of an allele, let u
0
, v
0
, and w
0
be the respective frequencies of
genotypes AA, Aa, and aa, respectively. Then the frequency of A is p = u
0
+ v
0
/2,
and the frequency of a is q = w
0
+ v
0
/2. A Punnett square predicts that after a
single generation, we observe frequencies u = p
2
, v = 2pq, and w = q
2
, and that
these frequencies remain xed over successive generations. Because q = 1 p, our
entire model is characterized by a single parameter p.
In practice, several of the Hardy-Weinberg Laws assumptions are hardly ever
met but it still provides a general framework for thinking about genetic variation.
Consider slide 3 page 2 as an example of the application of the HW Law, and let
us denote recessive allele causing sickle-cell anemia by s. We observe that many
more people have genotype Ss than is expected under the HW equilibrium. This
discrepancy is indeed due to selection, because people with one copy of s possess
immunity to the dreaded tropical disease malaria.
We next study two dierent approaches to neutral theory: the prospective ap
proach of classic population genetics, and the retrospective approach focusing on
the coalescent.
3. Prospective method
3.1. Neutral Theory History. In the 1960s, people thought that all mutations
dier in their tness, so selection would rule out bad mutations and x good mu
tations, with variation kept by balancing selection. So when Neutral Theory was
rst proposed by Motoo Kimura, a Japanese population geneticist, it was shocking
and controversial at the time. Having a background in physics, Kimura incorpo
rated diusion approximations to the eld of population genetics, focusing on nite
populations. The main premise of Neutral Theory is that most of the mutations
are neutral or nearly neutral, and the change in allelic frequencies is a result of
random genetic drift in nite populations (Slide 6, Page 2). All the mutations will
eventually become extinct or xed by this process unless directly opposed by other
evolutionary forces. Hence, the genetic variation seen in a population are generally
caused by mutations which are on their way to xation/extinction (Slide 1, Page
3). As a consequence, if one knows the rate of mutations and how quickly they x
in the population, one could have a good idea of how the allelic distribution in the
population should be.
3.2. Ewens Sampling Formula. In 1972, Warren Ewens proposed the famous
Ewens Sampling Formula, which is based on diusion theory and introduced the
innite alleles model. The innite alleles model claims that there are an innite
number of states into which an allele can mutate, so each mutation generates a
unique allele. Ewens used the concept of identity by descent (IBD) as opposed to
identity by kind. IBD is a concept that is dened with respect to the allele in the
ancestor. Imagine that you and your sibling received the same copy of chr7 from
your mother. In this particular chromosome even if there is a mutation in one of
the genes, your copy and your siblings copy will be IBD but not identitical by kind.

6.047 LECTURE 11, MOLECULAR EVOLUTION/COALESCENCE/SELECTION/KAKSOCTOBER 09, 20083
Ewens model depends on parameters N, the population size, and , the mutation
rate. We assume a diploid population (two alleles/ person), so there are 2N alleles
in total. In each generation, 2N new alleles are introduced to the population,
each with an initial frequency of 1/2N. The Ewens Sampling Formula calculates
the probability that a sample of n(n << N) gene copies contains k alleles such that
there are a
1
, a
2
, . . . a
k
alleles represented 1, 2, . . . , n times in the sample:
n!
k
n
1
P (a
1
, a
2
, . . . , a
n
) = ,
(n)
j=1
j
a
j
a
j
!
where
(n)
= ( + 1) . . . ( + n 1).
The main purpose of the Ewens Sampling Formula is to get an expected site fre
quencies spectrum (Slide 6, Page 2). This formula allows us to consider questions
like how many singletons do we expect given that we sample 100 individuals from
an innite population.
4. Retrospective approach
4.1. Coalescent Theory. Coalescent Theory was rst introduced by Kingman
in the early 1980s. Coalescent theory is a sample-based approach to population
genetics and is specically concerned with making inferences under a rigorous sta
tistical framework. The word coalescent refers to the occurrence, backward in
time, of a common ancestor between a pair of lineages that are both ancestral to
a present-day sample. The coalescent is a backward-time stochastic process that
models the ancestry of a sample of size n, back to the most recent common an
cestor (MRCA) of the entire sample. The coalescent makes detailed predictions
about patterns of genetic variation in the sample. All polymorphisms in the sam
ple can then be demonstrated by putting mutation events on dierent branches of
the coalescent tree (Slide 1, Page 4). The expected time to MCRA of a population
could be found at approximately T=4N generations before, but in practice, the
MCRA of a 20-invdividual sample can be very close to that of the total popula
tion. The ancestry of a sample of nite size (n) is composed of n 1 independent,
exponentially-distributed coalescence times, each ending with a coalescent event
between a random pair of lineages. When there are i ancestral lineages the rate of
coalescence is i(i 1)/2 and the expected time to a coalescent event is 2/(i(i 1)).
4.2. tests of selection within species. We construct three polymorphism sum
mary statistics:
(1) S, the number of segregating sites in the sample;
(2) , the average number of pairwise dierences;
(3)
i
, the number of sites that divide the sample into i and n i sequences.
These statistics can be used to estimate the important statistic (Slide 6, Page
4). Under neutrality all of the estimates of should be equal. However, selec
tion, population structure and other evolutionary forces could change the allelic
frequency spectrum in population and lead to dierences in these estimates. Based
on this inference, three selection tests are proposed by Tajima and Fu and Li re
spectively. (Slide 1, Page 5). In these tests, neutral expectation is used as null
hypothesis (Slide 3, Page 5), an alternative hypothesis (positive selection, balanc
ing selection, population structure/subdivision or population expansion) is tested
4 CAN CENIK, MATTHEW INCE, QU ZHANG
(Slide 4,5,6, Page 5 and Slide 1, Page 6). For example, positive selection would
lead to an excess of singletons while balancing selection would lead to a depletion
of singletons, doublets and triplets. However, population subdivision would also
yield a similar prole to balancing selection and population expansion will cause
a signal similar to positive selection. Therefore, these tests do not provide direct
evidence of selection.
Several other tests have been developed that tries to improve on the ideas of
the Tajimas D and Fu and Lis statistics. HKA Test, for example, measures the
polymorphism and divergence in two loci, and tests whether there is excess in one of
the two classes (Slide 2, 3, Page 6). The MK test, on the other hand, measures the
synonymous and nonsynonymous polymorphism and divergence in one locus, and
tests whether there is an excess in one class (Slide 4, 5, Page 6). Hence, MK test
requires information from a single locus but polymorphism data from 2 species/
4.3. Rate-based selection metric. To determine if a certain gene is under se
lection, we may use the rate-based selection metric. Recall that an amino acid is
determined by a codon of three adjacent nucleotides, and that dierent codons, such
as CGC and CGA, can represent the same amino acid, arginine in this case. A syn
onymous mutation leaves the amino acid sequence intact, while a non-synonymous
mutation changes it.
We use various methods, notably PAML, to compute d
N
, the rate at which
synonymous mutations occur, and d
S
, the rate at which synonymous mutations
occur. We dene the rate-based selection metric to be d
N
/d
S
. There are three
cases of interest:
(1) d
N
/d
S
< 1, purifying selection
(2) d
N
/d
S
= 1, neutral expectation
(3) d
N
/d
S
> 1, positive selection
In general, genes performing essential roles, such as those encoding hemoglobin,
undergo purifying selection, because changes tend to disrupt the genes function
and cause harm. Genes that perform peripheral functions can undergo positive
selection if some change is useful, but not necessary for survival. For instance,
changes to a mammals hair coloring could provide useful camouage.
Rate-based tests could be aected if synonymous sites are not completely neutral.
There is growing evidence for the use of preferred codons for certain amino acids
instead of the other equivalent codons. This phenomenon is known as codon bias.
There is evidence for positive correlation between d
N
/d
S
and dispensability, gene
length, and negative correlation with expression level, protein abundance, codon
bias, number of protein-protein interactions and centrality in interaction networks.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
Lecture 12: Population Genomics I 6.047/6.878
Population Genomics I (Pardis Sabeti)
1 Introduction
It is readily observable that a single species, while phenotypically discernible as a distinct
group dierent from other species, often displays a great amount of phenotypic diversity of its
own. In this lecture, we study the genetic basis of this diversity by considering polymorphisms
and their evolution, basic population genetics, the xation index, and haplotype analysis.
2 Polymorphisms
Polymorphisms are DNA variations between dierent individuals of the same species. There
are several types of polymorphisms, including single nucleotide polymorphisms (SNPs), vari
able number tandem repeats, insertions and deletions, large-scale polymorphisms, and copy
number variants. In addition to helping explain the genetic basis of our diversity, under
standing polymorphisms is also medically relevant. These changes, when they aect critical
genes, can result in severe clinical outcomes like sickle cell anemia (SNP), Huntingtons
disease (triplet repeat), and cystic brosis (deletion).
3 Population genetics
Consider a population of N individuals. At a given gene locus, there may be many possible
alleles (polymorphisms). Each individual in the population has two alleles for every auto
somal gene locus, and the genotype of an individual refers to this collection of two alleles,
taken over all loci of interest. The allele frequency of a particular allele is the proportion
of all 2N alleles of the population that are of that type, and the genotype frequency of
a particular genotype is the proportion of all N individuals that have that genotype. A
genotype is said to be homozygous if the genotype corresponds to two identical alleles, and
heterozygous if the two alleles are dierent.
Hardy-Weinberg equilibrium refers to the state of a population when several (ide
alized) assumptions specifying mating patterns and evolutionary parameters are true. To
make this more concrete, say we have two alleles A and T with frequencies p and q = 1 p,
respectively, and therefore three genotypes: AT , AA, and TT . We further assume the fol
lowing HW conditions hold true: (1) random mating, (2) no mutation, (3) no migration, (4)
no selection, and (5) a large population. Then the allele and genotype frequencies will both
remain constant, with the allele frequencies given by p and q for A and T respectively. The
corresponding genotype frequencies for AT , AA, and TT are 2pq, p
2
, and q
2
, respectively.
The chi-square test can be used to determine how closely a particular populations genotype
frequencies resemble those expected from a HW model.
1

In real biological systems, HW assumptions are frequently violated. Small populations
(violating assumption 5), for example, mean that genetic drift (essentially, stochasticity)
will either x alleles in the entire population, or eliminate them all together, given enough
time.
4 Natural history of polymorphism
We can trace the evolution of current SNPs by nding the ancestral state of a given
polymorphism. To do this, we compare the species in which we observe the SNP to species
that are closely related, termed outgroups. We assume that the allele we observe in the
outgroup represents the ancestral state. This assumption is rationalized by reasoning that
observable SNPs likely arose recently, since old SNPs are likely to have reached xation
given enough time. If the split between a species and its outgroup occurred long before the
introduction of the SNP into the species, then the outgroup likely contains the ancestral
state.
5 Population dierences
We can compare subpopulations within a species by measuring dierences in allele frequen
cies. The metric used to do this is called the xation index of a subpopulation compared
to total, or F
ST
, for short. F
ST
is given by
F
ST
=
H
tot
H
sub
, (1)
H
tot
where H
tot
is the total heterozygosity of all subpopulations, and H
sub
is the heterozygosity
of a particular subpopulation. For two alleles in proportions p and q, heterozygosity is given
by 2pq. More generally, for a population with n alleles in frequencies p
i
(1 i n), we can
dene heterozygosity as 1 G where G is the homozygosity of the population, given by
N
G = p
i
2
. (2)
i=1
As we can see from this formula, F
ST
simply measures the proportion of the total heterozy
gosity of the entire population that is explained by the particular subpopulation of interest.
For example, suppose we have two alleles A and T , and two subpopulations 1 and 2. 1 is
entirely A and 2 is entirely T . In this extreme case, both subpopulations have H
sub
= 0 and
the H
tot
is 2pq = 0.5. Hence F
ST
= 1.
6 Haplotype analysis
Under natural selection, individuals with certain phenotypes (those that are selectively
favorable) are more likely to produce fecund ospring, and consequently their underlying
2
genotypes are more likely to propagate through the population than both selectively neutral
and selectively unfavorable phenotypes. A haplotype is a collection of alleles that lie to
gether on the same chromosome, passed down as a unit. Because of recombinationthe
normal exchange of DNA between homologous chromosomeswe would expect selectively
neutral portions of the genome to have long haplotypes, corresponding to young alleles, in
low frequency. If, however, there was positive selection, the corresponding haplotype would
be both long and high frequency. Sabeti et al developed a metric that uses long-range mark
ers to measure haplotype lengths, the Extended Haplotype Homozygosity (EHH) [1]. After
determining haplotype lengths with such a metric, we could identify positive selection by
looking for young alleles that exhibit both long haplotypes and high frequency.
Through large-scale collaborative eorts like the International HapMap Project, haplo
type data has been gathered that has allowed us to understand population dierences, and
search for selection sweeps within populations. A selection sweep occurs when an allele
quickly ascends in frequency due to positive selection. Sweep software developed by Sabeti et
al [2] and applied to the HapMap Project Data has identied population-specic positively
selected regions of the genome. One noteworthy example is the identication of LARGE
and DMD, two genes which have been connected to the Lassa virus, in West Africa. For
a thorough review of this recent work, please consult [2]. Such results, when coupled with
existing annotation data and functional analysis, could further both basic science as well as
modern medicine.
7 References
[1] P. Sabeti et. al. Detecting recent positive selection in the human genome from haplotype
structure. Nature. 2002: 419, 823-37.
[2] P. Sabeti et. al. Genome-wide detection and characterization of positive selection
in human populations. Nature. 2007: 449, 913-19.
3
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
Fall 2008 6.047/6.878 Lecture 13
Population Genomics II : Pardis Sabeti
1 Introduction
In this lecture we will discuss linkage disequilibrium and allele correlations, as
well as applications of linkage and association studies.
2 Linkage/Haplotypes
Recall Mendels Law of Independent Assortment, which states that allele pairs
from dierent loci separate independently during the formation of gametes.
For example if we have a population of individuals such that at a given locus
we nd 80% A and 20% G, while at a second locus of interest we nd 50% C
and 50% T, then independent assortment would imply that the corresponding
haplotype frequencies at this double loci in the next generation should be (80%
* 50%) = 40% AC, (20% * 50%) = 20% GC, and similarly, 40% AT, 10% GT.
However in real life the separation is not always independent, since the two
loci may be linked, by being close together on the same chromosome. For
example for the allele frequencies at the two loci given above we might get
actual haplotype frequencies of 30% AC, 20% BC, 50% AT, and 0% GT.
This occurs due to linkage disequilibrium, which is the disequilibrium of
the allele frequencies which occur when the loci are close enough that link
age occurs (eventually, recombination occurs and the frequencies settle into
equilibrium). Linkage disequilibrium such as in this example, where there are
only two alleles observed at each of two loci of interest, is often measured by
a quantity, D, which is dened as follows:
D = |ObsP (A
1
B
1
) ObsP (A
2
B
2
) ObsP (A
1
B
2
) ObsP (A
2
B
1
)|
where ObsP (A
i
B
j
) denotes the fraction of observed haplotypes at the A and
B loci consisting of allele A
i
at locus A and allele B
j
at locus B. (Note: in our
example, A
1
= A, A
2
= G, B
1
= C, and B
2
= T .) In general, A
1
, B
1
denote
1
the major alleles and A
2
, B
2
the minor alleles at the two loci.
Note that if there is no linkage disequilibrium, i.e., the allele pairs separate
independently, then ObsP (A
i
B
j
) is simply the product of the allele frequencies,
P (A
i
) and P (B
j
). It follows that in this case,
D = |P (A
1
) P (B
1
) P (A
2
) P (B
2
) P (A
1
) P (B
2
) P (A
2
) P (B
1
)| = 0.
For the observed haplotype frequencies given above, we instead get
D = |0.3 0 0.5 0.2| = 0.1.
It is useful to compare to D to Dmax, the maximum possible value of D given
the allele frequencies you have. One can show that Dmax is equal to the
smaller of the expected haplotype frequencies ExpP (A
1
B
2
) and ExpP (A
2
B
1
)
when there is no linkage (see Appendix). In particular, it is convenient to
dene
D
D

=
Dmax
which in this case equals 0.1/0.1 = 1, indicating the maximum possible link
age disequilibrium, or complete skew. (D

varies between 0 when totally
unlinked, to 1 when totally linked.)
Three other examples of linkage disequilibrium are given in the slides for this
lecture two with complete skew (D=1) in each case where only two ga
metes are observed, and one with partial (D=0.6), where all four gametes are
observed.
Note that it is unusual to see all four gametes, absent recombination, since the
fourth gamete would most probably require a repeat mutation at one of the
two loci, which is assumed to never happen by the Ininite Sites Assumption.
Consistent with the Innite Sites Assumption is a simple test called the Four
Gamete Test which says that if there are fewer than four gametes (at a double
locus), then recombination does not need to be invoked to explain it, however
if there are four gametes then a recombination must have occured. Note that
the Innite Sites Assumption is just a model, which fails for example when
the mutation rate is high.
2
More generally, for n loci, in the case of complete linkage, so that no recom
bination occurs, we expect at most n + 1 distinct haplotypes (the ancestral
haplotype and a haplotype resulting from a single mutation at each of the n
loci). Whereas with no linkage, so that all recombinations occur, any of 2
n
haplotypes occur, since any subset of the n loci may have a mutation in that
case.
Linkage disequilibrium is governed by recombination rate and the amount of
time (in generations) that has passed. With time, linkage disequilibrium drops
o, more quickly when the recombination rate is higher. Some parts of the
genome have higher recombination rates than others, so it is not always physi
cal distance in the genome which determines recombination rates between two
loci. (For example, there is a motif which seems to recruit recombinase, thus
raising recombination rates whereever that motif is prevalent.)
History also matters for example, perhaps one of the mutations hasnt been
around as long, and therefore there hasnt been enough time for the linkage
disequilibrium to break down.
Correlation between alleles is an important concept related to linkage. Its
governed by linkage between the alleles and the allele frequencies. When allele
frequencies are very dierent, correlation is lower. Correlation is measured by
the correlation coecient r
2
, dened as
D
2
2
r =
P (A1)P (A2)P (B1)P (B2)
in the case where there are two alleles at each of the two loci of interest. The
correlation coecient is a measure of how predictive the allele at locus A is
of the allele at locus B. There are two examples given in the slides one with
complete correlation (r
2
= 1), and the other with low correlation (r
2
= 0.25).
Correlation is also useful in disease mapping. This slide shows how one dis
ease may be predicted by the SNP at locus B, whereas another disease may
not be predicted by any one SNP, but can be predicted by the double SNP
at loci A and B. In these cases we say that the SNP at B is a tag for the
rst disease, while the double SNP at A and B is a tag for the second disease.
Next we will discuss some linkage applications, but rst we give an overview
of classic linkage analysis.
3
3 Classic Linkage Analysis
The classic goal is to nd out what causes a disease, or more generally, a
phenotype of interest. The classic way to do this was to study pedigrees.
Traditionally, squares denote men, and circles women, color denotes aected
individuals, and white denotes unaected individuals. An informative marker
in a linkage study is heterozygous in the parent of interest. For example, the
the rst slide titled Tracing Inheritance of a Disease Gene it is easy to see
that C is transmitted by the father whenever the child is aected, but in the
second slide, it is A that is transmitted by the father whenever the child is
aected. Thus in the rst case it seems that the causative site for the disease
is linked to allele C, whereas in the second case, it seems to be linked to allele
A. A classic way to measure linkage given a pedigree is the LOD score, which
is the log odds that the observed phenotypes would occur if the recombination
rate were some assumed value vs. 0.5 (which corresponds to no linkage).
In this way it took about 20 years to localize and understand the most preva
lent known causative mutation for cystic brosis (CF). Via DNA sequence
comparison, it was discovered that the mutation is in the transmembrane part
of a protein aecting function of the chloride channel.
In classic linkage analysis, you can nd a many megabase region with a linkage
signal, but then how do you nd where the causative gene is in that region?
It would obviously be much easier if the region were suciently small.
Note that in classic linkage analysis we only need linkage with a locus, not
with an allele.
4 Population Association
Nowadays we look for allele-associated linkage i.e., the allele and the causative
mutation are close enough that the phenotype is associated, not just linked,
to the allele. (The allele and the causative mutation are close enough that
recombination has never occurred between them.)
For example, the ApoE4 allele is associated with Alzheimers disease, HFE
with Venous thrombosis, and several other common variants are listed in the
slides. Thanks to dramatic technological advances in sequencing we now have
a comprehensive catalog of human variation which make genome-wide associ
4
ation studies feasible. Eric Lander et al were responsible for the rst genome-
wide map, of 4000 SNPs, in 1998. In 1999, The SNP Consortium (TSC)
announced their 2-year goal to map 300,000 SNPs across the human genome,
but in 2000 already 1.4 million SNPs were mapped (via shotgun sequencing
by David Altshuler et al), and the progress continued to escalate, so that by
2004 there were 8 million SNPs mapped.
The International HapMap Project formed in 2002 with the goal of genotyping
3 million well-dened SNPs in 270 people in order to determine correlation
among these SNPs, i.e., resulting in a mapping of haplotypes.
Note that in parts of the genome where recombination rates are low, 1 SNP
per 5 kilobases of DNA can be sucient, but in areas where recombination
is frequent, considerably more SNPs are needed, for a meaningful association
study.
In October of 2005 HapMap met its goal. The slide shows pairwise comparison
of the SNPs across the genome, displayed via Haploview by Mark Daly et al.
In this graphic red represents linkage and white represents no linkage. Pairs
of SNPs across the genome point down to linkage disequilibrium values (D)
indicated by color (red for linkage, white for no linkage). Note that the closer
two SNPs are, the higher prob. they will point to red. Also note the hap
lotype blocks red regions showing almost complete linkage disequilibrium
between all the SNPs in those regions. Straight white areas correspond to
recombination hotspots, where linkage disequilibrium completely drops o.
4.1 Case Control
Roughly 80% of common genetic variations in humans have been identied,
but further sequencing still needs to be done. Major technology advances have
brought us to the point where we can genotype 1.2 million polymorphisms in
500 people or more, per day.
When studying all of this data, it is essential that cases and controls are well
dened. For example, we can calculate the frequencies of disease vs. no disease
in the presence of two dierent alleles at a given locus, and then compute
the relative risk of the disease in the presence of each of the two alleles. In
particular, calculate the Odds Ratio (OR) for allele 1 as follows:
5
OR(All1) =
(disease and allele 1) (disease and allele 2)
(no disease and allele 1) (disease and allele 2)
and analogously, the OR(All2), which is simply the reciprocal of OR(All1).
To evaluate the condence we should have in these numbers we can compute
the standard error and the 95% condence intervals around OR(All1) and
OR(All2), for example for the frequencies given in the Case Control slides for
this lecture, we get 95% condence intervals which are signicant since they
do not span the value 1 (which would correspond to the completely random
case). It is also common to compute the Chi-Square value as a measure of the
signicance of these results.
4.2 Example: Age-related Macular Degeneration
A good example of a disease which has become much more understood thanks
to genome-wide association studies is Age-related Macular Degeneration (AMD),
which causes loss of vision at the focal point in older people, so that they only
have peripheral vision. In 2004 we had no idea why it was happenning. In
2006 at the Broad Institute and elsewhere, using a whole genome association
scan, scientists found 3 genes with 5 common variants explaining 50% of AMD
occurrence.
For many other diseases (diabetes, schizophrenia, multiple sclerosis, etc.) there
are already good candidates for causative mutations, which are in the process
of being better understood by similar studies.
4.3 Transmission Disequilibrium Test (TDT)
A commonly done association test which is not subject to population strati
cation is the Transmission Disequilibrium Test (TDT ), within families.
To do TDT, we look at families of three (two parents and a child), in which at
least one of the parents is heterozygous for a marker and the child is aected,
and note which allele is passed to the aected child from each heterozygous
parent. In the absence of association between the marker and the disease, we
expect a 1:1 ratio for the allele transmitted from the heterozygous parent to
the aected child. Over many families, we can compute
X
2
(n
1
n
2
)
2
=
TDT
(n
1
+ n
2
)
6
where n
i
denotes the number of times allele i was transmitted from a heterozy
gous parent to the aected child, as a measure of the association between this
marker and the occurrence of the disease.
4.4 Measuring length of haplotype
Another measure of linkage disequilibrium which is specic not only to the
locus of interest but also the particular allele at that locus is the length of
the haplotype of that allele at that locus. Sabeti developed Extended Hap
lotype Homozygosity (EHH), an especially useful measure of the length of a
haplotype. This is a bifurcation diagram showing the region of the ancestral
chromosome with nodes corresponding to a set of long-range markers in that
region, and branches o of those nodes to other sub-branches with correspond
ing nodes which may branch or not, as well, depending on what mutations have
occurred at what markers. The thickness of the branches between each pair
of nodes indicates the prevalence of the corresponding allele at the marker
corresponding to its endpoint. Thus it captures the decay of association to
markers at various distances from the starting locus, giving a fuller picture of
the length of the haplotype.
5 Appendix
Irwin Jungreis has kindly provided the following derivation of the formula for
Dmax, the maximum value of the measure of linkage disequilibrium, D, given
the allele frequencies.
Since the allele frequencies are computed from the observed data, they impose
constraints on the observed values. Let p = P (A1), q = P (B1). Then we have
ObsP (A
1
B
1
) + ObsP (A
1
B
2
) = p
ObsP (A
2
B
1
) + ObsP (A
2
B
2
) = 1 p
ObsP (A
1
B
1
) + ObsP (A
2
B
1
) = q
ObsP (A
1
B
2
) + ObsP (A
2
B
2
) = 1 q
If we treat p and q as known and ObsP (A
i
B
j
) as unknowns we have 4 equations
in 4 unknowns. However, they are singular, because the sum of the rst two
equations equals the sum of the second two. So there is a 1-dimensional family
of solutions. Since the expected values are one possible solution, the family of
7
solutions can easily be seen to be:
ObsP (A
1
B
1
) = ExpP (A
1
B
1
) + t
ObsP (A
1
B
2
) = ExpP (A
1
B
2
) t
ObsP (A
2
B
1
) = ExpP (A
2
B
1
) t
ObsP (A
2
B
2
) = ExpP (A
2
B
2
) + t
with constraints t >= ExpP (A
1
B
1
), t <= ExpP (A
1
B
2
), t <= ExpP (A
2
B
1
), t >=
ExpP (A
2
B
2
) since the probabilities ObsP (A
i
B
j
) 0.
It is easy to see then that
max(t) = min(ExpP (A
1
B
2
), ExpP (A
2
B
1
))
max(t) = min(ExpP (A
1
B
1
), ExpP (A
2
B
2
))
max(|t|) = max(min(ExpP (A
1
B
2
), ExpP (A
2
B
1
)), min(ExpP (A
1
B
1
), ExpP (A
2
B
2
))
Since A1 and B1 are the major alleles, we have p 1 p and q 1 q, so
pq >= p(1q), pq >= q(1p), (1p)(1q) <= p(1q), and (1p)(1q) <=
q(1 p). Thus
min(ExpP (A
1
B
2
), ExpP (A
2
B
1
)) >= ExpP (A
2
B
2
) = min(ExpP (A
1
B
1
), ExpP (A
2
B
2
))
so
max(|t|) = min(ExpP (A
1
B
2
), ExpP (A
2
B
1
)).
It remains to show that D = |t|. By denition,
D = |ObsP (A
1
B
1
) ObsP (A
2
B
2
) ObsP (A
1
B
2
) ObsP (A
2
B
1
)|,
but
ObsP (A
1
B
1
)ObsP (A
2
B
2
) = ExpP (A
1
B
1
)ExpP (A
2
B
2
)+t(ExpP (A
1
B
1
)+ExpP (A
2
B
2
))+t
2
and
ObsP (A
1
B
2
)ObsP (A
2
B
1
) = ExpP (A
1
B
2
)ExpP (A
2
B
1
)t(ExpP (A
1
B
2
)+ExpP (A
2
B
1
))+t
2
so D = |pq(1 p)(1 q) p(1 q)q(1 p) + t| = |t|, as desired.
8
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
Lecture 15 - Comparative Genomics I: Genome annotation
11/23/08
1 Introduction
This lecture andthe next will discuss the recent andcurrent researchincomparative genomics being
performed in Professor Kellis lab. Comparative genomics allows one to infer understanding of
genomes from the study of the evolution of closely related species, and vice-versa. This lecture will
discuss the use of evolution to understand genomes, and lecture 16 will deal with using genomes
to better understand evolution. By understanding genomes, we mean primarily to annotate the
various parts: protein coding regions, regulatory motifs, etc. Well see later that comparative
genomics also allows us to uncover completely new ways various elements are processed that we
would not recognize using other methods.
In Dr. Kellis lab, mammals, ies and fungi are studied. Slide 6 shows the many species that
are part of the data sets that are analyzed. We want to study a wide variety of organisms to
observe elements that are at dierent distances from humans. This allows the study of processes
at dierent ranges of evolution (dierent snapshots in time based on divergence point). There are
several reasons why it is important to have closely related species as well as more distantly related
species. More closely related species should have very similar functional elements and randomness
in the non-functional elements. This is because selection weeds out disrupting mutations in
functional regions, and mutations accumulate in the non-functional regions. More distantly related
species will likely have signicant dierences in both their functional and non-functional elements.
Phylogeny allows observation of individual events that may be dicult to resolve in species that
are more separated. However, our signal relies on the ability to identify/observe an evolutionary
event, thus if we look only at species that are close, there wont be enough changes to discriminate
between functional and non-functional regions. More distantly related species allow us to better
identify neutral substitutions.
2 Preliminary steps in comparative genomics
Once we have our sequence data (or if we have a new sequence that we wish to annotate), we
begin with multiple alignments of the sequences. We BLAST regions of the genome against other
genomes, and then apply sequence alignment techniques to align individual regions to a reference
genome for a pair of species. We then perform a series of pairwise alignments walking up the
phylogeny until we have an alignment for all sequences. Because we can align every gene and
every intergenic region, we dont just have to rely on conserved regions, we can align every single
region regardless of whether the conservation in that region is sucient to allow genome wide
1
placement across the species. This is because we have anchors spanning entire regions, and we
can thus infer that the entire region in conserved as a block and then apply global alignment to the
block.
3 Evolutionary signatures
Slide 10 shows results for nucleotide conservation in the DBH gene across several species (note
that other species that are not show on the slide were also used to calculate conservation). To
calculate the degree of conservation a hidden Markov model (HMM) was used with two states:
high conservation and low conservation. The Y-axis shows the score calculated using posterior
decoding with this model. There are several interesting features we can observe from this data.
We see that there are blocks of conservation separated by regions that are not conserved. The
12 exons are mostly conserved across the species, but certain exons are missing (e.g. zebrash is
missing exon 9). Certain intronic areas have stretches of high conservation as well. We also note
the existence of lineage-specic conserved elements. If theres a region thats thought to be intronic
but still appears to be highly conserved, then this is evidence for that region being functional.
We want to develop evolutionary signals for each of the functional types in the genome. The
specic function of a region results in selective pressures which give it a characteristic signature
of insertions/deletions/mutations. Protein-coding genes exhibit particular frequencies of codon
substitution as well as reading frame conservation. RNA structures have compensatory changes
to maintain their secondary structure. RNAs look dierent from RNA genes, here paired regions
are not undergoing the compensatory changes that occurredabove, they are very highly conserved.
Intermediate regions are able to diverge. Regulatory motifs are not conserved at the exact same
position, they can move around since they only need to recruit a factor in a particular region. They
show an increased conservation phylogenetically across the tree, while showing small changes
that preserve the consensus of the motif, while the primary sequence can still change. This lecture
will discuss further how to determine protein-coding signatures.
4 Protein-coding signatures
Slide 12 shows a region of a gene near a splice site. The same level of conservation exists on
both sides of the splice site, but we notice signicant dierences between the sequences to the
left and to the right of the splice site. Recognizing these dierences allows us to construct our
signature. To the right of the splice site, gaps occur in multiples of three (thus conserving the
frame), whereas to the left of the splice site frame-shifting occurs. There is also a distinct pattern
to the mutations on the right side, as the mutations are largely 3-periodic and certain triplets are
more frequently exchanged. As a bonus, by being able to recognize the change in regions, our
splice site becomes immediately obvious as well. By testing for (i) reading-frame conservation and
(ii) codon-substitution patterns, we can identify protein coding regions very accurately.
4.1 Reading-Frame Conservation
By scoring the pressure to stay in the same reading frame (i.e. no gaps that are not multiples of
three), we can easily quantify how likely a region is protein-coding or not. Staying in the same
frame is obviously important in a protein coding region, as a frame shift would completely alter
2
the amino-acid structure of the protein. There are two methods that can be used to do this: a cuto
method or a scoring method. In the cuto method we penalize gaps if they are not in multiples
of three. We can do this by selecting a segment of the gene to see how many gaps satisfy this
condition. We need to perform this three times over all possible osets and then choose the best
alignment. When penalizing the gaps, it is important to skip gaps that are not multiples of three
if they have compensating gaps in their pre-specied neighborhood such that the sum of the gaps
is still a multiple of three. In the scoring method we count the number of nucleotides that are out
of frame and normalize by the length of the gene. The percentage of nucleotides out of frame is
very high in the non-coding regions, while the number is very low in the coding regions. These
methods do not work as well if there are sequencing errors. We can compensate for these errors
by using a smaller scanning window and observing local reading frame conservation.
This method has been applied to the yeast genome, with very good results. The method was
shown to have 99.9% specicity and 99% sensitivity, and when applied to 2000 hypothetical in
yeast ORFs (open reading frames)
1
, it rejected 500 of them as false positives. When applied to
human genome, 4000 genes were rejected
2
. Both nding have been validated experimentally.
4.2 Codon-Substitution Patterns
The second signature is somewhat more elaborate, and measures the exchange rate of dierent
codons. A 64 64 codon substitution matrix (CSM) is created to measure how often a specic
triplet is exchanged in species 1 for another triplet in species 2. Slide 15 shows the CSM for genes
and for intergenetic regions. A number of salient features present themselves in the gene CSM.
Note that the main diagonal element has been removed, because the frequency of a triplet being
exchanged for itself will obviously be much higher than any other exchange. We still see a strong
diagonal element in the protein coding regions. We also note certain high-scoring o diagonal
elements, these are substitutions that are close in functional spaces rather than in sequence space;
either 6-fold degenerate codons, or very similar elements. We also note stripes of low values.
These correspond to stop codons, so substitutions to this triplet would signicantly alter protein
function and thus are strongly selected against. One thing to note regarding this image: There is
a CpG dinucleotide mutational bias in the human genome (due to methylation sites). The authors
needed to guess the correct CpG frequency rate and subtract that from the image. There could be
some remnants of this correction in the image. In intergenic regions the exchange rates are more
uniform. In these regions, what matters is the mutational pattern, i.e. whether a change is one or
more mutations away. Therefore, intergenic regions are dictated by mutational proximity whereas
genetic regions are dictated by selective proximity. We can use these two matrices to create a
likelihood ratio matrix. Slide 17 applies this test to aligned sequences, and this makes the protein
coding regions immediately obvious.
An interesting feature of this method is that it automatically infers the genetic code from
the pattern of substitutions that occur, simply by looking at the high scoring substitutions. In
species with a dierent genetic code, for example in Candida albumin (for which CTG codes for
serine (polar) rather than leucine (hydrophobic)), the patterns of codon exchange will be dierent.
1
Kellis M, Patterson N, Endrizzi M, Birren B, Lander E. S. 2003. Sequencing and comparison of yeast species to
identify genes and regulatory elements. Science. 423: 241-254.
2
Clamp M et al. 2007. Distinguishing protein-coding and noncoding genes in the human genome. PNAS. 104:
19428-19433.
3
However no knowledge if this is required by the method. Instead, we can deduce this a posteriori
from the CSM.
Regarding implementation: These methods can be implemented using a HMM or conditional
random eld (CRF). CRF allows the integration of diverse features that do not necessarily have
a probabilistic nature, whereas HMMs require us to model everything as transition and emission
probabilities. CRF will be discussed in an upcoming lecture. One might wonder why these more
complex methods need to be implemented, when the simpler method of checking for conservation
of the reading frame worked well. The reason is that in very short regions, insertions and deletion
will be very infrequent, thus there wont be enough signal to make the distinction between protein-
coding and non-protein-coding.
5 Protein-coding evolution and nucleotide conservation
Finally, onslide 18 we look at simultaneous observationof protein-coding signatures andconserva
tion. We see regions that have low conservation, but a large protein-coding signature, as well as the
reverse. Identication of regions tagged as being genes but that did not have high protein-coding
signatures helped to strongly reject 414 genes in the y genome previously classied as CGid-only
genes which led FlyBase curators to delete 222 of them and ag another 73. In some cases there
were denite false negatives, as functional evidence existed for the genes under examination. In
the data, we also see regions with both conversation as well as a large protein-coding signature,
but that are not marked as being parts of genes. Some of these have been experimentally tested
and have been show to be parts of new genes or extensions of existing genes.
Finally, comparative genomics allows the identication of new mechanisms of regulation. 150
genes in the y and 5 in the human possess regions where a stop codon exists inside a region with
a large protein-coding signature, and a second stop codon follows shortly after it. These are in
brain proteins and ion channels. The reason for this is still unclear. There may be an increased
conservation of secondary structure that favors translational read-through, or A-to-I editing of the
stop codon.
4
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
Comparative Genomics II: From Genomes to
Evolution
Lecture: Manolis Kellis.
November 4, 2008
1 Introduction
This is the second lecture in a series of two lectures on the work being done
in Prof. Kelliss lab on comparative genomics. Comparing genomes of related
species at dierent evolutionary distances can teach us both about the genetic
code and about evolution. In this lecture we discussed two examples of what
we can learn about evolution using comparative genomics.
The topics of this lecture are discussed in greater detail in the papers [1] and
[2]. A brief summary is given in these notes with references to gures in lecture
slides.
2 Whole genome duplication
Evolution requires infrequent random errors, i.e. mutations, in cell division to
create variation in dierent species. Genomic duplication is a particular type of
such error which can be useful to explain innovations in evolution[3]. Most of
the time genomic duplication will lead the daughter cell to be less t (sick)
and to get selected out. However, if the daughter species survives and adapts,
one copy of an original gene can perform its task while the other gene can evolve
to gain new function increasing gene content and tness. In class, we discussed
one type of genomic duplication, namely whole genome duplication (WGD), in
the study of which comparative genomics has brought in novel information.
2.1 Before comparative genomics
In his 1970 book Evolution by Gene Duplication where he postulated the role
of genomic duplication in evolution, Ohno also suggested the occurance of whole
genome duplications, in particular that the vertebrate genome is the result of
one or more whole genome duplications[3]. Such large scale duplications would
explain, for example, how there are 4 Hox genes in humans compared to 1 Hox
gene in ies.
1
WGD has been suggested in various other cases but conclusive evidence was
not found until 2004. In particular, the possibilty of WGD in the yeast S. cere
visiae has been debated since 1997. When S. cerevisiae genome was sequenced,
large duplication blocks were observed. Wolfe and colleagues suggested that
these duplications were due to a WGD[4]. Others have argued that the ob
served paralogous gene rate of 8% was too small to suggest a WGD and could
be explained with independant local duplications[5].
2.2 Comparative evidence for WGD in yeasts
Assume we label a number of neighboring genes in a segment of DNA in order 1
16. Then a WGD occurs and there are two copies of the segment in the genome.
As this dual-genome species evolves, some of the redundant genes are lost and
some gain new functions such that one of the chromosomes has orthologous
genes to gene 1, 3, 4, 6, 9, 10, 12, 13, 14, 16 while the other chromosome has 2,
3, 5, 7, 8, 11, 13, 15. Comparing the two chromosomes we would see paralogues
3 and 13 and we would not be able to tell if these paralogues are from WGD or
individual local duplications.
However, if we compare genomes of dierent species before and after WGD,
we would see that a region in the species before WGD will correspond to two
regions in the species after duplication. (See slide 5.) Here before and after
duplication refer to species that descend from a branch without and with WGD
respectively. This is exactly what was observed in Prof. Kelliss lab. Com
parative analysis between S. cerevisiae and K. Waltii gave a less clean signal
than the comparison of dierent Saccharomyses species. (See slide 6.) Look
ing closely at individual matching regions S. cerevisiae and K. Waltii, the dual
match signal that would be expected from a WGD was noticed. (See slide 7.)
During the analysis of the data, sequencing of the K. Waltii was completed.
It was seen that 16 chromosomes in S. cerevisiae corresponds to 8 chromosomes
in K. Waltii. (See slide 9.) Looking at slide 9, one can see that the correspon
dence between the chromosomes in the modern species is not perfect. This is
due to many chromosome crossing that have happened since the two species
branched. (Chromosome crossing is not expected to have any major selective
eects so can happen relatively often.)
Another evidence for WGD in the data comes from the positions of chro
mosome centromeres relative to genes. Assume there is a centromere between
genes 6 and 7 in the above example. In the two corresponding S. cerevisiae
chromosomes, we would expect centromeres to be between labels 6 and 9 and 5
and 7 respectively. This kind of centromere position prediction was seen to hold
true in all the chromosomes in the comparison of S. cerevisiae and K. Waltii.
(See slide 10.)
An interesting observation is that WGD event in the yeasts happened ap
proximately 190M years ago. This was also when fruit bearing plants evolved.
This was a time when there was an abundence of sugar available in the envi
ronment and the rst generations of inecient/sick species of yeasts with too
many redundant genes would be able to survive. Also the new functions that
2
would develop with new derived genes could give them useful features needed
in these new conditions. (Such as making beer.)
2.3 Post-duplication evolution
K. Waltii has approximately 5000 genes. After genome duplication, the ances
tor of S. cerevisiae had approximately 10000 genes. Soon after, most of the
duplicate genes were lost. S. cerevisiae now has 5500 species.
Do the new paralogous genes share their old task and evolve at a similar
high rate (as proposed by Lynch in 2000)? Or is one better preserved to do the
old task while one evolves rapidly to gain new function (as proposed by Ohno in
1970)? Comparing genes in S. cerevisiae and K. Waltii, it was seen that 95% of
the gene pairs showed asymmetric accelaration rates, supporting Ohnos model.
This means we can dene ancesteral and derived functions for the two
paralogous genes. Indeed, biological experiments have shown that such a dis
tinction exists. Ancesteral genes are more vital (gene removal is more likely to
be lethal), are expressed more abundantly and serve general functions whereas
derived genes are used only in specic conditions and in specic tasks and can
be removed without causing the organism to die. Also, ancesteral genes have
higher network connectivity.
Comparison of S. cerevisiae and S. bayanus reveals another interesting fea
ture of the evolution of paralogous genes. Both S. cerevisiae and S. bayanus
are derived from the branch where WGD has occured. (Gene ordering data
also provides evidence that WGD happened earlier in time compared to species
seperation between S. cerevisiae and S. bayanus.) So the paralogous genes in S.
cerevisiae have been separated in time longer than orthologous genes between S.
cerevisiae and S. bayanus. However, the paralogous genes are closer in sequence
matching than orthologous genes. This observation hints at occurance of gene
conversions between paralogues, i.e. when a gene is being duplicated, all or part
of it may be replaced by the paralogue.
3 Phylogenomics
In previous lectures we have studies various algorithms to obtain phylogenetic
trees for dierent species. Similar studies can be performed to study phylogeny
of orthologous and paralogous genes. These trees contain information from
both the evolution of the species and evolution of genes, with gene loss and
gene duplication events.
Gene trees can be built for individual genes. Then these genes can be recon
ciled with known species trees to draw complete evolution of the gene. However,
accuracy of gene trees built on single genes are usually not very accurate:
Some gene trees lead to many loss and duplication events which could be
explained more simply. See slide 30.
3
Gene tree topologies obtained from neighboring genes which evolved to
gether are not robust. See slide 32.
Simulations of evolution of genes show that the information in single genes
may not be enough to get accurate results.
The gene tree accuracy has been shown to be mainly limited by the information
from the genes. Trees for longer genes which have higher information content
can be reconstructed more successfully. See slide 33. Also, very fast evolving
or slow evolving genes carry less information compared to moderately diverging
sequences (40-50% sequence identity) which give the best performance. See slide
34. These observations have been made both in Monte Carlo studies and in data
from 12 y species where gene trees were tested for a congruent topolgy to the
species tree. Since Monte Carlo and data gave similar results, it was concluded
that more information is needed to accurately build gene trees.
It has been observed that mutation rate between two genes in two species
can be seperated into a gene-dependant rate and a species-dependant rate. In
other words, dierent gene trees have similar topologies for dierent species
with a gene specic multiplying factor. (See slide 37.) This observation can
be used to write a generative model with which dierent tree topologies can
be assigned likelihood values. The species specic substitution rate s
i
, which
depends on evolutionary dynamics of the species such as population size and
generation time and appears to be normal distributed for dierent genes. The
gene specic rate g is constrained by the selective features of it function and is
observed to be distributed as a gamma distribution, which can be expected for
a rate.
This new information can be used to build more accurate gene trees than
previous methods. Assume species tree is known (this can be built using whole
genome information.) A training set can be chosen from well known one-to-one
orhologies and using congruent trees to the species tree. Using this training
set, values of g and s
i
can be sampled and the parameters of the corresponding
Gamma and Normal distributions (g G = (, ), s
i
S
i
= N (
i
,
2
)) can
i
be inferred by ts.
In the next step, trees are built for remaining genes. A distance matrix M
is built for genes in question. Next, trees of dierent topologies are constructed
where the branch lengths b
i
are calculated using the matrix M. The likelihood
of the topology can be calculated as the probability of observing the branch
lengths b
i
= g s
i
. The best topology is selected using maximum likelihood.
One example where this method proves better performance compared to
previous methods is the gene tree of hemoglobin- protein shown on slide 40.
This method correctly accounts for the fast evolution of the rodent branch.
The method has been also shown to provide accurate results when the
gene tree is not congruent to the species tree. Consider the tree containing
hemoglobin- for rat and mouse and hemoglobin- for dog and human. The
correct topology should have dog and human genes seperating after human and
rodent since gene duplication happened before relevant speciation events. In
4
deed the method gives the correct tree containing gene duplication. See Figure
4. in [2] for a more clear description.
In conclusion, observing that gene and species substitution rates can be
seperated and calculating species substitution rates, a new gene tree construc
tion algorithms has been developed. This method has been observed to give
much better accuracy than previous gene tree construction algorithms.
References
[1] Manolis Kellis, Bruce W. Birren and Eric S. Lander, Proof and evolutionary
analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae,
Nature 8 April 2004, p617
[2] Matthew D. Rasmussen and Manolis Kellis, Accurate gene-tree reconstruc
tion by learning gene-and species-specic substitution rates across multiple
complete genomes, Genome Res. 2007 December; 17(12): 1932-1942 (2007).
[3] Susumu Ohno (1970). Evolution by gene duplication. Springer-Verlag. ISBN
0-04-575015-7.
[4] Wolfe, K. H. and Shields, D. C. Molecular evidence for an ancient duplication
of the entire yeast genome. Nature 387, 708-713 (1997)
[5] Genomic exploration of the hemiascomycetous yeasts: 20. Evolution of gene
redundancy compared to Saccharomyces cerevisiae. FEBS Lett. 487, 122-133
(2000)
5
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
1
6.047/6.878 Computational Biology Nov. 4, 2008
Lecture 18: CRFs for Computational Gene Prediction
Lecturer: James E. Galagan
Overview of gene prediction
One of the fundamental problems in computational biology is to identify genes in very long
genome sequences. As we know, DNA is a sequence of nucleotide molecules (a.k.a. bases) which
encode instructions for generation of proteins. However, not all of these bases are responsible
for protein generation. As an example shown in the 4th slide on page 1 of [2], in the eukaryotic
gene structure, only exons contribute to protein manufacturing. Other segments, like intron,
intergenic, start, stop, are not directly responsible for protein production. Therefore our task
in gene prediction (or genome annotation) is, given a DNA sequence (X) with zero or more
genes and some evidence associated with the given DNA sequence, determine a labeling
(Y ) that assigns to each base a label according to the functionality of that part of
the gene. For example, the labels can be intergenic, start, exon, acceptor, intron, donor, stop,
etc.
How do we do gene prediction? It turns out that we can make use of several types of
evidence. For example, some gene parts are associated with short xed sequences, which are
called signals. However, since such short sequences appear randomly everywhere in the DNA
sequence, we cannot solely use these signals for gene prediction purposes. Other evidence relies
on the tendency of certain genomic regions to have specic base composition (a.k.a. content
measures). For example, there are usually multiple codons for each amino acid and not all
codons are used equally often, which gives rise to dierent ratios of nucleotides in coding
sequences. Apart from this evidence, which stem from DNA properties, there is also evidence
like direct experimental evidence, BLAST hits, HMMer hits, etc. The main challenge of
gene prediction algorithms is how to take advantage of all these types of evidence
together.
The most popular method so far of combining evidence is to use Hidden Markov Models
(HMMs). In an HMM, we model the labels as hidden states and assume that the hidden states
constitute a Markov chain. We assign an emission probability to every (x X, y Y ) to
model the probability of observing base x when the hidden state is y. This model is also called
a generative model, since a convenient way to think of HMMs is to imagine that there is a
machine with a button. By pushing the button, one can generate a genome sequence according
to some probability distribution. In gene prediction, we use maximum-likelihood training to
compute state transition probabilities and emission probabilities, and then nd the most likely
hidden state sequence Y y
n
. Note that HMMs in fact model a joint distribution over = y
1

bases and hidden states, namely P (X, Y ) = P (Labels, Sequence). However, in gene prediction
problems, we are given X and only need to predict Y . In other words, we want the conditional
probability distribution, P (Y |X). There are also some so-called generalized hidden Markov
models (GHMMs). In these models, each state may emit more than one symbol, emission
probabilities may be modeled by any arbitrary probabilistic models, and sometimes the feature
lengths are explicitly modeled according to experimental data (instead of restricted to geometric
distributions as required by HMMs). As an example, the 2nd slide on page 2 of [2] demonstrates
Genscan, a software based on GHMMs. The 3rd slide shows a comparison of performances (in
1
terms of both Sensitivity and Specicity measures) of four dierent gene prediction softwares
based on GHMMs. One can see that there is still room for improvement, and this is mostly
due to the limitations of HMMs.
2 Why Conditional Random Fields (CRFs)
There are certain intrinsic limitations to the HMM approach. First, as we mentioned ear
lier, the most natural distribution to model is the conditional distribution instead of the joint
distribution. HMMs expend unnecessary eort to model the joint distribution P (X, Y ) =
P (Y |X) P (X), when our real goal is just P (Y |X). Whats even worse is that during the
learning process we might have to trade the accuracy in modeling P (Y |X) in order to model
P (X) more accurately, and this leads to less accurate gene prediction.
Another limitation is that since HMMs are directed graphical models, each component of
HMMs has strict probabilistic semantics. Therefore it is hard for HMMs to model dependent
evidence. However as we know in gene prediction, there are many dependent evidence that need
to be incorporated. For example, HMMer protein domain predictions come from models based
on known protein sequences and these sequences are the same proteins searched by BLAST.
Therefore evidence coming from HMMer hits and BLAST hits are by no means independent.
In fact, dependence is the rule for most evidence available for gene prediction. One major
drawback of HMMs is that it is very cumbersome to model arbitrary dependent evidence of the
input sequences. One may try to modify HMMs by adding more hidden states to model such
dependency, but this approach usually leads to training and inference that is computationally
intractable and/or suers from data sparsity problems.
One common strategy is to simply assume independence (naive Bayesian assumption) among
the conditional probabilities. In other words, we would make the assumption that
P (X|Y
i
Y
i1
. . . Y
1
) = P (X|Y
i
)
, but this assumption is typically a false one in practice.
All these diculties with HMMs motivated people to propose an alternative approach known
as discriminative models to model the conditional distribution P (Y |X) directly. Contrary to
the directed graphical model of HMMs, this is an undirected graphical model (i.e. Markov
random eld). In the undirected graphical models, edges between nodes no longer bear proba
bilistic semantics. Instead, an edge simply represent some compatibility function (the f eature
functions) between the values of the random variables. These feature functions are then glob
ally normalized to assign a probability to each graph conguration. Because of this change in
semantics, we can simply add edges between label y
j
and any set of bases in X without creat
ing additional dependencies among nodes in Y . We can also model the conditional probability
P (y
j
|X) without considering the dependencies among X.
A particular instance of the undirected graphical model is called the conditional random
eld model (CRF)
1
. In this model, a set of unobserved variables are conditioned on a set of
observed variables which are called conditional random elds. Our new CRF models have the
following three desirable characteristics:
1
Like HMMs, the CRF models were also borrowed from the eld of natural language processing (NPL) by
computational biologists.
2

1. There are ecient learning algorithms to compute the parameters of the model and e
cient inference algorithms to predict hidden states;
2. It is easy to incorporate diverse evidence;
3. We can build on the best existing HMMs for gene calling.
A linear chain CRF is shown in the last slide on page 4 of [2]. This is a very simple
CRF model which enables us to do ecient training and inference. In this model, the input
data (DNA sequence, BLAST hits, ESTs, etc) is denoted collectively by X. The hidden state
labels (exon, intron, etc) are denoted by Y . We arrange all the labels in a linear chain and
assign a set of feature functions to every clique {y
i1
, y
i
, X} and denote them by f
j
(y
i1
, y
i
, X),
where j = 1, . . . , J. We also assign a weight
j
to each feature function and nally dene the
conditional probability to be
J N

1
P (Y X) =
j
f
j
(y
i
, y
i1
, X) exp | ,
Z(X)
j=1 i=1
where the normalization factor Z(X) is given by
J N
Y j=1 i=1

( ) = Z X
j
f
j
(y
i
, y
i1
, X) exp .

Doing so guarantees that
P (Y |X) = 1
Y
as required by the semantic of probabilities.
The basic idea underling this formality is that feature functions {f
j
} return real values on
pairs of labels and input data that we think are important for determining P (Y |X). We may
not know how the conditional probability is changed by this feature function or what is its
dependence on other evidence. However, we model this dependency by weights {
j
} and learn
these weights to maximize the likelihood of training data. Finally we normalize the probability
to ensure that P (Y |X) sums to one over all possible Y s. The advantage of this formality is
that it captures the conjunction property between labels and input data directly without resort
to P (X).
There are three fundamental steps in applying CRF models to gene prediction: Design,
Inference and Learning. In the following sections we will look at these three steps in details.
CRF Design
Selecting feature functions is the core issue in CRF applications. Technically speaking, feature
functions can be arbitrary functions that return real values for some pair of labels (y
i
, y
i1
)
and the input X. However we want to select feature functions that capture constraints or
conjunctions of label pairs (y
i
, y
i1
) and the input X that we think are important for P (Y |X).
Commonly-used feature functions include:
3
3
indicator function where f
j
(y
i
, y
i1
, X) = 1 for certain clique {y
i
, y
i1
, X} and 0 other
wise;
sum, product, etc over labels and input data;
some probability distribution over cliques {y
i
, y
i1
, X}, and so on.
As an interesting example, we can see two feature functions in the slides 4 and 5 on page 5 of [2].
We are in the situation that both BLAST and HMMer evidence suggest labeling y
i
as exon. This
will be hard to handle in HMMs but is very easy in CRF model: we simply include a BLAST
feature function and a HMMer feature function and assign two weights
BLAST
and
HMMer
for
these two feature functions. Thus, we do not need to worry about the dependence between the
BLAST and HMMer evidence. There is no requirement that evidence represented by feature
functions be independent. The underlying reason for this is we do not model P (X) in CRF
models. All we care about is which evidence contributes to the conditional probability P (Y |X).
The weights will determine the (relative) extent to which each set of evidence contributes and
interacts.
In practice, there may be thousands or even millions of arbitrary indicator feature functions
to incorporate into our CRF model. A naive approach is try all of them via brute force search,
which is obviously impractical. However, a much better strategy is to take advantage of all
the research on HMMs during the past decade by using the best HMM as our starting point.
That is, we start with feature functions derived from the best HMM-based gene prediction
algorithms. To see why this strategy works, one may look at slides 3, 4, 5 and 6 on page 6
of [2]. As it is shown there, HMM is simply a special case of CRF with all the feature function
weights = 1 and each feature function taking the special form
f
HMM
(y
i
, y
i1
, x
i
) = log(P (y
i
|y
i1
) P (x
i
|y
i
))
= log(P (y
i
|y
i1
)) + log(P (x
i
|y
i
))
= f
HMM-Transition
+ f
HMM-Emission
,
where we dene f
HMM-Transition
as
f
HMM-Transition
(y
i
, y
i1
, x
i
) = log(P (y
i
|y
i1
))
and dene f
HMM-Emission
by
f
HMM-Emission
(y
i
, y
i1
, x
i
) = log(P (x
i
|y
i
)).
Thus, we see that HMMs are just a very special form of the CRF model. As a result, adding
new evidence to existing HMM becomes simple because we can add arbitrary feature functions
to the CRF-version of our HMM and then learn the weights of these new feature functions
empirically. These weights will capture the impact of the incorporating new features into our
original HMM model.
HMMs and linear chain CRF explore the same family of conditional distributions P (Y |X),
and we can convert between HMMs and linear chain CRFs. In fact, HMMs and CRFs form
a generative-discriminative pair as demonstrated in slide 3 on page 7 of [2]. An important
reason for adopting linear chain CRF is because the underlying probability distribution is a
factoring distribution, reducing the computation complexity of probability distributions from
exponential time to linear time. Another reason is ecient inference capabilities, as we see in
the next section.
4

4
5
CRF Inference
Recall that our ultimate goal in gene prediction is, when given a sequence of DNA and some
evidence (denote them collectively by X), to select the best labeling Y . As with HMMs, the
most natural choice for best labeling is simple the most probable labeling given by
J N
1
arg max
Y
P (Y X) = arg max |
Y
j
f
j
(y
i
, y
i1
, X) exp .
Z(X)
j=1 i=1
But of course we can not aord to score every possible Y . The chain structure of the linear
chain CRF plays a crucial role in this respect since it guarantees the sub-optimality structure
of P (Y |X), just the same as in HMMs we have seen before. Therefore, following the notation
we used in the Viterbi algorithm, let v
k
(i) be the probability of the most likely path passing
through i and ending at state k, then we have the recurrent relation
J
v
k
(i) = max
(i 1) exp
j
f
j
(k, , X)
.
j=1
This recurrent relation enables us to compute the most probable labeling by dynamic program
ming, as the Viterbi algorithm for HMMs.
In fact to see the relation between CRFs and HMMs more explicitly, as shown in slides 2,
3 and 4 on page 8 of [2], we can dene the quantity
HMM
for HMMs by
HMM
(y
i
, y
i1
, X) = P (y
i
|y
i1
)P (x
i
|y
i
),
and dene an analogous quantity for generic CRFs by
CRF
(y
i
, y
i1
, X) = exp
j
f
j
(y
i
, y
i1
, x
i
)
.
j
Thus, we can rewrite all HMM equations (Viterbi, Forward, Backward, etc) in terms of
HMM
,
and then replace
HMM
with
CRF
, and get a set of analogous equations for CRF models.
CRF Learning
Suppose we are given an independent and identically distributed (i.i.d.) training set {(X
(k)
, Y
(k)
)},
which is usually a set of manually curated genes sequences for which all nucleotides are labeled.
Dene
J N
j=1 i=1
1
Z
(X)
P
(Y |X) =
j
f
j
(y
i
, y
i1
, X) exp ,

where
(X) =
j
f
j
(y
i
, y
i1
, X)
J N
Y j=1 i=1
5
exp .

6
K
We would like to chose = (
1
, . . . ,
J
) to maximize L() =
k=1
P
(Y
(k)
|X
(k)
), that is,
the product of the conditional probabilities of all the training data. For numerical stability
convenience (to avoid computational underow), we instead maximize

= log L(), the log-
likelihood of the data
j
i i1

we have
K J N K

(k) (k)
(k) (k)
( ) = ( ) log ( ) f , X Z X y , y
j
.
k=1 k=1
As one can show, ( ) is concave and therefore is guaranteed to have a global maximum. The
()
global maximum is obtained at = 0. As we see from a homework problem, at the maximum
K N K N

j=1 i=1
(k)
(k) (k)
, y
(k)
, X
(k)
) =
i
, X
(k)
)P
model
(Y
|X
(k)
). f
j
(y f
j
(y , y
i1 i i1
k i=1 k i=1 Y

Note that the lefthand side of the above equation is the actual count in the training data while
the righthand side corresponds to the expected count under the CRF modeled distribution.
This nice feature guarantees that, using maximum likelihood training, the characteristics of the
training data must also hold in our CRF model.
However, the feature functions are usually given in blackbox forms and in general are not
invertible. Therefore we have no way to nd the global maximum assignment of weights in
analytical closed forms. We can in stead rely on a gradient search algorithm outlined below.
Gradient Search Algorithm for Computing
1 Dene forward/backward variables similar to HMMs
2 Set weights arbitrarily
3 Compute Z(X) using forward/backward algorithms
4 Compute ()/
i
using Z(X) and forward/backward algorithms
5 Update each parameter by gradient search using ()/
i
(quasi-
Newton method)
6 Go to step 3 until convergence to global maximum
CRF Applications to Gene Prediction
To summarize, in order to apply CRFs in gene prediction, we rst design the feature functions
on label pairs {y
i
, y
j
} and X.
by
Then we use training data set to select the weights . Finally,
,
J N
j=1 i=1
1
P (Y X) =
j
f
j
(y
i
, y
i1
, X) exp |
Z(X)
we can use the selected feature functions and computed weights to nd the most probable
labeling Y for a given input sequence X.
6
On page 9 of [2], we nd a brief introduction to Conrad, a gene predictor based on CRF. One
can see that, when predicting on C. neoformans chr 9, Conrad outperforms the best previous
predictor Twinscan.
References
[1] Scribe notes - Lecture 6: CRFs for Computational Gene Prediction. MIT
6.047/6.878 Computational Biology, Sept. 25, 2007.
[2] James Galagan. Conditional Random Fields for Computational Gene Prediction. Lecture
note 6 of MIT 6.047/6.878 Computational Biology, Sept. 2007.
[3] Aron Culotta, David Kulp and Andrew McCallum. Gene prediction with conditional
random elds. Technical report UM-CS-2005-028, University of Massachusetts, Amherst,

7
http://www.cs.umass.edu/~culotta/pubs/culotta05gene.pdf
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
Introduction to Bayesian
Networks
6.047/6.878 Computational Biology: Genomes, Networks, Evolution
Overview
We have looked at a number of graphical
representations of probability distributions
DAG example: HMM
Undirected graph example: CRF
Today we will look at a very general graphical
model representation Bayesian Networks
One application modeling gene expression
Aviv Regev guest lecture an extension of
this basic idea
Probabilistic Reconstruction
Expression data gives us
information about what genes
tend to be expressed with others
In probability terms, information
about the joint distribution over
gene states X:
P(X)=P(X
1
,X
2
,X
3
,X
4
,,X
m
)
Can we model this joint
distribution?
Bayesian Networks
Directed graph encoding
joint distribution variables X
P(X) = P(X1,X2,X3,,XN)
Learning approaches
Inference algorithms
Captures information about
dependency structure of
P(X)
Example 1 Icy Roads
I
Icy Roads
W
Watson
Crashes
H
Holmes
Crashes
Causal
impact
Assume we learn that Watson has crashed
Given this causal network, one might fear Holmes has
crashed too. Why?
Example 1 Icy Roads
I
Icy Roads
W
Watson
Crashes
H
Holmes
Crashes
Now imagine we have learned that roads are not icy
We would no longer have an increased fear that Holmes has
crashed
Conditional Independence
I
Icy Roads
W
Watson
Crashes
H
Holmes
Crashes
If we know nothing about I, W and H are dependent
If we know I, W and H are conditionally independent
No Info on I
We Know I
Conditional Independency
Independence of 2 random variables
Conditional independence given a third
( , ) ( ) ( ) X Y P X Y P X P Y =
| ( , | ) ( | ) ( | )
but ( , ) ( ) ( ) necessarily
X Y Z P X Y Z P X Z P Y Z
P X Y P X P Y
=
Example 2 Rain/Sprinkler
R
Rain
W
Watson
Wet
H
Holmes
Wet
S
Sprinkler
Holmes discovers his house is wet.
Could be rain or his sprinkler.
R
Rain
W
Watson
Wet
H
Holmes
Wet
S
Sprinkler
Now imagine Holmes sees Watsons grass is wet
Now we conclude it was probably rain
And probably not his sprinkler
Example 2 Rain/Sprinkler
Explaining Away
R
Rain
W
Watson
Wet
H
Holmes
Wet
S
Sprinkler
Initially we had two explanations for Holmes wet grass.
But once we had more evidence for R, this explained away H
and thus no reason for increase in S
Conditional Dependence
R
Rain
W
Watson
Wet
H
Holmes
Wet
S
Sprinkler
If we dont know H, R and S are
independent
But if we know H, R and S are conditionally dependent
Dont know H
Know H
Graph Semantics
Y
X Z
Y
X Z
Y
X Z
Serial Diverging Converging
Each implies a particular independence relationship
Three basic building blocks
Chain/Linear
| X Z
| X Z Y
Y X Z
Y X Z
Diverging
| X Z
| X Z Y
Y X Z
Y X Z
Converging
| X Z
| X Z Y
Y X Z
Y X Z
Conditional Dependence - Explaining Away
Graph Semantics
Y
X Z
Converging
Three basic building blocks
A B
.
D-Separation
Definition : Two nodes A and B are d-
separated (or blocked) if for every
path p between A and B there is a
node V such that either
1. The connection is serial or diverging
and V is known
2. The connection is converging and V
and all of its descendants are
unknown
If A and B are not d-separated, they are d-
connected
Three semantics combine in concept of d-separation
Equivalence of Networks
Two structures are equivalent if they represent
same independence relationship - they encode
the same space of probability distributions
Will return to this when we consider causal vs probabilistic networks
Y
X Z
Y
X Z
Y
X Z
Example
Bayesian Networks
A network structure S
Directed acyclic graph (DAG)
Nodes => random variables X
Encodes graph independence sematics
Set of probability distributions P
Conditional Probability Distribution (CPD)
Local distributions for X
A Bayesian network (BN) for X = {X
1
, X
2
, X
3
,, X
n
}
consists of:
Example Bayesian Network
R
Rain
W
Watson
Wet
H
Holmes
Wet
S
Sprinkler
( ) ( ) ( | ) ( | , ) ( | , , )
( ) ( ) ( | ) ( | , )
P X P R P S R P W R S P H R S W
P R P S P W R P H S R
=
=
BN Probability Distribution
Only need distributions over nodes and their
parents
1 1
1
1
( ) ( ,..., )
( ( ))
n
i i
i
n
i i
i
P X P X X X
P X pa X
=
=
=
=
BNs are Compact Descriptions of P(X)

Independencies allow us to factorize distribution
Example
- Assume 2 states per node
X4
X2 X3
X1
X5
1 2 1 3 2 1
4 3 2 1 4 3 2 1
1
( ) P(X )P(X |X )P(X |X ,X )
P(X |X ,X ,X )P(X5|X ,X ,X ,X )
2 4 8 16 32 62 entries
( ) ( ( ))
P(X1)P(X2|X1)P(X3|X1)
P(X4|X2)P(X5|X3)
2 4
n
i i
i
P X
P X P X pa X
=
=
+ + + + =
=
=
+ +
4 4 4 18 entries + + =
Recall from HMM/CRF Lecture
( ) ( )
all nodes v
i i-1 i i
P(X,Y) = P(v|parents(v))
= P Y |Y P X |Y
Y
2
Y
3
Y
1
X
2
X
3
X
1
Y
4
X
4
Y
2
Y
3
Y
1
Y
4
X
2
Directed Graph Semantics
Factorization
Potential Functions over Cliques
(conditioned on X)
Markov Random Field
Factorization
( )
all nodes v
i i-1
P(Y|X) = P(v|clique(v),X)
= P Y |Y ,X
CPDs
R
W H
S
R S H P(H|R,S)
0 0 1 0.1
0 0 0 0.9
0 1 1 0.8
0 1 0 0.2
1 0 1 0.9
1 0 0 0.1
1 1 1 0.1
1 1 0 0.9
Discrete
Continuous
Bayesian Networks for Inference
Observe values (evidence on) of a set
of nodes, want to predict state of other
nodes
Exact Inference
J unction Tree Algorithm
Approximate Inference
Variational approaches, Monte Carlo
sampling
Observational inference
Walking Through an Example
R
W H
S
R=y R=n
W=y 1 0.2
W=n 0 0.8
P
0
(W|R)
R=y R=n
S=y 1,0 0.9,0.1
S=n 0,0 0,1
P
0
(H|R,S)
S=y S=n
0.1 0.9
P
0
(S)
R=y R=n
0.2 0.8
P
0
(R)
R
W H
S
WR RHS
We define two clusters:
-WR, RHS
The key idea: the
clusters only
communicate through R
If they agree on R, all is
good
R
R=y R=n
W=y 1 0.2
W=n 0 0.8
P
0
(W|R)
R=y R=n
S=y 1,0 0.9,0.1
S=n 0,0 0,1
P
0
(H|R,S)
S=y S=n
0.1 0.9
P
0
(S)
R=y R=n
0.2 0.8
P
0
(R)
WR RHS R
We will find it easier to work on this
representation:
We then need P(WR) and P(RHS):
P(WR) =P(R)P(W|R)
P(RHS) =P(R)P(S)P(H|R,S)
R=y R=n
W=y 0.2 0.16
W=n 0 0.64
P
0
(W,R)
R=y R=n
S=y 0.02,0 0.072,0.008
S=n 0.18,0 0,0.72
P
0
(R,H,S)
R=y R=n
0.2 0.8
P
0
(R)
WR RHS R
We will find it easier to work on this
representation:
We then need P(WR) and P(RHS):
P(WR) =P(R)P(W|R)
P(RHS) =P(R)P(S)P(H|R,S)
R=y R=n
W=y 0.2 0.16
W=n 0 0.64
P
0
(W,R)
R=y R=n
S=y 0.02,0 0.072,0.008
S=n 0.18,0 0,0.72
P
0
(R,H,S)
R=y R=n
0.2 0.8
P
0
(R)
WR RHS R
Note that by marginalizing out W
from P
0
(W,R) we get
P
0
(W)=(0.36,0.64)
This is our initial belief in Watsons
grass being (wet,not wet)
R=y R=n
W=y 0.2 0.16
W=n 0 0.64
P
0
(W,R)
R=y R=n
S=y 0.02,0 0.072,0.008
S=n 0.18,0 0,0.72
P
0
(R,H,S)
R=y R=n
0.2 0.8
P
0
(R)
WR RHS R
Now we observe H=y
We need to do three things:
1. Update RHS with this info
2. Calculate a new P
1
(R)
3. Transmit P
1
(R) to update WR
H=y
R=y R=n
W=y 0.2 0.16
W=n 0 0.64
P
0
(W,R)
R=y R=n
S=y 0.02,0 0.072,0
S=n 0.18,0 0,0
P
0
(R,H,S)
R=y R=n
0.2 0.8
P
0
(R)
WR RHS R
Updating RHS with H=y
We can simply
- Zero out all entries in RHS where
H=n
H=y
R=y R=n
W=y 0.2 0.16
W=n 0 0.64
P
0
(W,R)
R=y R=n
S=y 0.074,0 0.264,0
S=n 0.662,0 0,0
P
1
(R,H,S)
R=y R=n
0.2 0.8
P
0
(R)
WR RHS R
Updating RHS with H=y
We can simply
- Zero out all entries in RHS where
H=n
- Re-normalize
But you can see that
this changes P(R)
from the perspective
of RHS
H=y
R=y R=n
W=y 0.2 0.16
W=n 0 0.64
P
0
(W,R)
R=y R=n
0.2 0.8
P
0
(R)
WR RHS R
2. Calculate new P
1
(R)
Marginalize out H,S from RHS for:
P
1
(R) = (0.736,0.264)
Note also
P
1
(S) =(0.339,0.661)
P
0
(S) =(0.1,0.9)
R=y R=n
S=y 0.074,0 0.264,0
S=n 0.662,0 0,0
P
1
(R,H,S)
H=y
R=y R=n
0.736 0.264
P
1
(R)
WR RHS R
2. Transmit P
1
(R) to update WR
P
1
(R)
1
1 1
0
0
P(W,R)=P(W|R)P(R)
P(R)
=P (W,R
P
)
(R)
R=y R=n
W=y 0.2 0.16
W=n 0 0.64
P
0
(W,R)
R=y R=n
S=y 0.074,0 0.264,0
S=n 0.662,0 0,0
P
1
(R,H,S)
H=y
R=y R=n
0.736 0.264
P
1
(R)
WR RHS R
2. Transmit P
1
(R) to update WR
P
1
(R)
1
1 1
0
0
P(W,R)=P(W|R)P(R)
P(R)
=P (W,R
P
)
(R)
R=y R=n
W=y 0.736 0.052
W=n 0 0.211
P
1
(W,R)
R=y R=n
S=y 0.074,0 0.264,0
S=n 0.662,0 0,0
P
1
(R,H,S)
H=y
R=y R=n
0.736 0.264
P
1
(R)
WR RHS R
2. Transmit P
1
(R) to update WR
P
1
(R)
1
1 1
0
0
P(W,R)=P(W|R)P(R)
P(R)
=P (W,R
P
)
(R)
R=y R=n
W=y 0.736 0.052
W=n 0 0.211
P
1
(W,R)
R=y R=n
S=y 0.074,0 0.264,0
S=n 0.662,0 0,0
P
1
(R,H,S)
P
1
(R)/P
0
(R)
H=y
R=y R=n
0.736 0.264
P
1
(R)
WR RHS R
2. Transmit P
1
(R) to update WR
P
1
(R)
1
0
1 1
0
1
0
P(R)
P(W,R)=P(W|R)P(R)
=P (W,R)
P
P (R)
0.788 (W=y)=
P (W=y)=0.360
R=y R=n
W=y 0.736 0.052
W=n 0 0.211
P
1
(W,R)
R=y R=n
S=y 0.074,0 0.264,0
S=n 0.662,0 0,0
P
1
(R,H,S)
P
1
(R)/P
0
(R)
H=y
R=y R=n
0.736 0.264
P
1
(R)
WR RHS R
P
2
(R)/P
1
(R)
R=y R=n
W=y 0.736 0.052
W=n 0 0.211
P
1
(W,R)
R=y R=n
S=y 0.074,0 0.264,0
S=n 0.662,0 0,0
P
1
(R,H,S)
P
2
(R)
H=y W=y
Now we observe W=y
1. Update WR with this info
2. Calculate a new P
2
(R)
3. Transmit P
2
(R) to update WR
R=y R=n
0.93 0.07
P
2
(R)
WR RHS R
P
2
(R)/P
1
(R)
R=y R=n
W=y 0.93 0.07
W=n 0 0
P
2
(W,R)
R=y R=n
S=y 0.094,0 0.067,0
S=n 0.839,0 0,0
P
2
(R,H,S)
P
2
(R)
H=y W=y
P
2
(S=y)=0.161
P
1
(S=y)=0.339
P
0
(S=y) =0.1
- R is almost certain
- We have explained away H=y
- S goes low again
P*(S)/P(S)
Message Passing in J unction Trees
X Y S
e
Separator
State of separator S is information shared
between X and Y
When Y is updated, it sends a message to X
Message has information to update X to
agree with Y on state of S
4 4
4
5
5
1
2
1
1
1
3
2
Message Passing in J unction Trees
A node can send one message to a neighbor, only
after receiving all messages from each other neighbor
When messages have been passed both ways along a
link, it is consistent
Passing continues until all links are consistent
A B C
F E D
G H
e
e
3 4
HUGIN Algorithm
Select one node, V, as root
Call CollectEvidence(V):
Ask all neighbors of V to send message to V.
If not possible, recursively pass message to all neighbors but V
Call DistributeEvidence(V):
Send message to all neighbors of V
Recursively send message to all neighbors but V
A simple algorithm for coordinated message passing
in junction trees
BN to J unction Tree
A topic by itself. In summary:
Moral Graph undirected
graph with links between
all parents of all nodes
Triangulate add links so
all cycles>3 have cord
Cliques become nodes of
J unction Tree
R
W H
S
WR RHS R
Recall from HMM/CRF Lecture
( ) ( )
all nodes v
i i-1 i i
P(X,Y) = P(v|parents(v))
= P Y |Y P X |Y
Y
2
Y
3
Y
1
X
2
X
3
X
1
Y
4
X
4
Y
2
Y
3
Y
1
Y
4
X
2
Directed Graph Semantics
Factorization
Potential Functions over Cliques
(conditioned on X)
Markov Random Field
Factorization
( )
all nodes v
i i-1
P(Y|X) = P(v|clique(v),X)
= P Y |Y ,X
Applications
G1 G2 G4
G10 G7 G6
G9 G8
G3
G5
Measure some gene expression predict rest
Latent Variables
G1 G2 G4
G10 G7 G6
G9 G8
G3
G5
Drugs
D1
D4
D2
D3
Latent Variables
G1 G2 G4
G10 G7 G6
G9 G8
G3
G5
Drugs
D1
D4
D2
D3
Env
pH
O
2
Latent Variables
G1 G2 G4
G10 G7 G6
G9 G8
G3
G5
Drugs
D1
D4
D2
D3
Env
pH
O
2
Metabolites
M1 M2 M3 M4
Observation vs Intervention
Arrows not necessarily causal
BN models probability and correlation
between variables
For applications so far, we observe
evidence and want to know states of
other nodes most likely to go with
observation
What about interventions?
Example Sprinkler
If we observe the
Sprinkler on
Holmes grass
more likely wet
And more likely
summer
But what if we
force sprinkler on
Rain
W H
Sprinkler
Season
Intervention cut arrows from parents
Causal vs Probabilistic
This depends on
getting the arrows
correct!
Flipping all arrows
does not change
independence
relationships
But changes causality
for interventions
X Z Y
X Z Y
Learning Bayesian Networks
1. A network structure S
2. Parameters, , for probability
distributions on each node, given S
Given a set of observations D (i.e. expression
data set) on X, we want to find:
Relatively Easy
Learning
Given S, we can choose maximum likelihood
parameter
We can also choose to include prior information
P() in a bayesian approach
1
argmax ( | , ) ( ( ), )
n
i i
i
P D S P X pa X

=
= =
( | ,D) ( , | ) ( )
( | , )
bayes
P S P S D P
P S D d

=
=
Learning Bayesian Networks

1. A network structure S
2. Parameters, , for probability
distributions on each node, given S
Given a set of observations D (i.e. expression
data set) on X, we want to find:
NP-Hard
Learning S
Find optimal structure S given D
In special circumstances, integral analytically tractable
(e.g. no missing data, multinomial, dirichlet priors)
( | ) ( | ) ( )
( | ) ( | , ) ( | )
P S D P D S P S
P D S P D S P S d
Learning S Heuristic Search

Compute P(S|D) for all networks S
Select S* that maximizes P(S|D)
Problem 1: number of S grows super-exponentially with
number of nodes no exhaustive search, use hill
climbing, etc..
Problem 2: Sparse data and overfitting
S
P(S|D)
S* S
P(S|D)
S*
Sparse data
Lots of data
Model Averaging
Rather than select only one model as result of
search, we can draw many samples from
P(M|D) and model average
( | ) ( | , ) ( | )
1 ( ) ( | )
xy
samples
xy
samples
P E D P Exy D S P S D
S P S D
=
=
How do we sample....?
Sampling Models - MCMC
( | ) ( ) ( | )
min 1,
( | ) ( ) ( | )
new new old new
old old new old
P D S P S Q S S
p
P D S P S Q S S

=

Markov Chain Monte Carlo Method
Sample from
Direct approach intractable due to partition function
MCMC
Propose Given S
old
, propose new S
new
with
probability Q(S
new
|S
old
)
Accept/Reject Accept S
new
as sample with
( | ) ( )
( | )
s k
( | ) ( )
s k
k
P D S P S
P S D
P D S P S
=
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
Genome wide study of gene regulation
6.047/6.878 Lecture 20
Lectured by P. Kheradpour, 11/13/08
Table of contents
Genome wide study of gene regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Gene expression patterns (motivation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Transcription factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Experimental discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Computational discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Discovered motifs have functional enrichments . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
microRNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
The biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Computational miRNA discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
12 Drosophila genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Validation of the predicted miRNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Additional signatures of mature miRNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Prediction of binding sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Experimental approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Computational approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Phylogenetic footprinting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Comparison with experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Network level examination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Gene expression patterns (motivation)
Given the complex spatial and temporal regulation of gene expression [7], exemplied by recent
studies on the mouse [21], and on Drosophila heart development [30], there is great impetus to
identify the targets of regulatory factors in order to understand how such regulation is achieved.
Many connections in these regulatory networks are conserved all the way up to mammmals, and
so we can conceivably use evolutionary information in order to probe the regulatory networks of
humans.
We will focus here on regulation of transcription as achieved by transcription factors (TFs)
and also regulation of translation by microRNAs (miRNAs).
Transcription factors
TFs regulate the expression of target genes by binding to DNA, in regions that may or may not
be proximal to the gene that is being regulated. Binding is specic for a particular target
sequence, or motif , although there is most probably also non-specic binding that enables the
TFs to bind to other regions of DNA in order to rapidly locate their target sequences [10]. Each
motif can be viewed as a position weight matrix (PWM), which shows the relative specicity for
each base at every position in the motif. The eect of TF binding may be to activate or to
repress gene expression.
1
We will now examine some of the approaches that have been used to discover motifs.
Experimental discovery
Several revolutionary experimental methods have enabled motif discovery to be carried out in a
brute force, de novo fashion, without need for knowledge of promoter regions.
The Systematic Evolution of Ligands by Exponential Enrichment (SELEX) protocol [32]
involves taking a pool of RNA or DNA fragments (the library) and adding the protein of
interest. The members of the library that do not bind to the protein are then removed, and
another batch of fragments added. This results in enrichment of the sequences that bind
favourably to the protein, although it does not allow particularly for the binding strength to be
quantied.
The DNA immunoprecipitation with microarray detection (DIP-Chip) approach [16] involves
adding naked genomic DNA fragments of around 600bp to a sample of puried protein. The
protein-bound fragments are separated and labelled, such that they can be easily identied by
binding to a whole-genome microarray.
Protein binding microarrays (PBMs) [20] contain double stranded DNA, meaning that the pro
tein of interest can bind directly to the microarray. By tagging the protein with a dye, or anti
body recognition sequence, the PBM can be used to visualize which sequences bind the protein.
By spotting all possible k-mers on a given chip, a relative intensity can be calculated for each
sequence, allowing the binding strength to be quantied.
Computational discovery
The rst approach is to take coregulated genes and perform some kind of enrichment process.
This enrichment may be achieved by determining which motifs occur in at least n sequences
[23], or through approaches such as EM or Gibbs sampling (see Lecture 9) [17, 31].
Another approach to motif discovery is to look for conservation in multiple genomes. Given the
conservation rate for random motifs, it is possible to assign a p-value to such conserved regions
[33].
This process was pipelined by Kellis et al. [11] in the following fashion:
i) create motif seeds of the form XXX-(--)
n
-XXX using six non-degenerate characters (ACGT)
note: motifs often have a gap in the middle (around 0-10bp), hence the inclusion of a gap in
the seeds. A length-10 gap may be biologically justied since this is one turn of the helix, and so
this motif will correspond to two distinct binding sites on the same side of DNA
ii) score conservation
iii) expand seeds using degenerate characters
iv) cluster motifs using sequence similarity
Discovered motifs have functional enrichments
Using a similar procedure to that detailed above, an analysis of 12 Drosophila genomes [29]
showed that certain motifs were enriched in particular tissues (red dots in gure on slide 14),
and that functional clusters of similar motif expression patterns emerged in related tissues. Con
versely, ubiquitously expressed genes showed a particularly low occurrence of most motifs, indi
cating that these genes are not regulated by TFs.
2
The earlier studies on mammals [33] showed positional bias with respect to the transcription
start site (TSS), and tissue-specic expression for many of the computationally discovered
motifs, all of which suggests that the computational procedure is detecting motifs that are actu
ally involved in regulation.
Summary
Despite numerous successful applications, experimental approaches are costly and slow, and gen
erally require puried TF. It may also be necessary to apply computational procedures to pro
cess the microarray data obtained, so one cannot regard them as purely experimental tech
niques.
However, computational approaches also require sets of coregulated genes or multiple genomes
in order to carry out the procedures outlined above. In addition, whilst it is not necessary to
possess a priori knowledge of the TFs before carrying out these analyses, this can mean that it
is dicult to subsequently match the discovered motifs to specic TFs.
microRNAs
The biology
microRNAs are short pieces of DNA around 20 nucleotides in length, which are transcribed from
the genome by Pol II. After several processing steps, including by the nuclease Drosha, the
miRNA is exported from the nucleus. It then has the loop removed by an enzyme known as
Dicer, leaving an miRNA:miRNA duplex. This is eventually cleaved by a helicase, and one sec
tion of single-stranded RNA then joins with the RNA-induced silencing complex (RISC),
forming the miRISC [3]. This complex binds to a target mRNA sequence, and disrupts transla
tion by interrupting the progress of the ribosomes. The miRISC also marks the mRNA for
degradation.
Whilst only discovered fteen years ago [13], miRNAs are now known to be involved in almost
every level of regulation, and most genes are now thought to be targetted by miRNAs [9]. In
addition to the mechanisms of translation regulation already mentioned, it has been suggested
that the miRNAs may help to lter out noise in expression levels [28], as well as cleaning up
residual mRNAs.
The structure of the miRNA tends to be a stem-loop, with two complementary arms linked by a
loop region [3]. One of these arms contains a binding sequence that is complementary to the
target mRNA, and this arm will become part of the miRISC. Occasionally the other arm may
also bind a dierent mRNA sequence, in which case this second arm is termed the star arm.
Other structural features may include small loops along the arms where there is mismatch in
complementarity. Interestingly, despite the length of the miRNA sequence, the specicity of the
miRNA for its target mRNA sequence is conferred mainly by a 7-8 nucleotide stretch at the
start of the region of the binding arm of the miRNA that goes on to form the miRISC.
Computational miRNA discovery
Given the distinctive structural features of miRNAs, one computational strategy for detecting
them is to search the genome for likely hairpins, featuring two complementary regions with a
short loop in between. However, many such hairpins exist, and very few true miRNAs are
among these. As such, very high specicity is needed for reliable prediction.
It has been observed that structural features alone are insucient to reliably predict miRNAs,
even when folding energy and other features are all incorporated into a machine learning
approach. Screening for hairpin energy does result in around a 40-fold enrichment for true
miRNAs, but this is not sucient given the huge number of putative hairpins.
3
12 Drosophila genomes
Using the alignment of the genomes of twelve species of Drosophila from Clark et al. [1], it was
observed that the miRNA arms are highly conserved, whilst the loop tends to be less conserved.
Given that one of the arms actually binds to the mRNA, whereas the loop is cleaved and has no
specic function (in these examples at least), one might expect this pattern of conservation. It
was also noted that the regions anking the loop are not well conserved, and these dierent pat
terns of conservation give rise to a characteristic conservation prole for miRNAs.
Using a machine learning approach on this conservation prole, Ruby et al. [24] were able to
achieve much higher enrichment for true miRNAs. Indeed, combining this with all the other fea
tures enabled a 4551-fold enrichment, which could be deemed sucient for predictive purposes.
Validation of the predicted miRNAs
Experimental approaches have been used to sequence miRNAs (usually just the arms), and a
the predicted miRNAs were observed to show a good match with these experimental reads [24].
Of the 101 hairpins obtained through the computational procedure, good evidence was found to
suggest that most of these were in fact true miRNAs. Of those that did not match any existing
evidence, many were found within introns and intergenic regions, which indicates that the pre
dictions are in fact reasonable. In addition, many predicted miRNAs were seen to cluster with
other known miRNAs.
In theory, both sense and anti-sense strands of the miRNA gene could fold up to form the same
hairpin, with the star arm of the sense hairpin equivalent to the non-star arm of the antisense
hairpin [24]. Nevertheless, there was an observed preference for the sense strand in many cases
[29], although this could provide an additional mechanism for variability.
Two predicted miRNAs were observed to overlap with protein coding genes [15], and as such
these were initially discarded. However, as it turned out, these regions of the genome had in fact
been erroneously annotated as protein coding regions due to the transcription of the miRNA
genes.
Additional signatures of mature miRNAs
As already mentioned, the 7mer at the 5 of the mature miRNA (or miRNA*) sequence is chiey
responsible for the specicity of binding. This short region was observed to be highly conserved,
particularly in the non-star sequence, and as such, the high conservation of this region can be
used to identify the adjacent cleavage site, and to predict the 5 end of the miRNA. It was also
observed that this 7mer tends to be conserved in the 3 UTRs of target sequences, but avoided
in those of anti-targets, demonstrating both positive and negative selection.
In the cases where it was not possible to predict the 5 end accurately, imprecise processing was
observed to be associated with these miRNAs [29], providing another example of the link
between the computational and biological features of miRNA regulation.
Question: how much of miRNA specicity comes from the RISC complex?
Answer: It is mostly bases 2-7 of the miRNA that determine specity, with the RISC acting as
more of a scaold. The rst base of the 7mer is less important, and tends to be a T or a U.
There is in fact a crystal structure of an miRNA:mRNA complex from C.elegans that was pub
lished earlier this year [6].
4
Prediction of binding sites
With motif discovery one can look for statistical enrichment over the whole genome in order to
establish the consensus for the motif. However, in order to nd a single instance of a binding
site (whether for a TF binding site, or an miRNA target), dierent procedures must be adopted
[12].
Experimental approaches
Several chromatin immunoprecipitation (ChIP) techniques have been developed in order to
detect protein-DNA binding. The general approach consists of using antibodies for a particular
protein to immunoprecipitate sequences bound to the protein. This can be followed by either
microarray detection (ChIP-Chip), or sequencing (ChIP-Seq) [18] of the bound sequences.
However, both of these approaches rely on antibody availability, and are restricted to specic
tissues. Although this specicity can also be useful, the experimental approaches are plagued by
high false positive and false negative rates. In principle these approaches nd all binding sites,
but if there is lots of TF then low-ainity sites may become bound, and so the denition of a
binding site will depend on the specic conditions used.
Computational approaches
The main role for computational approaches is to increase the specicity of matching. Single-
genome approaches may look for clustering of binding sites for TFs that are thought to act
together [22, 4, 27], but these techniques may miss instances where TFs act alone. Multi-genome
approaches, such as phylogenetic footprinting [19, 5, 8, 14] are therefore of interest.
Phylogenetic footprinting
Within a given multiple sequence alignment (MSA), there may be regions that are not present
in all the aligned sequences. In particular, some binding sites for a specic motif may only be
conserved in a subset of the species, and may have moved, or mutated in other species. Tradi
tional conservation-based approaches to detecting binding sites might therefore miss these cases.
One possible way of tackling this problem is to use the notion of the branch length score (BLS)
to assign a score to particular instances where a binding site is missing in certain sequences.
The BLS is computed by taking the total length of all the branches in the phylogenetic tree con
necting the sequences that possess the binding sequence, and dividing this total by the sum of
the branch lengths for the tree connecting all sequences [12]. The result of this is that the
absence of the binding sequence from a few short branches of the tree does not have a large
impact on the resulting score.
The BLS can be converted into a condence score by using random motifs to work out the
signal
average background noise, and then expressing the condence score as C =
signal + noise
.
This condence score has been shown to select for TFs that occur within promoters, and
miRNA motifs in the 3 UTRs of their target mRNAs
Condence score must be saying something, since high condence score enriches for strand bias
in miRNAs, as well as recapitulating the strand bias observed in miRNA motifs.
Comparison with experiment
When the high-condence binding sites are compared to sequences derived from ChIP data, a
high enrichment for ChIP sequences is observed in both mammals and ies [2]. If the binding
sequences are highly conserved as well as posessing a high condence score, an even higher
enrichment is observed. When the promoters of y muscle genes were examined, there is a clear
enrichment for activating motifs in genes that are highly expressed in muscle tissue, and a corre
sponding enrichment for repressor motifs in underexpressed genes. The computational procedure
was able to detect this enrichment better than the ChIP data [34, 26, 25] in almost all cases.
5
It is worth noting, however, that an advantage of the ChIP experiments over computational pro
cedures that employ evolutionary information is that the former allow detection of species-
specic binding, whereas the latter rely on a signal across species.
Network level examination
The regulatory network for the y was reconstructed using the computational predictions, and
was seen to feature several links that are conrmed by the literature. It was also observed that
many genes are highly regulated by both TFs and miRNAs, as shown by the correlated in-
degrees in the network. All of this could be taken to support the validity of the computational
procedures that have been described.
Bibliography
[1] Evolution of genes and genomes on the drosophila phylogeny. Nature, 450(7167):203218, November
2007.
[2] Artem Barski, Suresh Cuddapah, Kairong Cui, Tae-Young Roh, Dustin E. Schones, Zhibin Wang, Gang
Wei, Iouri Chepelev, and Keji Zhao. High-resolution proling of histone methylations in the human
genome. Cell, 129(4):823 837, 2007.
[3] David P. Bartel. Micrornas: Genomics, biogenesis, mechanism, and function. Cell, 116(2):281 297,
2004.
[4] Benjamin P. Berman, Yutaka Nibu, Barret D. Pfeier, Pavel Tomancak, Susan E. Celniker, Michael
Levine, Gerald M. Rubin, and Michael B. Eisen. Exploiting transcription factor binding site clustering to
identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proceedings of the
National Academy of Sciences of the United States of America, 99(2):757762, 2002.
[5] Mathieu Blanchette and Martin Tompa. Discovery of Regulatory Elements by a Computational Method
for Phylogenetic Footprinting. Genome Research, 12(5):739748, 2002.
[6] M. Cevec, C. Thibaudeau, and J. Plavec. Solution structure of a let-7 mirna:lin-41 mrna complex from c.
elegans. Nucleic Acids Research, 36(7):23302337, February 2008.
[7] Eric H. Davidson and Douglas H. Erwin. Gene Regulatory Networks and the Evolution of Animal Body
Plans. Science, 311(5762):796800, 2006.
[8] Laurence Ettwiller, Benedict Paten, Marcel Souren, Felix Loosli, Jochen Wittbrodt, and Ewan Birney.
The discovery, positioning and verication of a set of transcription-associated motifs in vertebrates.
Genome Biology, 6(12):R104, 2005.
[9] Robin C Friedman, Kyle Kai-How Farh, Christopher B Burge, and David Bartel. Most mammalian
mRNAs are conserved targets of microRNAs. Genome Research, In press, 2008.
[10] Stephen E. Halford and John F. Marko. How do site-specic DNA-binding proteins nd their targets?
Nucl. Acids Res., 32(10):30403052, 2004.
[11] Manolis Kellis, Nick Patterson, Matthew Endrizzi, Bruce Birren, and Eric S. Lander. Sequencing and
comparison of yeast species to identify genes and regulatory elements. Nature, 423(6937):241254, May
2003.
[12] Pouya Kheradpour, Alexander Stark, Sushmita Roy, and Manolis Kellis. Reliable prediction of regulator
targets using 12 Drosophila genomes. Genome Research, 17(12):19191931, 2007.
[13] R. C. Lee, R. L. Feinbaum, and V. Ambros. The c. elegans heterochronic gene lin-4 encodes small rnas
with antisense complementarity to lin-14. Cell, 75(5):843854, December 1993.
[14] B. P. Lewis, I. H. Shih, M. W. Jones-Rhoades, D. P. Bartel, and C. B. Burge. Prediction of mam
malian microrna targets. Cell, 115(7):787798, December 2003.
[15] Michael F. Lin, Joseph W. Carlson, Madeline A. Crosby, Beverley B. Matthews, Charles Yu, Soo Park,
Kenneth H. Wan, Andrew J. Schroeder, L. Sian Gramates, Susan E. St. Pierre, Margaret Roark, Ken
neth L. Wiley, Rob J. Kulathinal, Peili Zhang, Kyl V. Myrick, Jerry V. Antone, Susan E. Celniker,
William M. Gelbart, and Manolis Kellis. Revisiting the protein-coding gene catalog of Drosophila
melanogaster using 12 y genomes. Genome Research, 17(12):18231836, 2007.
[16] Xiao Liu, David M. Noll, Jason D. Lieb, and Neil D. Clarke. DIP-chip: Rapid and accurate determina
tion of DNA-binding specicity. Genome Research, 15(3):421427, 2005.
6
[17] Kenzie D MacIsaac and Ernest Fraenkel. Practical strategies for discovering regulatory dna sequence
motifs. PLoS Comput Biol, 2(4):e36, Apr 2006.
[18] Elaine R. Mardis. Chip-seq: welcome to the new frontier. Nature Methods, 4(8):613614.
[19] A. M. Moses, D. Y. Chiang, D. A. Pollard, V. N. Iyer, and M. B. Eisen. Monkey: identifying conserved
transcription-factor binding sites in multiple alignments using a binding site-specic evolutionary model.
Genome Biol, 5(12), 2004.
[20] Sonali Mukherjee, Michael F. Berger, Ghil Jona, Xun S. Wang, Dale Muzzey, Michael Snyder,
Richard A. Young, and Martha L. Bulyk. Rapid analysis of the dna-binding specicities of transcription
factors with dna microarrays. Nature Genetics, 36(12):1331+, November 2004.
[21] L. A. Pennacchio, N. Ahituv, A. M. Moses, S. Prabhakar, M. A. Nobrega, M. Shoukry, S. Minovitsky,
I. Dubchak, A. Holt, K. D. Lewis, I. Plajzer-Frick, J. Akiyama, S. De Val, V. Afzal, B. L. Black,
O. Couronne, M. B. Eisen, A. Visel, and E. M. Rubin. In vivo enhancer analysis of human conserved non-
coding sequences. Nature, 444(7118):499502, Nov 23 2006.
[22] Anthony A. Philippakis, Brian W. Busser, Stephen S. Gisselbrecht, Fangxue S. He, Beatriz Estrada,
Alan M. Michelson, and Martha L. Bulyk. Expression-guided in silico evaluation of candidate cis regula
tory codes for drosophila muscle founder cells. PLoS Computational Biology, 2(5):e53+, May 2006.
[23] Isidore Rigoutsos. Combinatorial pattern discovery in biological sequences: the teiresias algorithm.
Bioinformatics, 14:5567(13), February 1998.
[24] J. Graham Ruby, Alexander Stark, Wendy K. Johnston, Manolis Kellis, David P. Bartel, and Eric C.
Lai. Evolution, biogenesis, expression, and target predictions of a substantially expanded set of Drosophila
microRNAs. Genome Research, 17(12):18501864, 2007.
[25] Thomas Sandmann, Charles Girardot, Marc Brehme, Waraporn Tongprasit, Viktor Stolc, and
Eileen E.M. Furlong. A core transcriptional network for early mesoderm development in Drosophila
melanogaster. Genes Development , 21(4):436449, 2007.
[26] Thomas Sandmann, Lars J. Jensen, Janus S. Jakobsen, Michal M. Karzynski, Michael P. Eichenlaub,
Peer Bork, and Eileen E.M. Furlong. A temporal map of transcription factor activity: Mef2 directly regu
lates target genes at all stages of muscle development. Developmental Cell, 10(6):797 807, 2006.
[27] Mark D. Schroeder, Michael Pearce, John Fak, Hongqing Fan, Ulrich Unnerstall, Eldon Emberly, Niko
laus Rajewsky, Eric D. Siggia, and Ulrike Gaul. Transcriptional control in the segmentation gene network
of drosophila. PLoS Biology, 2(9):e271+, September 2004.
[28] Alexander Stark, Julius Brennecke, Natascha Bushati, Robert B B. Russell, and Stephen M M. Cohen.
Animal micrornas confer robustness to gene expression and have a signicant impact on 3utr evolution.
Cell, December 2005.
[29] Alexander Stark, Michael F. Lin, Pouya Kheradpour, Jakob S. Pedersen, Leopold Parts, Joseph W.
Carlson, Madeline A. Crosby, Matthew D. Rasmussen, Sushmita Roy, Ameya N. Deoras, Graham J.
Ruby, Julius Brennecke, Harvard F. Curators, Berkeley, Emily Hodges, Angie S. Hinrichs, Anat Caspi,
Benedict Paten, Seung-Won Park, Mira V. Han, Morgan L. Maeder, Benjamin J. Polansky, Bryanne E.
Robson, Stein Aerts, Jacques van Helden, Bassem Hassan, Donald G. Gilbert, Deborah A. Eastman,
Michael Rice, Michael Weir, Matthew W. Hahn, Yongkyu Park, Colin N. Dewey, Lior Pachter, James W.
Kent, David Haussler, Eric C. Lai, David P. Bartel, Gregory J. Hannon, Thomas C. Kaufman, Michael B.
Eisen, Andrew G. Clark, Douglas Smith, Susan E. Celniker, William M. Gelbart, and Manolis Kellis. Dis
covery of functional elements in 12 drosophila genomes using evolutionary signatures. Nature,
450(7167):219232.
[30] Pavel Tomancak, Amy Beaton, Richard Weiszmann, Elaine Kwan, ShengQiang Shu, Suzanna Lewis,
Stephen Richards, Michael Ashburner, Volker Hartenstein, Susan Celniker, and Gerald Rubin. Systematic
determination of patterns of gene expression during drosophila embryogenesis. Genome Biology, 3(12):1
14, 2002. This article is part of a series of refereed research articles from Berkeley Drosophila Genome Pro
ject, FlyBase and colleagues, describing Release 3 of the Drosophila genome, which are freely available at
http://genomebiology.com/drosophila/.
[31] Martin Tompa, Nan Li, Timothy L. Bailey, George M. Church, Bart D. De Moor, Eleazar Eskin,
Alexander V. Favorov, Martin C. Frith, Yutao Fu, James J. Kent, Vsevolod J. Makeev, Andrei A.
Mironov, William S. Noble, Giulio Pavesi, Graziano Pesole, Mireille Rgnier, Nicolas Simonis, Saurabh
Sinha, Gert Thijs, Jacques v. van Helden, Mathias Vandenbogaert, Zhiping Weng, Christopher Workman,
Chun Ye, and Zhou Zhu. Assessing computational tools for the discovery of transcription factor binding
sites. Nature Biotechnology, 23(1):137144, January 2005.
[32] C Tuerk and L Gold. Systematic evolution of ligands by exponential enrichment: RNA ligands to bacte
riophage T4 DNA polymerase. Science, 249(4968):505510, 1990.
[33] Xiaohui Xie, Jun Lu, E. J. Kulbokas, Todd R. Golub, Vamsi Mootha, Kerstin Lindblad-Toh, Eric S.
Lander, and Manolis Kellis. Systematic discovery of regulatory motifs in human promoters and 3 utrs by
comparison of several mammals. Nature, aop(434):338345, March 2005.
[34] J. Zeitlinger, R. P. Zinzen, A. Stark, M. Kellis, H. Zhang, R. A. Young, and M. Levine. Whole-genome
chip-chip analysis of dorsal, twist, and snail suggests integration of diverse patterning processes in the
drosophila embryo. Genes Dev, 21(4):385390, February 2007.
7
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
Lecture 21: Introduction to Steady State
Metabolic Modeling
November 18, 2008
1 Introduction
Metabolic modeling allows us to use mathematical models to represent complex
biological systems. This lecture discusses the role that modeling biological sys
tems at the steady state plays in understanding the metabolic capabilities of
interesting organisms and how well steady state models are able to replicate in
vitro experiments.
1.1 What is Metabolism?
According to Matthews and van Holde, metabolism is the totality of all chem
ical reactions that occur in living matter. This includes catabolic reactions,
which are reactions that lead the break down of molecules into smaller com
ponents and anabolic reactions, which are responsible for the creation of more
complex molecules (e.g. proteins, lipids, carbohydrates, and nucleic acids) from
smaller components. These reactions are responsible for the release of energy
from chemical bonds and the storage of energy respectively. Metabolic reactions
are also responsible for the transduction and transmission of information (for
example, via the generation of cGMP as a secondary messenger or mRNA as a
substrate for protein translation).
1.2 Why Model Metabolism?
An important application of metabolic modeling is in the prediction of drug
eects. An important subject of modeling is the organism Mycobacterium tu
berculosis [1]. The disruption of the mycolic acid synthesis pathways of this
organism would help control TB infection. Computational modeling gives us
a platform for identifying the best drug targets in this system. Gene knock
out studies in Escherichia coli have allowed scientists to determine which genes
and gene combinations aect the growth of that important model organism [2].
Both agreements and disagreements between models and experimental data can
help us assess our knowledge of biological systems and help us improve our
1
predictions about metabolic capabilities. In the next lecture, we will learn the
importance of incorporating expression data into metabolic models.
2 Model building
2.1 Chemical reactions
In metabolic models, we are concerned with modeling chemical reactions that
are catalyzed by enzymes. Enzymes work by facilitating a transition state of
the enzyme-substrate complex that lowers the activation energy of a chemical
reaction. The diagram on slide 5 of page 1 of the lecture slides demonstrates
this phenomenon. A typical rate equation (which describes the conversion of
the substrates of the enzyme reaction into its products S=P) can be described
by a Michaelis-Menten rate law:
V
=
[S]
, where V is the rate of the
Vmax Km+[S]
equation as a function of substrate concentration. K
m
and V
max
are the two
parameters necessary to characterize the equation.
The inclusion of multiple substrates, products, and regulatory relationships
quickly increases the number of parameters necessary to characterize enzyme
function. The gures on slides 1, 2, and 3 of page 2 of the lecture notes demon
strate the complexity of biochemical pathways. Kinetic modeling quickly be
comes infeasible because the necessary parameters are dicult to measure and
also vary across organisms [3]. Thus, we are interested in a modeling method
that would allow us to avoid the use of large numbers of poorly-determined
parameters.
2.2 Steady-state assumption
The steady state assumption allows us to represent reactions entirely in terms
of their chemistry (the stoichiometric relationships between the components of
the enzymatic reaction) by assuming that there is not accumulation of any
metabolite in the system. This does not mean that there is not ux through
any of the reactions, simply that accumulation does not occur. An analogy
is to a series of waterfalls that contribute water to pools. As the water falls
from one pool to another, the water levels do not change even though water
continues to ow (see page 2 slide 5). This framework prevents us from seeing
transient kinetics that can result from perturbations of the system, but if we
are interested in long-term metabolic capabilities (functions on a scale longer
than milliseconds or seconds) then steady state dynamics may give us all the
information that we need.
An important aspect of the steady-state assumption is that reaction sto
chiometries are conserved across species, whereas the biology of enzyme catal
ysis (and the parameters that characterize it) are not conserved across species.
This makes the ability to generalize across species and reuse conserved pathways
in models much more feasible.
2

2.3 Reconstructing Metabolic Pathways
There are several databases that can provide the information necessary to re
construct metabolic pathways in silico. These databases allow reaction stoi
chiometry to be accessed using Enzyme Commission numbers, which are the
same in each organism that uses that particular enzyme. Among the databases
of interest are ExPASy [4], MetaCyc [5], and KEGG [6]. These databases often
contain pathways organized by function that can be downloaded in a format such
as SBML, often making pathway reconstruction very easy for well-characterized
pathways. ExPASy [4], MetaCyc [5], and KEGG [6]. These databases often con
tain pathways organized by function that can be downloaded in a format such
as SBML, often making pathway reconstruction very easy for well-characterized
pathways.
3 Metabolic Flux Analysis
Metabolic ux analysis (MFA) is a way of computing the distribution of reaction
uxes that is possible in a given metabolic network at steady state. We can
place constraints on certain uxes in order to limit the space described by the
distribution of possible uxes.
3.1 Mathematical representation
We can represent the ux of each substrate x
i
, we can represent the rate of
change of that substrate as
dxi
= S v, where S is a vector that describes the
dt
stoichiometric coecients of that metabolite in each reaction in which it is a
substrate or product. With this steady state assumption, we have 0 = S v.
Because we are interested modeling pathways, we represent stoichiometric as an
m n matrix S and uxes as an n-dimensional matrix v. Page 3 slides 3 and
4 show gures of these mathematical representations.
3.2 Null space of S
The feasible ux space through the reactions of the model are dened by the
null space of S (page 4 slide 5). In addition, there are a series of non-unique
basis vectors b
i
that describe the null space (page 4 slide 6). All of the null space
uxes are this linear combinations of the basis and v =
i
a
i
b
i
. These can be
found by standard methods such as singular value decomposition (SVD) [7].
These basis vectors represent extreme pathways at the edge of the ux cone,
which is a geometric representation of the ux space and give us important
information about the metabolic capabilities of the organisms in which we are
interested (pager 5 slide 2).
3

3.3 Constraining the Flux Space
One obvious constraint that we can place on our uxes is that they must be
positive. Within this framework we can represent reverse reactions as separate
reactions within the stoichiometric matrix and then set a lower bound on all
the reactions of 0. Likewise, if we happen to know the maximum uxes of any
of our reactions (these values correspond to our V
max
parameters), then we can
also place these constraints on the ux space. We can also add input and output
uxes that represent transport into and out of our cells, which are often much
easier to measure than internal uxes and can thus serve to help us to generate
a more biologically-relevant ux space. An example of an algorithm for solving
this problem is the simplex algorithm [8]. Page 5 slides 4-6 demonstrate how
constraints on the uxes change the geometry of the ux cone. In reality, we
are dealing with problems in higher-dimensional spaces.
3.4 Linear Programming
The constraints described above give us the linear programming problem de
scribed on slide 31. We justify the formulation of the ux determination prob
lem as a linear programming problem by recognizing that for a given point in
time and a given environment an organism will demonstrate one particular ux
distribution. In order to try and narrow down our feasible ux, we assume that
there exists a tness function which is a linear combination of any number of
the uxes in the system. Linear programming (or linear optimization) involves
maximizing or minimizing a linear function over a convex polyhedron specied
by linear and non-negativity constraints.
We solve this problem by identifying the ux distribution that maximizes
an objective function z =
i
c
i
v
i
= c
T
v. Our solutions lie at the boundaries of
the permissible ux space and can be on points, edges, or both. This concept
is demonstrated on page 6 slide 1. In that slide, A is the stoichiometric matrix,
x is the vector of uxes, and b is a vector of maximal permissible uxes.
In microbes such as E. coli, this objective function is often a combination
of uxes that contributes to biomass. However, this function need not be com
pletely biologically meaningful. For example, we might simulate the maximiza
tion of mycolates in M. tuberculosis, even though this isnt happening biologi
cally. It would give us meaningful predictions about what perturbations could
be performed in vitro that would perturb mycolate synthesis even in the absence
of the maximization of the production of those metabolites.
Flux balance analysis was pioneered by Palssons group at UCSD and has
since been applied to E. coli, M. tuberculosis, the human red blood cell [9].
4 Applications of metabolic modeling
One application of metabolic modeling is called deletion analysis. In this tech
nique, we develop a metabolic model for an organism of interest and calculate
the feasible ux space with constraints while systematically removing reactions
4
to generate in silico mutants (page 6 slide 6). Using this method we can make
computational predictions about the metabolic capabilities of microbial knock
out strains [10]. Geometrically, each knockout changes the shape of the ux cone
and constrains the feasible ux space (page 7 slide 1). If we are using growth
as our objective function, we can then use the value of the objective function
at the optimal solution as a measure of phenotype. Errors (discordance with
experimental observations) can then help us to assess the validity of our model
and observations. This technique has been used successfully in E. coli and yeast
[11].
In [10], Edwards and Palsson modeled the central metabolism of E. coli us
ing 720 reactions and 436 metabolites. This large model is an good example of
a system that is modeled well my a steady-state model and FBA but would be
exceedingly dicult to describe well using ordinary dierential equations. Ed
wards and Palsson simulated knockouts with pertubations in processes including
glycolysis, electron transport, the pentose phosphate pathway, and TCA. Simu
lation results were compared with experimental results and good predictions of
lethal phenotypes, reduced growth phenotypes, and well-functioning phenotypes
were achieved.
The results of [11] may be less impressive. Although good results with
experimental data are achieved, most of the phenotypes are predicted to survive
perturbations. Nevertheless, this model and the model of Edwards and Palsson
both provide a computational foundation for the improvement of our biological
knowledge.
References
[1] Jamshidi, N. and Palsson, B. . BMC Systems Biology 1, 26 (2007).
PMC1925256.
[2] Edwards, J. S., Ibarra, R. U., and Palsson, B. O. Nat Biotech 19(2), 125
130 February (2001).
[3] Holmberg, A. Mathematical Biosciences 62(1), 2343 November (1982).
[4] Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, I., Appel, R. D., and
Bairoch, A. Nucl. Acids Res. 31(13), 37843788 July (2003).
[5] Caspi, R., Foerster, H., Fulcher, C. A., Kaipa, P., Krummenacker, M.,
Latendresse, M., Paley, S., Rhee, S. Y., Shearer, A. G., Tissier, C., Walk,
T. C., Zhang, P., and Karp, P. D. Nucl. Acids Res. 36(suppl 1), D623631
(2008).
[6] Kanehisa, M., Goto, S., Kawashima, S., and Nakaya, A. Nucl. Acids Res.
30(1), 4246 (2002).
[7] Price, N. D., Reed, J. L., Papin, J. A., Famili, I., and Palsson, B. O.
Biophys. J. 84(2), 794804 February (2003).
5
[8] http://en.wikipedia.org/wiki/Simplex algorithm.
[9] Edwards, J. S., Covert, M., and Palsson, B. Environmental Microbiology
4(3), 133140 (2002).
[10] Edwards, J. S. and Palsson, B. O. Proceedings of the National Academy of
Sciences of the United States of America 97(10), 55285533 May (2000).
PMC25862.
[11] Forster, J., Famili, I., Palsson, B. O., and Nielsen, J. Omics: A Journal of
Integrative Biology 7(2), 193202 (2003). PMID: 14506848.
6
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
6.047/6.878 Lecture 22: Metabolic Modeling 2
November 20, 2008
1 Review
In the last lecture, we discussed how to use linear algebra to model metabolic networks with flux-balance analysis
(FBA), which depends only on the stoichiometry of the reactions, not the kinetics. The metabolic network is
represented as an n m matrix M whose columns are the m reactions occurring in the network and whose rows are
the n products and reactants of these reactions. The entry M(i, j) represents the relative amount of metabolite i
consumed or produced by reaction j. A positive value indicates production; a negative value indicates consumption.
The nullspace of this matrix consists of all the m 1 reaction flux vectors that are possible given that the metabolic
system is in steady state; i.e. all the sets of fluxes that do not change the metabolite concentrations. The nullspace
ensures that the system is in steady state, but it there are also additional constraints in biological systems: fluxes
cannot be infinite, and each reaction can only travel in the forward direction (in cases where the backward reaction
is also biologically possible, it is included as a separate column in the matrix). After bounding the fluxes and
constraining the directions of the reactions, the resulting space of possible flux vectors is called the constrained flux-
balance cone. The edges of the cone are known as its extreme pathways. By choosing an objective functionsome
linear combination of the fluxesto maximize, we can calculate the optimal values for the fluxes using linear
programming and the simplex algorithm. For example, the objective function may be a weighted sum of all the
metabolite fluxes that represents the overall growth rate of the cell (growth objective). Alternatively, we can
maximize use of one particular product by find the m 1 flux vector that is in the flux-balance cone and has the
largest negative value of that product.
At the end of the last lecture, we also discussed knockout phenotype predictions, a classic application of metabolic
modeling. Experimental biologists often gather information about the function of a protein by generating transgenic
organisms in which the gene encoding that protein has been disrupted, or knocked out. We can simulate in silico the
effect of knocking out a particular enzyme on metabolism by assuming that the reaction catalyzed by that enzyme
does not occur at all in the knockout: zeroing out the jth column of M effectively removes the jth reaction from the
network. What used to be an optimal solution may now lie outside of the constrained flux-balance cone and thus no
longer be feasible in the absence of the jth reaction. We can determine the new constrained flux-balance cone and
calculate the new optimum set of fluxes to maximize our objective function, predicting the effect of the knockout on
metabolism.
2 Knockout Phenotype Prediction in Eukaryotes
(Forester et al, 2003; Famili et al, 2003)
The last lecture gave an example of using knockout phenotype prediction to predict metabolic changes in response
to knocking out enzymes in E. coli, a prokaryote. Eukaryotic cells, including animal cells, have more complex
organization and more complex gene regulation and gene expression pathways. In this lecture, a recent attempt to
perform similar knockout phenotype prediction in yeast, a eukaryote, was presented (Forster et al, 2003 and Famili
et al, 2003). The authors predicted whether a variety of enzyme knockouts would be able to grow under a few
different environmental conditions and compared the predictions to experimental results. They achieved 81.5%
agreement between their predictions and experiment, but looking at the data more closely reveals that all of the
discrepancies were false positives, in which the yeast were predicted to grow but did not. In fact, the model
predicted that almost every knockout would grow. While not bad, these results were not as compelling as those for
the prokaryotic case, highlighting the need to incorporate the ability of cells to regulate gene expression in response
to environmental and metabolic changes into flux-balance analysis of eukaryotes.
3 Overview of Todays Material: Extensions to FBA
Today we discuss a number of extensions to flux-balance analysis that provide additional predictive power. First,
we demonstrate the ability of FBA to give quantitative predictions about growth rate and reaction fluxes given
different environmental conditions. We then describe how to use FBA to predict time-dependent changes in growth
rates and metabolite concentrations using quasi steady state modeling. Next, we discuss two approaches for taking
gene expression changes into account in the FBA model: by building the rules of gene regulation into a Boolean
network or by using available expression data from microarray experiments to constrain the flux-balance cone.
Finally, we provide an example of using expression data to predict the state of the environment from the metabolic
state, rather than the other way around.
4 Quantitative Flux Prediction
(Edwards, Ibarra, & Palsson, 2001)
Since FBA maximizes an objective function, resulting in a specific value for this function, we should in theory be
able to extract quantitative information from the model. An early example of this was done by Edwards, Ibarra, and
Palsson (2001), who predicted the growth rate of E. coli in culture given a range of fixed uptake rates of oxygen and
two carbon sources (acetate and succinate), which they could control in a batch reactor. They assumed that E. coli
cells adjust their metabolism to maximize growth (using a growth objective function) under given environmental
conditions and used FBA to model the metabolic pathways in the bacterium. The controlled uptake rates fixed the
values of the oxygen and acetate/succinate input fluxes into the network, but the other fluxes were calculated to
maximize the value of the growth objective. The authors quantitative growth rate predictions under the different
conditions matched very closely to the experimentally observed growth rates, implying that E. coli do have a
metabolic network that is designed to maximize growth. The agreement between the predictions and experimental
results is very impressive for a model that does not include any kinetic information, only stoichiometry. Prof.
Galagan cautioned, however, that it is often difficult to know what good agreement is, because we dont know the
significance of the size of the residuals.
5 Quasi Steady State Modeling
(Varma & Palsson, 1994)
The previous example used FBA to make quantitative growth predictions under specific environmental conditions
(point predictions), but is it also possible to predict time-dependent changes in response to varying environmental or
cellular conditions? Varma and Palsson (1994) demonstrate the use of FBA to predict time-dependent changes in E.
coli metabolism. Their quasi steady state model is based on the idea that time-dependent changes in fluxes can be
modeled as transitions through individual time points that are themselves in steady state. In other words, quasi
steady state modeling divides time into differential slices of size t and assumes that the state of the system is
constant within each interval. This assumption requires that metabolism can adjust to changes more rapidly than the
changes occur. To model a time profile of the system under the quasi steady state assumption, we use FBA to
calculate the fluxes within each time interval. Because fluxes represent the derivatives of the metabolite
concentrations, we can assume that the derivatives are constant over each t and integrate to find the starting
metabolite concentrations at the next time interval. So quasi steady state modeling is an iterative process in which
we calculate the optimal fluxes and growth rate for the system at one time point, and then use those optimal fluxes to
derive the initial environmental conditions for the next time point.
Varma and Palsson (1994) first predicted growth rate, oxygen uptake, and acetate secretion for specified glucose
uptake rates, using quantitative flux prediction as in the previous example. Their FBA model incorporated
experimentally-determined values of maximum oxygen utilization rate, maximum aerobic glucose utilization rate,
maximum anaerobic glucose utilization rate, and growth- and non-growth-associated maintenance requirements.
Their predictions were again found to be very similar to experimental results.
The researchers then used quasi steady state modeling to predict the time-dependent profiles of cell growth and
metabolite concentrations in batch cultures of E. coli that had either a limited initial supply of glucose or a slow
continuous glucose supply. They predicted available glucose concentration and acetate secretion over time, and
their predictions again matched very well with experimental measurements: their model anticipated both the smooth
behavior (e.g. decreasing glucose concentration, increasing acetate secretion over time) and also sudden transitions
(e.g. the sudden decrease in acetate secretion when available glucose concentration reached zero). Thus, in E. coli,
quasi steady state predictions are impressively accurate even with a model that does not account for any changes in
enzyme expression levels over time. However, this model would not be adequate to describe behavior that is known
to involve gene regulation; for example, if the cells had been grown on half-glucose/half-lactose medium, the model
would not have been able to predict the switch in consumption from one carbon source to another that occurs
experimentally when E. coli activates alternate carbon utilization pathways only in the absence of glucose.
6 Incorporating Regulation into Metabolic Models
Metabolic pathways are regulated at many different levels: the metabolite itself can be regulated, while
transcriptional, translational, and post-translational regulation control the availability of active enzyme. These
regulatory processes have been studied extensively by biologists, and many of the discrepancies between FBA
predictions and experimental results can be explained by using existing knowledge about gene regulation. As stated
in Covert et al (2001):
...FBA leads to incorrect predictions in situations where regulatory effects are a dominant influence on the behavior
of the organism. ... Thus, there is a need to include regulatory events within FBA to broaden its scope and predictive
capabilities.
Incorporating known regulatory information into metabolic models in order to improve prediction is an important
area of current research.
6.1 Regulation as Boolean Logic
(Covert, Schilling, & Palsson, 2001)
The first attempt to include regulation in an FBA model was published by Covert, Schilling, and Palsson in 2001.
The researchers incorporated a set of known transcriptional regulatory events into their analysis of a metabolic
regulatory network by approximating gene regulation as a Boolean process. A reaction does or does not occur
depending on the presence of both the enzyme and the substrate(s): if either the enzyme that catalyzes the reaction
(E) is not expressed or a substrate (A) is not available, the reaction flux will be zero:
rxn = IF (A) AND (E)
Similar Boolean logic can be used to determine whether enzymes will be expressed or not, depending on the
currently expressed genes and the current environmental conditions. For example, transcription of the enzyme (E)
might only occur if the appropriate gene (G) is available for transcription and if a repressor (B) is not present:
trans = IF (G) AND NOT (B)
The authors used these principles to design a Boolean network that inputs the current state of all relevant genes (on
or off) and the current state of all metabolites (present or not present), and outputs a binary vector containing the
new state of each of these genes and metabolites. The rules of the Boolean network were constructed based on
experimentally-determined cellular regulatory events. Treating reactions and enzyme/metabolite concentrations as
binary variables does not allow for quantitative analysis, but this method can predict qualitative shifts in metabolic
fluxes when merged with FBA. Whenever an enzyme is absent, the corresponding column is removed from the
FBA reaction matrix, as was described above for knockout phenotype prediction. This leads to an iterative process:
1) given the initial states of all genes and metabolites, calculate the new states using the Boolean network; 2)
perform FBA with appropriate columns deleted from the matrix, based on the states of the enzymes, to determine
the new metabolite concentrations; 3) repeat the Boolean network calculation with the new metabolite
concentrations; etc.
An application of this method from the study by Covert et al was to simulate diauxic shift, a shift from metabolizing
a preferred carbon source to another carbon source when the preferred source is not available. The modeled process
includes two gene products, a regulatory protein RPc1, which senses (is activated by) Carbon 1, and a transport
protein Tc2, which transports Carbon 2. If RPc1 is activated by Carbon 1, Tc2 will not be transcribed, since the cell
preferentially uses Carbon 1 as a carbon source. If Carbon 1 is not available, the cell will switch to metabolic
pathways based on Carbon 2 and will turn on expression of Tc2. This information can be represented by the
Booleans:
RPc1 = IF (Carbon1)
tTc2 = IF NOT (RPc1)
Covert et al found that this approach gave predictions about metabolism that matched results from experimentally-
induced diauxic shift.
So far we have discussed using this combined FBA-Boolean network approach to model regulation at the
transcriptional/translational level, and it will also work for other types of regulation. The main limitation is for slow
forms of regulation, since this method assumes that regulatory steps are completed within a single time interval
(because the Boolean calculation is done at each FBA time step and does not take into account previous states of the
system). This is fine for any forms of regulation that act at least as fast as transcription/translation; for example,
phosphorylation of enzymes (an enzyme activation process) is very fast and can be modeled by including the
presence of a phosphorylase enzyme in the Boolean network. However, regulation that occurs over longer time
scales, such as sequestration of mRNA, is not taken into account by this model. This approach also has a
fundamental problem in that it does not allow actual experimental measurements of gene expression levels to be
inputted at relevant time points. Given recent experimental advances, it has become very easy to measure the
expression of large numbers of genes using microarrays, and we no longer need to depend on rules to calculate gene
expression levels over time.
6.2 Modeling Metabolism with Expression Data
As discussed previously in lecture, it is now possible to measure mRNA levels for thousands of genes at a time by
microarray experiments. Historically, data from microarray experiments have been analyzed by clustering, and
unknown genes are hypothesized to function similarly to known genes within the same cluster. This analysis can be
faulty, however, as genes with similar functions may not always cluster together. Incorporating microarray
expression data into FBA provides another way of interpreting the data.
The key to this approach is to determine how the mRNA expression level correlates with the flux through a reaction
catalyzed by the encoded enzyme. For the reaction:
with substrate S, product P, and enzyme E, the concentration of the enzyme directly determines the flux only if there
is an excess of substrate, i.e. the reaction is enzyme-limited. By Michaelis-Menten kinetics, the flux is given by:
where K
m
is the concentration of substrate S that gives a flux equal to half the maximum flux, v
max
, and [E
tot
] is the
concentration of available enzyme. The specific flux depends both on the concentration of enzyme and the
concentration of substrate. As a simple example of the importance of substrate concentration in reaction systems,
consider a system of two enzyme-catalyzed reactions:
Even though enzyme E
2
is present, the flux through the second reaction is zero because no S
2
can be produced in the
absence of E
1
. So the enzyme concentration cannot simply be taken as proportional to the reaction flux in our model
of the system. However, the enzyme concentration can be treated as a constraint on the maximum possible flux,
given that [S] also has a reasonable physiological limit.
The next step, then, is to relate the mRNA expression level to the enzyme concentration. This is more difficult,
since cells have a number of regulatory mechanisms to control protein concentrations independently of mRNA
concentrations. For example, translated proteins may require an additional activation step (e.g. phosphorylation);
each mRNA molecule may be translated into a variable number of proteins before it is degraded (e.g. by antisense
RNAs); the rate of translation from mRNA into protein may be slower than the time intervals considered in each
step of FBA; and the protein degradation rate may also be slow. Despite these complications, the mRNA expression
levels from microarray experiments are usually taken as upper bounds on the possible enzyme concentrations at
each measured time point. Given the above relationship between enzyme concentration and flux, this means that the
mRNA expression levels are also upper bounds on the maximum possible fluxes through the reactions catalyzed by
their encoded proteins. The validity of this assumption is still being debated, but it has already performed well in
FBA analyses and is consistent with recent evidence that cells do control metabolic enzyme levels primarily by
adjusting mRNA levels (Prof. Galagan referred students to the lecture notes from last year, when he discussed a
study by Zaslaver et al (2004) that found that genes required in an amino acid biosynthesis pathway are transcribed
sequentially as needed). This is a particularly useful assumption for including microarray expression data in FBA,
since FBA makes use of maximum flux values to constrain the flux-balance cone.
Unpublished data from Caroline Colijn, Aaron Brandes, and Jeremy Zucker provide an example of using mRNA
expression levels as constraints on maximum flux values. Microarray data from E. coli growing on two different
carbon sources, glucose and acetate, show significant differences in gene expression between the two conditions.
For example, the glyoxylate shunt is a pathway that normally has very low flux but that is required when using
carbon sources with only 2 carbons instead of 6. During growth on acetate, expression of the glyoxylate shunt
enzymes was upregulated 20-fold and the flux through this pathway significantly increased in experimental
measurements. Using the expression levels as flux constraints in an FBA model resulted in calculated flux values
that predicted a similar increase in the glyoxylate shunt pathway under acetate growth conditions.
Prof. Galagans group has been using this combined FBA and microarray data approach to predict the state of
metabolic pathways in the tuberculosis (TB) bacterium under various drug treatments. For example, several TB
drugs target the biosynthesis of mycolic acid, a cell wall constituent that is not present in animal cells. In 2005,
Raman et al published an FBA model of mycolic acid biosynthesis, consisting of 197 metabolites and 219 reactions.
Microarray expression data is also available for thousands of TB genes in the presence of 75 different drugs, drug
combinations, and growth conditions (published in Boshoff et al, 2004). Prof. Galagans group combined these
expression data with the published FBA model to predict the effect of each tested condition on mycolic acid
synthesis. Their approach was to use mycolic acid production as the objective function in the FBA algorithm, in
order to calculate the maximum possible amount of mycolic acid that the TB bacterium could produce under each
condition. The experimentally-determined mRNA expression levels were used to constrain the maximum reaction
fluxes, and the FBA results from each experimental condition were compared to the results from a control condition
in the absence of drug to determine the change in mycolic acid synthesis attributable to the presence of each drug.
The group then did additional calculations in order to characterize the significance and specificity of their results.
To determine the significance of the size of the calculated changes in mycolic acid synthesis, they repeated the same
analysis on each of the different control conditions (different growth conditions but all in the absence of drugs) to
see how mycolic acid synthesis varies as a result of normal variations in environmental conditions. The dotted lines
in the Significance and Specificity figure from lecture represent the 95% confidence interval for normal mycolic
acid variation that is not due to the effect of drugs. In addition, to determine whether observed effects were specific
to the 30 genes in the mycolic acid biosynthesis pathway versus whether the drugs affected cellular metabolism as a
whole, the group repeated the calculations using the expression levels of other sets of 30 randomly-selected genes
from the microarray experiment. If the same effect was observed for randomly-selected genes as for the mycolic
acid pathway genes, then the drug must not be specific for mycolic acid biosynthesis. The blue 95% error bars in
the Significance and Specificity figure from lecture represent the specificity of the effect for mycolic acid
biosynthesis.
The results of this approach were encouraging. Of the 7 known mycolic acid inhibitors, 6 were correctly identified
as inhibitors by the model. Their specificity was also predicted correctly (for example, PA-824 was correctly
predicted as a non-specific inhibitor). Interestingly, the 7
th
known inhibitor, triclosan, was also identified by the
model but was predicted to be an enhancer rather than an inhibitor, suggesting that additional studies of this drug
may reveal multiple effects under different conditions. 4 novel inhibitors and 3 novel enhancers of mycolic acid
synthesis were also predicted by the model.
One could argue that predictions about the effects of these drugs on mycolic acid biosynthesis could have been made
on the basis of the expression data alone, without using the FBA model. However, there are three reasons why the
combined FBA approach is preferable. First, the known inhibitors of mycolic acid biosynthesis have different
mechanisms of action and dont cluster together when the microarray data are clustered by traditional methods,
suggesting that expression data alone would not give accurate predictions about their functions. FBA also has the
advantage of not requiring a labeled training set; in this case, a training set for enhancers of mycolic acid
biosynthesis was not available, since there are no known enhancers. Finally, the FBA model contains additional
information and can answer more questions than clustering analysis; in this case, for example, it provided
information on the specificity and strength of each drugs effect on mycolic acid biosynthesis.
7 Predicting Nutrient Source
The examples above have used modeling to predict the metabolic state of an organism given known environmental
conditions. But now that we can obtain information about the metabolic state of an organism from microarray
expression data, we can imagine the converse: using modeling to predict the state of the environment given a known
metabolic state. Such predictions could be useful for determining the nutrient requirements of an organism with an
unknown natural environment, or for determining how an organism changes its environment (TB, for example, is
able to live within the environment of a macrophage phagolysosome, presumably by altering the environmental
conditions in the phagolysosome and preventing its maturation). As before, the measured expression levels provide
constraints on the reaction fluxes, altering the shape of the flux-balance cone (now the expression-constrained flux-
balance cone). FBA can be used to determine the optimal set of fluxes that maximize growth within these
expression constraints, and this set of fluxes can be compared to experimentally-determined optimal growth patterns
under each environmental condition of interest. The difference between the calculated state of the organism and the
optimal state under each condition is a measure of how sub-optimal the current metabolic state of the organism
would be if it were in fact growing under that condition.
Unpublished data from Desmond Lun and Aaron Brandes provide an example of this approach. They used FBA to
predict which nutrient source E. coli cultures were growing on, based on gene expression data. They compared the
known optimal fluxes (the optimal point in flux space) for each nutrient condition to the allowed optimal flux values
within the expression-constrained flux-balance cone. Those nutrient conditions with optimal fluxes that remained
within (or closest to) the expression-constrained cone were the most likely possibilities for the actual environment of
the culture. Results of the experiment are shown in the lecture slides, where each square in the results matrices is
colored based on the distance between the optimal fluxes for that nutrient condition and the calculated optimal
fluxes based on the expression data. Red values indicate large distances from the expression-constrained flux cone
and blue values indicate short distances from the cone. In the glucose-acetate experiments, for example, the results
of the experiment on the left indicate that low acetate conditions are the most likely (and glucose was the nutrient in
the culture) and the results of the experiment on the right indicate that low glucose/medium acetate conditions are
the most likely (and acetate was the nutrient in the culture). When 6 possible nutrients were considered, the correct
one was always accurately predicted by the model, and when 18 possible nutrients were considered, the correct one
was always one of the top 4 ranking predictions. These results suggest that it is possible to use expression data and
FBA modeling to predict environmental conditions from information about the metabolic state of an organism.
8 Summary and Additional Approaches
This lecture has taken us from simple flux-balance analysis, which gave steady-state reaction flux predictions based
on the stochiometry of metabolic reactions, through improvements that allow quantitative and time-dependent
predictions, to incorporation of gene expression changes that now allow modeling of complex cellular behaviors
such as carbon source switching. This is an active area of research and many other improvements are possible. For
example, Palssons group has recently incorporated microarray data into the Boolean model of gene regulation by
turning genes off in the Boolean network whenever their measured expression levels fall below some threshold
value. Another group recently used an FBA approach that maximizes the number of reactions in the metabolic
network whose fluxes match the measured expression levels of the corresponding enzymes; in other words, the
number of reactions whose enzymes are not expressed and have zero flux plus the number of reactions whose
enzymes are expressed and have significant flux. The class suggested that a new approach to identifying drug
targets in a pathogen could be to calculate in silico which enzyme and reaction, if knocked out, would have the
largest negative effect on flux through a pathway of interest in a pathogen. The converse of this is the synthetic
biology approach: trying to add reactions into a network (first in silico, by searching the space of possible models) to
see which additional reaction would most improve flux through a pathway of interest.
9 References
Boshoff, H.I.M., Myers, T.G., Copp, B.R., McNeil, M.R., Wilson, M.A., and Barry, C.E.III. The Transcriptional
Responses of Mycobacterium tuberculosis to Inhibitors of Metabolism. J Biol Chem 279(38):40174-40184 (2004).
Covert, M., Schilling, C. H., and Palsson, B.. Regulation of Gene Expression in Flux Balance Models of
Metabolism. Journal of Theoretical Biology 213(1):73-88 (2001).
Edwards, J.S., Ibarra, R.U., and Palsson, B.. In silico predictions of Escherichi coli metabolic capabilities are
consistent with experimental data. Nature Biotechnology 19:125-130 (2001).
Famili,I., Frster, J., Nielsen, J., and Palsson, B.. Saccharomyces cerevisiae Phenotypes can be Predicted using
Constraint-based Analysis of a Genome-scale Reconstructed Metabolic Network. PNAS, 100:13134-13139 (2003).
Frster, J., Famili, I., Fu, P., Palsson, B.., and Nielsen, J. Genome-Scale Reconstruction of the Saccharomyces
cerevisiae Metabolic Network. Genome Research, 13(2):244-253 (2003).
Raman, K., Rajagopalan, P., and Chandra, N. Flux Balance Analysis of Mycolic Acid Pathway: Targets for Anti-
Tubercular Drugs. PLoS Comput Biol 1(5):e46 (2005).
Varma, A. and Palsson, B.. Stoichiometric Flux Balance Models Quantitatively Predict Growth and Metabolic
By-Product Secretion in Wild Type Escherichia coli W3110. Appl Environ Microbiol 60(10):3724-3731 (1994).
Zaslaver, A., Mayo, A.E., Rosenberg, R., Bashkin, P., Sberro, H., Tsalyuk, M., Surette, M.G., and Alon, U. Just-in
time transcription program in metabolic pathways. Nat Genet 36(5):486-91 (2004).
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
6.04716.878 Fall 2008
Lecture 24: Module Networks
Aviv Regev
1 Introduction
Biological systems are generally incredibly complex, consisting of a huge number of interacting parts. Con-
sider, for example, the task of understanding the structure of the genetic regulatory system. The dependence
among expression levels of all the genes in a cell is so complicated that it makes any sort of detailed un-
derstanding almost impossible. The diagram for the functional network of yeast genes looks more like a
hairball than anything else! Therefore, for a high-level, conceptual understanding of how transcription is
controlled, it is useful t o talk about higher-level objects rather than individual gene expressions. Often such
conceptually simpler networks are drawn by hand by biologists. Is there a systematic way t o accomplish
this?
To this end, we define the notion of a module, which is a set of biological entities that act collectively t o
perform an identifiable and distinct function. A module should act as an entity with only weak linkage t o the
rest of the system. Some examples of modules are metabolic pathways, molecular machines and complexes
for the function of signal transduction. Our focus here is on regulatory modules, modules in the genome that
perform a distinct function. Regulatory modules are co-expressed, co-evolved and regulated by the same
set of transcription factors. For example, an operon, which consists of genes coded next t o each other and
constitutes a functional unit of transcription, is a regulatory module. Another paradigmatic example is the
ribosomal module.
Using modules as the basic units, one could now construct a regulatory module network where each node
is a regulatory module and where interdependencies between modules are well-defined since all the genes in
a module are co-expressed and co-regulated. The module network captures redundancies and structure not
easily represented in a complete regulatory network.
Now, module networks can be described in several ways. They can de described in terms of their functional
behavior as a whole; this is the simplest approach but sometimes suffices. Or it can be described as a logical
program. This is the approach we will take in the following and describe in detail. Or it can described in
terms of the low-level chemistry that determines the dynamics of the network. This is hardest t o accomplish
but sometimes necessary.
2 Reconstructing Module Networks: Yeasts
Work by Pe'er, Segal, Koller and Friedman shows how module networks might be reconstructed from gene
expression data and a set of known candidate regulatory genes.
We first describe the action of regulators on gene expression by a logical program, represented as a
decision tree. Each node in the tree consists of a regulatory gene. If the gene is upregulated, one of the two
outgoing edges from the node is followed; otherwise, the other edge is followed. Each leaf of the decision tree
is a regulation context, that is, a configuration of the regulator genes, determined by the path from the root
t o that leaf.
A regulation context determines1 the gene expression probabilistically. For example, given that regulator
gene Ylis upregulated and regulator gene Y2is downregulated, the expression of gene X might be a normal
probability distribution centered around some value. These dependencies of gene expression on regulation
contexts can be captured by a Bayesian network. Therefore, by applying standard algorithms t o learn
l ~ e Of course, in nature, the target gene expression is not only a function of the will return t o this assumption later.
regulatory gene expressions but also activities occurring during translation, transport, etc.
Bayesian networks, we can learn the genetic regulatory netowrk for yeast. But we come back t o our starting
question. How does one make sense of the structure of this hairball network?
This is where we use the idea of modules. Note that genes in the same module have the same dependency
on the regulatory context, because they are, by definition, coexpressed and coregulated. So, we can combine
genes in the same module into one node and simplify the network. Therefore, in order to learn the regulatory
module network, the learning algorithm has to both partition the genes into modules as well as learn the
Bayesian network between the module nodes.
We accomplish this using an EM algorithm. In the M-step, learns the best decision tree for each module,
given the partition of genes into modules. In the E-stage, the gene pool is partitioned into modules, given the
regulatory programs. Let us discuss this in a bit more detail. In the M-stage, given the module assignments,
for a given module, the algorithm builds the regulatory program by choosing the regulator whose split best
predicts behavior of module genes, creating a node on the decision tree for this regulator and then recursing
on the two branches. In the E-step, the regulatory programs define a probability distribution over the
expression levels for each gene. That is, for each particular gene, we can obtain a probability value that a
genes expression value was obtained by a particular regulatory program. In this case, we simply assign the
gene t o the module that best predicts it. Each reassignment therefore is guaranteed to improve the overall
predictiveness.
This algorithm was carried out for yeast cells. The results were very encouraging. The predicted module
assignments resulted in genes in each module being functionally coherent. The match between regulator
genes and target genes was there, as well as match between regulators and cis-reg motifs. Finally, one
could also verify that the predicted logic in how the regulators affected the target genes corresponded to our
previous understanding. The Hap4 motif was a test case for this study.
Composing the map between genes and modules and the map between modules and regulators yields a
map between genes and regulators. That is, we can use the predictions from the EM algorithm t o predict
regulatory intereactions. Looking at the regulatory programs then allow us t o make predictions of the
following kind: regulator X activates process Y under condition 2.This is the sort of logic that is buried
beneath the hairball in usual networks but now can be teased out. Experiments were then conducted t o
verify some of these predictions. The test conducted was t o find out if knockout of the regulator gene X
affected the process Y under condition 2.A statistical paired t-test was done t o find differently expressed
genes under the knockout. To find out if the predicted modules were responding, the distribution of the
differently expressed genes among the modules was looked at and then the modules were ranked according
to enrichment for differently expressed genes. Finally, t o test what process was being affected, the regulator-
process annotation provided by the predictions were compared t o the pattern of differently expressed genes.
There were three test cases, YPL230W, PPTl and Kin82. Other than the algorithm failing t o predict the
processes affected by the Kin82 correctly, the experimental validations generally supported the predictions
made by the EM algorithm.
Finally, before we move on, the question asked in the footnote remains. Why does simple association
between regulator expression and target gene expression even work; why can we ignore details of translation,
etc.? Many regulators are, after all, controlled post-transcription. The explanation hinges on the presence
of positive and negative feedback loops in which a signaling molecule regulates a transcription factor that,
in turn, regulates the expression level of the target gene. The effect is that even if a transcription factor's
activity is not directly determined by its expression level, it often reveals itself in the expression of indirect
regulators (such as signalling proteins that control the transcription factor's activity) during microarray
experiments. In terms of our algorithm, we can view the gene encoding a transcription factor as itself being
a target gene, and then run the EM algorithm t o find out the logical relationships between the signaling
molecules and the secondary transcription factors. (In fact, in auto-regulatory relationships, the regulatory
and target genes are identical.)
3 Module Networks in Mammals
The fact that we could successfully reconstruct module networks in yeasts is encouraging. A typical response
t o such a success story, though, is that the algorithms can only be applied t o simple systems like yeasts and
they are highly unlikely t o work in more complex systems like mammals which have a lot of regulatory
relationships. However, although it is true that eukaryotic regulatory networks are far more complex, the
presence of such structure provides more modularity and in fact might be helpful t o a module network
reconstruction algorithm. In this section, we talk about work with Levine, Subramanian and Novershtern
on constructing a regulatory module network for Hematopoiesis.
The data for the experiment was acquired from at least 5 donors for almost each state. There is a rich
source of data for hematopoiesis already present in the form of microarray analysis of levels of RNA, and
this provides a validation system (called D-MAP) for our experiments. Experiments have shown that the
lineage difference information yielded by the microarray results correspond t o changes in relative expression
levels of most genes.
Using a list of known regulatory factors, we can run the EM algorithm as described in the previous
section. The algorithm is run on 523 regulator genes and 193 target genes (which are also transcription
factors). This results in a predicted module network along with regulatory programs. We can now do
experiments t o verify if the functional units predicted by the module partitionings indeed correspond with
reality. One example is the PBX1 module. Here, several known factors were discovered automatically along
with the correct regulatory logic. Also previously unknown factors and regulatory logic were predicted and
verified experimentally; for instance, the role of GATAl in the MLLT7 module. Such knowledge might be
used t o find a code for hematopoiesis in the future.
We also briefly discussed a module network for lymphoma, that is joint work with Monti, Leite, Shipp,
Lu and Golub. One can construct an miRNA module network using the previously discussed EM algorithm
and then get annotation of regulators by targets. They show the existence of combinatorial regulation, where
the ACB and GCB modules behave differently with respect t o each other depending upon the context.
4 Conclusion
In sum, we discussed how regulatory module networks, an abstraction of transcriptional circuits, might be
reconstructed given gene expression data and a set of regulators. We applied this t o yeasts and t o mammals,
showing that the method makes reasonable predictions and allow us t o get a deeper understanding of the
functionality of the regulatory network.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2008
6.047/6.878 Fall 2008
Lecture 25: Synthetic Biology
Tom Knight
December 13, 2008
1 Introduction
Compared to human-engineered artifacts, biological systems are strikingly powerful: they are re
liable and exible, capable of storing information at incredible density, and most importantly
able to self-replicate. In many ways, biology, when viewed as a technology, is vastly superior to
any other.
Scientists have been modifying biological systems for many years both to better understand
them and to make them do useful work. Though often called genetic engineering, these traditional
approaches lack the rigor and power of true engineering disciplines. Synthetic biology is a new
discipline that aims to truly engineer biology: synthetic biologists develop techniques and genetic
parts, with a focus on standardization, characterization, and modularity.
2 The Analogy Between Electrical Engineering and Synthetic Bi
ology
The analogy between electrical engineering and synthetic biology is strong. Electrical engineering
splintered o from physics in the early twentieth century because electrical engineers were building
circuits instead of studying quantum physics. Similarly, synthetic biology has splintered o from
traditional biology and bioengineering in this century because synthetic biologists are engineering
organisms instead of investigating natural ones.
This analogy extends beyond history: in many ways, synthetic biologists are actively modeling
their discipline on electrical engineering (among other elds). The power of modern digital systems
is largely due to the idease of hierarchy, abstraction, standardization, etc. Consider the hierarchy
in modern EECS:
atoms and transistors, where physicists work
transistors and connections, where layout engineers work
abstract logic gates, where computer architects work
assembly and C programming, where system programmers work
high-level programming, where application programmers work
1
It would be impossible to understand all these levels at once who could understand an OS
kernel as a quantum mechanical system? Instead, engineers working at one level of the hierarchy
abstract the level below: computer architects treat a handful of transistors as an abstract logic
gate, application programmers treat a series of complicated interactions with a hard drive as a
simple system call. Standards play a critical role in this hierarchy: the standard x86 instruction
set allows operating systems to run on both AMD and Intel hardware and the POSIX standard
allows a great deal of software to compile and run on any UNIX-like system.
Synthetic biologists have put much eort into the development of modular, standardized bio
logical parts (usually genes or sets of genes.) The BioBrick standard, a packaging for DNA that
allows easy assembly of subparts, is perhaps the most widely used [4]. Families of logic gates have
been implemented using both RNA [6] and DNA [5], and work has been done on producing spec
sheets for genetic parts [1].
3 Complexity Reduction
Electrical engineers and computer scientists have a host of design and modeling tools at their
disposal; modern hardware design would be impossible without them. A major barrier to the
development of analogous tools for synthetic biologists is the complexity of biological systems.
The most popular organisms for engineering are E. coli and S. cerevisiae; though both are well-
studied, they are complex and are not yet understood to the degree that would permit accurate
computational modeling.
One solution to this problem is to engineer a simple, well-understood organism to design a
chassis for synthetic biology. Multiple teams are working toward this goal; most are focusing on
reengineering Mycoplasmas, a group of bacteria with some of the smallest known genomes.
3.1 Minimal Cells
One approach to the reduction of complexity is the attempt to design a minimal cell: a cell with
the smallest possible genome. Most notable is the team at JCVI attempting to design a minimal
cell based on the bacterium Mycoplasma genitalium [3] [2].
3.2 Simple Cells
A second approach, that taken by Dr. Knight, is to design a cell that is as simple as possible.
This diers from a minimal cell: a cell where many proteins have multiple roles will likely have
a smaller genome than a cell where every protein has a single role, but the latter is conceptually
much simpler.
Dr. Knight has chosen to engineer a simple cell starting from the bacterium Mesoplasma orum
(another mycoplasma), which has a number of advantages:
short doubling time
no growth at 37
C
a unique genetic code (like other mycoplasmas)
2
4
As a rst step in the process of engineering a simplied cell, Dr. Knights lab has determined
the essential genes in the genome of M. orum by creating and sequencing a library of knockout
mutants. The next steps in the process include adding a plasmid system, adding a recET recom
bination system, and developing a kit of recoded parts (e.g. GFP, antibiotic resistance genes).
Biosafety
Synthetic biological systems pose at least three potential dangers: uncontrolled growth, horizontal
gene transfer from natural organisms, and horizontal gene transfer to natural organisms. Un
controlled growth could be checked by designing synthetic biological systems to require pairs of
nutrients: though evolution might circumvent a single nutrient requirement, it is unlikely to cir
cumvent two simultaneously.
Engineered organisms based on M. orum would be naturally resistant to horizontal gene
transfer (in either direction). UGA, normally a stop codon, is the preferred tryptophan codon
in M. orum; thus, gene transfer to natural organisms is unlikely. Similarly, gene transfer from
natural organisms is prevented by the fact that CGG is a nonsense codon in M. orum. Further
altering the genetic code in synthetic biological systems could increase this natural barrier.
References
[1] Barry Canton, Anna Labno, and Drew Endy. Renement and standardization of synthetic
biological parts and devices. Nature Biotechnology, 26(7):787793, 2008.
[2] Daniel G. Gibson, Gwynedd A. Benders, Cynthia Andrew-Pfannkoch, Evgeniya A. Denisova,
Holly Baden-Tillson, Jayshree Zaveri, Timothy B. Stockwell, Anushka Brownley, David W.
Thomas, Mikkel A. Algire, Chuck Merryman, Lei Young, Vladimir N. Noskov, John I. Glass,
J. Craig Venter, Clyde A. III Hutchinson, and Hamilton O. Smith. Complete Chemical Synthe
sis, Assembly, and Cloning of a Mycoplasma genitalium Genome. Science, 319(5867):12151220,
2008.
[3] John I. Glass, Nacyra Assad-Garcia, Nina Alperovich, Shibu Yooseph, Matthew R. Lewis, Mahir
Maruf, Clyde A. III Hutchinson, Hamilton O. Smith, and J. Craig Venter. Essential genes of a
minimal bacterium. Proc. Natl. Acad. Sci., 103(2):425430, 2006.
[4] Thomas Knight. Idempotent Vector Design for Standard Assembly of Biobricks. 2003.
[5] Reshma P. Shetty. Applying engineering principles to the design and construction of transcrip
tional devices. PhD thesis, Massachusetts Institute of Technology, 2008.
[6] Maung Nyan Win and Christina D. Smolke. A modular and extensible RNA-based gene-
regulatory platform for engineering cellular function. Proc. Natl. Acad. Sci., 104(36):14283
14288, 2007.
3

Computacional Biology

Uploaded by

Copyright:

Available Formats

Computacional Biology

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computacional Biology

Uploaded by

Copyright:

Available Formats

MIT OpenCourseWare

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

2. Classification Bayesian Techniques

P (Closs 1 | coturc) > P( Closs 2 | coturc)

In this case the use of logarithms provide distinct advantages

2.1 Training Sets

2.2 Getting Priors

2.3 Multiple Features

2.4 Testing a classifier

3. Classification Support Vector Machines

n some scenarios we have a data set I

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

For short times, the evolutionary rate is constant: r = 1 3 and s = .

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

BNs are Compact Descriptions of P(X)

Learning Bayesian Networks

Learning S Heuristic Search

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution

You might also like