Computational Biology Compressed
Computational Biology Compressed
Computational Biology Compressed
BIOLOGY
A HYPERTEXTBOOK
SCOTT T. KELLEY
Department of Biology
San Diego State University
San Diego, California
AND
DENNIS DIDULO
Becton, Dickinson and Company
San Diego, California
COMPUTATIONAL
BIOLOGY
A HYPERTEXTBOOK
Washington, DC
Copyright © 2018 American Society for Microbiology. All rights reserved.
No part of this publication may be reproduced or transmitted in
whole or in part or reused in any form or by any means, electronic
or mechanical, including photocopying and recording, or by any information
storage and retrieval system, without permission in writing from the publisher.
10 9 8 7 6 5 4 3 2 1
Send orders to ASM Press, P.O. Box 605, Herndon, VA 20172, USA
Phone: 800-546-2416; 703-661-1593
Fax: 703-661-1501
E-mail: [email protected]
Online: http://www.asmscience.org
To Kina and Aidan, my wonderful and supportive family.
Preface ix
For the Instructor xi
For the Student xiii
Acknowledgments xiv
About the Authors xv
CHAPTER 00 Introduction 5
Activity 0.1: Biological Databases and Data Storage 20
CHAPTER 01 BLAST 31
Activity 1.1: BLAST Algorithm 36
CHAPTER 07 Probability: All Mutations are not Equal (-ly Probable) 157
Activity 7.1: Generating PAM and BLOSUM Substitution
Matrices 163
Index 191
PREFACE
T
his textbook is a hypertextbook. Half of the textbook material lies between
the pages of this book and the other half on the Internet. It seems natural
that a hypertextbook, which combines print and online apps for mobile tech
nology, would be a great way to learn the basics of bioinformatics, which
uses informatics (computational) theory to study biological data.
This book was born outof a mix of necessity and inspiration.1 The necessity
came from the dearth of bioinformatics instructional materials appropriate for my
combination of biology students, with little or no computer background,2 and
computer science students, who were interested in the field but had little under
standing of biology. The need became acute when I learned that my favorite bio-
informatics lab manual, Bioinformatics for Dummies (BFD3), would no longer be
updated. BFD was a great lab manual for learning how to perform basic bioinfor-
matics data analysis. This book did not explain the principles behind the algo
rithms, but I could cover those during lectures. BFD was clear and fun to read and
provided practical skills for biologists and others looking to analyze data. Unfortu-
nately, the most recent version was printed in 2007!
I kept using the old edition of BFD for some time, but eventually the tutorials
became obsolete and the students took longer and longer to complete the exer
cises. In fact, several passages of BFD were obsolete a few months after the
book was printed. Bioinformatics websites are constantly changing, including
their designs and the URL links, and sometimes the pages themselves disappear
altogether. Since I began writing this book, two of the websites I teach in the book
and online materials changed significantly, and one disappeared altogether.
This led to my original inspiration for the hyper- part of this hypertextbook.
What if I made my own bioinformatics tutorials and sample test data for com
monly used analysis tools online in easily updated files? That way, when a link
changed or the programmers moved a radio button around, I could easily alter the
tutorial to reflect these changes in real time. Students would not have to wait for
a new version of a book to have an accurate tutorial.
The next inspiration arose from my use of paper-based puzzles and problems to
teach the bioinformatics algorithms. The problems I taught in class, combined with
ix
x Prefa ce
the anticipatory conceptual exercises and lecture material, were very successful
for teaching how the methods worked. Unfortunately, paper-based problems also
had significant drawbacks: the students were given only one practice problem per
algorithm, and they received very little feedback as a result. Typically, I would (1)
teach the method, (2) do an example with the students in class, (3) assign it for
homework, (4) get it back a week later, and (5) return it with feedback a week after
that. And that was it.
Fortunately, I realized that the structure of the algorithm puzzles I taught would
be perfect for touchscreen devices and laptops. Most of them involved either
sliding letters around or filling in boxes with numbers, both easily done with a fin
ger or a mouse. With the collaboration of my bioinformatics website designer and
coauthor Dennis Didulo, I created interactive learning tools that provide limitless
practice and instant feedback for students. When we combined the bioinformat-
ics software tutorials and test data into one site, we had a comprehensive learn
ing paradigm for introductory bioinformatics. (See “For the Student” section
below for an outline of the website features.)
In my class, I noticed an immediate increase in algorithm comprehension and
problem-solving ability. Students gained much more practice, received more feed
back, and performed much better on tests. Because new problems were easily
randomly generated, each student had their own personal data set. Best yet, I could
now quickly generate new exam questions and answers with the click of a button!
And what did my students think? These quotes speak for themselves.
“It is a wonderful learning tool. The online programs made learning the
algorithms almost easy.”—Ruby, undergraduate student
“I didn’t want to tell you how much I liked the website because I didn’t
want your ego to get too big.”—Emily, undergraduate student
“It was much better than that bioinformatics cat video.”—Pedro, graduate
student
“You can learn bioinformatics while waiting in line at the DMV or sitting
on your couch eating cheese puffs!”—Anonymous
Notes
1. Much like the invention of the salad spinner.
2. Many biology students tell me flatly that they are “bad with computers” or even state that
“computers hate [them].” For the record, computers really don’t care about you at all. Which
is why we should never give them weapons (see the film “Terminator”).
3. BFD, the budding bioinformatician’s BFF.
FOR THE INSTRUCTOR
T
his hypertextbook can be used in a number of ways: in a lecture or online
course, using the book as an outline for a course, or using just the sections
of interest. It is important to note that, being a hypertextbook, the web
components are not supplemental, but instead are crucial for being able to
understand the content presented in the physical book. In my classes, I use
the interactives inside of class, and the students also use them outside of class
to help them solve algorithm problems or prepare for exams. Generally, I use the
following approach:
2. Have students solve the conceptual (anticipatory) exercise in class, sharing
answers with one another.
4. Have students bring outtheir mobile devices (laptops, smartphones, and
tablets) and solve the interactive problems.
5. Have students share their answers with neighbors in class and with the
instructor.
Then, to make sure the students practice at home, I assign the paper-based prac
tice problems. Finally, in the computer lab, or for homework, I assign the lab exer
cises with the software based on the algorithms.
So far, the approach has been a great success in my classes. The online tools
increase comprehension and improve exam results, and the easily updated tuto
rials for bioinformatics analysis software and biological databases reduce a lot of
student frustration. I hope it proves as successful in your class as it has in mine.
xi
FOR THE STUDENT
T
his textbook is really a hypertextbook, meaning that much of the most excit
ing learning happens online. Close to half of the book materials are online,
and in each chapter you will be directed to the online tools associated with
the text. The idea is to leverage the uniquely powerful aspects of the Internet
to help you learn about bioinformatics. The puzzle-like nature of bioinformat-
ics algorithms makes them especially suited to interactivity and “gamification”
(making difficult things into games with points and scores). The interactive nature
of mobile devices and their connection to online bioinformatics software make
them useful learning tools for understanding the theory behind bioinformatics
methods (the algorithms) and for gaining practical experience with their imple
mentation (software analysis and databases). In order to enhance learning of
the principles behind bioinformatics algorithms and make them more engaging,
the online resources have been designed to
•
Be interactive, with touchscreen puzzle-like problem sets that provide instant
feedback
•
Be multiplatform, usable on computers, tablets, and smartphones
•
Be highly practical, with direct links to data analysis websites and including
test data sets and step-through tutorials
•
Be easily updated, because bioinformatics websites change constantly and
tutorials often need adjustment
•
Allow plenty of practice through instant “random” problem generation and
quizzes
xiii
ACKNOWLEDGMENTS
I
wish to thank the leadership of the California State University Program in Edu-
cation, Research, and Biotechnology (CSUPERB) and the grant reviewers who
approved my proposal on mobile app education technology that provided the
seed money for developing the interactive technology and web resources.
I thank Greg Payne at ASM Press for listening to my ideas and taking them
seriously and for his support during the writing and publishing process, and I thank
my colleague at SDSU, David Lipson, for telling Greg about my project. I thank the
hundreds of bioinformatics students who took my course at SDSU, who helped
me refine my algorithm teaching methods from their sub-alpha development
pencil-and-paper stages allthe way through to the interactive app stage. You are
the reason I do allthis in the first place. I give special thanks to my spouse Kina
Thackray for her advice during the long process of developing the bioinformat-
ics learning algorithms, for encouraging me to submit grant proposals, and for
her very helpful comments on multiple drafts of the book. Finally, I need to thank
my former biometry professor Dr. Michael Grant, who taught me statistics
and introduced me to programming (SAS) and Dr. Gary Stormo, who graciously
allowed me to pursue bioinformatics as a postdoctoral researcher in his lab.
xiv
ABOUT THE AUTHORS
1
2 CO MPUTATIONAL B IOL OGY
General features
While most of the pages look like the Alignment page, the Basics page is or
ganized differently and mostly contains information and tutorials.
T
he word bioinformatics re fers to the com pu tational anal
y
sis of com plex
iological data. The “bio-” prefix indicates biology, of course, while “infor
b
matics” is the science of data processing, storage, and retrieval (a.k.a. informa
tion science) that first developed in the 1960s. Bioinformatics itself dates
back to the early 1970s, when computers were first used to analyze molec
ular sequences. While our knowledge of biological processes, the amount of mo
lecular data, and the speed and throughput of computation have allexpanded
dramatic ally, the field of bioinformatics still primarily focuses on the analysis of
three critical biological molecules: DNA, RNA, and protein.
These molecules are critical to the cellular processes of all living organisms,
and the analysis of the composition and patterns of these molecules should in
theory reveal allthe secrets to life. (Or, as Dr. Frankenstein would say, “It’s alive!
Bwahaha!”) In fact, because DNA encodes the information for allthe RNA and
protein in every cell, analysis of DNA sequence patterns comprises the majority of
bioinformatics. RNA and protein sequences are also analyzed using specific bioin-
formatics algorithms, but the sequences of these molecules are often computa
tionally determined from the DNA sequence in one way or another (see below).
The purpose of this chapter is to explain the general properties of these bio
logical molecules and how they are represented and stored in the computer. It is
critical to understand the connection between the data you observe in computer
files and the biological molecules. Otherwise the data analysis and databases
that store these data will make little sense. We also briefly discuss what is known
as the central dogma of molecular biology, how the DNA inside cells is “read” by
the cellular machinery, and the general structure of the gene. The introductions to
each chapter provide additional background information about the structure and
function of DNA, RNA, and proteins and how bioinformatics can be used to ana
lyze different aspects of these molecules.
5
6 CO MPUTATIONAL B IOL OGY
Why Bioinformatics?
When nonscientists ask me what I do for a living, I tell them I’m a computational
biologist. This oxymoron usually elicits a confused expression (“You study the
biology of computers? Say what?”). I quickly follow this by asking them if they
have heard of DNA and the human genome, which most people have by now.
Then I tell them that the DNA that makes up the human genome is really just 3
BILLION LETTERS in a computer. Here is a little snippet of DNA information from
the human genome:
AGAAAATCACCCTTCCCAGGGGGAAGGTGCTGGGCAGTGGCACTGCCTCTTGGGGGAAGAGGTTGGGCAG
GGGCTGACGGGCAATGGCAGATGACAGCATCCAAACTTCCACACACAGAGTCTGTTCCTTCCTCTTCCCC
GTGCCATCCCAACTCCCTTCTGCCTTGTCATCTACGTCATGGGAAGCAGGTGACATATCTGGCAAGTTAT
TTTGGGGGCCTGGCTTCTCCCAGGTGAAGAGGGAGCAGCAGCTGGAGGGGCAGAAAGAGGGGACAGGGAG
GGGCTGGAGGGCACAGCTGAAGACAGCCTGGGAGGTGACTGTCATCCCCTCCAGTCTCTGCACACTCCCG
GCTGCAGCAGAGCAGGAGGAGAGAGCACGGCCTGGAATGCTAATTTGCCAGGAGCTCACCTGCCTGCGTC
ACTGGGCACAGACGCCAGTGAGGCCAGAGGCCGGGCTGTGCTGGGGCCTGAGATGGGGTGGTGGGGAGAG
AGTCTCTCCCCTGCCCCTGTCTCTTCCGTGCAGGAGGAGCATGTTTAAGGGGAAGGGTTCAAAGCTGGTC
ACATCCCCAACAAAAAAGCCCACGGACAACGAAAAGCCCACTCGCTTGTCCAGTGCCACAGGAGGGGGCA
AGTGGAGGAGGAGAGGTGGCGGTGCTCCCCACTCCACTGCCAGTCGTCACTGGCTCTCCCTTCCCTTCAT
C C T C G TT C C C TAT C T G T C AC C ATTT C C T G T C G T C G TTT C C T C T G A AT G T C T C AC C C T G C C C T C C C T G C TT
GCAAGTCCCCTGTCTGTAGCCTCACCCCTGTCGCATCCTGACTACAATAACAGCTTCTGGGTGTCCCCGG
CATCCACTCTCTCTCCCTTCTTATCCCTTCCGTGACGGATGCCTGAGGAACCTTCCCCAAACTCTTCTGT
CCCATCCCTGCCCTGCTCAAAATCCAATCACAGCTCCCTAACGCTCCTGAATCAACGTGAAGTCCTGTCT
TGAGTAATCCGTGGGCCCTAACTCACTCATCCCAACTCTTCACTCACTGCCTTGCCCCACACCCTGCCAG
After examining this DNA sequence, try answering the following questions:
My guess is that withouta computer and bioinformatics, you don’t stand a chance
of answering these questions correctly. The above DNA sequence information
codes for a small fragment of human DNA on chromosome 12. (Or it could be from
a vampire bat. Keep reading to find out!) In fact, the entire human genome contains
1,000 times this much information. (BTW, this information is called sequence infor
mation because it is linked together as a sequence of letters.) And look how boring
it is! The same 4 letters—A, G, C, and T—over and over again in different combina
tions. RNA and protein information looks pretty similar in the computer, except that
one can tell RNA sequence data apart because it contains U instead of T. Protein
sequence data are also easy to differentiate because up to 21 different letters rep
resenting the various amino acids are used in the sequences.
Here is some RNA sequence information:
GUUUAAGGGACACCGCAGAAAUGGUGAAUACAAUGAAGACAAAGCUGUUGUGUGUACUGCUGCUUUGTGG
RCDRGLAQCHTVPVKSCSELRCFNGGTCWQAASFSDFVCQCPKGYTGKQCEVDTHATCYKDQGVTYRGTW
I N TR O D U C TI O N 7
Granted, the protein sequence is a little more interesting, but it is still pretty
mind-numbing to stare at allday. However, mind-numbing tasks are exactly what
computers were built for: determining the positions of allthe known stars in our
galaxy, calculating compound interest for billions of bank accounts, and searching
through allthe house cat video URLs on the Internet, among other things.
In fact, the amount of molecular sequence data has grown so vast, and the
technologies for generating DNA sequence information from organisms have
become so efficient, that computer processor and hard drive technologies are
starting to fall behind the rate of biological information generation. The graph in
Fig. 0.1 shows the exponential growth in the databanks from 1982 to 2008.
In 2008, the amount of information depicted in Fig. 0.1 was considered an ex
treme amount of information. Now a single researcher can, in a single day, gener
ate DNA sequence information equivalent to all the total sequence information
available in 2008.
So, what can one do with allthese data? That question is the principal subject
of this book, namely, how bioinformatics algorithms and banks of fancy comput
ers can make sense of this growing mountain of molecular sequence data. In the
next few sections you will learn a little about these critical biological molecules
and how letters of the alphabet can be used to represent and store them in
FIGURE 0.1. Exponential growth of GenBank database from 1982 to 2008. Courtesy
of National Library of Medicine.
8 CO MPUTATIONAL B IOL OGY
FIGURE 0.2. Chemical structure of DNA at the atomic level. A, adenine; T, thymine;
C, cytosine; G, guanine. Courtesy of Zephyris (Richard Wheeler), under license
CC BY-SA 3.0.
I N TR O D U C TI O N 9
The chapters of this book show ways in which the combination of experimenta
tion and DNA sequence analysis3 can reveal powerful new insights into molecular
processes, cellular mechanisms, disease, and biodiversity. However, before we
can analyze DNA sequences in the computer, we must first store the DNA infor
mation in a computer. How do we represent the complex biochemical structure
of DNA in a computer?
Figure 0.2 shows the chemical structure of DNA at the atomic level, showing
allthe individual atoms and bonds between them for just a small fraction of a typ
ical DNA molecule. This figure of a DNA double helix shows the arrangement of
individual atoms in a fragment of DNA. This DNA segment has a total of 14 nucle
otide base pairs. Two of the base pairings are shown on the right side of the figure.
A thymine (T) nucleotide binds to adenine (A), and cytosine (C) binds to guanine
(G). If we were to store this structure in its full 3-dimensional (3D) glory, we would
have to keep track of the position of every atom, bond, and bond angle for more
than 60 atoms per DNA base pair. Remember, too, that the DNA in just one of
your cells is approximately 300,000,000 times longer than the DNA in the figure.
Saving allthese data, even in a big computer, is not very practical. Clearly, we
need an alternative to storing every atom of every DNA molecule. Fortunately,
the chemical structure of DNA is highly redundant and easy to simplify. Figure 0.3
shows that if we zoom in and flatten a piece of the structure, we can see the four
nucleotides that make up DNA.
FIGURE 0.3. Close-up view of a sequence of DNA showing chemical structures of the four nucleotide
bases and their pairings. Solid boxes correspond to those in Fig. 0.2. Courtesy of Zephyris (Richard Wheeler),
under license CC BY-SA 3.0.
10 CO MPUTATIONAL B IOL OGY
Each of the four nucleotides is composed of three parts: the phosphate that
binds the nucleotides together, the deoxyribose sugar molecule, and the nucleoside
base itself. The nucleoside base is really the only interesting bit. The DNA mole
cule is a long string of these four nucleotides base-paired to complementary nu
cleotides on the opposite strand. The box in Fig. 0.4 shows the A and T binding to
one another, making a base pair. To make it simpler and easier to store in a com
puter, we can use single letters in place of the nucleotides: A, T, C, and G.
Single letters are perfect for data storage. The whole human genome is “only”
3 billion letters—that’s just 6 MB of data, the size of a family photo, and you can
store a lot of photos on a typical laptop. DNA data storage is even easier when
you realize that one must store only half the data. DNA has two parallel strands
running in opposite directions. In Fig. 0.4, the strand on the left runs from 5' (top)
to 3' (bottom), while the rightmost strand runs in the opposite direction.
Since adenine ALWAYS binds thymine, and guanine ALWAYS binds cytosine, if
you know one strand, you automatically know the other. If one stores the left
most strand as ACTG, it is trivial to reconstruct the opposite strand as CAGT. The
DNA nucleotides are always written left to right, from the 5' to 3' direction.4 This
is the sequence (the order) in which the nucleotides are written. When biologists
talk about a DNA sequence this is what they are talking about. When you have
one strand of a sequence, you can determine the other by finding its reverse
complement. To do so, move in the reverse direction, from the end back to the
beginning, and determine the matching base for each nucleotide.
FIGURE 0.4. Close-up view of base-paired nucleotides (boxed). The arrows show how
DNA strands run in opposite directions. A 5'-to-3' strand is paired with a complementary
strand running in the 3' to 5' direction.
I N TR O D U C TI O N 11
For instance, say you have the DNA sequence GACCTTA. To reverse comple
ment this sequence, go to the last letter, in this case A, and write the comple
mentary base, T. Move backwards until you reach the beginning. An example is
shown in Fig. 0.5.
The final sequence is the reverse complement, which in this case is TAAGGTC.
Ta-da!
In the first step of this process, a protein enzyme complex that includes the RNA
polymerase unwinds the DNA double helix and transcribes one of the strands into
an RNA molecule called mRNA. The m in mRNA stands for messenger, because
the mRNA is a copy of the message contained in the DNA nucleotides.
RNA molec ules are very similar to DNA but have a few key differences.
Figure 0.6 illustrates the basic process of transcription, in which the RNA poly
merase protein complex makes a copy of one strand of the DNA double helix (the
coding strand) by synthesizing a complementary single-stranded RNA molecule.
Along the template strand, the RNA polymerase moves in a 3' to 5' direction, rec
ognizing each DNA nucleotide, finding a complementary RNA nucleotide, and adding
this nucleotide to the 3' end of the growing RNA molecule. Once the synthesis is
12 COMPU TATIO NA L B IOL OGY
complete, the RNA is uncoupled from the DNA, the RNA polymerase falls off,
and the DNA double helix reforms.
A static image does not do justice to the beauty of transcription. Fortunately,
the folks who run the DNA Learning Center (DNALC; https://www.dnalc.org/)
have created stunning animations of many molecular processes. Here is the link
to a 3D animated video of transcription:
https://www.dnalc.org/resources/3d/12-transcription-basic.html
I N TR O D U C TI O N 13
https://www.dnalc.org/resources/3d/13-transcription-advanced.html
Protein Translation
The final destination of mRNA is the ribosome, the so-called protein factory of
the cell. The ribosome is a macromolecular complex composed of two very large
structural RNA molecules called ribosomal RNA (rRNA) and a lot of proteins. It is
at the ribosome that the data encoded in the DNA is translated5 into protein in a
factory-like manner. Figure 0.7 shows a basic illustration of translation.
Again, the DNALC has created excellent translation videos that are very worth
watching.
Advanced translation:
https://www.dnalc.org/resources/3d/16-translation-advanced.html
FIGURE 0.7. Simplified illustration of translation. Steps 1–7 describe the process of
protein translation in which a molecular “machine” called the ribosome translates mRNA
into a protein. The steps shown in the figure are shared by all cellular life, though bacteria
do not contain a nucleus. Courtesy of Kelvin Ma, under license CC BY 3.0.
new organisms. A new bacterial genome, for instance, can be sequenced and
assembled in a day and yield 3,000+ protein sequences.6
For example, here is a portion of the DNA sequence that codes for the hu
man hemoglobin beta protein, part of the complex that binds oxygen in red
blood cells:
I N TR O D U C TI O N 15
FIGURE 0.8. Aspects of protein structure. The structure of a protein can be described
at four different levels. The primary structure is the sequence of bonded amino acids,
unfolded. The amino acids fold into two basic types of structures known as secondary
structures: alpha helices and beta sheets. The tertiary structure is the 3D structure of the
protein and is the most important for understanding the protein's function. Finally, quaternary
structure is an arrangement of multiple proteins bound together to make a single functional
macromolecule. Image courtesy of Thomas Shafee, under license CC BY 4.0.
16 COMPU TATIO NA L B IOL OGY
FIGURE 0.9. Chemical structures of the 20 common amino acids that are used to
make proteins. The letters below the names of the amino acids can be used instead of the
full names to represent the structure. For example, V represents valine, a hydrophobic
amino acid.
ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG
TTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGG
GGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGT
GCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACT
GTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCA
TCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAAT
GCCCTGGCCCACAAGTATCAC
Using the genetic code (Fig. 0.10) to computationally translate this sequence
starting at the first base, here is the protein translation of this DNA sequence:
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN
ALAHKYH
I N TR O D U C TI O N 17
FIGURE 0.10. Genetic code for eukaryotes, like us. The possible triplet codons of mRNA
are listed with the amino acids they encode in eukaryotes (bacteria have a slightly different
genetic code). In this table, the AUG start codon is shaded in green. The stop codons,
which cause termination of translation, are shaded in pink.
Each group of three nucleotides in the DNA sequence codes for a different amino
acid in the protein sequence. For example, the first three nucleotides, ATG, code
for M (methionine). In Fig. 0.10, the table shows the RNA sequences which cor
respond to particular amino acids, but one can also use the DNA nucleotides. To
do this: (i) transcribe the DNA into RNA (easy), (ii) break up the RNA sequence
into groups of three nucleotides (i.e., codons), and (iii) match the codons to the
amino acids using the table. For example, the RNA codon CUU codes for leucine
(Leu), which is stored as an L in the computer.
tion is much simpler. However, as single cells, bacteria must adapt quickly to en
vironmental conditions, producing metabolic enzymes, proteins involved in
motility, or cell surface proteins very rapidly. Thus, unlike eukaryotes, bacteria
cluster their genes into operons, contiguous regions of DNA that each code for a
different polypeptide (protein) involved in the same functional process. For exam
ple, allthe proteins involved in metabolizing glucose are next to each other in an
operon. In bacteria, the process of transcription is directly linked to translation,
and they often happen more or less simultaneously. In eukaryotes, mRNA is ex
ported outside the nucleus of the cell for translation.
Notes
1. Some viruses, like HIV and influenza virus, use RNA (the chemical cousin of DNA) as their
genetic material.
2. It would be rather thin, though, just 20 nanometers in diameter, or 5,000 times thinner than
a human hair.
3. Also the analysis of RNA and protein sequences, which can be determined from DNA
sequences.
4. The protein enzymes that bind to DNA to (biochemically) read it, or copy it, recognize direc
tionality and always move in the 3' to 5' direction on the DNA.
5. A note on transcription and translation. Transcription is the process of copying the same text
from one place to another. The mRNA is pretty much the same language (nucleot ides). Trans-
lation is the process of expressing words in another language. DNA and RNA sequences are
like one chemic al language, and protein sequences are like another.
6. Hence another need for bioinformatics. Imagine the number of experiments that would be
needed to test allthe functions of these proteins at the bench!
20 COMPU TATIO NA L B IOL OGY
Motivation
The goals of this section are to help you (i) understand why computers are increasingly vital in
the biological sciences, (ii) learn the basics of how biological information, namely, DNA, RNA,
and protein sequence information, is stored in the computer, and (iii) be able to interpret com
mon data file types used for storing biological information. DNA is life’s principal information
storage device and is present in every living cell. Some viruses, like Ebola, use RNA, but are they
really “alive”? (Statement intended to annoy virologists.) Complete understanding of an organ
ism’s DNA tells you everything about the organism: what proteins it makes, when and how it
activates genes, how different cell types (e.g., heart, liver, and brain cells) develop, and more.
However, because researchers generate so much DNA sequence from so many organisms,
and it is so very boring to look at, we need computers to make any sense of these data. Below
is a sequence of human DNA, with each letter representing a nucleotide in the DNA. This se
quence is a tiny fraction of the 3 BILLION nucleotides in the human genome:
CTGATGGGAATGCAAGCAGCCATTGAGCAGGCTATGAAGAGTCGTGAGATTCTGGGCATCTCAGACCCTC
AGACGCTGGCCCATGTGCTGACAGCCGGAGTGCAGAGTTCCTTGAATGACCCACGCCTCTTCATCTCCTA
TGAGCCCAGTACCCTCGAGGCTCCCCAGNCAGCACCAACACTCACCAACCTCACCCGAGAAGAACTACTG
GCCCAGCTACAGAGGAGCATCCACCATGAGGTCCTTGAGGGCAACGTGGGTTACCTACGAATAGATGATT
TCCCCGGCCAGGAGGTACTGAGTGAGCTGGGGGGATTCTTGGTGACCCATATGTGGAGGCAGCTCATGGA
CACCTCCTCCTTGGTGCTCGATCTCCGGTACTGTGCTGGTGGTCACATCTCTGGGATCCCTTATTTCATC
Note how this sequence is just the same four letters over and over in different combinations—
terribly dull to read and nearly impossible to interpret. However, with fancy computational algo
rithms and banks of computers, we could use these letters to determine in seconds the species
that the sequence came from, whether the sequence codes for a protein, structural RNA, or a
piece of ancient “junk” DNA, and the cellular role of the gene coded for by this DNA. (This one
is a piece of Florida muskrat DNA that codes for an eye gene.) Far out, right?
Before discussing how one computationally analyzes DNA, RNA, or protein sequences, one
first needs to understand the ways in which these sequences are stored in the computer. After
covering the basics of biological sequences in the introduction, the exercises will cover the
National Center for Biotechnology Information’s (NCBI) scientific journal article search engine
and storage database (PubMed) and the underlying text format (MEDLINE). The exercises and
tutorials will then cover the complex, information-rich GenBank data file format and the much
simpler FASTA file format. Finally, we will briefly review a series of very useful biological informa
tion databases for exploring protein sequences and microbial and eukaryotic genomes.
Learning Objectives
. Understand how and why biological information is stored electronically (Motivation).
1
2. Gain familiarity with some commonly used data storage formats and how to interpret them
(Concepts and Exercises).
I N TR O D U C TI O N 21
3. Learn the basics of some powerful and highly useful biological sequence databases
(Lab Exercises).
Concepts
To develop an appreciation of the types of data that can be stored in biological databases, use
your knowledge of biology and cleverness to determine what the elements shown in boldface
below indicate about a particular DNA sequence stored in a database at NCBI. The following text
file is an example of a GenBank formatted file. Write your guess as to what the items indicated
in boldface mean in the nearest box.
The answers are shown below. This file format can be quite rich in information. Indeed, one can
learn a lot about the gene function, splicing, regulation, and other things by looking at such files.
CDS join(AF018429.1:282..561,AF018429.1:1034..1172,560..651,
AF018431.1:1..45,AF018432.1:658..732,AF018432.1:884..954,
AF018432.1:1391..1447)
/gene=“DUT”
/note=“DUT-M; alternatively spliced; mitochondrial form of
the protein; sim
i
lar to H. sa piens dUTPase en coded by
GenBank Accession Number U90224”
/codon_start=1
/product=“dUTPase”
/protein_id=“AAB71393.1”
/db_xref=“GI:2443580”
/translation=“MTPLCPRPALCYHFLTSLLRSAMQNARGTAEGRSRGTLRARPAP
RPPAAQHGIPRPLSSAGRLSQGCRGASTVGAAGWKGELPKAGGSPAPGPETPAISPSK
RARPAEVGGMQLRFARLSEHATAPTRGSARAAGYDLYSAYDYTIPPMEKAVVKTDIQI
ALPSGCYGRVAPRSGLAAKHFIDVGAGVIDEDYRGNVGVVLFNFGKEKFEVKKGDRIA
QLICERIFYPEIEEVQALDDTERGSGGFGSTGKN”
mRNA join(AF018429.1:<1018..1172,560..651,AF018431.1:1..45,
AF018432.1:658..732,AF018432.1:884..954,
AF018432.1:1391..>1447)
The protein sequence translation
/gene=“DUT”
for a splice mRNA.
/product=“dUTPase”
/note=“alter
na
tively spliced; en codes nu clear form of the
protein”
CDS join(AF018429.1:1018..1172,560..651,AF018431.1:1..45,
AF018432.1:658..732,AF018432.1:884..954,
AF018432.1:1391..1447)
/gene=“DUT”
/note=“DUT-N; alternatively spliced; nu clear form of the
pro
tein; simi
lar to H. sa pi
ens dUTPase en coded by GenBank
Accession Number U90224”
/codon_start=1
/product=“dUTPase”
/protein_id=“AAB71394.1”
/db_xref=“GI:2443581”
/translation=“MPCSEETPAISPSKRARPAEVGGMQLRFARLSEHATAPTRGSAR
AAGYDLYSAYDYTIPPMEKAVVKTDIQIALPSGCYGRVAPRSGLAAKHFIDVGAGVID
EDYRGNVGVVLFNFGKEKFEVKKGDRIAQLICERIFYPEIEEVQALDDTERGSGGFGS
TGKN”
ex
on 560..651
/gene=“DUT”
/number=3
ORIGIN
1 tccctaaatc aacacagatc atgtggagga ataaaatggg gttaatatat gtaaaaccaa
61 ttaggaaact gtttctgggg caacacagta aagggcttat tcaatggata ggctagtatt
121 attagttagt aattgggccc tttttttctt tgtttctttt cttcattttt ttccttttca
181 aactatgggt tgtaaagcat ccaccttttg aaagtttgcc tttctgccct ttcacgctga
241 taagtacctc agtttccaat aaacttttgt tcaggggcaa acatttacaa tgttgacatc
301 tcttcacacc accaaaaata ttcatggaga attattttat ctaaagctgt ctttttaata
361 ataaaatagc cacctctacc ttcttcataa acttttaaga tgaattggta attcatcata
421 gcaaggttga ttttagaaac taaagttgca ttaattcatt aaatacactg aaagtaattt
481 tgtatgcttg gtcacaaaga aaatataaaa acaattttat aaatagattt gcagttattt
541 tctttcaata ttttcttagt gcctatgatt acacaatacc acctatggag aaagctgttg
601 tgaaaacgga cattcagata gcgctccctt ctgggtgtta tggaagagtg ggtaagtcat
661 ttaagaaaca ggtaactatt tgtcaagttc tcctttgtga tagattcttc atgtttcatt
I N TR O D U C TI O N 25
Exercises
Lab exercises (practice)
In this part of the exercise, you will learn the basics of some very useful biological
databases. Follow the link or QR code below to background information on data
formats and tutorials on databases you should use to answer the lab exercise
questions. The Basics link includes tutorials for the UniProt (protein) and Ensembl
(genome) databases, and the large collection of databases available at NCBI. The
NCBI drop-down menu includes links to at least 45 different databases, allowing
analysis of everything from gene expression to biochemistry to taxonomy.
Lab Exercise
Part 1: using NCBI PubMed
1. How would you search in PubMed for allpapers about bacteria written by
Pace AND Lane? Use the MEDLINE author field to be more specific. Write the
PubMed search terms you used here.
2. Use the MEDLINE entry of the Pace and Lane article titled “Evolutionary
relationships among sulfur- and iron-oxidizing eubacteria” to get the GenBank
accession numbers in the article. Write the first five accession numbers
below.
3. How would you find allthe experts on progesterone in San Diego? Do a spe
cific search for the keyword and address. Write the search terms you would
use below.
4. Find the PubMed PMID for any two papers from authors at San Diego State
University that have worked with 16S. Write the search term used and the
PMIDs below.
Search Terms:
PMID 1:
PMID 2:
5. Use a combination of the search field and the left sidebar to search for review
papers about Mycobacterium published in the last 5 years. Write the title of
the first search result below:
Title:
I N TR O D U C TI O N 27
1. Write the title line for the FASTA entry below for the DNA sequence of the nu
clear splice mRNA variant description from GenBank entry AH005568.2.
What is a nuclear splice mRNA variant anyway? Explain briefly below. (Hint:
Ask Dr. Wikipedia or Professor Google for things you don’t know or recognize.
The Internet: not just for cat videos anymore!)
2. Find the following information for the human estrogen receptor (accession:
NM_000125):
a. What is the sequence of the polyadenylation [poly(A)] signal? Hint: look for
“polyA_signal_sequence” in the file.
b. Write the title line of the coding region protein sequence in FASTA format
of estrogen receptor alpha isoform 1 (Homo sapiens estrogen receptor 1):
c. At what position of the DNA record does the protein coding region start for
the alpha isoform?
f. Find the FASTA protein sequence of 30S ribosomal protein S3. Hint: click
on the protein link (part e) and search the protein table for “30S ribosomal
protein S3.” Then select the link under “Protein product.” Write the title line
of the FASTA entry below.
2. Using UniProt, find the following information about the gene involved in cystic
fibrosis transmembrane conductance regulation in Homo sapiens.
b. Describe briefly the function of this gene (see database entry under Function).
T
he inventors of the BLAST (Basic Local Alignment Search Tool) computer
program should win the Nobel Prize in Physiology or Medicine. I’m seri
ous! Actually, they should probably share it with the inventors of the
FASTA algorithm, which preceded the more efficient and flexible BLAST
algorithm.
Why should the makers of a computer program that finds the best database
match to a biological sequence win the most prestigious award in biomedic al sci
ence? The Nobel Prize is usually reserved for discoveries made in the laboratory,
such as the discovery of novel disease organisms, how to genetic ally engineer
mice, proteins that glow in the dark, the mechanism of cell death, that sort of thing.
A computer program seems a paltry gimmick in comparison, and biologists often
chuckle or even scoff when I suggest that it deserves the Nobel. Determining the
best match of a biological sequence appears rather uninspiring compared with
the discovery of antibiotics or proteins that act like viruses. Yet in an era of rapidly
expanding DNA sequencing capability and big biological data, BLAST’s simplicity,
elegance, and efficiency make this algorithm the most powerful biomedical dis
covery tool in the history of science.
BLAST It
Researchers use BLAST so often that it has been turned into a verb. (Like “to
google” or “to chill”.) “Hey, did you BLAST that sequence I gave you yet? No?
Well, what are you waiting for? BLAST it!” To understand why BLAST is so use
ful and popular, imagine that you work in a molecular biology lab studying a
novel infectious disease and you just received the results of your very first se
quencing run, perhaps the first genetic information in the history of this unknown
organism. Unfortunately, the sequence looks like this1:
31
32 COMPU TATIO NA L B IOL OGY
>FX093345
TGCAGTCGATCATCAGCATACCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTTC
TTGGTGTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTCCTTCAGGTTAGAGACGTGCTAGTGCG
TGGCTTCGGGGACTCTGTGGAAGAGGCCCTATCGGAGGCACGTGAACACCTCAAAAATGGCACTTGTGGT
CTAGTAGAGCTGGAAAAAGGCGTACTGCCCCAGCTTGAACAGCCCTATGTGTTCATTAAACGTTCTGATG
CCTTAAGCACCAATCACGGCCACAAGGTCGTTGAGCTGGTTGCAGAAATGGACGGCATTCAGTACGGTCG
TAGCGGTATAACACTGGGAGTACTCGTGCCACATGTGGGCGAAACCCCAATTGCATACCGCAATGTTCTT
CTTCGTAAGAACGGTAATAAGGGAGCCGGTGGTCATAGCTATGGCATCGATCTAAAGTCTTATGACTTAG
That’s some exciting research you have there! Well, maybe, but what exactly
is it? Clearly, it is a DNA sequence of some sort, but is it the secret to uncovering
a deadly new pathogen, or is it human DNA contamination from the technician at
the sequencing facility? Is it part of a protein toxin that allows a deadly microbe to
destroy epithelial cells in the lungs, or is it a regulatory gene sequence from a
housefly that landed on your pipette tip when you weren’t looking?
Eyeballing a bunch of letters, the same four letters (A, G, T, and C) repeated in
different orders, will get you nowhere. Wouldn’t it be great if, withoutpilfering a
pipette or pouring a petri plate, you could answer the following questions?
FIGURE 1.1. Results of a BLAST nucleotide search with the FX093345 nucleotide
sequence. The results, returned in under a second, reveal that the sequence is an exact
match to the genome of a severe acute respiratory syndrome (SARS) coronavirus protein
first isolated in Shanghai (apparently misspelled “Shanhgai” in the database), China. The
first 50 matches were to SARS coronaviruses isolated at different times, which provides
confidence that we have indeed isolated a SARS virus. At the time this BLAST analysis was
performed, the sequence was a perfect 100% match (bottom right under the “Ident” column)
to more than one sequence, so we cannot say with certainty that it is the Shanghai strain.
The fact that so many were identical also means that the results appear in a random order.
That, and the fact that databases are constantly updated with new sequences, means that
another search with the sequence may return different results from those seen in the figure.
One of the most important considerations for interpreting BLAST results is the E-value.
The E-value is the Expectation value, which tells the user how likely the search similarity
result is due to random chance. An E-value of 1e-4 is the same as 0.0001, which indicates
this similarity would occur 1 in 10,000 times by chance. Note, an E-value of 0.0 signifies a
likelihood below 1e-250.
useful. As an example, Fig. 1.3 shows some of the standard experimental ap
proaches needed to determine the function of a protein-coding gene found in a
bacterial genome.
While the cost of characterizing even a single protein-coding gene can be sig
nificant in terms of both time and money, once we have this characterization and
this sequence in a database, algorithms like BLAST become especially powerful.
Figure 1.3B illustrates how BLAST and DNA sequencing technology can amplify
biological knowledge.
In the example, BLAST is really amplifying the experimental knowledge by
allowing us to infer the function of 20 different previously unknown DNA se
quences through a rapid sequence alignment. The process can also go in the
other direction. For instance, we could use BLAST to discover if a DNA sequence
that codes for a gene of unknown function is present in every living organism
ever sequenced. However, should an experimentalist decide to target the pro
tein product of this gene of unknown function for analysis and figure outits mo
lecular function, we would suddenly know the function of this gene in millions of
species.
Given the rate of sequence generation, it is safe to say that BLAST and other
bioinformatics methods are the primary source of information for 99.9% of all
newly sequenced DNA. This has made BLAST instrumental for discovering, among
other things, the following:
B LA S T 35
FIGURE 1.3. Characterizing the function of a novel bacterial protein encoded in the
Escherichia coli genome and then using BLAST to find similar functions in newly
sequenced genomes. (Left) Flowchart of approaches necessary to characterize a novel
protein in the E. coli genome. Most bacteria have one copy of each gene on a single circular
chromosome.2 Each of the flowchart steps represents weeks or months of painstaking
work. Typically, this process may take a year and cost on the order of $50,000. (Right)
Once a gene has been characterized, sequences from other bacterial genomes (circles) can
be quickly matched to this sequence using BLAST. The figure shows BLAST identifying
porin proteins in 20 different bacterial genomes. When the matches are sufficiently strong,
we can infer that these proteins have functions highly similar to that of the original,
which saves the need to experimentally characterize this protein from all20 genomes.
If this doesn’t warrant a Nobel Prize in Medicine, frankly, I don’t know what
does.
Notes
1. Note the FASTA format of the sequence, the program’s enduring legacy to bioinformatics.
2. In the cell, the genomic DNA is usually tightly wound in a supercoiled state.
36 CO MPUTATIONAL B IOL OGY
Motivation
The purpose of this activity is to teach the basic concepts behind the BLAST algorithm and how
to use a web-based implementation of this algorithm to analyze DNA and protein sequence
data. BLAST (Basic Local Alignment Search Tool) is a fast computational method for making se
quence alignments. Sequence alignments are a critical part of bioinformatics. Computational
methods for making pairwise alignments of biological molecules (DNA, RNA, or protein) were
some of the very first bioinformatics algorithms developed. Among other things, sequence align
ments allow researchers to determine the organisms from which the molecule came (human,
oyster, pine tree, bacterium, etc.) and predict the cellular function of biological molecules based
only on their sequence. For example, BLAST can report with high confidence that the protein se
quence YNFGSGSAYGGSFGGVDGLLAGGEKATMQNL is keratin from the domestic dog hair found
on your sofa.
BLAST was created to speed up the process of making sequence alignments. Full pairwise
sequence alignment methods (see Chapter 03) are too computationally intensive to handle the
alignment of thousands or millions of sequences. BLAST speeds up this process by “chopping
up” an input sequence into smaller bits and matching these smaller bits to millions of different
sequences. The algorithm then attempts to extend the sequence alignment to make a full align
ment. Then the algorithm ranks the sequence alignments, and the longest alignment with the
fewest mismatches wins! In bioinformatics parlance we call the good matches “hits,” and the
best ones are “best hits” or “top hits.” BLAST is a heuristic method, meaning that it is not guaran
teed to find the optimal alignment, but it is much faster than more stringent approaches.
Learning Objectives
. Know the basic purpose and utility of the BLAST computational method (Motivation).
1
2. Understand the concepts behind the BLAST algorithm (Concepts and Exercises).
3. Correctly solve sequence-matching problems based on the BLAST algorithm (Concepts and
Exercises).
4. Learn how to use NCBI’s BLAST web-based sequence analysis website and be able to cor
rectly interpret its output (Concepts and Exercises).
Concepts
As mentioned above, the purpose of the BLAST algorithm is to find the best hit (highest-scoring
match) of an unknown DNA or protein sequence in a database. The method has been so suc
cessful because of its clever simplicity. In order to better grasp the method behind the algorithm,
try the preparatory exercise. Using your brain and a pencil, try to find regions (local alignments)
that best match the following DNA sequence and protein sequences. First, try to match the
Query DNA sequence to the Sbjct1 (Subject) DNA sequence. Then do the same for the protein
Query and Sbjct sequences. Find the regions of best alignment between the two, and keep in
B LA S T 37
mind that the match doesn’t have to be perfect and may even need spaces to help it line up.
Circle or draw lines between the matching letters.
DNA MATCH
Query: AGCGAATATTATGTTGAAGTAGCAAAGTCCTGGAGCCT
Sbjct: ACTACAGGGGAGTTTTGTTGAAGTTGCAAAGTCCTGGAGCCTCCAGAGGGC
PROTEIN MATCH
Query: MEMKATTALLNDRVLRAMLYFWCKAEETCALEVCEE
Sbjct: ETIRRAYPDANLLNDRVLRAMLYFWRKAEETCAPSVSMRKIVATWMLEVCEE
Reflection
• How much of the query did you try to match at one time?
• How did you find a match? Can you describe it in words?
• Were there any mismatches for the best sequence?
• Were there ever multiple matches? Would breaking up the Query (introducing a gap)
help?
Below is the answer. The vertical lines indicate a perfect match between the letters of the two
sequences. Note that there are some mismatches and that a big gap must be inserted for the
end of the protein Query sequence to match the Sbjct sequence (LEVCEE).
DNA MATCH
Query: AGCGAATATTATGTTGAAGTAGCAAAGTCCTGGAGCCT
|| ||||||||| |||||||||||||||||
ACTACAGGGGAGTTTTGTTGAAGTTGCAAAGTCCTGGAGCCTCCAGAGGGC
Sbjct:
PROTEIN MATCH
Most students solve this problem by (i) sliding the Query sequence along the Sbjct sequence, (ii)
finding a short region that matches well, and then (iii) extending the match as far as possible.
This is essentially how the BLAST algorithm works. Figure 1.1.1 details the basic steps of the
algorithm.
38 CO MPUTATIONAL B IOL OGY
FIGURE 1.1.1. Principles behind the BLAST algorithm. (1) The first step of the algorithm is
to break the Query sequence into smaller pieces called “words.” For DNA, the word size is
usually 10 or 11 letters long, but the example uses four-letter words for simplicity. (Four-letter
words. *snicker*) Protein sequence matches start with fewer letters. (2) Then the algorithm
slides these smaller words across possible target sequences until it finds a perfect match
with one of the small words. (3) Starting with this small alignment, BLAST then extends the
alignment until it runs outof letters or the alignment becomes poor (lots of mismatches). The
score of the alignment is determined by summing up the scores for matches and mismatches.
For DNA, BLAST uses +5 for a match and −4 for a mismatch. The scores of protein sequence
alignments are determined by using a special table of match/mismatch scores, such as the
BLOSUM62 matrix (see Chapter 07).
Exercises
Interactive exercise (theory)
Use the online BLAST Interactive Link below to learn how the algorithm makes
and scores BLAST sequence alignments. Click on the dark circle with the yellow
letter I at the top of the page to learn how to use the BLAST Exercise teaching
interactive. Once you learn how it works, solve the activity problem.
Problem
Practice with the BLAST Exercise Link, then solve the problem below.
1. Write the best alignment of the Query to each DNA sequence in the boxes
and circle the first matching word from the Query.
Lab Exercise
Click on the sample and lab exercise data link for the sequence data used in this
exercise.
Part 1
Use NCBI BLAST tools to analyze the following DNA that you just sequenced
from a plasmid and answer the following questions:
>Part1_Plasmid_Derived_Sequence
C G T T TA C G G C G T G G A C TA C C A G G G TAT C TA AT C C T G T T C G C T C C C C A A C G C T T T C G C T C C T C A G C G T
C A G T TA C T G C C C A G A G A C C C G C C T T C G C C A C C G G T G T T C C T C C T G ATAT C T G C G C AT T C C A C C G C TA
CACCAGGAATTCCAGTCTCCCCTGC
1. Use the NCBI BLAST tool to perform a sequence search with the above
sequence.
a. The highest-scoring BLAST hit is to what named organism? (Ignore the
unknown/uncultured organism hits.)
c. What is the function of the gene, if known? (Don’t know? Try asking “Pro
fessor” Google or “Dr.” Wikipedia!)
i. E-value:
ii. Identities:
42 COMPU TATIO NA L B IOL OGY
Part 2
>Part2_Protein_Sequence
M N G T E G P N F Y V P F S N K T G V V R S P F E Y P Q Y Y L A E P W Q F S M L A AY M F L L I V L G F P I N F LT LY V T V Q H K K
L R T P L N Y I L L N L AVA N H F M V F G G F T T T LY T S L H G Y F V F G S T G C N L E G F FAT L G G E I A LW S LV V L A I E
RY V V VC K P M S N F R F G E N H A I M G VA F TW V M A L AC A A P P LVG W S RY I P E G M Q C S C G I DY Y T L K P E V N N E
S F V I Y M F V V H F T I P M T I I F F C Y G Q LV F T V K E A A A Q Q Q E S AT T Q K A E K E V T R M V I I M V I A F L I C W V P Y
A S VA F Y I F T H Q G S D F G P I L M T L PA F FA K S S A I Y N P V I Y I M M N K Q F R N C M LT T I C C G K N P F G E E E G S T
TASKTETSQVAPA
1. Use the NCBI BLAST tool to perform a sequence search with the protein se
quence above.
i. E-value:
ii. Identities:
B LA S T 43
c. How many different reading frames did the program reveal that were lon
ger than 100 amino acids (aa)?
>Part2_Mystery_DNA
TCCCTCCACAAAGAATGGAGCTGTGAACTACTAGCACGCAATGTGATTCCTGCAATTGAAAATGAACAAT
ATAT G C TAC TTATAG ATA AC G G TATT C C G AT C G C TTATT G TAG TT G G G C AG ATTTA A AC C TT G AG AC T G A
GGTGAAATATATTAAGGATATTAATTCGTTAACACCAGAAGAATGGCAGTCTGGTGACAGACGCTGGATT
ATT G ATT G G G TAGCACCATT CGGACATT CT CAATTACTTTATA A A A A A ATGTGTC AGA A ATACC C TGATA
TGATCGTCAGATCTATACGCTTTTATCCAAAGCAGAAAGAATTAGGCAAAATTGCCTACTTTAAAGGAGG
TAAATTAGATAAAAAAACAGCAAAAAAACGTTTTGATACATATCAAGAAGAGCTGGCAACAGCACTTAAA
A AT G A ATTTA ATTTTATTAAAAAATAGAAGGAGACATC C C TTATGGGA ACTAGACTTAC A ACC C TATC A A
ATGGGCTAAAAAACACTTTAACGGCAACCAAAAGTGGCTTACATAAAGCCGGTCAATCATTAACCCAAGC
C G G C AG TT C TTTA A A A AC T G G G G C A A A A A A A ATTAT C C T C TATATT C C C C A A A ATTAC C A ATAT G ATAC T
GAACAAGGTAATGGTTTACAGGATTTAGTCAAAGCGGCCGAAGAGTTGGGGATTGAGGTACAAAGAGAAG
AACGCAATAATATTGCAACAGCTCAAACCAGTTTAGGCACGATTCAAACCGCTATTGGCTTAACTGAGCG
TGGCATTGTGTTATCCGCTCCACAAATTGATAAATTGCTACAGAAAACTAAAGCAGGCCAAGCATTAGGT
TCTGCCGAAAGCATTGTACAAAATGCAAATAAAGCCAAAACTGTATTATCTGGCATTCAATCTATTTTAG
GCTCAGTATTGGCTGGAATGGATTTAGATGAGGCCTTACAGAATAACAGCAACCAACATGCTCTTGCTAA
AGCTGGCTTGGAGCTAACAAATTCATTAATTGAAAATATTGCTAATTCAGTAAAAACACTTGACGAATTT
GGTGAGCAAATTAGTCAATTTGGTTCAAAACTACAAAATATCAAAGGCTTAGGGACTTTAGGAGACAAAC
TCAAAAATATCGGTGGACTTGATAAAGCTGGCCTTGGTTTAGATGTTATCTCAGGGCTATTATCGGGCGC
44 COMPU TATIO NA L B IOL OGY
AACAGCTGCACTTGTACTTGCAGATAAAAATGCTTCAACAGCTAAAAAAGTGGGTGCGGGTTTTGAATTG
GCAAACCAAGTTGTTGGTAATATTACCAAAGCCGTTTCTTCTTACATTTTAGCCCAACGTGTTGCAGCAG
G TTTAT C TT C A AC T G G G C C T G T G G C T G C TTTA ATT G C TT C TAC T G TTT C T C TT G C G ATTAG C C C ATTAG C
ATTTGCCGGTATTGCCGATAAATTTAATCATGCAAAAAGTTTAGAGAGTTATGCCGAACGCTTTAAAAAA
TTAGGCTATGACGGAGATAATTTATTAGCAGAATATCAGCGGGGAACAGGGACTATTGATGCATCGGTTA
CTGCAATTAATACCGCATTGGCCGCTATTGCTGGTGGTGTGTCTGCTGCTGCAGCCGGCTCGGTTATTGC
TT C AC C G ATT G C C TTATTAG TAT C T G G G ATTAC C G G T G TA ATTT C TAC G ATT C T G C A ATATT C TA A AC A A
GCAATGTTTGAGCACGTTGCAAATAAAATTCATAACAAAATTGTAGAATGGGAAAAAAATAATCACGGTA
AG A AC TAC TTT G A A A AT G G TTAC G AT G C C C G TTAT C TT G C G A ATTTAC A AG ATA ATAT G A A ATT C TTAC T
GAACTTAAACAAAGAGTTACAGGCAGAACGTGTCATCGCTATTACTCAGCAGCAATGGGATAACAACATT
GGTGATTTAGCTGGTATTAGCCGTTTAGGTGAAAAAGTCCTTAGTGGTAAAGCCTATGTGGATGCGTTTG
AAGAAGGCAAACACATTAAAGCCGATAAATTAGTACAGTTGGATTCGGCAAACGGTATTATTGATGTGAG
TAATTCGGGTAAAGCGAAAACTCAGCATATCTTATTCAGAACGCCATTATTGACGCCGGGAACAGAGCAT
CGTGAACGCGTACAAACAGGTAAATATGAATATATTACCAAGCTCAATATTAACCGTGTAGATAGCTGGA
AAATTACAGATGGTGCAGCAAGTTCTACCTTTGATTTAACTAACGTTGTTCAGCGTATTGGTATTGAATT
AGACAATGCTGGAAATGTAACTAAAACCAAAGAAACAAAAATTATTGCCAAACTTGGTGAAGGTGATGAC
AACGTATTTGTTGGTTCTGGTACGACGGAAATTGATGGCGGTGAAGGTTACGACCGAGTTCACTATAGCC
GTGGAAACTATGGTGCTTTAACTATTGATGCAACCAAAGAGACCGAGCAAGGTAGTTATACCGTAAATCG
TTTCGTAGAAACCGGTAAAGCACTACACGAAGTGACTTCAACCCATACCGCATTAGTGGGCAACCGTGAAG
A A A A A ATAG A ATAT C G T C ATAG C A ATA AC C AG C AC C AT G C C G G TTATTAC AC C A A AG ATAC C TT G A A AG
C T G TT G A AG A A ATTAT C G G TAC AT C AC ATA AC G ATAT C TTTA A AG G TAG TA AG TT C A AT G AT G C C TTTA A
CGGTGGTGATGGTGTCGATACTATTGACGGTAACGACGGCAATGACCGCTTATTTGGTGGTAAAGGCGAT
GATATTCTCGATGGTGGAAATGGTGATGATTTTATCGATGGCGGTAAAGGCAACGACCTATTACACGGTG
GCAAGGGCGATGATATTTTCGTTCACCGTAAAGGCGATGGTAATGATATTATTACCGATTCTGACGGCAA
T G ATA A ATTAT C ATT C T C T G ATT C G A AC TTA A A AG ATTTA AC ATTT G A A A A AG TTA A AC ATA AT C TT G T C
ATCACGAATAGCAAAAAAGAGAAAGTGACCATTCAAAACTGGTTCCGAGAGGCTGATTTTGCTAAAGAAG
TGCCTAATTATAAAGCAACTAAAGATGAGAAAATCGAAGAAATCATCGGTCAAAATGGCGAGCGGATCAC
CTCAAAGCAAGTTGATGATCTTATCGCAAAAGGTAACGGCAAAATTACCCAAGATGAGCTATCAAAAGTT
GTTGATAACTATGAATTGCTCAAACATAGCAAAAATGTGACAAACAGCTTAGATAAGTTAATCTCATCTG
TAAGTGCATTTACCTCGTCTAATGATTCGAGAAATGTATTAGTGGCTCCAACTTCAATGTTGGATCAAAG
TTTAT C TT C T C TT C A ATTT G C TAG AG C AG C TTA ATTTTTA AT G ATT G G C A AC T C TATATT G TTT C AC AC A
TTATAGAGTTGCCGTTTTATTTTATAAAAGGAGACAATATGGAAGCTAACCATCAAAGGAATGATCTTGG
TTTAG TT G C C C T C AC TAT G TT G G C AC A ATAC C ATA ATATTT C G C TTA AT C C G G A AG A A ATA A A AC ATA A A
CHAPTER
02
PROTEIN ANALYSIS
P
roteins are the workhorses of biology. There is a reason the central dogma of
molecular biology ends in the formation of proteins: every cellular and physi
ologic al process within an organism involves proteins. Whether it is for syn
thesizing RNA or DNA, propagating signals along nerve fibers, or destroying
disease-causing pathogens, proteins (mainly enzymes) do most or allof the
work. Proteins also determine the shape and mobility of cells (structural proteins),
act as gatekeepers for molecules entering and exiting cell membranes (mem
brane proteins), respond to external signals such as hormones or glucose (recep
tor proteins), and transmit signals inside of cells (cell signaling proteins).
Figure 2.1 presents a few examples of proteins that perform common cellular
processes.
Protein Bioinformatics
In order to understand the biological function of a protein, one must first under
stand its physic al properties. Ideally, one would know the entire three-dimen
sional (3D) structure of a protein and the location and angle of every amino acid in
the protein at the atomic level. Unfortunately, experim ental determination of pro
tein 3D structure, primarily performed by first crystallizing the protein, is often a
monum ental undertaking that is not possible for alltypes of structures (e.g., pro
teins inside cell membranes known as transmembrane proteins). Protein crystal
structures can take months or years to determine, and many labs devote their en
tire research program to the crystallization of important proteins.
Naturally, slow and methodical experimental methods cannot possibly keep
up with the rate at which scientists are generating protein sequence data. Sequenc
ing technologies can generate the entire sequence of multiple bacterial genomes
in a day. Given that an average sized bacterial genome may encode >3,000 pro
tein sequences, you can clearly see the need for computational methods to speed
up this process of determining protein functions. To fill this need,1 bioinformati
cians have designed algorithms to leverage our vast existing knowledge of amino
acids and protein structure to develop algorithms that can determine structural or
functional properties of proteins using only the primary protein sequence. This is
47
48 COMPU TATIO NA L B IOL OGY
Bioinformatics Methods
The algorithms we cover in this chapter attempt to predict physic al and structural
properties of a protein using only its primary sequence, literally a text string of amino
acid letters. (Like this protein sequence of unknown function: DRKELLEYISRAD.)
Of course, the fastest methods of determining the structure of a novel unknown
protein sequence is to find a highly similar match in a database to a protein with
known structure. For example, it would be easy to determine most of the struc
ture and function of a newly sequenced bacterial outer membrane protein (OMP)
because there are many well-characterized OMPs with 3D structures already in
GenBank, UniProt, and other databases. Matches to more distantly related
sequences with similar structures can also reveal important structural fea
tures. The Pfam (protein family) database, COG (clusters of orthologous groups)
database, and other databases use various algorithms, such as multiple-sequence
alignments and hidden Markov models, to group proteins into various functional
classes. A significant match to a family or cluster could provide insight into the
protein’s structure or function.
PR O TEI N A N A LY S I S 49
The problem becomes more difficult as the protein becomes more dissimilar
to known proteins. In this chapter, we cover methods that predict the properties
of protein sequences based only on the amino acid composition of the se
quence. The first method we discuss predicts the relative hydrophobicity of
protein sequences by scanning the sequence and looking for regions of the
protein with lots of hydrophobic amino acids. Hydrophobic means water (hydro-)
fearing (-phobic). Protein regions that are hydrophobic are usually found inter
acting with hydrophobic molecules, such as the lipids found in the cell mem
brane. Figure 2.3 illustrates some examples of proteins with critical hydrophobic
regions.
The second method we cover is a probability approach to determine the second
ary structure of a novel protein using the primary sequence. Protein 3D structures
50 COMPU TATIO NA L B IOL OGY
are really combinations of two main secondary structural elements, alpha helices
and beta sheets, with loop regions connecting the various structural elements.
Figure 0.8 illustrates how the combinations of protein secondary structure ele
ments can combine to form the final 3D structure of a protein. The Chou-Fasman
algorithm that we look at in this chapter uses knowledge gleaned from many pro
tein sequences to determine how often particular amino acids are found in alpha
helices or beta sheets.
PR O TEI N A N A LY S I S 51
Notes
1. “I feel the need, the need for speed.”—Tom Cruise, pilot and bioinformatician.
2. The ribosome matches the codon with the tRNA anti-codon to covalently bind together the
amino acids during protein synthesis.
3. An example of this is the androgen receptor, which binds the steroid hormone testosterone
and acts as a transcription factor to alter the expression of muscle-building genes, especially
in “roid ragers.” Or professional athletes.
52 COMPU TATIO NA L B IOL OGY
Motivation
The purpose of this activity is to teach the concepts behind hydrophobicity plotting. Hydropho
bicity plots use a simple sequence scanning approach and experim ental values of amino acid
hydrophobicity to determine which parts of a protein are hydrophobic. Generally speaking, hy
drophobicity is one of the most important properties of biological molecules. Proteins tend to be
a mix of both hydrophobic and hydrophilic amino acids. Hydrophobic (“water-fearing”) proteins
tend to be found in parts of the cell that are rich in lipids, such as the cell membrane, the nuclear
membrane, and vesicles. If we were to determine that a protein had many long sections of
hydrophobic amino acids, this could mean that the protein was part of a membrane or vesicle.
Similarly, if we were to determine that a part of the protein was hydrophilic (“water loving”), we
could predict that this part of the protein was not likely to be found in a membrane or binding
lipid molec ules. Instead, we might predict that it was intra- or extracellular or bound to hydro
philic charged molecules like DNA or RNA. In this section, you will learn a basic method of hydro
phobicity plotting for predicting hydrophobic (or hydrophilic) regions of any given protein sequence.
Afterwards, you will gain practice with online software that determines the most hydrophobic
regions of proteins using the same algorithm.
Learning Objectives
1. Understand the basic concept of protein hydrophobicity and why it is important for under
standing protein function (Motivation).
2. Learn how to use a sliding window algorithm and an amino acid hydrophobicity scale to de
termine the most hydrophobic region of a protein (Concepts and Exercises).
3. Use online hydrophobicity software to plot hydrophobic regions of a protein sequence and
interpret the output (Concepts and Exercises).
Concepts
To better understand the method behind hydrophobicity plotting, try the preparatory exercise on
the next page, which asks you to circle or underline the most hydrophobic region of the protein
sequence “Protein Seq 1” using information in the diagram. The diagram indicates the amount
of free energy needed to dissolve each of the amino acids found in biological protein sequences
into a hydrophobic solvent. The more negative the change in free energy value (∆G), the more
readily the amino acid dissolves into the solvent (in this case, octanol) and the more hydrophobic
the amino acid. For example, leucine (L) has a ∆G of −1.2. The more hydrophilic amino acids have
high positive ∆G values (e.g., histidine [H] has a ∆G of +2.4).
PR O TEI N A N A LY S I S 53
Protein Seq 1:
E R K PY LV AW M K R
FIGURE 2.1.1. Free energies for transfer of amino acids from water to octanol. Green bars, charged
residues; orange bars, polar residues; purple bars, hydrophobic residues. Data from Bowie JU. 2005.
Nature 438:581–589.
Reflection
• How did you find the beginning of the hydrophobic region?
• Was searching for the hydrophobic region similar in any way to the BLAST algorithm? Why
or why not?
• Could you use the free energy diagram (Fig. 2.1.1) to score this hydrophobic region?
• Plot the average hydrophobicity scores of the first 4 amino acids, the middle 4 amino acids,
and the final 4 amino acids in the box on the next page.
54 COMPU TATIO NA L B IOL OGY
The most hydrophobic region of the sequence is underlined below. To determine a score for
this region, one might sum up the total free energy score of this 6-amino-acid-long region. Better
yet, the total could be divided by the length of the hydrophobic region to determine the average
hydrophobicity of this region. In this case, the most hydrophobic region would be negative.
Protein Seq 1:
E R K PY LV AW M K R
The hydrophobicity plotting method you will learn using the interactive module also scans
protein sequences a few amino acids at a time and calculates the average hydrophobicity. How
ever, instead of free energy values, we will use the hydropathy scores developed by Kyte and
Doolittle (a Kyte-Doolittle hydropathy plot), in which the most hydrophobic amino acids have
large positive values. For example, the Kyte-Doolittle hydropat hy value for the hydrophobic amino
acid leucine is +3.8, while the hydropathy of the polar amino acid histidine is −3.2.
Exercises
Interactive exercise (theory)
Use the online hydrophobicity exercise link below to learn how to find the
hydrophobic regions of a protein sequence. The Interactive Link explains how to
use the teaching interactive. Once you learn how it works, solve the activity
problem.
Problem
1. Fill in the two empty boxes using the amino acid hydrophobicity scores below.
3. Write the score and average for the last 5-amino-acid window.
56 COMPU TATIO NA L B IOL OGY
Lab Exercise
>ProteinSequence21A
M G P T S V P LV K A H R S S V S D Y V N Y D I I V R H Y N Y T G K L N I S A D K E N S I K LT S V V F I L I C C F I I L E N I F V L
LT I W K T K K F H R P M Y Y F I G N L A L S D L L A G VAY TA N L L L S G AT T Y K LT PA Q W F L R E G S M F VA L S A S V F S
L L A I A I E R Y I T M L K M K L H N G S N N F R L F L L I S A C W V I S L I L G G L P I M G W N C I S A L S S C S T V L P LY H K H
Y I L F C T T V F T L L L L S I V I LY C R I Y S LV R T R S R R L T F R K N I S K A S R S S E K S L A L L K T V I I V L S V F I A C
WA P L F I L L L L D V G C K V K T C D I L F R A E Y F LV L AV L N S G T N P I I Y T LT N K E M R R A F I R I M S C C K C P S G D
SAGKFKRPIIAGMEFSRSKSDNSSHPQKDEGDNPETIMSSGNVNSSS
a. What is this protein? (Hint: you may need to use another algorithm that we
have already covered to search for a close match to this sequence.)
b. Create a hydrophobicity plot (window size = 19) with this sequence data
using the Kyte-Doolittle hydrophobicity scale. Draw/show the plot below.
c. Repeat the plotting using the Eisenberg hydrophobicity scale, a scale based
on different free energy values. Describe briefly how they compare in
terms of transmembrane predictions?
2. Use ProteinSequence21A with TMHMM. Draw/show the TMHMM plot below.
a. How do the plot results compare to the Kyte-Doolittle and Eisenberg graph
predictions? Explain.
58 CO MPUTATIONAL B IOL OGY
Motivation
As described in the Chapter 02 introduction, the structure of a protein can be described at three
different levels: primary (also written as 1°), secondary (2°), and tertiary (3°). The primary struc
ture is the linear order of amino acids, and the tertiary structure is the full 3D structure. In be
tween primary and tertiary structures stands the secondary structure, which also tells a great
deal about protein structure and is easier to predict than the tertiary structure. The two basic
types of secondary structures are alpha helices and beta sheets. There are also loop regions,
which mainly serve to connect the helical and sheet structural elements. Proteins are essentially
alpha helices, beta sheets, and loops arranged in different combinations. Some tertiary struc
tures are formed exclusively from one type of secondary structure (e.g., the structure known as
a beta barrel), but more often they have a mix of alpha helices and beta sheets.
The purpose of this activity is to teach the Chou-Fasman algorithm for predicting protein sec
ondary structure based on the primary protein sequence. The Chou-Fasman algorithm was first
designed in the early 1970s by, you guessed it, Chou and Fasman. By looking at amino acids in
proteins with known structure, Chou and Fasman determined how often each amino acid ap
peared in an alpha helix, beta sheet, or loop region. They then assigned a likelihood score, called
a propensity, for each of the 20 most common amino acids. Some amino acids are more com
mon in alpha helices than in beta sheets and vice versa. In this chapter, you will learn how to use
the propensities to score a protein sequence as being an alpha-helical region or a beta sheet and
then use online prediction software on actual protein sequences.
Learning Objectives
1. Understand the basic concept of protein secondary structure, the two main forms (alpha he
lix and beta sheet), and why it is important for understanding protein function (Motivation).
2. Learn the principles behind the Chou-Fasman algorithm and be able to calculate the relative
scores for the likelihood of a protein sequence forming an alpha helix or a beta sheet (Con
cepts and Exercises).
3. Learn how to use a sliding-window algorithm and secondary structure propensity scores to pre
dict whether a section of a protein is an alpha helix or a beta sheet (Concepts and Exercises).
4. Use online secondary structure prediction (Chou-Fasman) software to predict protein second
ary sequence and interpret the output (Concepts and Exercises).
Concepts
To understand how to use the Chou-Fasman algorithm, one must first understand the principle
of propensity and how propensity values are calculated. The word propensity means an inclina
tion or natural tendency to behave in a particular way. For example, I have a propensity for nerdi
ness. (Set phasers to stun!) The secondary structure propensity of an amino acid is equivalent to
the probability (or the likelihood) of an amino acid being part of a particular type of secondary
structure. Three propensities are calculated for each amino acid:
PR O TEI N A N A LY S I S 59
Propensities greater than 100 mean that an amino acid is more likely than by random chance
of being in that particular structure, while propensities less than 100 mean that the amino acid is
less likely to be found in that type of secondary structure.
Here are three examples:
These propensity values indicate that alanines are more likely to be found in alpha helices than
chance would dictate and less likely to be in beta sheets or turns. Threonines have a higher pro
pensity to be in beta sheets than in alpha helices and turns, and asparagines are more likely to
be in turns than in alpha helices or beta sheets. To be clear, just because alanines have a higher
propensity to be in alpha helices does not mean that they cannot also be in beta sheets or turns,
but it does mean that if you find a region of a protein sequence with more alanines, this region
is more likely to be an alpha helix.
Calculating Propensities
How are propensities calculated? Well, like most great bioinformatics values, propensities are
based on real (experimental) data. In this case, amino acid secondary structure propensities are
based on how many times each amino acid is found in alpha helices, beta sheets, and turns in
known protein structures. Figure 2.2.1 shows an example of how to calculate the alpha helix pro
pensity [P(a)] of lysines.
Reflection 1
• Based on the data in Figure 2.2.1, what is the propensity for lysines being in a beta sheet? A
turn?
• If the expected value for lysine being in a turn was 0.1 because only 10% (0.1) of the amino
acids in the database were in turns, how would this change the P(turn) for lysine?
• Could you use these numbers to determine whether a region of a protein was likely to be
an alpha helix?
Where's Alphie?
Like "Where's Waldo" but not as frustrating!
The table on page 61 shows the propensities of each amino acid for being in an alpha helix,
beta sheet, or turn. Use the values to determine which region of the Protein Seq 2 might be
an alpha helix.
Protein Seq 2:
A E E M H L R N G I Q C QWY F
60 COMPU TATIO NA L B IOL OGY
FIGURE 2.2.1. Calculation of P(a) for lysines. (A) A protein database with 10 proteins,
each with 300 amino acids evenly split among alpha helices, beta sheets, and turns. (B) To
calculate the P(a), first determine the underlying probability of any given amino acid being in
an alpha helix. From the database we know that it is 1/3 (0.33), since 1/3 of allthe amino
acids in the database have been experimentally determined to be in alpha helices. Focusing
just on lysines, of the 100 lysines found in the database, most of them are in alpha helices
(0.8 or 80%), which suggests a propensity of lysines to be in alpha helices. The observed
P(a) for lysines is 0.80 (80/100), while the expected value is 0.33 for any amino acid to be in
an alpha helix. The propensity is pretty easy to calculate: divide the observed value by the
expected value and multiply by 100.
PR O TEI N A N A LY S I S 61
Reflection 2
• How does searching for alpha helices compare to hydrophobicity plotting or BLAST?
• How many amino acids have a high propensity for being in both alpha helices and beta
sheets?
• Are there any regions in the sequence that might be more likely to be beta sheets?
Below is the answer. The underlined region has a large proportion of amino acids that tend to be
found in alpha helices. The middle region is more likely to be a turn (italics), and the right side is
more likely to be a beta sheet (blue).
A E E M H L R N G I Q C QWY F
Exercises
Interactive exercise (theory)
Use the online exercise link below to learn how
to predict the secondary structure of a protein
sequence. The Interactive Link explains how to use
the teaching interactive. Once you learn how it
works, solve the activity problem.
Problem
1. Circle the “start” and “stop” regions of the protein sequence using the Chou-
Fasman algorithm for an alpha helix.
3. Is it an alpha helix or not? Answer yes or no below and indicate why you
reached this conclusion. (Hint: calculate the score for the exact same region of
the sequence being a beta sheet. Which is greater, the alpha helix or the beta
sheet score?)
Lab Exercise
>ProteinSequence22A
MWVLINLLILMIMVLISVAFLTLLERKILGYIQDRKGPNKIMLFGMFQPFSDALKLLSKEWFFFNYSNLFIYSPMLMFFLS
LVMWILYPWFGFMYYIEFSILFMLLVLGLSVYPVLFVGWISNCNYAILGSMRLVSTMISFEINLFFLVFSLMMMVESFSFN
EFFFFQNNIKFAILLYPLYLMMFTSMLIELNRTPFDLIEGESELVSGFNIEYHSSMFVLIFLSEYMNIMFMSVILSLMFYG
FKYWSIKFILIYLFHICLIIWIRGILPRIRYDKLMNMCWTEMLMLVMIYLMYLYFMKEFLCI
a. First of all, what is this protein? (BLAST on UniProt website http://www
.uniprot.org/blast/)
>ProteinSequence22B
MGPTSVPLVKAHRSSVSDYVNYDIIVRHYNYTGKLNISADKENSIKLTSVVFILICCFIILENIFVLLTIWKTKKFHRPMY
YFIGNLALSDLLAGVAYTANLLLSGATTYKLTPAQWFLREGSMFVALSASVFSLLAIAIERYITMLKMKLHNGSNNFRLFL
LISACWVISLILGGLPIMGWNCISALSSCSTVLPLYHKHYILFCTTVFTLLLLSIVILYCRIYSLVRTRSRRLTFRKNISK
ASRSSEKSLALLKTVIIVLSVFIACWAPLFILLLLDVGCKVKTCDILFRAEYFLVLAVLNSGTNPIIYTLTNKEMRRAFIR
IMSCCKCPSGDSAGKFKRPIIAGMEFSRSKSDNSSHPQKDEGDNPETIMSSGNVNSSS
A
lignments of biological sequences (DNA, RNA, and protein) have been a
centerpiece of bioinformatics from its inception. Chapter 01 covered the
rapid, high-performance pairwise BLAST aligner, and in this section we ad
dress a more sophisticated, albeit slower, method for generating pairwise
sequence alignments that is guaranteed to produce a mathematically opti
mal alignment. We also discuss how pairwise alignments can be turned into
multiple sequence alignments (MSAs).
FIGURE 3.1. Structures of the mouse and human globin proteins. The bumpy blobs
represent the three-dimensional structures of the mouse and human globin proteins.
Notice how the structures are superimposable. The goal is to create a sequence alignment
of the protein sequence, or underlying DNA sequence, in which the homologous
functional regions of the proteins are aligned. In the process of making a sequence
alignment between the mouse and human globin protein primary amino acid sequences,
the sequence corresponding to mouse region 1 (M REGION 1) should be aligned to
human region 1 (H REGION 1), and mouse region 2 should be aligned to human region 2.
Clearly, it would make no sense to align the amino acids (or the underlying DNA that
codes for these amino acids) of mouse region 1 to human region 2. This is an easy
problem when the structures of the homologous molecules are available. The problem
becomes harder when one has only the two sequences and no structure, and harder still
the more distantly related (and different) the sequences are from one another.
which nucleotides (or amino acid positions) have experienced mutations and which
have not.
To find the really important nucleotides or amino acids, any change of which
would destroy the function of the macromolecule, here’s a hint: look for the con
served alignment positions, the positions that have not changed in any of the
sequences. The more constrained or conserved a nucleotide or amino acid se
quence is, the more important it likely is for the functioning of the macromole
cule. Mutations at these positions would almost certainly lead to the death of the
organism, and natural selection would remove it from the gene pool. Figure 3.2
illustrates examples of sequence alignments showing both highly conserved po
sitions and variable positions.
While the conserved positions in sequence alignments indicate functionally
critical nucleotides or amino acids, the so-called varia ble positions can also be
useful. The types and patterns of mutations allowed at these variable positions
S eq u e nc e A lignm ent 69
can be used in a large number of bioinformatics approaches that we will deal with
in other chapters, including
FIGURE 3.2. MSAs of DNA and protein sequences indicating specific conserved and
variable positions. (A) MSA of a transcription factor binding site for the estrogen receptor
protein. The underlined regions indicate the nucleotides that directly bind the protein
transcription factor. Any mutation to the conserved nucleotides indicated by an asterisk would
prevent the binding of the estrogen receptor transcription factor and prevent the transcription
(and therefore translation) of a protein involved in fertility. (B) MSA of 5 related bacterial outer
membrane protein sequences. The proteins are channels that allow ions to move in or outof
bacterial cells. Asterisks indicate amino acids conserved in allthe outer membrane proteins.
The periods and colons indicate amino acid positions with less conservation.
70 COMPU TATIO NA L B IOL OGY
It turns outthat the protein sequences themselves are much easier to align
than the DNA sequences that code for the proteins. Not only is there less redun
dancy, but one can also use sophisticated scoring schemes for matches and mis
matches. These scoring schemes are known as the PAM and BLOSUM matrices
(see Chapter 07 for details on matrix calculation) and provide a score when an
amino acid in one sequence matches the amino acid in the compared sequence
and a mismatch score for every other amino acid. In protein sequence alignments
mismatches can sometimes be useful, as we will discuss in Chapter 07. Activity 3.1
teaches how to use these scoring schemes to align protein sequences.
The most difficult types of sequences to align are noncoding RNA or DNA se
quences. Unlike protein-coding genes, noncoding DNA (e.g., promoter regions
upstream of the coding region) and DNA that encodes structural RNA can accu
mulate lots of insertions or deletions and still function. Insertions or deletions in
protein-coding genes most often lead to frameshift mutations, which result in
completely new and nonfunctioning protein sequences which are usually elimi
nated by natural selection. For RNA sequences, alignments are often performed
by combining information on the RNA’s structure with a dynamic programming
method. Other methods, such as sliding window algorithms, can be used to align
noncoding DNA.
72 CO MPUTATIONAL B IOL OGY
Multiple-Sequence Alignment
In this chapter, we mainly focus on how to generate a sequence alignment for two
sequences. In practice, one usually wants to create a MSA. It turns outthat MSAs
can be created by clever extension of pairwise sequence alignments. The pro
gressive alignment method described by Paulien Hogeweg and Ben Hesper in
1984 was effectively integrated by the creators of the ClustalW program a decade
later. The latest version of this software, one of the most-cited bioinformatics pro
grams, is called Clustal Omega, which you will learn to use in Activity 3.1.
The basic approach of allthe progressive alignment methods is as follows.
FIGURE 3.4. Progressive sequence alignment of four DNA sequences. At the top (left
to right), beginning with the unaligned DNA sequences, allpossible pairs are aligned using
a dynamic programming approach. Distances are created from the sequence alignments
and used to build a guide tree. At the bottom (left to right), using the guide tree, the most
similar sequences are aligned in pairs. Consensus sequences are made from the pairs,
and then the pair of consensus sequences is aligned. Note how the final MSA retains the
gaps from the previous pairwise alignments as well as from the consensus pair alignment.
S eq u enc e A lignm ent 73
The key to this approach is that at each stage of the progressive alignment
the program always aligns only two sequences. Figure 3.4 shows an example of
a progressive MSA with four DNA sequences.
Notes
1. Retroviruses (such as HIV) have high rates of mutations that occur during genome replication
and result in many mutant viral variants. Most result in nonfunctional viruses, but others re
sult in new functioning varia nts of the virus that allow the virus to escape the immune sys
tem. Cool but evil, sort of like Darth Vader.
2. Good sequence alignments are vital for proper data interpretation. For years, poor sequence
alignments that did not properly address large insertion or deletion mutations (gaps) led re
searchers to conclude that birds were related to mammals. This changed when a research
group in China took a closer look and noticed that the sequence alignment was wrong.
74 CO MPUTATIONAL B IOL OGY
Motivation
Alignments of different DNA, RNA, or protein sequences are a fundamental aspect of bioinfor-
matics. Sequence alignments allow similarity comparisons (e.g., BLAST) for function prediction
and organism identification. They also identify when mutations have occurred, such as the occur
rence of single nucleotide polymorphisms in the human genome or the evolution of viruses
during a pandemic. Multiple-sequence alignments are also critic al for a number of other bioinfor-
matics approaches, including many of the other bioinformatics tools covered in this book.
The purpose of this activity is to teach a dynamic programming algorithm for the global (end-
to-end) alignment of any two DNA or protein sequences.1 This algorithm, called the Needleman-
Wunsch method, uses a scoring system and a grid matrix to efficiently determine the best
alignment between a pair of DNA or protein sequences. Dynamic programming algorithms are
common in mathematics, computer science, and bioinformatics. An algorithm is considered
“dynamic programming” or “dynamic optimization” if it breaks down a problem into a set of eas
ily solvable subproblems, each of which is stored and used for the eventual solution. (If only this
would work with personal relationships.) The cool thing about these solutions is that, given a set
of simple assumptions, they are guaranteed to produce the optimal sequence alignment.2 In this
case, the assumptions are a set of scores for the matching of two nucleotides, for the mismatch
ing of two nucleot ides, and for gaps when there has been an insertion or a deletion. This chapter will
teach you how to solve dynamic programming problems for pairwise DNA and protein sequences
and then teach you how to use an online program for constructing MSAs. The Needleman-Wunsch
algorithm and dynamic programming can be a bit confusing at first glance, so make sure to read
through the tutorial on the website carefully and practice using the interactives.
Learning Objectives
1. Learn the importance of sequence alignment and the challenges imposed by different types
of mutations in generating an accurate sequence alignment (Motivation).
2. Use match, mismatch, and gap penalties to create a dynamic programming matrix for a pair-
wise DNA sequence alignment (Concepts and Exercises).
3. Use the matrix to perform the traceback step and determine the final, best global (end-to-
end) alignment of two sequences (Concepts and Exercises).
4. Do the same for pairwise protein sequence alignment using the specialized protein scoring
tables PAM and BLOSUM (Concepts and Exercises).
5. Learn how to use the Clustal Omega program to perform pairwise and multiple-sequence
alignments (Concepts and Exercises).
Concepts
To prepare you to understand the principles and goals of dynamic programming, try the follow
ing anticipatory exercise. The Needleman-Wunsch method, and other pairwise sequence align
S eq u enc e A lignm ent 75
ment methods, uses two-dimensional graphs to solve the alignment problem. Below you will
find graphs that indicate two possible alignments of the same pair of DNA sequences, Seq1 and
Seq2. Follow the paths to determine how to align the nucleotides (letters) of the two different
sequences. Note that one sequence is longer than the other but the alignment is “global,” which
means that the sequences must be aligned from end to end. This also means that, if the se
quences are of different lengths or very divergent from one another, you will need to adjust with
spaces or other characters to create gaps in the sequence so that the correct bases line up prop
erly. Most often hyphens are used to indicate these gaps (indels). Write the alignments below
the graphs.
To begin the problem, in each graph start at the top left square and follow the X’s. The X’s in
dicate the path for each alignment. Notice how Seq1 is shorter than Seq2 but in the graph they
are aligned end to end. This means that you must have gaps to get them to align properly. Both
paths indicate that the G in Seq1 should be aligned to the G in Seq2. From there, both paths
“move” diagonally and have the A in Seq1 aligned to the A in Seq2. However, the next X in the
third square is diagonal in path 1 but vertical in path 2. This means that there are no bases in
Seq1 to align with the G in Seq2, which creates a gap (indicated by a hyphen).
The first three positions of the alignments are given below. Try finishing the rest for both
paths and answer the reflection questions.
Seq1: GATTTA
Seq2: GAGTTCA
PATH 1: PATH 2:
G A T T T A G A T T T A
G X G X
A X A X
G X G X
T X T X
T X T X
C X C X
A X A X
Path 1:
Seq1: GAT _________________
Seq2: GAG _________________
Path 2:
Seq1: GA— _________________
Seq2: GAG _________________
76 COMPU TATIO NA L B IOL OGY
Reflection
• How do you align the letters of the DNA when the path moves diagonally? Do the letters
always match?
• What happens when the paths move vertically? Would a horizontal move be different and, if
so, how?
• Could this type of graphing method be used for protein sequences?
• Which of the alignments seems better? Could some kind of scoring scheme help?
Below is the answer. Each path showed a different alignment for the same pair of sequences
(Seq1 and Seq2). Because Seq1 is shorter, in an end-to-end alignment, allthe letters have a
match in Seq 1, but not allthe Seq2 letters have a match. Thus, in order to get the sequences to
align properly, it is necessary to put in some gaps (spaces) in the sequence alignment, indicated
by a hyphen. The graph indicates what letters you should match up (a diagonal move is a match
or a mismatch). However, when the path moves vertically, it stays on one sequence but moves
along the other. This indicates that you must put a gap in the sequence alignment in the sequence
that does not move. (At this point, I'm feeling quite moved.) Since in this case it is always Seq1,
that is where allthe gaps are placed. (A horizontal move would have been a gap in Seq2.)
Path 1:
Seq1: GAT—TTA
Seq2: GAGTTCA
Path 2:
Seq1: GA—TTTA
Seq2: GAGTTCA
So which of the two alignments for these same sequences is better? The second alignment
path seems better, but we cannot know withouta scoring scheme of some sort. Computers
need numbers, and that is where the Needleman-Wunsch algorithm comes in.
(continued )
78 COMPU TATIO NA L B IOL OGY
FIGURE 3.1.1. Solving a dynamic programming alignment problem. The value for
each cell is based on the values from the surrounding cells. (1) Create a grid with your two
sequences, adding an extra row and column (so, in this example, a 3-by-3 sequence
becomes a 4-by-4 grid). Start with a 0 score in the top left corner. (Gotta start somewhere!)
(2) Moving horizontally or vertically, use the gap penalty score. To fill in the top outside row,
move horizontally and add the gap penalty to the score from the previous cell. For instance,
moving right, add a −1 to the score from the original top left cell. Do the same to the leftmost
column going down from the initial starting cell. (3) To determine the score for an empty cell,
calculate the following three scores: the diagonal (match or mismatch), horizontal (gap
penalty), and vertical (gap penalty), adding each score to that of the starting cell. In this case,
the diagonal is a match (A to A), so we add +1 to the 0 that was in the top left cell, getting a
diagonal score of +1. Next, determine the gap scores for the cell coming from the vertical and
horizontal.3 The gap penalty in this example is −1. The horizontal and vertical starting cells each
have a score of −1, so the horizontal and vertical scores for the new cell are both −2. (4) Pick
the highest (maximal) of the three numbers, and that is the final score for the cell, in this case
+1. (5) The process repeats itself as you move to the next empty cell. (6) In this case, the
diagonal is a mismatch (C to A), so we add 0 to the starting cell’s score, giving us a −1
diagonal score. The vertical −1 gap penalty adds to the vertical starting score of 1, giving us a
final vertical score of 0. The horizontal −1 gap penalty adds to the horizontal starting score of
−2, giving us a final horizontal score of −3. So, the final score for the new cell is 0. (7) The
process is repeated as you continue to fill outthe rest of the empty cells until the entire
graph is completed (8). (9) Solving the graph, traceback, and alignment. The Needleman-
Wunsch algorithm creates a global (end-to-end) sequence alignment.4 Once the graph
is solved, start the traceback at the bottom right corner (the highest final score) for the
global alignment. The arrows show the path of the traceback, which goes back to the
cell from which the highest number was derived. In this case, the “2” is the highest,
and it came from the diagonal cell. (10) Finish the traceback to the top left, and then
(11) follow the path to make the alignment (12).
S eq u enc e A lignm ent 79
Exercises
Interactive exercise (theory)
Use the online Needleman-Wunsch alignment interactive link below to learn how
to create scoring matrices, solve tracebacks, and produce pairwise sequence align
ments. The Interactive Link explains how to use the teaching interactives. Once you
learn how it works, solve the activity problem.
Problems
1. Fill in the blank cells using the Needleman-Wunsch algorithm and then com
plete the traceback. Using the traceback, write the sequence alignment in the
spaces below.
S eq u e nc e A lignm ent 81
2. Fill in the blanks in this protein alignment using the Needleman-Wunsch algo
rithm and the PAM250 scoring matrix, shown in Fig. 3.1.2 (PAM250 is dis
cussed further in Chapter 07). For this problem allthe gap scores are −5, but
the match and mismatch scores come from the PAM250 matrix. For example,
an R-to-W mismatch according to the PAM250 matrix is +2, while an M-to-M
match is +6. (Make sure that you can find these values in the PAM250 matrix.)
To practice this type of problem, click on the Align Proteins button in the inter
active module.
Lab Exercise
DNA multiple-sequence alignment
Unaligned DNA sequences for questions 1 to 3 below (and at the Sample and Lab
Exercise data link):
>BR110MP90
TAATATCAATAGAAGAATTAGCCAAAATTACGTCCTGTCAAACCCCCTATGGTAAATAGAAAAATAAATC
CGATAGCTCATAGGGATGAAGGAGTTAAAGTAATTTGGGAGCCATGGTATGTTGCGAGTCATCTAAAAATTT
TGATTCCAGTAGGAACTGCAATAATTATTGTGGCTGATGTGAAGTAAGCTCGAGTATCAACATCTATCCCTACTGT
AAATATATGATGGGCTCACACTACAAATCCTAGCAGACCAATTGCTATTATAGCATAAATTATTCCTAATAAAC
CGAAAGCTTCCTTTTTGCCACTTTCTTGTCTAATAATATGAGAAATTATTCCGAAACCAGGTAAAATTAGAATAT
A A AC TT C AG G AT G T C C G A A A A AT C A A A ATA A AT G C T G ATA A AG A ATAG G AT C C C C T C C AC C T G AT G G G
TCAAAGAAGGTAGTATTAATATTTCGATCTGTCAATAGTATAGTAATAGCTCCGGCTAATAC
>JE05
AATAATATCAATGGAAGAATTGGCTAAAACTACACCAGTTAATCCCCCTAAAGTAAAGAGGAAAATAAAT
CCAATAGCTCAAAGGGAGGAGGGGTTTAGGGTAATTTGAGACCCATGGTATGTAGCTAACCATCTAAAAATTT
TGATTCCAGTCGGAACTGCAATAATTATTGTGGCAGACGTGAAATAGGCGCGAGTATCTACATCTATTCCTACTG
TGAACATATGATGGGCTCATACTACAAAACCTAATAGTCCAATTGCTATTATAGCATAAATTATTCCCAATAAT
CCAAAAGCTTCCTTTTTTCCTCTTTTCTTGCCTAATAATATGAGAAATTATACCAAATCCTGGTAAAATTAAAATAT
AAACTTCAGGGTGCCCAAAAAATCAGAATAAGTGCTGATAGAGGATAGGGTCTCCTCCACCGGAGGGA
TCAAAAAAAGTAGTATTAATATTTCGGTCTGTTAAAAGTATAGTGATAGCCCCAGCTAAC
>JE06
GAATAATATCAATGGAAGAATTGGCTAAAACTACACCAGTTAATCCCCCTAAAGTAAAGAGGAAAATAAATC
CAATAGCTCAAAGGGAGGAGGGATTTAGGGTAATTTGAGACCCATGGTATGTAGCTAACCATCTAAAAATTT
TGATTCCAGTCGGAACTGCAATAATTATTGTGGCAGACGTGAAATAGGCGCGAGTATCTACATCTATTCCTACTGT
GAACATATGATGGGCCCATACTACAAAACCTAATAGTCCAATTGCTATTATAGCATAAATTATCCCCAATAATC
CAAAAGCTTCCTTTTTACCTCTTTCTTGCCTAATAATATGAGAAATTATACCAAATCCTGGTAAAATTAAAAATATA
AACTTCAGGGTGCCCAAAAAATCAGAATAAGTGTTGATAGAGGATAGGGTCTCCTCCACCGGAGGGAT
CAAAAAAAGTAGTATTAATATTTCGGTCTGTTAAAAGTATAGTGATAGCTCCAGCTAA
>JE11L95
TATCAATGGAAGAATTGGCTAAAATTACACCAGTTAATCCCCCTAAAGTAAAGAGGAAAATAAATCCAATAGCT
CAAAGGGAGGAGGGAGTCAGGGTAATTTGAGATCCATGGTATGTAGCCAATCATCTAAAAATTTTAATTCCAGTC
CGAACTGCAATAATTATTGTGGCAGATGTAAAATAGGCGCGAGTATCTACATCTATTCCTACTGTGAAGTATATG
G AT G A G C T C ATA C TA C A A A A C C TA ATA AT C C A AT T G C TAT TATA G C ATA A AT TAT T C C TA ATA AT C
CAAAAGCTTCCTTTTTTCCTCTTTCTTGCCTAATAATATGGGAAATTATACCAAATCCTGGTAAAATTAAAATATA
AACTTCAGGGTGCCCAAAAAATCAAAATAAGTGTTGATAGAGGATAGGGTCTCCTCCACCGGAGGGGT
CAAAAAAAGTAGTATTAATATTTCGGTCTGTTAAAAGTATAGTGATAGCCCCAGCTAATACCG
S eq u enc e A lignm ent 85
>JE17NB95
AATATCAATGGAAGAATTGGCTAAAACTACACCAGTTAATCCCCCTAAAGTAAAGAGGAAAATAAATCCAATAGCT
CAAAGGGAGGAGGGATTTAGGGTAATTTGAGACCCATGGTATGTAGCTAACCATCTAAAAATTTTGATTCCAGTC
GGAACTGCAATAATTATTGTGGCAGACGTGAAATAGGCGCGAGTATCTACATCTATCCCTACTGTGAACATNAT
G AT G G G C T C ATA C TA C A A A A C C TA ATA G T C C A AT T G C TAT TATA G C ATA A AT TAT T C C C A ATA AT C
CAAAAGCTTCCTTTTTTCCTCTTTCTTGCCTAATAATATGAGAAATTATACCAAATCCTGGTAAAATTAAAATATA
AACTTCAGGGTGCCCAAAAAATCAGAATAAGTGTTGATAGAGGATAGGGTCTCCTCCACCGGAGGGAT
CAAAAAAAGTAGTATTAATATTTCGGTCTGTTAAAAGTATAGTGATAGCCCCAGCTAACACC
>JE19NB95
ATATCAATGGAAGAATTGGCTAAAACTACACCAGTTAATCCCCCTAAAGTAAAGAGGAAAATAAATCCAATAGCT
CAAAGGGAGGAGGGATTTAGGGTAATTTGAGACCCATGGTATGTAGCTAATCATCTAAAAATTTTGATTCCAGTCG
GAACTGCAATAATTATTGTGGCAGATGTAAAATAGGCGCGAGTATCTACATCTATTCCTACTGTGAACATATGAT
GAGCTCATACTACAAAACCTAATAATCCAATTGCTATTATAGCATAAATTATTCCCAATAATCCAAAAGCTTCCTTT
TTAAACCTCTTTCTTGCCTAATAATATGAGAAATTATACCAAATCCTGGTAAAATTAAAATATAAACTTCAGGGT
GCCCAAAAAATCAGAATAAGTGTTGATAGAGGATAGGGTCTCCTCCACCGGAGGGATCAAAAAAAGTAGTATTA
ATATTTCGGTCTGTTAAAAGTATAGTGATAGCCCCAGCTAACACT
>JE39M95
TA ATAT C A AT G G AAGAATT GGCTAAAACTACACCAGTTA ATCC C C C TA A AGTA A AGAGGA A A ATA A ATC
CAATAGCTCAAAGGGAGGAGGGATTTAGGGTAATTTGAGACCCATGGTATGTAGCTAACCATCTAAAAATTTT
GATTCCAGTCGGAACTGCAATAATTATTGTGGCAGACGTGAAATAGGCGCGAGTATCTACATCTATCCCTACTGT
GAATATATGATGGGCTCATACTACAAAACCTAATAGTCCAATTGCTATTATAGCATAAATTATTCCCAATAATC
CAAAAGCTTCCTTTTTTCCTCTTTCTTGCCTAATAATATGAGAAATTATACCAAATCCTGGTAAAATTAAAATATA
AACTTCAGGGTGCCCAAAAAATCAGAATAAGTGTTGATAGAGGATAGGGTCTCCTCCACCGGAGGGAT
CAAAAAAAGTAGTATTAATATTTCGGTCTGTTAAAAGTATAGTGATAGCCCCAGCTA
1. Use the Clustal Omega alignment program with the default matrix to align
the DNA sequence data. Write the first 10 positions of the alignment for
only the first 4 aligned sequences below (or the first 4 rows of the alignment
if you are using copy/paste).
86 COMPU TATIO NA L B IOL OGY
2. This next section is designed to give you practice converting file formats.
Many programs accept only a few selected file formats. Many allow you to
use FASTA formats, but some need specialized formats, so practice converting
file types is helpful. The conversion website used here functions similarly to
the Clustal Omega site. Use https://www.ebi.ac.uk/Tools/sfc/emboss_
seqret/to convert the Clustal alignment you created in question 1 into a FASTA
format. (Note: use the ENTIRE file, including the header that begins with
“CLUSTAL” and the asterisks at the bottom that are not actually part of the
alignment.)
Write the title lines and the first 10 positions of the alignment for only the first
3 sequences in the FASTA file (or the first 3 sequences in the alignment if you
are using copy/paste).
3. Convert the Clustal alignment to the Nexus/paup interleaved format, which is
used often in phylogenetic analyses.
a. What kind of information does the header of the Nexus/paup file contain?
Unaligned protein sequences for question 4 below (and at the Sample and lab
exercise data link):
>LCseedSfl
MKKLTVAISAVAASVLMAMSAQAAEIYNKDSNKLDLYGKVNAKHYFSSNDADDGDTTYVRLGFKGETQINDQLTG
FGQWEYEFKGNRAESQGSSKDKTRLAFAGLKFGDYGSIDYGRNYGVAYDIGAWTDVLPEFGGDTWTQTDVFM
TGRTTGVATYRNNDFFGLVDGLNFAAQYQGKNDRTDVTEANGDGFGFSTTYEYEGFGVGATYAKSDRTNDQVIY
GNNSLNASGQNAEVWAAGLKYDANNIYLATTYSETQNMTVFGNNHIANKAQNFEVVAQYQFDFGLRPSVAYLQSK
GKDLGAWGDQDLIEYIDVGATYYFNKNMSTFVDYKINLIDKSDFTKASGVATDDIVAVGLVYQF
>PhoEseedEco1
MKMKKSTLALVVMGIVASASVQAAEIYNKDGNKLDVYGKVKAMHYMSDNDSKDGDQSYIRFGFKGETQINDQL
TGYGRWEAEFAGNKAESDTAQQKTRLAFAGLKYKDLGSFDYGRNLGALYDVEAWTDMFPEFGGDSSAQTDNFM
TKRASGLATYRNTDFFGVIDGLNLTLQYQGKNENRDVKKQNGDGFGTSLTYDFGGSDFAISGAYTNSDRTNEQNLQ
SRGTGKRAEAWATGLKYDANNIYLATFYSETRKMTPITGGFANKTQNFEAVAQYQFDFGLRPSLGYVLSKGKDIEGI
GDEDLVNYIDVGATYYFNKNMSAFVDYKINQLDSDNKLNINNDDIVAVGMTYQF
>PhoEseedEco2
MKMKKSTLALVVMGIVASVSVQAAEIYNKDGNKLDVYGKVKAMHYMSDNDSKDGDQSYIRFGFKGETQINDQL
TGYGRWEAEFAGNKAESDTAQQKTRLAFAGLKYKDLGSFDYGRNLGALYDVEAWTDMFPEFGGDSSAQTDNFM
TKRASGLATYRNTDFFGVIDGLNLTLQYQGKNENRDVKKQNGDGFGTSLTYDFGGSDFAISGAYTNSDRTNEQNLQ
SRGTGKRAEAWATGLKYDANNIYLATFYSETRKMTPISGGFANKTQNFEAVAQYQFDFGLRPSLGYVLSKGKDIEGI
GDEDLVNYVDVGATYYFNKNMSAFVDYKINQLDSDNKLNINNDDIVAVGMTYQF
>PhoEseedEco4
MKKSTLALVVMGIVASASVQAAEIYNKDGNKLDVYGKVKAMHYMSDNDSKDGDQSYIRFGFKGETQINDQLT
GYGRWEAEFAGNKAESDTAQQKTRLAFAGLKYKDLGSFDYGRNLGALYDVEAWTDMFPEFGGDSSAQTDNFM
TKRASGLATYRNTDFFGVIDGLNLTLQYQGKNENRDVKKQNGDGFGTSLTYDFGGSDFAISGAYTNSDRTNEQNLQ
SRGTGKRAEAWATGLKYDANNIYLATFYSETRKMTPITGGFANKTQNFEAVAQYQFDFGLRPSLGYVLSKGKDIEGI
GDEDLVNYIDVGATYYFNKNMSAFVDYKINQLDSDNKLNINNDDIVAVGMTYQF
>PhoEseedSen1
MNKSTLAIVVSIIASASVHAAEVYNKNGNKLDVYGKVKAMHYMSDYDSKDGDQSYVRFGFKGETQINDQLT
GYGRWEAEFAGNKAESDSSQQKNRLAFAGLKLKDIGSFDYGRNLGALYDVEAWTDMFPEFGGDSSAQTDNF
MTKRASGLATYRNTDFFGIVDGLDLTLQYQGKNEDRDVKKQNGNGFGTSVSYDFGGSDFAVSGAYTLSDRTREQNLQ
RRGTGDKAEAWATGVKYDANDIYIATFYSETRNMTPVSGGFANKTQNFEAVIQYQFDFGLRPSLGYVLSKGKD
IEGVGSEDLVNYIDVGATYYFNKNMSAFVDYKINQLDSDNTLGINDDDIVAIGLTYQF
>PhoEseedSen2
MNKSTLAIVVSIIASASVHAAEVYNKNGNKLDVYGKVKAMHYMSDYDSKDGDQSYVRFGFKGETQINDQLT
GYGRWEAEFAGNKAESDSSQQKTRLAFAGLKLKDIGSFDYGRNLGALYDVEAWTDMFPEFGGDSSAQTDNFM
TKRASGLATYRNTDFFGIVDGLDLTLQYQGKNEDRDVKKQNGDGFGTSVSYDFGGSDFAVSGAYTLSDRTREQNLQ
RRGTGDKAEAWATGVKYDANDIYIATFYSETRNMTPVSGGFANKTQNFEAVIQYQFDFGLRPSLGYVLSKGKD
IEGVGSEDLVNYIDVGAIYYFNKNMSAFVDYKINQLDSDNTLGINDDDIVAIGLTYQF
88 COMPU TATIO NA L B IOL OGY
>PhoEseedSfl
MKKSTLALVVMGIVASASVQAAEIYNKDGNKLDVYGKVKAMHYMSDNASKDGDQSYIRFGFKGETQINDQLT
GYGRWEAEFAGNKAESDTAQQKTRLAFAGLKYKDLGSFDYGRNLGALYDVEAWTDMFPEFGGDSSAQTDNFM
TKRASGLATYRNTDFFGVIDGLNLTLQYQGKNENRDVKKQNGDGFGTSLTYDFGGSDFAISGAYTNSDRTNEQNLQ
SRGTGKRAEAWATGLKYDANNIYLATFYSETRKMTPITGGFANKTQNFEAVAQYQFDFGLRPSLGYVLSKGKDIEGI
GDEDLVNYIDVGATYYFNKNMSAFVDYKINQLDSDNKLNINNDDTVAVGMTYQF
>PhoEseedSty
MNKSTLAIVVSIIASASVHAAEVYNKNGNKLDVYGKVKAMHYMSDYDSKDGDQSYVRFGFKGETQINDQLT
GYGRWEAEFASNKAESDSSQQKTRLAFAGLKLKDIGSFDYGRNLGALYDVEAWTDMFPEFGGDSSAQTDNFM
TKRASGLATYRNTDFFGIVDGLDLTLQYQGKNEDRDVKKQNGDGFGTSVSYDFGGSDFAVSGAYTLSDRTREQNLQ
RRGTGDKAEAWATGVKYDANDIYIATFYSETRNMTPVSGGFANKTQNFEAVIQYQFDFGLRPSLGYVLSKGKD
IEGVGSEDLVNYIDVGATYYFNKNMSAFVDYKINQLDSDNTLGINDDDIVAIGLTYQF
>nmpCseedEco1
MNIYRAVTSFFNNSSKKGLTMKKLTVAISAVAASVLMAMSAQAAEIYNKDSNKLDLYGKVNAKHYFSSNDADD
GDTTYARLGFKGETQINDQLTGFGQWEYEFKGNRAESQGSSKDKTRLAFAGLKFGDYGSIDYGRNYGVAYDIGAWT
DVLPEFGGDTWTQTDVFMTQRATGVATYRNNDFFGLVDGLNFAAQYQGKNDRSDFDNYTEGNGDGFGFSATYEYE
GFGIGATYAKSDRTDTQVNAGKVLPEVFASGKNAEVWAAGLKYDANNIYLATTYSETQNMTVFADHFVANKAQN
FEAVAQYQFDFGLRPSVAYLQSKGKDLGVWGDQDLVKYVDVGATYYFNKNMSTFVDYKINLLDKNDFTKEGANKSLI
>nmpCseedSty
MKLKLVAVAVTSLLAAGVVNAAEVYNKDGNKLDLYGKVHAQHYFSDDNGSDGDKTYARLGFKGETQINDQLTG
FGQWEYEFKGNRTESQGADKDKTRLAFAGLKFADYGSFDYGRNYGVAYDIGAWTDVLPEFGGDTWTQTDVFMT
GRTTGVATYRNTDFFGLVEGLNFAAQYQGKNDRDGAYESNGDGFGLSATYEYEGFGVGAAYAKSDRTNNQVKAA
SNLNAAGKNAEVWAAGLKYDANNIYLATTYSETLNMTTFGEDAAGDAFIANKTQNFEAVAQYQFDFGLRPSIAYLKS
KGKNLGTYGDQDLVEYIDVGATYYFNKNMSTFVDYKINLLDDSDFTKAAKVSTDNIVAVGLNYQF
a. Find the longest stretch of complete conserved positions. Write the amino
acids for this stretch of sequence below (which is the same in allthe
sequences).
b. Find the region or regions of the alignment in which most or allof the se
quences have a 3-character gap (“---”). Write the amino acid letters for the
sequences that do NOT have gaps.
S eq u enc e A lignm ent 89
Notes
1. This is also called a pairwise sequence alignment.
2. BLAST is faster and highly accurate, but it is not guaranteed to produce an optimal sequence
alignment.
3. It is necessary to have the scores from the three surrounding cells, which is why we added
an extra row and column when we started the graph.
4. The Smith-Waterman variant of this algorithm produces a local alignment, which aligns two
sequences but does not constrain the alignment to be from end to end. This is important
when aligning, say, a short sequence to an entire genome. In the Smith-Waterman variant, all
the negative cell values are changed to 0 when the graph is calculated, which makes the pos
itive local alignments visib
le. Then the traceback starts at the highest scoring cell regardless
of where that value is in the graph, and proceeds until it reaches a 0 value.
CHAPTER
04
PATTERNS IN THE DATA
E
arly in the primordial days of molecular sequencing, analysis of DNA and
protein sequence alignments uncovered interesting nonrandom patterns of
nucleotide variation. One of the earliest discoveries came from the compari
son of the upstream (5′) promoter regions of protein-coding genes, right in
the midst of the binding site of the RNA polymerase. This region contained
what is known as a TATA box, which is vital for the correct binding of the RNA
polymerase and transcription of the messenger RNA. Analysis of the sequence
alignment containing the TATA box region, even visually, shows a significant bias
towards T and A nucleotides (Fig. 4.1). The TATA box turns outto be crucial for
binding the protein known, shockingly enough, as the TATA binding protein, which
is a critical part of the RNA polymerase complex. Experimental mutations of
these conserved nucleotides reduced or completely eliminated the process of
transcription. If such a mutation happened in a gene that was critic al for organism
development or survival, the organism with the mutation would be eliminated via
natural selection and the mutation would not pass on to future generations.
Harsh, but true.
Such patterns proved to be extremely common across a vast variety of ge
nomes. Figure 4.2 illustrates a number of other examples using sequence logos.
Sequence logos were created by Tom Schneider and Mike Stephens back in the
dark ages (1990, when cell phones were the size of bricks and email was only for
nerds) and were designed as a simple visual means of illustrating the highly con
served positions in a sequence alignment (DNA, RNA, or protein). These posi
tions were likely to be extremely important and, therefore, conserved by natural
selection. The calculation of sequence logos is quite simple and is based on infor
mation theory. The level of conservation at a position of a sequence alignment,
Rseq , is calculated as follows:
⎛ N ⎞
Rseq = log2 N − ⎜ − ∑
⎝ n=1
pn log2 pn ⎟
⎠
(U for RNA)—and the maximum sequence logo score for a position is 2 if allthe
nucleotides at a position are the same1 (e.g., all A nucleotides). Protein sequences
have many more symbols, 20 total representing each of the most frequently occur
ring amino acids, so the maximum bit score (if 100% of the amino acids at a posi
tion are of one type) is 4.32.
Figure 4.2 illustrates sequence logos made for sequence alignments of non
coding DNA sequences.
Similar conserved patterns can be detected in protein sequence alignments.
These regions indicate amino acids that are critic al in the functioning of the pro
tein. For instance, highly conserved amino acids in a transcription factor (TF) could
be critic al in allowing the TF DNA-binding domain to bind DNA.
Sequence Motifs
The types of short sequence patterns shown in the figures here can also be
found in RNA and protein sequences, and are generally referred to as motifs.
One goal of bioinformatics has been to develop algorithms to automatic ally iden
tify functional motifs in the vast sea of DNA, RNA, and protein databanks. For in
stance, we might want to search for allthe alternate splice sites in the genome of
the fruit fly or find allthe binding sites of a particular TF in the genome of the Black
Death bacterium.
The algorithms covered in this chapter are designed to take into account the
facts that motifs tend to be short and that, while certain positions tend to be con
served, there is also a lot of positional variability among related motifs. The two
methods we discuss in the chapter include (i) a basic protein sequence motif al
gorithm and (ii) a DNA motif search tool called a weight matrix.2 In general, these
and other motif-searching algorithms have the following in common.
FIGURE 4.2. Logos for sequences involved in critical cell functions. The bigger the letter
is shown at a position, the more conserved (and important) that nucleotide is in the process.
(Top) Conserved sequences within E. coli promoter regions. (Middle) Exon and intron
splice sites. (Bottom) A site in a bacterial genome that binds a TF. The sequence logos were
generated using the online WebLogo software (http://weblogo.berkeley.edu/).
That these motifs are based on experimental data cannot be overstated. They
must be built on alignments of experimentally tested sequences of known func
tion, and the more the better. Since the motifs themselves tend to be short and
have highly variable positions as well as conserved positions, one generally cannot
use BLAST to find matching motifs in new sequences. However, because they
are short and fuzzy, matches using motif-searching algorithms need to be taken as
hypothetical because many are likely to be false positives.
Notes
1. This is also known as the number of “bits.”
2. Chapter 07 discusses the general method called hidden Markov models, which can also be
used to find motifs.
94 COMPU TATIO NA L B IOL OGY
Motivation
Protein sequence motifs are short stretches of amino acids with specific functional roles that are
found in many different types of proteins. For instance, the DNA-binding motif known as a zinc
finger is found in numerous different types of transcription factors (TFs), proteins that regulate
the transcription of DNA to messenger RNA. Although the amino acids in the binding motif are
similar across many different TFs, the other amino acids in these proteins can be very different,
yet they allneed to bind DNA; hence, they have a DNA-binding motif. Being able to identify mo
tifs, given the primary protein sequence, can provide important insight into a functional aspect of
unknown proteins.
In this activity, you will learn how to turn a multiple-sequence alignment of known protein
motifs into a search pattern for scanning new proteins for the same motif. In the first step, you
will learn how to build a position-specific search pattern using the sequence alignment. In the
second step, you will learn how to scan new protein sequences for positive matches. You will
also learn how to use online motif searching software to search for matching motifs in protein
sequences.
Learning Objectives
1. Understand the basics of sequence motifs and how detecting motifs in proteins helps to
identify important functional aspects (Motivation).
2. Learn how to build a sequence motif from alignments of short protein sequences and use
them to search for matches within novel protein sequences (Concepts and Exercises).
3. Use sequence motif matching software to build and search for protein motifs (Concepts and
Exercises).
Concepts
This preliminary exercise will help you understand the principles behind protein sequence motif
searching. The alignment on the next page contains a series of short sequences of a protein
motif in the DNA-binding domain of the androgen receptor (AR) from various different mammal
species. The AR binds the hormone testosterone, and once activated by the hormone binding,
the AR moves from the cell cytoplasm to the nucleus, where it binds the regulatory DNA
sequences of many genes. This binding helps to activate the transcription of genes which are
vital for male reproduction and development.
PATTER N S I N TH E D ATA 95
Positions
1 2 3 4 5 6
Human AR Motif: R V L E G Q
Dog AR Motif: R A M E G K
Camel AR Motif: R A M E G Q
Horse AR Motif: R V M E G K
Mouse AR Motif: R V V E G Q
Bear AR Motif: R A L E G K
Look at each position of the alignment. Can you make a pattern (known as a profile) that matches
allthe sequences? Hint: the first letter in the pattern would be an R because they allhave an ar
ginine (R) at the first position.
PROFILE:
M FWVY RV M E G K S K
Reflection
• Which positions are the most conserved? Most variable?
• How might you search for your pattern in a database of protein sequences? (How did the
BLAST algorithm find the first matching “word”?)
• What if there were 10 possible choices of amino acid at position 2? How might you indicate
a hypervariable position?
Below is the profile for this sequence alignment and a match to the profile in the test protein
sequence. The answer uses hyphens to indicate separate positions of the motif.
PROFILE: R - [V or A] - [L or M or V] - E - G - [Q or K]
M FWVY RV M E G K S K
In three of the positions of the sequence alignment, allthe motifs have an identical amino acid.
This suggests that mutating any of these three amino acids would inhibit the binding of the AR
to its target DNA sequence. For instance, if one were to mutate the DNA sequence that coded
for this motif so that there was a D at position 4 instead of an E, the motif would probably no
longer bind the correct DNA sequence.
One especially bad consequence of such a mutation would be infertility. Since the AR is
vital in the process of spermatogenesis (development of sperm), individuals with such a muta
tion could not reproduce and this mutation would not pass to the next generation.1 This pro
cess of natur al selection is the likely reason why one only observes an E at the 4th position of
this motif in allthese species (or an R at the 1st, or a G at the 5th). Once a profile is generated
96 COMPU TATIO NA L B IOL OGY
for a particular sequence motif, this profile can then be scanned across millions
of proteins in databases for matches.
Exercises
Interactive exercises (theory)
Use the online exercise link below to learn how to make protein sequence
motifs and use them to search for matches. The Interactive Link explains how to
use the teaching interactive. Once you learn how it works, solve the activity
problem.
Problem
Build the sequence motif and indicate whether the profile you built matches the
protein sequences below.
98 COMPU TATIO NA L B IOL OGY
Lab Exercise
1. Make a profile manually from the following data using the pattern syntax used
by PROSITE. See tutorial and link for ScanProsite.
P2_DROME/200-226 CVVCGDKSSGKHY
DR_CANFA/547-573 CLICADEASGCHY
1H16_CAEEL/14-40 CAICQESAEGFHF
2H18_CAEEL/11-37 CEVCPDKTSYRHF
3H18_CAEEL/11-37 CPVCGDRTSLRHF
4H18_CAEEL/11-37 CPPCGDLTSPSHF
5H18_CAEEL/11-37 CDLCGDPSRGWHF
F6H18_CAEEL/11-37 CFQCLDWTAGANF
7H18_CAEEL/11-37 CTWCPDQTGWFHF
8H18_CAEEL/11-37 CWPCWDPTVGGHY
9H18_CAEEL/11-37 CEACGDKTLGYHF
0H18_CAEEL/11-37 CSFCGDKTIPRNF
AH18_CAEEL/11-37 CPRCQEDTQRYHY
BH18_CAEEL/11-37 CQECGDKTWWRNF
b. Use the Option 2 “Submit MOTIFS” to search for hits with your profile in
the PROSITE database (use the default parameters). Fill in the table below
with the Swiss-Prot/UniProtKB accession numbers of hits to three different
organisms, as well as the motif sequence from the organism that matches
the profile you created, and the scientific and, if available, common
name of the organism. (The “Pattern” should be the sequence from
the organism that matches your search pattern.)
i.
ii.
iii.
100 CO MPUTATION AL B IOL OGY
2. Use ScanProsite Option 1 to scan the protein P04150 (UniProtKB Identifier)
and answer the following questions.
b. What motif(s) did you find (i.e., what amino acid sequence or sequences)?
4. Use the SMART (Simple Modular Architecture Research Tool) to study your
cool new sequence from a feral Chernobyl chicken2 (use the normal, not the
genomic, version).
SMART home page: http://smart.embl-heidelberg.de/.
a. Search for outlier homologues, PFAM, signal peptides, and internal re
peats. What high-probability motifs did you discover? Can you describe them
in words?
b. What is the name/function of the protein? (Hint: BLAST the protein sequence.)
PATTER N S I N TH E D ATA 101
c. What is the length of the longest mRNA transcript in nucleotides? Ensembl
uses bp (base pairs) because it is based on the DNA, but RNA does not
have pairs since it is single stranded, so it should be nucleotides (nt).
Notes
1. Natural selection requires both survival AND reproduction, not to mention variation and heri
tability. Sequence alignments are a powerful way to determine the results of nature’s exper
imentation: what is/is not allowed in nature.
2. These feral chickens probably fed on the gamma radiation-eating fungi inside the reactor core:
https://www.sciencedaily.com/releases/2007/05/070522210932.htm
102 CO MPUTATION AL B IOL OGY
Motivation
Transcription factors (TFs) help to determine whether a particular gene (or set of genes) is tran
scribed into messenger RNA. So-called activator TFs help activate (increase) the transcription
of a gene. Most activator TFs bind DNA upstream (5′) of the protein-coding region and make it
more likely that RNA polymerase will bind and transcribe the DNA template strand. RNA poly
merase might bind anyway, but activators make it more likely to happen by direct or indirect in
teractions with the RNA polymerase. Repressor TFs also bind DNA but do the opposite, making
it less likely that RNA polymerase will transcribe the DNA. TFs typically bind to specific se
quences of DNA, although how stringent these sequences are can vary. In fact, they usually
bind to different, though highly similar, DNA sequences. In order to predict these DNA binding
sites, we need a method that takes this variation into account. This was the idea behind the cre
ation of the position-specific weight matrix (PSWM).
This activity teaches how to create PSWMs using sequence alignments of experimentally
determined TF binding sites (TFBSs) and how to use a PSWM to search and find new sequences
that match the PSWM. These matches could be binding sites for the TF. This is similar in principle
to the creation of protein sequence motifs, but with DNA sequences. It’s the same idea, but
more math-y. After calculating the PSWM, you will learn how to scan DNA sequences for high-
scoring matches to the PSWM and then use online software for this purpose.
Learning Objectives
1. Understand the principles of PSWMs and how they are used for detecting protein-DNA bind
ing sites (Motivation).
2. Be able to construct and calculate a PSWM and use it to scan a DNA sequence for significant
matches (Concepts and Exercises).
3. Learn how to use PSWM software to detect DNA binding sites of TFs (Concepts and
Exercises).
Concepts
This preliminary exercise should help you understand the principles behind PSWMs. The se
quence alignment on the next page contains a series of short DNA sequences that have been
experimentally determined to bind the TF known as HotStuff (Donna Summer’s BFF TF).
PATTER N S I N TH E D ATA 103
1 2 3 4 5 6
———————————
A G C T A A
T G C T G A
A G C T C G
T G C T T G
A G T T A A
T G C T G G
A G C T T A
T G C T C A
———————————
How many nucleotides are at each position? (Fill in the table below.)
Position
1 2 3 4 5 6
A 4
G
C
T
ACAATGCTCAAGGG
Reflection
• Which positions are the most conserved? The most variable?
• Is position 3 more weighted towards a C or a T? How would you describe the weight of
position 1?
• How might you score a match to this PSWM using the frequency of the nucleotides at each
position?
• How many of each nucleotide should we see at each position if allbases were equally
likely? Hint: there are 4 nucleotides in DNA, and assume each is equally common (1/4 A,
1/4 G, 1/4 C, and 1/4 T).
On the next page are the answers showing the number of each nucleotide at every position and
the best match of the matrix in the sequence. Notice how the positions are weighted towards
particular nucleotides at certain positions. Position 1 is weighted equally between A and T but
away from G and C, while position 3 is heavily weighted towards C, though there is a slight pos
sibility of a T. There is no weight towards any of the nucleot ides at position 5—allare equally likely.
104 CO MPUTATION AL B IOL OGY
1 2 3 4 5 6
A 4 0 0 0 2 5
G 0 8 0 0 2 3
C 0 0 7 0 2 0
T 4 0 1 8 2 0
To get the best match in a sequence, we scan the sequence from left to right, looking at six
nucleotides at a time. Why six? Because the alignment is six nucleotides long. The winner is the
one with the best overall match to the matrix.
ACAATGCTCAAGGG
The question then arises: how do we get a score for the best fit to the matrix? In order to answer
this question, we need one more transformation of the matrix. In this transformation, the fre
quency of the nucleotides is used to determine a score for each nucleotide at each position. The
matrix transformation is essentially the natural log of the frequency of the base at each position
divided by the expected frequency of that base at that position.
For example, the observed frequency of A at position 1 is 0.5 (50%, or 4 outof 8 possible
nucleotides). At position 3, C has an observed frequency of 0.875 (87.5%, or 7 outof 8 possible
nucleotides). Clearly, these are more common than expected by chance. If there were no biases,
we would expect a frequency closer to 0.25 (25%, or 2 outof 8) of each nucleotide at each po
sition. Calculating the PSWM score is easy. Simply calculate the natural log of the likelihood of
the base at that position,1 which is the observed frequency (f ) of each nucleotide (i) at each po
sition (j) divided by the expected frequency of each nucleotide (pi):
fi ,j
ln
pi
We will assume that the bases are allpresent at equal frequencies in the organisms. Since DNA
has four nucleotides, each base has a pi of 0.25. pa = pg = pt = pc = 0.25. (This is not always the
case, since some organisms are G/C rich while others are A/T rich, but close enough!) This equation
works until you realize that many of the frequencies in the table are 0, and you cannot take the
natur al log of 0. However, with a few adjustments to the equation, we can calculate a very close
approximation:
ln
(n i ,j )
+ pi / (N + 1)
pi
where ni,j is the number of bases i at position j and N is the total number of sequences in the
alignment. The numerator is very close to the frequency, but adding the extra 0.25 ensures that
the number is never 0.
Using our equation, we can then transform the following matrix of position-specific nucleot ide
counts to a PSWM. Let’s start with position 1.
PATTER N S I N TH E D ATA 105
1 2 3 4 5 6
A 4 0 0 0 2 5
G 0 8 0 0 2 3
C 0 0 7 0 2 0
T 4 0 1 8 2 0
Calculating the PSWM score for the A at position 1 is as follows. Since nA,1 = 4 (counts of A’s
at position 1), pi = 0.25, and N = 8 (number of sequences), we get the following:
( 4 + 0.25 ) / (8 + 1) = + 0.64
ln
0.25
This is the same value for T at position 1, and the 0s in the matrix allhave the same value:
(0 + 0.25 ) / (8 + 1)
ln = − 2.19
0.25
Using the PSWM equation (and a calculator, unless you’re a math savant), we can readily fill out
the PSWM values for position 1 and the rest of the positions (the highest PSWM score(s) at
each position are indicated in bold):
1 2 3 4 5 6
A +0.64 −2.19 −2.19 −2.19 0.0 +0.85
G −2.19 +1.29 −2.19 −2.19 0.0 +0.37
C −2.19 −2.19 +1.17 −2.19 0.0 −2.19
T +0.64 −2.19 −0.59 +1.29 0.0 −2.19
Exercises
Interactive exercises (theory)
Now that you know the basics of how to calculate a PSWM using a sequence alignment, the
online exercises will give you practice calculating matrices and teach you how to scan sequences
to find a region with the highest score given the matrix. The Interactive Link has a more detailed
explanation of how to calculate a PSWM and how to use it to scan a sequence looking for the
best match. Once you learn how it works, solve the activity problem.
Problem
Enter the correct values in the boxes below (except the ones in the equation).
First, fill in the number of each of the four nucleotides at position 3. Second, use
these numbers to calculate the PSWM values for position 3. Finally, fill in the
position-specific scores for the nucleotides in the highlighted 5-base sequence
window and calculate the total score.
PATTER N S I N TH E D ATA 107
Lab Exercise
>EP_1
GAGAGCGGGCAGGAGGCGGGTTGGGAGGGCGCGGAGCCCCGGGTTCGGGGGAGACTGGAG
GGGCGCACGTGCGGCCGGGTGCGAGCGCGCGGCGGGGGAGGCTGCGGGGCGGCGCGGGGG
CGCGCGCGGAGCCCGAGCGGCGGCGCCAGGTCACACAACCTGTTTTGGCGCCTGCGGGCG
CCTGGGCCCAAGGGTGCGACGCGGGGGCGCCTGAGCCGGGACACAGGGGGTGCGGTGAGC
GCCAGGCGCCGCGGGGAGTTAAAAAGTTCGGGACCTGAGCGGTGCGTGGTTCCGCGGTGG
CCGCCTCTTCCTGCCGCGCAGGCCGAGGGTCCCGACGGCGCCGCTCACCGCTCCGGGACT
CAGCCTTTCTGGGCCCGGCCTGCGGTTCCCTCGGGGCCGGGGAGAGGGTGGAGCGCGGGA
GGAGGGGCGCCGGGTGGGGACGCCCAGGCCCTTCGTCGGGGGAGGGCGCTCCACCCGGGC
TGGAGTTGCAGAGCCCAGCAGATCCCTGCGGCGTTCGCGAGGGTGGGACGGGAAGCGGGC
TGGGAAGTCGGGCCGAGGTGGGTGTGGGGTTCGGGGTGTATTTCGTCCACGAGCCGGGGA
Use the sequence above with the LASAGNA TFBS search program to answer the
following questions. (Under “Matrix-Derived Models”, choose “Use TRANSFAC
Matrices.”)
1. What are the names and scores of the TFs with the two highest-scoring matri
ces (the two largest numbers in the Score column)?
2. What is the sequence snippet within EP_1 sequence (Sequence column) that
matches the second-highest-scoring TF and therefore could be its binding
site?
3. In the results, select the name of the TF in the left column. Using the Refer
ences link, can you describe a known biologic al role of the highest-scoring TF?
Notes
1. Why calculate the log likelihood? Using the natural log (or logarithms) of a number makes
calculations more convenient, especially with large numbers. For instance, instead of multi
plying frequencies, you can add the logs of these numbers.
CHAPTER
05
RNA STRUCTURE PREDICTION
D
NA is fundamentally boring, biochemically speaking. Sure, DNA contains
the blueprint for the entire structure and function of allliving cells (big deal),
but the double helix is as stable and inactive a molecule as they get. The
two antiparallel strands combine via their complementary Watson-Crick
base pairings, with multiple hydrogen bonds per pairing. Without lowering
the energy of activation using protein helicase enzymes, one must bring DNA to
around 95°C (nearly the boiling point of water) in order to separ ate the strands of
the double helix. Indeed, this stability is part of what makes DNA such a robust
transmitter of biological information. DNA can be unwrapped, unzipped, copied, and
transcribed, and it then re-forms quickly into a perfect double helix.
DNA’s cousin RNA, on the other hand, is an entirely different story. The chem
ical composition of RNA is very similar to that of DNA, but with a few changes
that make allthe difference. Like DNA, RNA is a long polymer chain composed of
four repeating nucleotides in which the sugar components of the nucleotides are
joined together by phosphodiester bonds. RNA has a pentose sugar slightly dif
ferent from that of DNA (ribose versus deoxyribose), with an additional hydroxyl
group. RNA also features the nitrogenous base uracil in place of thymine. Most
import antly, as shown in Fig. 5.1, cellular RNA molecules do not form double
helices.
The single-stranded nature of RNA makes it a particularly dynamic and inter
esting molecule and gives RNA the potential for both structural flexib ility and
biochemical activity. Because only one strand of RNA is synthesized, the nucleo
tides are not bound to their complements on another strand, leaving the hydro
gen bonds open. This allows them to form interactions with proteins and other
RNA molecules and, most importantly, with themselves. Given the right se
quence of nucleotides, and sometimes a little help from proteins, RNA molecules
can form a variety of complex structures that perform critical cellular functions, as
depicted in Fig. 5.2.
111
112 CO MPUTATION AL B IOL OGY
FIGURE 5.1. RNA versus DNA. The left side of the image shows a single strand of RNA,
as well as the chemical structures of the four nitrogenous bases that it comprises. The
right side shows a double-stranded DNA helix and its four nitrogenous bases. Credit:
NHGRI/Darryl Leja.
FIGURE 5.2. Some sweet, sweet RNA molecules and a few selected molecular inter
actions. (A) A tRNA molecule. tRNAs are made from a single strand of RNA which
base-pairs with itself to form a structure capable of carr ying amino acids to the ribosome.
(B) Ribosomal subunits are composed of RNA (yellow and orange) and proteins (blue) that
come together to form the ribosome, the cell’s protein-synthesizing machinery. (C) The
spliceosome, made of small nuclear RNAs and proteins, is part of the RNA editing machinery
in eukaryotic cells. (D) Secondary structure of the mir210 microRNA. MicroRNAs regulate the
expression of other genes. (E) The CRISPR-Cas9 system is a genome editing system found in
prokaryotic cells to protect against bacteriophage invaders, which uses RNA at multiple steps
to guide the process. Panels A and B courtesy of Yikrazuul, under license CC BY-3.0. Panel C
reprinted from Will CL, Luhrmann R. 2011. Cold Spring Harb Perspect Biol 3(7), with permission.
Panel E courtesy of Mirus Bio LLC. http://www.mirusbio.com
FIGURE 5.3. Secondary-structure diag ram of the RNase P structural RNA showing
examples of RNA structural elements known to occur in this and other molecules.
Adapted from Maeda T, Furushita M, Hamamura K, Shiba T. 2001. FEMS Microbiol
Lett 198:141–146, with permission.
and just like with proteins, there are a mess of them. While many of the large
structural RNAs have been well characterized, there are thousands of smaller
RNAs predicted to be in genomes that remain to be studied. With the rapid and
accelerating accumulation of DNA sequences, the question is: can we use DNA
sequences that code for potential structural RNA molecules to predict their sec
ondary or even tertiary structures?
The two methods we will cover for predicting RNA structure based on primary
sequence information are (i) thermodynamics-based prediction and (ii) mutual in
formation (MI). Thermodynamic methods use experimentally determined RNA
base-pairing and base-stacking values to determine what the most stable poten
tial RNA structure is for an RNA sequence. Given a DNA sequence encoding a
structural RNA, these methods first determine a series of potential ways in which
the sequence could fold upon itself. Then, using the experimentally predetermined
RNA stacking energies, the algorithm determines the free energy of allthe struc
tures, and the one with the lowest free energy wins (like golf, but more exciting).
Figure 5.4 illustrates potential folds and energies for the same RNA sequence.
Since folding algorithms maximize the number of base-pairings, and assume
that the molecule folds only upon itself with lots of stems, thermodynamics-
based predictions tend to perform more poorly with larger RNA structures. Also,
the larger the structure, the more possible folds the RNA could make, which
FIGURE 5.4. Three RNA folds for the same sequence. Fold 1 has the lowest free
energy and would be preferred by the thermodynamic prediction method. Reprinted
from Maeda T, Furushita M, Hamamura K, Shiba T. 2001. FEMS Microbiol Lett 198:141–146,
with permission.
116 CO MPUTATION AL B IOL OGY
exponentially increases the computational time. These methods also do not pre
dict higher-level interactions such as pseudoknots and base triples. However,
thermodynamic methods can be extremely fast, require only a single sequence
to make predictions, and tend to work well with small RNA molecules.
The second method, MI, addresses the problem of RNA secondary-structure
prediction very differently. Instead of folding a single RNA sequence by itself, MI
uses evolutionary information from many closely related sequences. Specifically,
MI analyzes the mutational variation in multiple-sequence alignments generated
from collections of the same RNA structure from many different organisms. In
these alignments, MI identifies varia ble positions in the sequence alignment in
which changes (mutations) at one position of the sequence alignment correlate
with changes at another position. These correlated positions mutually inform
one another, hence the name. The concept is that if two positions (or more in
some cases) always change at the same time, this provides evid ence that the
positions are interacting within a sequence. Figure 5.5 provides an example of
how and why correlated mutations like this occur.
The example in Fig. 5.5 is actually quite a common pattern. Paired regions of
ten show strong patterns of correlated mutations because stem regions and
pseudoknot regions are likely to be critically important to the stability and func
tion of the molecule. Too many uncompensated mutations will lead to a poorly
functioning molecule and could lead to death of the organism. In other words,
natural selection would take its course. For instance, mutations that destabilize
the stems of a tRNA molecule could lead to a molecule that does not function in
protein synthesis. No protein synthesis means no proteins, which means no cell.
Since MI cares only about correlated mutations in a sequence alignment,
more than just base pairs in stem regions can be detected. This means that pseu-
doknots and other strange interactions, such as base triples, can be determined
given enough sequence data and variation. Clearly, the downside of MI is the
need for lots of sequences and also a significant amount of variation among the
sequences (no variation means no correlation or prediction). However, with the
growing sequence databases from thousands of new genomes, sequence data
are not really a limiting factor anymore.
Notes
1. The ribosome is a complex RNA-protein macromolecule.
118 CO MPUTATION AL B IOL OGY
Motivation
Most of the RNA diversity in cells comes in the form of messenger RNA (mRNA) destined
for the ribosome, where the mRNA is used in protein synthesis and later recycled. However,
there are also many critical structural RNA molecules that have a variety of basic cellular func
tions and are not transcribed into proteins. In fact, the ribosome itself is largely made of two
structural RNAs, one in the small subunit and one in the large subunit. Structural RNA mole
cules called transfer RNAs (tRNAs) shepherd the amino acids to the ribosome during protein
synthesis. Other structural RNAs are involved in gene regulation and intron splicing, and
some even act as enzymes.
As with proteins, predicting the structure of these RNAs helps us understand how they func
tion. This activity covers two algorithms for predicting RNA structure. The first uses thermo
dynamic folding rules to find the best two-dimensional (secondary structural) fold of a given RNA
sequence. RNA strands easily fold on themselves and form hydrogen bonds, making helical-like
regions similar to DNA. The second method uses the principles behind mutual information (MI).
MI requires alignment of different RNA sequences and looks for instances when mutations are
correlated to predict likely-interacting RNA nucleotides. After mastering the principles of these
algorithms, you will learn how to use online RNA thermodynamic prediction and MI prediction
software and how to interpret their output.
Learning Objectives
. Learn about the biological function of structural RNA molecules (Motivation).
1
2. Understand both principles of RNA folding and secondary and tertiary structural elements
(Motivation).
3. Use free-energy thermodynamic rules to choose the most stable RNA fold among a series of
folds for the same RNA sequence (Concepts and Exercises).
4. Learn the principle of MI and how it can be used to predict both secondary and tertiary RNA
structural elements (Concepts and Exercises).
5. Learn how to use RNA prediction software and interpret the output (Concepts and Exercises).
Concepts
Algorithm 1: thermodynamic secondary-structure prediction
Problem: Given a set of possible RNA secondary structural folds for a particular RNA sequence,
which is the best?
Solution: To answer this question, we will use the thermodynamic principle of free energy. The
RNA structure with the lowest total free energy, i.e., the most stable predicted structure, will be
chosen as the best prediction.
R N A S tructur e Pr ed iction 119
To better understand the principles behind the thermodynamic structure prediction method,
try the preparatory exercise below. Using your brain and a pencil, try folding the RNA sequence
upon itself by bringing complementary nucleotides (A and U, G and C) together. Start by drawing
lines connecting paired nucleotides, and then draw the structure similar to the ones shown in
Fig. 5.5. (Hint: start by connecting the second nucleotide and the last nucleotide.)
Sequence: 5’ – G A G G U C G G A A G A C C U – 3’
Structure:
Reflection
• What types of RNA elements does your fold have?
• How many Watson-Crick pairings did you find? How many unpaired nucleotides?
• If you were to mutate the fifth nucleotide from a U to a G, how would this change the
structure? What would this do to the stability of the molecule?
• Since A-U or U-A pairs have 2 hydrogen bonds, and G-C or C-G pairings have three, how
many total hydrogen bonds are there in your structure?
Below is the folded structure. Since the nucleotides are allconnected from 5′ to 3′ via phospho-
diester (covalent) bonds, the molecule must twist around to be able to fold on itself. This is what
creates the hairpin loop structure.
Sequence: 5’ – G A G G U C G G A A G A C C U – 3’
Structure:
G A G A
G A G A
C G C G
U A G A
G C G C
G C G C
A U A U
G G
The best fold has a stem and a hairpin loop structure. The total number of hydrogen bonds for
this fold is 13. It is the combination of these hydrogen bonds plus the stacking energy that deter
mines the free energy of the stem regions. The other main determinants of the molec ule’s free
energy are the number and size of the loop regions, which raise the free energy of the molec ule
and lower its stability. For instance, the mutation from a U to a G in the 5th position of the se
quence results in the creat ion of an internal 2-base loop because the nucleot ides no longer pair. It
is easy to see how this addition would significantly raise the free energy of the RNA structure.
120 COMPU TATIO NAL B IOL OGY
FIGURE 5.1.1. Nearest-neighbor free-energy rules for RNA structures, first developed
by Turner and colleagues in 1999 and then updated in 2004.2 The first (uppermost) table
shows the free energies for base pairs stacked over other base pairs. For instance, an A-U
base pair stacked over a C-G base pair (A-U and C-G are “neighbors” in which the A and C
and the U and G are covalently bound) has a total free energy of −2.1. This free energy
includes both the energy of the pairing (A to U) and the fact that it is next to a C-G pair.
The lower table shows the free energies of hairpin loops, internal loops, and bulge loops.
Notice how the pairings allhave negative free-energy values (more stability) and the loops
have positive free energies (less stability) that vary depending on the size of the loop.
R N A S tructur e Pr ed iction 121
FIGURE 5.1.2. Calculating the total free energies of two possible secondary structures for the same
RNA sequence: UCGCUGUUCCACAGGA. Structure 1 features a 5-nucleotide hairpin loop and a bulge.
Structure 2 features a 3-nucleotide hairpin loop and a bulge. Structure 1 has the lowest free energy and,
therefore, is the most stable of the two structures.
Sequence 1: GAUCCUGCCUUCACGAUC
Sequence 1: GAUCCUGCC--UUCACGAUC
Sequence 2: GACCCUGCC--UUCAGGGUC
Sequence 3: CAACCUGCCAGUUCACGUUG
The figure below shows the known RNA structure of sequence 1 on the left. The other se
quences (2 and 3) have point mutations at positions 1, 3, 18, and 20. Fill in the different nucleo
tides for positions 3 and 18 in RNA sequence 2, and positions 1, 3, 18, and 20 in RNA sequence 3
in the spaces provided in the figure. Also, fill in the extra nucleotides for the 2-base AG nucleo
tide indel in sequence 3 at the top of the RNA structure. (Notice that the other sequences do
not have that extra A and G, so the alignment is filled in with gaps.)
R N A S tructur e Pr ed iction 123
Reflection
• How did the mutations affect the stem structure of the sequence 2 RNA? The sequence 3 RNA?
• How do positions 3 and 18 covary (i.e., how are they mutually informative)? How many
times do they change together? How about positions 1 and 20?
• We know from the structure of sequence 1 that the position 2 nucleotide (A) is interacting
with the second-to-last nucleotide (U) in the same sequence. Would MI help us predict this
interaction? Why or why not?
• Notice that there seems to be an insertion of two extra nucleotides in the hairpin loop of
sequence 3. The hyphens are put in the alignment to account for this indel mutation
(insertion in sequence 3 or a deletion in the ancestor of sequences 1 and 2). How might
this change the free energy of the structure?
The figure below shows the answer. Notice how the change in a nucleotide at one part of the
stem in sequence 2 and sequence 3 (compared with sequence 1) correlates with a change at
another part of the same sequence that maintains the base pair and the stem structure. This is a
very common pattern in RNA sequence alignments—the most common, in fact. It makes sense
because the stem structures are critical for the stability of the molecule, as you know from the
previous section. Since the principle of MI relies on correlated changes (covariation) between
nucleotide positions in a sequence alignment as evid
ence of interaction, this means two things:
(i) if there are no changes, MI cannot predict interactions, and (ii) other correlated changes, such
as pseudoknots or base triples, can also be detected.
124 COMPU TATIO NAL B IOL OGY
Figure 5.1.3 shows many instances of covariation in a larger alignment of 10 related RNA se
quences and how to identify mutually informative nucleotide positions.
FIGURE 5.1.3. Principle of MI. This comparative approach requires a multiple-sequence alignment (A). The
method searches for correlated positions in the alignment. For instance, in this alignment, positions 2 and
14 appear to be correlated (B) because as position 2 changes, from an A to a U, for example, there is a
corresponding change in position 14 from a U to an A. All instances of corresponding changes are identified
(C), and this can be used to determine the secondary structure of the RNA sequences. Finally, the
predicted structure of the first sequence in the alignment is shown (D). All the correlated changes in this
example are base pairing in stem regions, but other changes are also possible to predict.
R N A S tructur e Pr ed iction 125
Exercises
Interactive exercises (theory)
Use the RNA free-energy and MI links below to learn how to determine the best
RNA structure for a single sequence using thermodynamic calculations and to
predict interacting positions using a multiple-sequence alignment and MI.
Problems
1. Determine the RNA free energy of the following sequence. Show your work.
2. a. Draw lines below the table connecting predicted interacting positions based
on the principle of MI.
b. Draw the predicted structure of the first sequence in the alignment (top
row) below.
128 COMPU TATIO NAL B IOL OGY
Lab Exercise
Part 1. RNA folding
Use Mfold to predict the structure of a tRNA sequence and answer the following
questions.
>Haemophilus_influenzae_tRNA
GGGGAUAUAGCUCAGUUGGGAGAGCGCUUGAAUGGCAUUCAAGAGGUCGUCGGUUC
GAUCCCGAUUAUCUCCACCA
3. Draw/show your tRNA sequence in the analysis window and answer the fol
lowing questions or follow the directions below.
a. Click on an energy dot plot file link. What is this energy plot telling you?
Explain.
b. Draw or paste the top left corner of the energy dot plot from positions 1 to
30. Your answer should include nine 10 × 10 position squares.
c. Draw the RNA structural element predicted in this region of the dot plot.
You should be able to find this element in the structure 1 image file.
d. Compare the predicted RNA structure 1 to structure 2. How do they differ,
and how might this change the free energy of the molecule? Explain.
130 COMPU TATIO NAL B IOL OGY
Part 2
http://kelleybioinfo.org/algorithms/data/DRna1.txt
1. Write the matching nucleotide positions of the longest predicted stem-loop
region (the longest diagonal) in the graph. Approximate positions will suffice
given the difficulty of reading the graph. (As the old saying goes, you get what
you pay for, and these bioinformatics websites are free.)
2. Write the numbers of the predicted interacting positions indicated by the diag
onal in the top left corner of the MatrixPlot graph. Note: there are 345 aligned
sequence positions (x and y axes include positions 1 to 345).
3. Write the interacting RNA nucleotides at these positions using the RNA1
sequence.
Notes
1. Hydrogen bonding of guanine with uracil (considered noncanonical base pairs) is very com
mon in RNA molec ules.
2. Turner DH, Mathews DH. 2009. NNDB: the nearest neighbor parameter database for pre
dicting stability of nucleic acid secondary structure. Nucleic Acids Res 38:D280–D282.
CHAPTER
06
PHYLOGENETICS
O
ne of the most remarkable, and arguably the most important, bioinformat
ics discoveries revolutionized our entire understanding of life on Earth. In
1977, Carl Woese and colleagues at the University of Illinois manually
aligned pieces of ribosomal RNA1 sequences isolated from common bacte
ria (Escherichia coli and cyanobacteria), a few eukaryotes (yeast and an
aquatic plant called duckweed), and some methane-generating “bacteria” previ
ously isolated from dairy cows. Using some simple calculations to estimate how
different the sequences were from one another and some parsimony-type rea
soning (see Activity 6.1), the authors generated a small phylogenetic tree that
completely upended our understanding of life on Earth (Fig. 6.1).
While the tree doesn’t look like much, it strongly suggested that the methano
genic bacteria (e.g., Methanobacterium) and their relatives weren’t bacteria at all.
Rather, they comprised a major new branch in the tree of life. As you can imagine,
this result caused quite a stir and was greeted with much skepticism. This new
hypothesis of life overturned the basic 5-kingdom description of life that had been
accepted since the mid-19th century. However, the more sequences scientists
generated (including whole-genome sequences and alignments), and the better
the alignment and phylogenetic algorithms, the clearer and more solid the pattern
became (Fig. 6.2).
133
134 COMPU TATIO NAL B IOL OGY
FIGURE 6.1. The “universal” phylogenetic tree circa 1987. The tree of life naturally
separated into three domains: the Eubacteria (now called Bacteria), the Eukaryota (now
Eukarya), and the then-newly identified group called the Archaebacteria because they
were bacteria-like. The Archaebacteria turned outto be fundamentally different from the
Bacteria and share many molecular and cellular aspects with the Eukarya, and they are
now called the Archaea.
• In clouds, industrial waste pits, shower curtains, and scalding steam vents on
Hawaiian volcanoes2
• The mouth, gut, and surface of every animal on Earth
The number of discoveries proceeding from this research has been mind-
blowing. Since 1977 and the development of culture-independent molecular meth
ods, we have learned that
FIGURE 6.2.The expanding tree of life. New molecular methods, including whole
genome sequencing and direct sequencing of DNA from complex environmental
samples have greatly expanded our understanding of microbial diversity. The Bacteria
are indicated in blue, the Eukarya in red, and the Archaea in green. This tree is biased
towards bacterial lineages and only includes a few select representative sequences.
However, it does illustrate how our understanding of microbial diversity has grown.
A complete depiction of microbial diversity would include millions of branches and
would be impossible to display in a figure. Still doubted by some in the early-2000s, the
pattern of this phylogenetic tree continues to strengthen with the addition of genomic
sequences.6
Uses of Phylogenetics
The use of phylogenetics long predates the work of Woese and colleagues, al
though the comput ational methods trace back only to the 1960s. Most phy
logenetic trees were built to describe the relationships among macrofauna and
-flora like plants, insects, birds, marmots, and whales. Prior to DNA sequenc
ing, phylogenetic analysis relied on morphological data (presence or absence of
scales, bones, and hair) and other visible characteristics. However, with the de
velopment of DNA sequencing methods, nucleotides and amino acids became
predominant.
Beyond the big tree, phylogenetic theory has had an enormous impact on our
understanding not only of the relationships among organisms but also of the rela
tionships among genes within genomes, the process of molecular evolution, the
evolution of gene families, patterns of recombination, and horizontal gene trans
fer. Phylogenies have also been used to discover new forms of life and the origins
of deadly viruses and bacteria. Figure 6.3 shows some examples of the many
possible uses of phylogenetics.
136 COMPU TATIO NAL B IOL OGY
FIGURE 6.3. Some uses of phylogenetic trees. (A) Pathogen identification. Sequence
alignment and phylogenetic analysis were used to show that the deadly severe acute
respiratory syndrome (SARS) virus was a type of coronavirus (a common cold virus). The
numbers indicated the maximum possible bootstrap values (see “The Bootstrap” later in
this chapter). (B) Relationships among steroid receptors, transcription factors that bind
steroid hormones like estrogen (ER, estrogen receptor) or testosterone (AR, androgen
receptor). This phylogeny shows that all steroid receptors were derived from an ancestral
protein that likely bound an estrogen-like hormone. PR, progesterone receptor; GR,
glucocorticoid receptor; MR, mineralocorticoid receptor. The lamprey sequence outgroups
are highlighted in red. Reprinted from Thornton JW. 2001. Proc Natl Acad Sci U S A
98:5671–5676, with permission. (C) Phylogeny of cultured and uncultured mycobacterial
species (Mycobacterium tuberculosis causes, you guessed it, tuberculosis). The boldface
numbers indicate novel uncultured species of mycobacteria from the air of a hospital
pool where the lifeguards had been getting sick and coughing up blood. Reprinted from
Angenent LT, Kelley ST, St Amand A, Pace NR, Hernandez MT. 2005. Proc Natl Acad Sci
U S A 102:4860–4865, with permission.
FIGURE 6.4. Aspects of phylogenetic trees. (A) The topology of the tree indicates the
relationships among the taxa. The taxa at the tips (leaves, blue circles) of the tree are
connected by branches to the other taxa via internal nodes (red circles). The nodes indicate
the common ancestor, and the fewer the nodes between taxa, the closer their phylo
genetic relationship. Both trees are unrooted; however, the bottom tree shows additional
information in the form of branch lengths. Longer branches indicate more evolutionary
change in the sequence since the split from the common ancestor. (B) On the left is a
cladogram-type tree rooted with the squid (an invertebrate) outgroup. Outgroups are used
to determine the order of evolutionary events. Squid make a good outgroup in this case
because they are “outside” the group of three vertebrates. The trees to the right have the
same topology, just sideways. The arrow shows that one can rotate taxa at a node without
affecting the interpretation, since evolutionary time always extends from the root node
outward.
the word taxonomy and is a general term to encompass any level of taxonomic
organization. A taxon can refer to a species, genus, or family of organisms, but it can
also refer to genes or even unknown groups. Figure 6.4 shows how to read different
aspects of phylogenetic trees that can be used to describe the relationships among
taxa, the amount of mutational change since the taxa split from a common ancestor,
and even the statistical support of the relationships.
The Bootstrap
In Activity 6.1, we will cover two methods for creating phylogenetic trees using
multiple-sequence alignments of DNA or protein sequences: distance and parsi
138 COMPU TATIO NAL B IOL OGY
mony. These tree-building methods aim to determine the best phylogenetic tree
for a given set of taxa. However, while these methods can build a phylogeny, they
cannot by themselves determine the statistical significance of the tree.
This is where the bootstrap7 method comes in. Phylogenetic bootstrapping is
the most commonly used method for determining how well the data support the
relationships in the tree. A bootstrap is a common statistical procedure that creates
a random resampling of the data with replacement. In the case of phylogenetic
analysis, the bootstrap resamples the positions of the multiple-sequence align
ment, creating a new data set of the same size (same number of positions). In a
typical bootstrap analysis, many hundreds or thousands of bootstrap replicates are
performed, and each replicate creates a randomly sampled sequence alignment
(Fig. 6.5A). During each replicate, a phylogenetic tree is built from the randomly as
FIGURE 6.5. Example of bootstrap analys is. (A) Making two bootstrap data sets by
sampling with replacement. One hundred bootstrap replicates would make 100 data sets.
Notice how some positions have been sampled multiple times in a data set while others
not at all. (B) A phylogenetic tree is built for each data set. All the resulting trees are
combined to make a consensus tree. If allthe trees have a particular node, this node is said
to have 100% bootstrap support. In a standard approach, if a node is found in fewer than
50% of the bootstrap trees, allthe unsupported branches are collapsed into a single node.
PH Y LO G EN ETI C S 139
sembled alignment. If one performs 1,000 bootstrap replicates, one has created
1,000 phylogenetic trees of the same set of taxa. In the final step, allthe bootstrap
replicates are summarized into a single bootstrap consensus tree (Fig. 6.5B).
The idea behind the bootstrap in phylogenetics is simple: if every single tree in
allthe bootstrap replicates (100% of the trees) shows taxon A closely related to
taxon B, this is the highest support that can be achieved and indicates that the rela
tionship of taxon A to taxon B is strongly supported by the data. For example, in the
SARS phylogeny (Fig. 6.3A), there is 100% support for the relationship of the SARS
virus to the group 2 coronaviruses. In other words, no matter what positions of the
alignment are selected, SARS is always closely related to the group 2 coronavi
ruses. Bootstrap values of 95% or higher are considered significant, though in prac
tice values of 70% or higher can be trusted.
It is important to keep in mind several things about the bootstrap. First, allthe
bootstrap does is create randomized data sets. The bootstrap is a statistical
method, not a phylogenetic method. The phylogenetic analysis is performed sep
arately with each bootstrap data set. For instance, one can perform a neighbor-
joining, a maximum-parsimony, or a maximum likelihood bootstrap analysis (as
we will see in Activity 6.1). Second, the bootstrap phylogeny is a consensus phy
logeny, not the best phylogeny for the data set. For instance, the best phylogeny
may resolve allthe relationships, but the bootstrap often collapses the branches
that are not well supported (Fig. 6.5B). Third, the more bootstraps performed, the
better. However, some methods are very slow (like maximum parsimony and es
pecially maximum likelihood), and depending on the number of taxa, it may take
too much time to do thousands of bootstrap replicates.
Notes
1. The sequences were actually RNAs isolated from organisms grown in cultures—a real pain in
the neck! Now we just sequence the DNA that codes for the RNA (or anything else).
2. To read about those last two, see Kelley ST, Theisen U, Angenent LT, St Amand A, Pace
NR. 2004. Molecular analysis of shower curtain biofilm microbes. Appl Environ Microbiol
70:4187–4192 and Ellis DG, Bizzoco RW, Kelley ST. 2008. Halophilic Archaea determined
from geothermal steam vent aerosols. Environ Microbiol 10:1582–1590.
3. Locey KJ, Lennon JT. 2016. Scaling laws predict global microbial diversity. Proc Natl Acad
Sci U S A 113:5970–5975.
4. One study estimated total number of eukaryotic species at 8.7 million (Mora C, Tittensor DP,
Adl S, Simpson AG, Worm B. 2011. How many species are there on Earth and in the ocean?
PLoS Biol 9:e1001127), while microbial diversity (mainly Bacteria and Archaea) has been esti
mated at 1 trillion (see endnote 7, above).
5. See Kallmeyer J, Pockalny R, Adhikari RR, Smith DC, D’Hondt S. 2012. Global distribution
of microbial abundance and biomass in subseafloor sediment. Proc Natl Acad Sci U S A
109:16213–16216 and Whitman WB, Coleman DC, Wiebe WJ. 1998. Prokaryotes: the un
seen majority. Proc Natl Acad Sci U S A 95:6578–6583.
6. Why only 3 domains—why not a 4th, 5th, or more? Have we missed them? Are viruses the
4th domain? Stay tuned!
7. The term “bootstrap” comes from the idiom “pull yourself up by your own bootstraps,” which
means to do it withoutany outside assistance (i.e., allby yourself). Bootstraps can be found
on cowboy boots and are used to put them on. In the context of statistics or phylogenetics,
bootstrap methods resample the same data as used to generate the result, and thereby to
test the robustness of the result. “Pull the result up by its own data.”
140 COMPU TATIO NAL B IOL OGY
Motivation
A phylogeny is a diagram that indicates the evolutionary relationships among organisms or the
molecular sequences (DNA, RNA, or protein) within organisms.1 Phylogenetic trees have been
used to, among other things, determine the relationships among songbirds, study the evolution
of drug resistance in HIV, predict new gene functions, and discover new forms of microbial life.
Phylogenetic methods, algorithms for computing phylogenies, require multiple-sequence align
ments of DNA or protein sequences of the same gene from different organisms to determine
the evolutionary relationships among the organisms.2 One may also make sequence alignments
of different but related genes (e.g., allgene sequences in the globin family) to determine how
the gene family evolved. Phylogenetic methods use information from the mutations that have
accumulated among the sequences over evolutionary time to determine how the sequences are
related to one another.
In this activity, we will cover two basic methods for constructing the best phylogenetic tree,
given a set of aligned sequence data. The underlying principles behind these methods are very
different, but they have the same goal: finding the best phylogenetic tree for the data. We will
also cover a statistical approach for assessing the quality of the phylogeny and use an online re
source for building phylogenetic trees using these methods.
Learning Objectives
1. Learn the principles and goals of phylogenetics, some uses of phylogenies, how to interpret
phylogenetic trees, and how the bootstrap method can be used to determine tree accuracy
(Motivation).
2. Be able to calculate a distance matrix based on a multiple-sequence alignment and under
stand how it can be used to build a neighbor-joining tree (Concepts and Exercises).
3. Use the maximum-parsimony principle and set theory to determine the ancestral characters
in a phylogenetic tree (Concepts and Exercises).
4. Use the maximum-parsimony principle to determine the best among a set of phylogenetic
trees (Concepts and Exercises).
5. Learn how to use an online program for building phylogenetic trees using neighbor joining
and maximum parsimony and run a bootstrap analysis (Concepts and Exercises).
Concepts
This activity covers two related, but fundamentally different, approaches to estimating the phylo
genetic relationships among a set of aligned DNA or protein sequences. The first method we will
cover is referred to as a distance method because it uses the overall distance (dissimilarity)
among the sequences in a multiple-sequence alignment3 to determine the relationships. The
second method, referred to as maximum parsimony (MP), also uses a multiple-sequence align
ment but focuses on specific nucleotides or amino acid positions in the alignment to determine
PH Y LO G EN ETI C S 141
how the sequences are related to one another. In science, the principle of parsimony states that
in choosing between competing hypotheses, the one with the fewest assumptions should be
selected.4 In the case of phylogenetic trees, MP selects the phylogenetic tree that requires the
fewest number of changes (i.e., the maxim ally parsimonious tree).
Species 1 A T A T TT C G A T
Species 2 A T C G TC C G G A
Species 3 G C C G TT C G C A
Species 4 G T A G TC G G A T
Reflection
• What is the total number of identical nucleotides between species 1 and species 2? How
about species 1 and species 3?
• Could you use the number of differences and the length (number of positions) in the
sequence alignment to calculate a distance score between species 1 and 2?
• Could you use a similar approach with protein sequences? What would you compare?
• How many total pairwise comparisons are there in this alignment? If there were 100
sequences, would you offer your computer some candy if it helped you out?
The answer is as follows. The sequences for species 1 and species 4 have 4 outof 10 nucleo
tides that are different between them. The same is true of species 2 and species 3. Since the
sequences are 10 nucleotides in length, the distance is 4 outof 10, or 0.4, for these pairs and
makes them equally close matches. Species 1 and species 3, and species 3 and species 4, have
6 outof 10 nucleotides different, a distance of 0.6, which makes them equally far matches.
Species 1 A T A T TT C G A T
Species 2 A T C G TC C G G A
Species 3 G C C G TT C G C A
Species 4 G T A G TC G G A T
142 COMPU TATIO NAL B IOL OGY
FIGURE 6.1.2. Using the distance matrix from Fig. 6.1.1 to build an NJ tree. (1) The
initial tree is a so-called star phylogeny in which allthe species are attached to a central
node, and it has no structure. (2) The first step is to search the matrix for the nearest
neighbors: the sequences with the shortest distance. These are joined together on the
tree. (3) Next, the average distances are calculated for the two sequences that were
joined (S3 and S4) to allthe other sequences in the matrix. This creates a new, smaller
distance matrix. (4) Repeat step 2 with the new distance matrix. Find the shortest
distance and join the taxa. (5) Recalculate the distance matrix and join nodes until there
are no more distances.
The online distance matrix interactive link (see Exercises) has additional explanations of how
to build a matrix for a set of sequences and generate a phylogenetic tree. The distance calcula
tions performed here are the simplest possible, and there are more sophisticated metrics that
account for mutational biases and the possibility of mutation reversals. Also, the NJ tree-
building approach is more complicated than described and includes estimations of branch
lengths. However, the steps in Fig. 6.1.1 and 6.1.2 and the algorithm interactive show the basic
principles of matrix building and NJ tree construction.
while species 2 and 3 share an A at the same position. Write down which positions are shared
between the sequences below.
1 2 3 4 5 6 7 8 910
Species 1 VYLGHEFQKS
Species 2 VALRHDFQKW
Species 3 VALRHDFQFW
Species 4 VYLGHEFQFS
Reflection
• Based on your positional analysis of this multiple-sequence alignment, can you conclude
what species are most closely related?
• What do positions 1 and 3 tell you about which species are more closely related?
• Are there any conflicting data? In other words, do the shared amino acids at some positions
contradict the patterns at other positions?
• Could a similar approach be used with DNA sequences?
Here are the answers. Positions 2, 4, 6, and 10 indicate a close relationship between species 1
and 4 and species 2 and 3. Position 9 contradicts this result, and the rest of the positions do not
say anything about which sequence is related to which according to MP.
1 2 3 4 5 6 7 8 910
Species 1 VYLGHEFQKS
Species 2 VALRHDFQKW
Species 3 VALRHDFQFW
Species 4 VYLGHEFQFS
The basic principle behind MP is that only shared derived positions (called synapomorphies)6 are
useful for determining phylogenetic relationships. In the example, this includes positions 2, 4, 6,
9, and 10. Positions that are shared in allspecies are useless for MP, because they don’t give you
any information about whether species 1 is closer to species 2 than it is to species 3 or species 4.
These positions are considered to be ancestral character states (derived from a single common
ancestor species), and so the algorithm ignores them.
FIGURE 6.1.3. Two possible phylogenetic trees for one data set of 5 sequences with
7 nucleotide alignment positions. The bottom half of the figure shows the nucleotides
of the first position mapped to the tips of the phylogenetic tree. Next, we will determine
the minimum number of changes needed on the phylogeny to produce this pattern of
nucleotides at this position (see Fig. 6.1.4 and 6.1.5).
146 COMPU TATIO NAL B IOL OGY
In Fig. 6.1.4, set theory7 is used to determine the most parsimonious set of nucleotides8 at
each node of the tree. The following describes the set theory notation for solving this part of the
problem. (You will likely find the concept easier to grasp in the figure and the online tutorial/inter
active.) The set theory logic is as follows:
Translated into English, if the two higher nodes have something in common, the node contains
only what they have in common (the intersection, ∩). For example:
{node} = {G}
But what happens if they do not have anything in common? Easy, you keep everything! This is
also known as the union (∪) of the two sets.
FIGURE 6.1.4. Using principles of set theory to reduce the number of possibilities at
each node. To solve the problem, start at the tips (leaves) of the tree and move towards the
base (root). (A) In tree 1, the first node is an A because the intersection of the two tips above
is A. In tree 2, the first node is either an A or a G. The intersection of {A} and {G} is the empty
set, so we keep everything—the union {A, G}. (B) Proceed down to the base of each tree
using the above tips/nodes to determine the most parsimonious possibilities at each node.
PH Y LO G EN ETI C S 147
For example:
Figure 6.1.5 shows the second part of the process: determining a path through the tree that
requires the least number of changes (i.e., is maximally parsimonious). The online tutorial allows
you to practice with this concept of minimization.
Finally, the minimization process is repeated for each position in the sequence alignment (Fig.
6.1.6). While tedious, it becomes quicker when one realizes that certain positions can be ignored
because they are parsimoniously uninformative.
FIGURE 6.1.5. Having already minimized the possibilities at each node, choose the
characters (nucleotides) at each node that require the fewest changes. (A) Starting
at the root of the tree, the only option in both trees is G. Moving up the rightmost branch,
we see that no changes are required. The position doesn’t mutate (change) along the
branch, and 0 steps are required. Moving up the left branch, if we choose G at the node,
no changes are required. Choosing A would require 1 mutation (step) from a G to an A.
This is not the minimum, so choosing G is preferable for both trees. (B) Complete the
minimization for allnodes. In tree 1, the path chosen requires 1 total change in the tree for
this position, the G-to-A mutation indicated by the arrow. No other path will be less than 1
for this position in this tree. In tree 2, the path we choose has 2 mutations, and this is also
the minimum for this position on this tree.
148 COMPU TATIO NAL B IOL OGY
FIGURE 6.1.6. Finish the problem by calculating the number of steps for each position
on each tree. Then, add them up to determine the total number of steps across all
positions for each tree. In this case, tree 1 has a minim
um total of 8 steps across all
positions in the alignment. This is one step shorter than is needed for tree 2, so tree 1
is the most parsimonious option.
For example, if allthe nucleotides are the same at a position, there will be 0 changes (steps)
on allthe possible trees at this position (see position 5 in Fig. 6.1.6). Similarly, if there has been
only 1 change in one of the sequences, and allthe rest are the same, the number of changes
(steps) for this position will always be 1 no matter which tree is tested (position 4 in Fig. 6.1.6).
These positions are called uninformative characters because they cannot inform which of the trees
are most parsimonious. In fact, MP tree-building software completely ignores these positions.
The online MP tutorial explains how to determine the best phylogenetic tree given a multiple-
sequence alignment:
http://kelleybioinfo.org/algorithms/interactive/IMax.pdf.
PH Y LO G EN ETI C S 149
Exercises
Interactive exercises (theory)
Use the online phylogeny interactive links below to learn how to build distance
matrices and phylogenies using the distance method and learn how to select the
better phylogeny under the MP criterion. Once you learn how they work, solve the
activity problems.
Problems
1. Distance matrix problem: answer the questions below.
b. Determine the first node of the tree and write it outin set notation (see the
interactive).
c. Recalculate the distance matrix proportions after joining the first node.
2. Parsimony problem: fill outthe nodes in the tree below using parsimony.
Calculate the length of the tree for this position and write it next to each tree.
152 CO MPUTATION AL B IOL OGY
Lab Exercise
Using the DNA and protein multiple-sequence alignments (sample data 1 and
sample data 2, respectively), perform the phylogenetic analyses using the online
phylogenetic software described in the tutorial.
1. Using the sample data specified below, draw/show the following results.
NOTES:
• In the Workflow Settings, uncheck the boxes for Multiple Alignment and
Alignment curation. The sequences are already aligned and curated. Also,
leave the Visualization box checked and choose TreeDyn if not already
checked, and the workflow should be run “allat once.”
• Choose protein or DNA/RNA on the Input Data page as appropriate.
• Leave the rest of the default settings on the Input Data page.
2. Analyze sample data 2 using the NJ approach (FastDist + Neighbor). Draw the
top four branches of the tree or paste the tree below.
3. Perform a bootstrap analysis with 100 resamplings of the data using sample
data 1 with the NJ criterion selected (FastDist + Neighbor). Draw/show the
results below. Circle the branch with the highest bootstrap support.
154 CO MPUTATION AL B IOL OGY
Notes
1. DNA and other molecules are most often used to determine the relationships among organ
isms. However, sometimes researchers are interested in the evolution of the molecules
themselves (e.g., gene family expansion, the evolution of drug resistance in HIV, and RNA
structural changes).
2. Information from multiple-sequence alignments of different gene sequences within organ
isms can also be combined to increase information and phylogenetic accuracy.
3. See multiple-sequence alignment in Chapter 03, Activity 3.1.
4. Parsimony is also known as Occam’s (or Ockham’s) razor, named after William of Ockham, an
English Franciscan Friar (c. 1287–1347) from the town of Ockham. The town is also known
for opening its doors to William and Ellen Craft, slaves who escaped the United States after
the passage of the barbaric Fugitive Slave Act of 1850 and who became important figures in
the abolitionist movement. And it has a cool mill. They are very proud.
5. Distances can be calculated between pairs of aligned DNA/RNA sequences, protein se
quences, and even organism characteristics like number of toes, hair color, behaviors, and
other things.
6. Synapomorphy is derived from the Greek: “syn-” means “shared,” and “-morph” means “shape.”
7. In case you’ve forgotten (or never knew) how sets work, here is a short primer. The intersec
tion of two sets is symbolized by ∩ and is equal to the set of what is shared between the
two sets. The union of two sets is symbolized by ∪ and is equal to the set of allobjects in
both sets. For example, if A = {1, 3} (set A includes the number 1 and 3) and if B = {3, 4}, then
A ∩ B = {3} because the number 3 is allthey have in common, and A ∪ B = {1, 3, 4}, which is
allthe numbers in both sets. And finally, there is the concept of the empty set, symbolized
by {} or ∅, a set with nothing inside it. For example, if A = {1, 2} and B = {3, 4}, then A ∩ B = ∅
(the empty set) because the two sets share nothing in common.
8. This example uses DNA nucleotides, but the data could easily be amino acids or even physi
cal (morphological) characteristics of organisms. Phylogeneticists refer to these generally as
characters and the number of changes as steps.
CHAPTER
07
PROBABILITY: ALL MUTATIONS ARE
NOT EQUAL (-LY PROBABLE)
Y
ou have already encountered the use of probability in several algorithms in
this book, including the Chou-Fasman propensities in Chapter 02 as well
as the sequence logos and position-specific weight matrices in Chapter 04.
In this chapter, we discuss the concepts behind and generation of probabil
ity matrices that are used in many bioinformatics algorithms. Most of the
chapter and exercises focus on how to determine the probability (likelihood1)
of amino acids mutating to other amino acids. These amino acid substitution
matrices—Point Accepted Mutation (PAM) and BLOcks SUbstitution Matrix
(BLOSUM)—dramatically improve the performance of different protein alignment
algorithms. We also briefly discuss the advanced concept of hidden Markov
models (HMMs), a powerful means for making probability matrices “on the fly.”
HMMs are used in many bioinformatics applications, including predicting genomic
repeat regions, transmembrane proteins, and protein-coding regions in genomes
and clustering distantly related protein sequences into families.
157
158 COMPU TATIO NAL B IOL OGY
The BLOSUM matrices also contain log odds scores, but they are calculated
in a different way, as you will learn later in the chapter.
These substitution matrices have proven very helpful for improving the perfor
mance of many algorithms, particularly sequence alignment methods such as
dynamic programming and BLAST. The reason they are so useful in sequence
alignment is that the scores really help differentiate among possible alignments.
For instance, below are two small sequence alignments of the same query (QY)
sequence to two different subject sequences (S1 and S2).
QUERY SEQ : H L R W S
SUBJECT 1 : H L S E S
SUBJECT 2 : M L S W S
Alignment 1 Alignment 2
QY: H L R W S QY: H L R W S
S1: H L S E S S2: M L S W S
Using the PAM250 matrix in Fig. 7.2, one can score both by adding up the log-
odds scores for allthe alignment positions. For instance, in alignment 2 the score
for an H (histidine)-to-M (methionine) change is −2, that for an L-to-L change is +6,
and so forth.
In this case, even though both alignments have two mismatches, the PAM
scoring system tells us that alignment 2 is a better alignment because it has a
higher overall score.
Alignment 1 Alignment 2
QY: H L R W S QY: H L R W S
+6 +6 +0 −7 +3 =+8 –2 +6 +0 +17 +3 =+24
S1: H L S E S S1: M L S W S
P R OB AB IL IT Y: AL L M U TATI O N S A R E N O T EQ U A L ( - LY PR O B A B LE) 159
FIGURE 7.2. The PAM250 substitution matrix. This matrix shows the log odds scores
of every amino acid changing to every other amino acid, or not changing at all. Positive
numbers mean that the likelihood is higher than expected by chance, 0 means the same
as chance, and negative means less likely than chance. For example, alanine not mutating
(A to A on the table, top left red box) is more likely (+2) than alanine mutating to cysteine
(A to C, −2, fifth-row red box). All the values along the diagonal are for the amino acid NOT
changing. Typically, the log odds of no change are the highest values, but some substitu
tions are very common. For instance, a change from an isoleucine to a leucine (+2) is just
as likely as an alanine not mutating.
FIGURE 7.3. Venn diagram of amino acid properties (left) and the genetic code (right).
Amino acids within circles of the Venn diag
ram are considered more biochemically similar.
likelihood (−3) of isoleucine being changed to glycine (G). Why is this? There are
two fundamental reasons, which are illustrated in Fig. 7.3. First, isoleucine and
leucine are “chemical cousins.” These amino acids are similarly sized and have
similar biochemical properties; namely, they are both hydrophobic (Fig. 7.3, left
side, aliphatic grouping). A change of I to L in any given protein will usually have a
very modest effect on the protein’s function. However, isoleucine and glycine are
very different biochemically, and a substitution of one for the other could spell
disaster. See how in Fig. 7.3, left side, I and G are not in any shared groupings?
If such a substitution were to occur in a critic al cellular protein, it could eliminate
an organism’s ability to function or reproduce (Darwin says, "Goodbye!").
Second, the likelihood of a substitution is also dependent upon how many nu
cleotide changes (mutations) need to occur in the underlying protein-coding DNA.
For example, the genetic code table shows that only one nucleotide in the first
position of the codon needs to change (AUU to CUU) to cause an isoleucine-to-
leucine mutation (Fig. 7.3, right side). On the other hand, an isoleucine-to-glycine
mutation requires two independent mutations in the codon (AUU to GGU). Since
the probability that two independent events will occur is the product of each inde
pendent probability,3 the differences in codon sequences have a considerable
impact on the likelihood of amino acid mutations.
Creating a probability model that takes into account both the biochemical
properties of the amino acid AND the likelihood of mutations at the DNA level
would be difficult, to say the least. Instead of doing this, a researcher named Mar
garet Dayhoff tried another approach: counting the different amino acid substitu
tions that occur in nature.
whether this is more likely or less likely than expected by chance. Finally, the
ratio of these values is used to determine the log odds scores. Activity 7.1 ex
plains the differences between PAM and BLOSUM and the steps of calculat
ing these matrices.
Multiple different substitution tables have been created using both the PAM and
BLOSUM approaches. The choice of using a PAM or BLOSUM substitution matrix
for BLAST or other applications should ideally be based on the overall dissimilarity
of the sequences being compared or aligned. The greater the dissimilarity, the
higher the PAM substitution matrix number that should be used. A PAM1 matrix
might be more appropriate for very closely related sequences, while a PAM250
matrix may be more appropriate for highly divergent sequences. The numbers
indicate the expected rate of mutation among sequences per 100 amino acids.
Like PAM, BLOSUM numbers also indicate divergence levels of the sequences
used to make the matrices, but the values go in the opposite direction: lower num
bers are used for more divergent sequences, higher numbers for less divergent
sequences. A BLOSUM62 matrix was generated from sequences with an aver
age of 62% overall similarity, while a BLOSUM90 matrix was generated from se
quences that were 90% similar. Similarity equivalences between the two matrices
are as follows:
PAM BLOSUM
PAM100 BLOSUM90
PAM120 BLOSUM80
PAM160 BLOSUM60
PAM200 BLOSUM52
PAM250 BLOSUM45
When choosing what matrix to use, it is best to pick the matrix that fits the gen
eral divergence of the sequences being aligned. In practice, the choice of matrix
does not make a particularly big difference in alignments.
Notes
1. In common speech, probability and likelihood are used interchangeably, though there are
some differences—at least to statistics nerds. Both terms are used in this chapter, though
likelihood is the more appropriate term for the chance that one amino acid mutates into an
other amino acid.
2. The log odds score is the logarithm of the odds ratio. In the case of amino acid substitution
matrices, the odds ratio is the ratio of the observed likelihood of a substitution to the likeli
hood expected by chance.
3. It has been estimated that the chance of you getting caught in a tornado is 1 × 10 − 6, while you
have a 1 × 10−7 chance of being bitten by a shark. So, the expected probability of you being
in a tornado AND being bitten by a shark is (1 × 10 − 6) × (1 × 10 −7) = 1 × 10 −13. Unless you get
caught in a sharknado. Then the probab ility increases to 1 × 10 −1.
4. For a review, see Yoon B-J. 2009. Hidden Markov models and their applications in biological
sequence analysis. Curr Genomics 10:402–415.
5. Black EF, Marini L, Vaidya A, Berman D, Willman M, Salomon D, Bartholomew A, Ken-
yon N, McHenry K. 2014. Using hidden Markov models to determine changes in subject
data over time, studying the immunoregulatory effect of mesenchymal stem cells. Proc IEEE
Int Conf Escience 1:83–91.
P R OB AB IL IT Y: AL L M U TATI O N S A R E N O T EQ U A L ( - LY PR O B A B LE) 163
Motivation
Protein sequence matching (e.g., BLAST), multiple-sequence alignments, and phylogenetic ana
lyses have long used substitution matrices to determine the best alignments or the best trees.
Substitution matrices provide scores, called log odds scores, indicating the likelihood of each of
the 20 most commonly occurring amino acids mutating into allthe other amino acids or not mu
tating at all. The two approaches taught in this chapter for creating substitution matrices include
PAM (Point Accepted Mutation) and BLOSUM (BLOcks SUbstitution Matrix). Both approaches
use observed patterns of amino acid substitutions to generate substitution matrices. PAM deter
mines these probabilities using phylogenetic trees, while BLOSUM bases the probabilities on
conserved blocks of aligned sequences. Both methods also calculate how often the mutations
are expected to occur by chance based on the frequency of the amino acids. The scores are logs
of the ratios of the observed probabilities to the expected probabilities.
This activity will teach the principles of two different approaches for generating amino acid
substitution matrices, as well as how to calculate the log odds scores. The lab exercises will
show how these methods are used in protein BLAST analyses and introduce the Pfam (protein
family) database, which uses the more sophisticated HMM approach to cluster groups of func
tionally related proteins.
Learning Objectives
1. Learn the biological principles behind substitution matrices and how probabilistic approaches
can be used in bioinformatics (Motivation).
2. Use phylogenetic trees to estimate amino acid substitution rates and generate PAM-like sub
stitution matrices (Concepts and Exercises).
3. Use conserved blocks within protein sequence alignments to estimate amino acid substitu
tion rates and generate BLOSUM-like substitution matrices (Concepts and Exercises).
4. Learn how PAM and BLOSUM scores are used in protein BLAST analysis (Concepts and
Exercises).
5. Gain familiarity with the HMM-based Pfam database (Concepts and Exercises).
Concepts
To better understand the biological principles behind amino acid substitution matrices, try the
following exercise. On the next page is a Venn diagram illustrating the biochemical similarities
among the 20 most common amino acids. The curved lines encompass letter designations for
amino acids with similar biochemical properties, which are labeled outside the diagram. For ex
ample, isoleucine (I), valine (V), and leucine (L) are allalip
hatic amino acids that also belong to a
larger set of hydrophobic amino acids.
164 COMPU TATIO NAL B IOL OGY
The protein sequence alignments below are of the same length and have the same number
of identities (4 identical matches outof 10, i.e., 4/10). Use the information in the Venn diagram to
determine which of the two matches is biochemically more likely.
Match 1
Query: F G Q V I P A K R
Subjt: FANVM P A R E
Match 2
Query: F G Q V I P A K R
Subjt: FSCVFPAFV
Reflection
• Are some mismatches more useful than others for determining the better match? Why or
why not?
• Based on the Venn diagram and your understanding of proteins, would an E-to-M mutation
be more or less likely than an E-to-N mutation? Why?
• Can you use the PAM log odds matrix below to score the two alignments? The scores are
additive and the more positive, the more likely the substitution (or no substitution). For
example, a change from N to N is +2, while a change from N to Y or Y to N is −2.
Match 1
Query: F G Q V I P A K R
Subjt: FANVM P A R E
SCORE:
Match 2
Query: F G Q V I P A K R
Subjt: FSCVFPAFV
SCORE:
Match 1 is the better match. The matching amino acids are the same between match 1 and
match 2, but the 4 mismatches (italicized below) in both alignments are very different. Based on
the Venn diagram, the mutations leading to the differences in match 1 seem more plausible. K to
R substitutes one positive amino acid for another, while K to F substitutes a hydrophilic for a hy
drophobic amino acid. The PAM250 matrix score also supports this assessment.
Match 1
Query: F G Q V I P A K R
Subjt: FANVM P A R E
Match 2
Query: F G Q V I P A K R
Subjt: FSCVFPAFV
P R OB AB IL IT Y: AL L M U TATI O N S A R E N O T EQ U A L ( - LY PR O B A B LE) 165
Match 1
Query: F G Q V I P A K R
Subjt: F A N V M P A R E
SCORE: 9+1+1+4+2+6+2+3-1=27
Match 2
Query: F G Q V I P A K R
Subjt: F S C V F P A F V
SCORE: 9+1-5+4+1+6+2-5-2=11
Sweet Lou
To calculate the likelihood of amino acid substitutions, Dayhoff, the inventor of the PAM matrix,
came up with a clever idea. Instead of deriving a complex formula that included biochemical
properties and the genetic code, why not simply count how many times each amino acid mu
tated to every other amino acid using available protein sequence alignments? Dayhoff also cal
culated the number of times that each amino acid did not change at all.
To understand how counting can be used to calculate the future likelihood of certain events,
consider an analogy from the sport of baseball. In baseball, one of the most difficult things to
do is to hit the ball with the bat.1 The best hitters manage to hit the ball effectively only 3 out
of every 10 attempts and are even less likely to hit the best of alloutcomes, a home run. Fig-
ure 7.1.1 shows the hitting statistics of the Detroit Tigers legend Lou Whitaker (nicknamed
Sweet Lou). One way to determine how likely it would be for Sweet Lou to hit a home run
would be to count how many times he actually hit a home run and compare that number to the
number of times he did something else, including striking out, getting to first base, and walk
ing. Based on his career stats, Sweet Lou hit a home run 3 outof every 100 times he tried to
hit the ball.
To extend the baseball analogy a bit further, let’s calculate a log odds score for Sweet Lou hit
ting a home run. We have the observed likelihood (0.03), but what is the expected likelihood? In
this case, the expected likelihood might be the likelihood of a typical professional baseball player
hitting a home run. If the typical player hits a home run 3 times outof every 1,000 times trying,
then the expected likelihood is 0.003. If we calculate the log odds score using the base 10 loga
rithm,2 then Lou Whitaker’s log odds score of hitting a home run (SHR) would be
0.03 ⎞
SHR = log10 ⎛⎜ = 1.0
⎝ 0.003 ⎟⎠
meaning that Lou Whitaker is 10 times more likely to hit a home run than the typical professional
player. If the value were −1.0, he would be 10 times less likely, and 0 means he would hit home
runs at the typical rate [log10 (1) = 0].
Similar to the baseball analogy, we could use protein sequence data and counting to deter
mine the likelihood of a leucine (L) mutating to a tryptophan (W) or to a histidine (H) or not mu
tating at all. Dayhoff realized that she could look for mutations using protein multiple-sequence
alignments. Both the Dayhoff PAM method and the BLOSUM method use counting to deter
mine observed likelihoods, though the counting is done in a different manner for each method. In
addition to the observed likelihoods, both methods also calculate an expected likelihood, which
166 COMPU TATIO NAL B IOL OGY
FIGURE 7.1.1. Using existing data to determine likelihoods: a baseball analogy. The
figure shows 4 seasons of hitting statistics for Lou Whitaker (Sweet Lou), one of the
finest second basemen ever to play the game. Using the data at the bottom left, we
could calculate the likelihood that Lou would hit a home run (the HR column) the next
time he came up to bat. To do this, we could count the total number of home runs (HR)
Lou hit and divide that number by the total number of times he did anything else at bat,
including striking out(SO), walking (BB), or getting a regular hit. In his career, Sweet Lou
hit 244 home runs at 8,570 times at bat (AB), for a likelihood of ∼0.03 (hit a home run 3
times outof 100 tries). The same type of calculation could be done to see how likely he
was to get to first base, walk, or strike out. Photo courtesy of Aaron Caldwell, under
license CC BY-2.0.
is the likelihood that this would have happened by random chance just based on the frequencies of
the amino acids. The final score that is calculated for both methods is the log odds score, which
is the log of the ratio of the observed likelihood to the expected likelihood. A positive log odds
score indicates that the mutation is more likely than chance, while a negative score indicates
that it is less likely than chance.
FIGURE 7.1.2 Counting alanine substitutions using a phylogenetic tree. (A) Phylo
genetic tree reconstruction of the relationship of 4 aligned sequences. The tree has been
inverted so that the root of the tree is at the top, with the 4 sequences from the alignment
at the bottom. The tree includes 3 ancestral reconstructed sequence reconstructions, one
at the root (top) and two descendants (middle), giving rise to the 4 sequences (bottom). (B)
Counting the number of times in this data that alanine mutated to cysteine. (C) Counting the
number of times alanine remained unchanged. (D) Counting the number of times alanine
changed to another amino acid (in this case, leucine).
168 COMPU TATIO NAL B IOL OGY
FIGURE 7.1.3. Observed amino acid substitution patterns for 10,000 events. The red
boxes indicate the mutations of A to C and C to A, which were combined to determine the
mutation rate. The PAM matrix also calculated the rate of not mutating. A did not change
9,867 times, making MA,A = 9,867/10,000 = 0.9867.
Figure 7.1.3 shows a part of the original substitution table used by Dayhoff to create the PAM
scoring matrix. The data set available in the late 1970s was very limited because few sequences
were available at the time for generating the observed and expected probabilities using the
phylogenetic counting approach like the one shown in Fig. 7.1.2. To make the calculations more
meaningful and easier to calculate, Dayhoff scaled the substitutions such that each amino acid
underwent 10,000 total events. The matrix was also made symmetrical, such that Mi,j = Mj,i by
combining the counts of i to j and j to i and then dividing this over the total number of events (in
this case, 10,000). For example, in Fig. 7.1.3, there is 1 observed A-to-C mutation and 3 C-to-A
mutations, for a total of 4. Thus, MA,C = MC,A = 4/10,000 = 0.0004.
Finally, to calculate the log odds score, one takes the natural log of the observed mutation
rate divided by the expected rate as
pi × Mi ,j M observed frequency
Si ,j = log = log i ,j = log
pi × p j pj expected frequency
where Si,j is the log odds score for a substitution of amino acid i to j, pi and pj are the frequencies
of amino acid i to j respectively, and Mi,j is the mutation rate. These log odds scores were then
calculated for the transition of every amino acid to every other amino acid.
P R OB AB IL IT Y: AL L M U TATI O N S A R E N O T EQ U A L ( - LY PR O B A B LE) 169
The original PAM matrix was based on a data set that includes 71 families of closely related pro
teins. In order to account for substitution patterns among more distantly related proteins, Dayhoff
also introduced a scaling factor that projected the mutation rates for more distantly related se
quences. Different matrices were then created for higher levels of protein divergence. A PAM1 ma
trix assumed an average of 1 amino acid substitution per 100 amino acids, while a PAM250 matrix
assumed an average of 250 substitutions per amino acid (each amino acid mutated multiple times).
FIGURE 7.1.4. Counting amino acid tuples (pairs) in sequence alignment blocks.
(A) Multiple-sequence alignment of four proteins. (B) The first step of the BLOSUM
algorithm is to identify blocks of positions in the sequence alignment withoutgaps. (C)
Using these blocks, the algorithm counts allthe possible pairs within each column. These
are the observed substitutions. In the two columns analyzed in the figure, the algorithm
counted 6 NN tuples in the first column and 4 EE tuples, 1 RE tuple, and 1 ER tuple in the
second column.
170 CO MPUTATION AL B IOL OGY
These tuples are then used to calculate the observed frequencies of the pairings (Fig. 7.1.5).
Then the frequency of the individual amino acids in the tuples is used to calculate the expected
frequency of allthe tuples (Fig. 7.1.6). Finally, with the observed and expected frequencies calcu
lated, the last step is to calculate the log odds scores as follows:
⎛ P(O) ⎞
log odds ratio = 2 × log2 ⎜
⎝ P(E ) ⎟⎠
FIGURE 7.1.5. Calculation of observed frequencies [P(O)]. (A) The tuple table with the total
number of pairs. (B) Calculation of observed frequencies. If the amino acid tuple count is of the
same amino acid (i.e., NN), the tuple count is divided by the total number of observed tuples. If
the amino acid count is of two different amino acids (i.e., RE), one first combines the total of both
possible combinations (RE and ER) and then divides by the total number of observed tuples.
(C) Table of observed frequencies. The zeros indicate tuples that have not yet been observed.
P R OB AB IL IT Y: AL L M U TATI O N S A R E N O T EQ U A L ( - LY PR O B A B LE) 171
BLOSUM matrices, like the PAM matrices, have also been designed for sequences with var
ious levels of divergence. Unlike the PAM matrices, larger BLOSUM numbers should be used
with more similar sequences. The BLOSUM80 matrix was constructed using protein sequence
alignments that clustered together at the 80% similarity level, while the BLOSUM62 matrices
clustered at 62%.
172 CO MPUTATION AL B IOL OGY
Exercises
Interactive exercise (theory)
Use the online PAM and BLOSUM interactive links below to learn how these ma
trices are created using observed amino acid substitutions. The interactive links
explain how to use the teaching interactives. Once you learn how they work, use
them to solve the problems in the next section.
Problems
Solving for a BLOSUM matrix
1. Circle the sequence block in the multiple-sequence alignment below that
could be used for BLOSUM matrix calculations.
3. Fill in the blanks using the expected-probability values in the table.
P R OB AB IL IT Y: AL L M U TATI O N S A R E N O T EQ U A L ( - LY PR O B A B LE) 175
Lab Exercise
Part 1. BLOSUM and PAM: using blastp advanced paramet ers
1. Use the protein BLAST (blastp) program at NCBI to perform a BLAST search
with different PAM and BLOSUM matrices.
a. Search using the prot1 sequence from the sample data. Adjust the blastp
advanced parameters so that the search is performed with the PAM250
matrix. Find a result with an identity of less than 85%. Write the answers
to the problems below.
Protein function:
Identities:
Positives:
Gaps:
b. Repeat the search with the prot1 sequence, but this time perform the
search with the BLOSUM90 matrix. Find the same match result as in part
a and write the answers to the problems below.
Identities:
Positives:
Gaps:
Search the Pfam database with the prot2 sequence from the sample data and
write/draw the answers to the following questions.
1. What significant Pfam match or matches did you find? Write the description.
2. What are any known functions of the protein family (families)?
3. What are the expected value (E values) of this match from the sequence search
results page?
5. What does the logo say about potentially important amino acids in this protein?
Notes
1. Not surprising since a hard ball about the size of a human fist is hurled at the batter at speeds
close to 100 miles per hour (160 km/h), often with a wicked spin.
2. The PAM matrix calculates the log odds score using the natural logarithm, while BLOSUM
uses the base 2 logarithm. It turns outthat the base used is not allthat important, but I find
base 10 easiest to interpret.
CHAPTER
08
BIOINFORMATICS PROGRAMMING:
A PRIMER
T
he main goals of this hypertextbook are to teach the purpose of bioinfor
matics, the algorithms underlying some of the more commonly used bioin
formatics programs, and how to use bioinformatics software to analyze
sequence data. The final chapter focuses on the next step in the evolution of
the bioinformatician: programming. While there are many terrific programs
already available for bioinformatics analysis, there comes a time when you may
need to reformat a large data set for a particular program, or you may need to
compile a Unix program, or you may decide that cutting and pasting data sets into
websites will delay your graduation or publication by approximately a decade. The
purpose of this chapter is to familiarize you with a widely used bioinformatics pro
gramming environment, namely, the Unix operating system (OS), and two com
monly used bioinformatics programming languages: the R and Python languages.
Many excellent books, online tutorials, and classes exist for teaching Unix, R, and
Python, and after finishing the exercises in this chapter you should be ready to
tackle some of these resources and learning on your own. In the primers below
you will find links to free resources and useful textbooks for additional self-tutorials
or instructions.
179
180 COMPU TATIO NAL B IOL OGY
into a special window called a terminal to make things happen. For instance, in
stead of using the menu bar to make a new folder (in Unix systems, folders are
called directories) on your computer, you can quickly make one with a command
called “mkdir” (make directory). Let’s say you had a lot of secrets to keep: you
could make a directory called “TooManySecrets.”
$ mkdir TooManySecrets
$ cd TooManySecrets
$ ls
It may seem awkward, but once you learn the commands and understand the
OS, with one command you can locate any file or any program on the computer
no matter what directory you are in. For example, instead of clicking through four
different folders to find the files in TooManySecrets, you might do the following
instead:
$ cd ∼/Documents/ScottStuff/MISC/TooManySecrets
Ta-da! The / symbol tells the computer to look in a subdirectory. The ∼ is the home
directory. So, this command changes directory by following a path from the home
directory to Documents to ScottStuff to MISC to TooManySecrets. To find the
path to any directory, just type the command pwd (print working directory) and
then press the Enter key. In a Windows or Mac-style OS, to get to the file TooMa
nySecrets, you would need to open Documents, then open ScottStuff, then open
MISC, and then search for the file in the directory. The cd command does the
same thing withoutallthe clicking.
Even better, the Unix command line allows you to execute any software pro
gram from anywhere on the computer, as long as that program is installed in the
correct directory. For example,
$ python
This command finds and opens the Python program. If configured correctly, you
can also open browsers, spreadsheets, or any other program. These are major
advantages for bioinformatics programming, and every bioinformatics program
mer needs to be competent with the Unix language.
like systems, so it is good for bioinformaticians to be familiar with this OS. After
the tutorial in this chapter, you can complete a more extensive tutorial at
http://www.ee.surrey.ac.uk/Teaching/Unix/
You can also learn from this excellent, free online book, The Linux Command Line,
by William Shotts:
http://linuxcommand.org/tlcl.php
There are also many other available online tutorials and books. The MacOS
comes with a Unix terminal, which can be used to complete the tutorial exercise.
If you are using Windows 10, a Bash shell command line tool is included for de
velopers, but you may need to activate it.2 Alternatively, you will need to down
load and install a Unix emulator or a virtual machine on Windows, or install a flavor
of Linux on your personal computer.
Unix-like systems have a clear hierarchical directory structure. Every file and
folder (directory) in these systems is accessible from the command line, as long
as you know the file "path" to what you want to find.
To start playing around in Unix, open a terminal window that should look some
thing like the one below.
182 COMPU TATIO NAL B IOL OGY
In the terminal window, you type in commands and hit “Enter” and things hap
pen! Here, I have typed the commands pwd and then ls.
The pwd command stands for “print working directory” and reports the directory
path of the current directory. The ls command lists all the files and directories
within the current directory.
There are lots of commands you can use to navigate the operating system. Ap
proximately 20 to 30 commands are used all the time. Unix is tedious at first, but
it is much faster than clicking through a lot of folders and scrolling through win
dows. Below are some example commands.
B I O I N FO R M ATI C S PR O G R A M M I N G : A PR I M ER 183
Introduction to R
R is a programming language and environment for statistical computing and
graphics. R has the tools of a commercial statistical package (e.g., SPSS, Systat,
or SAS) but is free of charge (I wish I had discovered R years before I did.) I regu
larly use R to test for statistical correlations, make box plots, and perform analy
ses of variance, regressions, or dozens of other statistical tests. However, the
reason that R is a must for bioinformatics is because R comes with many other
packages (libraries), including many bioinformatics packages. Dozens of bioinfor
matics methods have been coded in R, and an R library is pretty much the first
accessible place for a new sequence analysis algorithm. At the time of this writ
ing, the open-source Bioconductor depository (http://www.bioconductor.org)
had more than 1,200 R packages for high-throughput data analysis. The CRAN3
depository also contains dozens of libraries4 with powerful statistics and algo
rithms for biological data analysis, such as “vegan,” “randomForest,” “ggplot,” and
many multivariate analysis packages.5
This tutorial teaches a few basics to get you going, including (i) how to install
R, (ii) how to upload a data file, and (iii) how to perform a few simple statistical
analyses.
To continue learning R, I recommend the tutorial at
http://www.cyclismo.org/tutorial/R/index.html
http://tryr.codeschool.com/levels/1/challenges/1
R Tutorial: Installation
R Tutorial: Installation
Step 2: Commands in R
This next section requires that you download a data set and put it on your com
puter desktop.6 The file used in the exercise can be found at
http://kelleybioinfo.org/algorithms/basics/programming/RTestData.txt
Notes
1. The sample names cannot start with a number. For instance, you cannot put
001 instead of S001. If you have numbers, put a letter in front of them.
2. Do not allow spaces in any of the names of variables. No funny symbols or
special characters. Letters and numbers only.
3. Empty cells in a data set, called a data frame in R, must be replaced by NA (for
“not available”).
The next part loads the data into R. This is called the dataframe. In this case, read
the data in using the read.table function and assign it to the variable d (for “data”).
Note that many textbooks and tutorials assign values to varia bles using the arrow
symbol. For instance, the code shown in step 2 could be written
d<-read.table(“Desktop/RTestData.txt”,header=TRUE)
However, most programming languages use the equal sign to assign variables. I
think it looks much better than the arrow, and it works great in R. You can use ar
row symbols if you like, but it takes twice as many keystrokes.
After you have installed R and loaded up the test data set, try outsome of the
very exciting statistical analyses!
186 COMPU TATIO NAL B IOL OGY
Step 1: You need a text-only file that R can read. I created a tab-delimited text file called RTest-
Data.txt. This data is from a gum disease study and has abundances of bacteria found in the
human mouth.
id=Code for each patient. Two rows for each subject: one before and one
after gum cleaning.
strep=Percentage of Streptococcus bacteria
lepto=Percentage of Leptotrichia bacteria
prev=Percentage of Prevotella bacteria
fuso=Percentage of Fusobacteria bacteria
veil=Percentage of Veillonella bacteria
time=Time that sample was taken: 1–before gum cleaning; 2–after gum cleaning
status=Disease status: 1 is healthy, 2 is diseased gums
pocket=Average gum pocket depth across all the teeth in the mouth (in millimeters)
deepest=Depth of the deepest gum pocket in the mouth (in millimeters)
RTestData.txt can be downloaded by saving it as a text file or by copy/pasting data into a text file:
http://kelleybioinfo.org/algorithms/basics/programming/RTestData.txt
Step 2: To read in your data set, you need to know where the data set is on your computer.
(I made it easy and put it on my desktop.) Then type the path to the folder/directory in the
console and hit return.
B I O I N FO R M ATI C S PR O G R A M M I N G : A PR I M ER 187
To read in your data set in Windows, you have to find the path to the file. To find
the path, right-click the data file and choose “Properties” at the bottom of the
menu. You will get a window that looks like this:
Note that after reading in the variable, to access the data associated with the
variable, you must use the $. Because it is annoying to keep typing the dollar
sign, I copied the data to a new variable name like this: strep=d$strep
Introduction to Python
Python has become the most widely used programming language in bioinformat
ics, and for good reason. Not only is Python a flexible, fully functional program
ming language used in applications around the world, but also it was designed to
188 COMPU TATIO NAL B IOL OGY
be relatively easy to learn. Guido van Rossum,7 Python’s creator, combined the
style of the C programming language with the ease of a simpler learning lan
guage called ABC. The result was a clean and rapid scripting language with a simple
syntax that results in fewer bugs. Python really shines in bioinformatics because
it is wonderful for opening, reading, and parsing data files, and because it is a tre
mendous so-called glue language. With Python it is easy to glue together many
different programs (R, Unix, C, and Python) into a powerful analysis workflow.
The Python exercises in this chapter assume the use of PYTHON 3.x version
of the programming language. The current version as of this writing is Python 3.6.
Python is available for download and installation at
https://www.python.org/
if it is not already on your system (check this). You can read more about why Py
thon is awesome at
https://www.python.org/about/success/
The tutorial here includes a short primer on using the Python Interpreter and
an example of how to write, save, and execute a simple programming file. If you
want to learn more after finishing this, there are many excellent free tutorials, in
cluding on the Python website.
Python tutorial:
https://docs.python.org/3/tutorial/index.html
Python for total beginners:
https://wiki.python.org/moin/BeginnersGuide/NonProgrammers
https://automatetheboringstuff.com/
There are loads of other resources and books for learning Python, including a
book called Python for Biologists, by Martin Jones.
After installing Python, you should be able to invoke Python on the command line in
a terminal window by typing the name of the program. This opens the Python Inter-
preter, where you can run Python code directly in the terminal by typing ‘python’
and hitting the Enter key.
B I O I N FO R M ATI C S PR O G R A M M I N G : A PR I M ER 189
This part assumes that you have installed a version of Python 3 and can run
the program in a terminal window. There are other ways to run Python, including
with a text editor like Emacs.
Let’s do something a little more interesting. This code snippet prints the numbers
from 5 to 9. How would you change the loop to print numbers from 3 to 101? Try
it yourself!
This part shows you how to save a file and execute it using Python on the com
mand line. Most people save their Python files by adding a “.py” file extension at
the end of the name. This allows you to remember that it is indeed a Python pro
gram file, and many programs will also recognize this as a Python file by the file
extension. You will need to use a text editor to save your file; there are many to
choose from, including editors that come with your OS, such as Notepad or nano.8
Once you quit the Python interpreter (type Control-d to quit) all your work disap
pears. The interpreter is nice for testing out simple functions or for doing calcula
tions, but your work is lost after you quit. To save your program, you need to write it
in a separate file. Below we save our work with the Emacs text editor, a commonly
used text editor for programming. (The nano program is an editor that often comes
with Linux. Type nano at the prompt.
190 COMPU TATIO NAL B IOL OGY
After you finish these primers you can work your way through the excellent web
sites, online pdfs, and books mentioned throughout the chapter. Armed with this
knowledge, you are on your way towards rapid, custom-designed high-through
put analysis of your own crazy data. Both R and Python have scores of freely
available bioinformatics-specific libraries that will allow you to implement the al
gorithms found in the book and much more—including parsing data sets and ana
lyzing new data.
Biopython: http://biopython.org
Bioconductor: https://www.bioconductor.org
Certainly, one of the best ways to truly learn a programming language is to have
a project that you really want, or really need, to do. Perhaps you need to reformat
a data set to run with specialized analysis software (Python), or you need to per
form stepwise multiple regression analysis (R), or maybe you need to create a
rapid workflow that uses a series of unrelated programs (Unix). Programming can
be a challenging, error-filled journey, but it’s worth it when you experience that
feeling of glee as your code crunches through thousands of data points to pro
duce an analysis of unsurpassed beauty. Then you’re ready to take on the world!
Notes
1. Linux was created by a bloke named Linus Torvald, ergo “Linus’s Unix,” shortened to “Linux.”
2. http://www.windowscentral.com/how-install-bash-shell-command-line-windows-10
3. From the R website: “CRAN is a network of ftp and web servers around the world that store
identical, up-to-date, versions of code and docum entation for R.”
4. Inside R packages can be readily installed by using the install.package command. For exam
ple, to install ggplot2, use install.packages(“ggplot2”). To use the new library, type library
(ggplot2).
5. https://cran.r-project.org/web/views/Multivariate.html
6. To save this file from a web browser, you can, for example, click “Save As” to save as a text
file in Firefox, or click “Save As” and select “page source” in Safari. Also, you can copy/paste
into a text-only processor like nano or Emacs. To check that it is a text file, you can use the
command head or less. If you see a lot of garbage, it is not text-only. Note that Microsoft
Word is not a text-only processor.
7. van Rossum was a massive Monty Python fan, hence the name: Python.
8. I prefer the Emacs editor that can be installed for allOSs, though it does have a steep learn
ing curve.
INDEX
Adenine, nucleotide base in DNA and pairing, 9, 10 BLAST (Basic Local Alignment Search Tool) algorithm, 1
Alanine, propensities for, 59 algorithm activity, 36–38
Alignment page, 1–3 BLAST It, 31–32
Amino acids BLAST tutorial, 40
baseball analogy for likelihood in substitution, characterizing protein in Escherichia coli
165–166 genome, 35
chemical structures of, 16 concepts of, 36–38
free energies for transfer of, 53 interactive exercise, 38, 39
PAM and BLOSUM matrices, 160–161 lab exercises, 40–44
propensities, 59, 61 learning objectives, 36
substitution bias, 159–160 massive parallelization of, 32, 34
substitution matrices, 157–159 motivation, 36
Venn diagram of properties and genetic code, 160 power of, 33–35
Androgen receptor, 136 results of search of hypothetical FX093345
Anthrax toxin, representation of, 48 nucleotide, 32, 33
Archaea, phylogenetic tree of life, 134 sequence alignment, 158
Asparagine, propensities for, 59 Blastp Tutorial Link, 175, 176
BLOSUM (Blocks Substitution Matrix), 157–159
Bacteria, phylogenetic tree of life, 134 activity generating PAM and, 163–166
Bacterial gene, general structure of, 19 calculation of, 169–171
Baseball analogy, likelihood in amino acid substitution, PAM and, 160–161
165–166 BLOSUM Interactive Link, 172
Bioinformatics, 5 Bootstrap
computer and, 6–8 analysis, 138
hidden Markov Models (HMMs), 161–162 phylogenetics, 137–139
methods, 48–50 term, 139n7
power of, 33–35
protein, 47–48 Chernobyl chicken, sequence, 100–101
Bioinformatics software, 1–3 Chou-Fasman algorithm, 50, 58, 157
Python programming language, 187–190 Chou-Fasman Interactive Link, 61
R programming language, 183–187 Clustal Omega alignment program, 72, 74, 85–86, 88
Unix operating system, 179–183 Clustal Omega Tutorial Link, 83
Biological databases and data storage ClustalW program, 72
concepts, 21–25 Computer. See also Bioinformatics software
exercises, 25–28 bioinformatics, 6–8
learning objectives, 20–21 DNA in the, 8–11
motivation, 20 protein sequences in the, 13–14, 16–17
Biological molecules, properties, 5 RNA in, 11–13
191
192 INDEX