Computational Biology Compressed

Download as pdf or txt
Download as pdf or txt
You are on page 1of 211

COMPUTATIONAL

BIOLOGY
A HYPERTEXTBOOK
SCOTT T. KELLEY
Department of Biology
San Diego State University
San Diego, California
AND
DENNIS DIDULO
Becton, Dickinson and Company
San Diego, California

COMPUTATIONAL
BIOLOGY
A HYPERTEXTBOOK

Washington, DC
Copyright © 2018 American Society for Microbiology. All rights reserved.
No part of this publication may be reproduced or transmitted in ­
whole or in part or reused in any form or by any means, electronic
or mechanical, including photocopying and recording, or by any information
storage and retrieval system, without permission in writing from the publisher.

Disclaimer: To the best of the publisher’s knowledge, this publication provides


information concerning the subject ­matter covered that is accurate as
of the date of publication. The publisher is not providing ­legal, medical,
or other professional ser­vices. Any reference herein to any specific
commercial products, procedures, or ser­vices by trade name, trademark,
manufacturer, or other­wise does not constitute or imply endorsement,
recommendation, or favored status by the American Society for
Microbiology (ASM). The views and opinions of the author(s) expressed
in this publication do not necessarily state or reflect ­those of ASM,
and they ­shall not be used to advertise or endorse any product.

Library of Congress Cataloging-­in-­Publication Data


Names: Kelley, Scott T. (Scott Theodore), author. |
Didulo, Dennis, author.
Title: Computational biology : a hypertextbook / Scott T. Kelley, Department
of Biology, San Diego State University, San Diego, California, and Dennis
Didulo, Becton, Dickinson and Company, San Diego, California.
Description: Washington, DC : ASM Press, [2018] | Includes index.
Identifiers: LCCN 2017051454 (print) | LCCN 2017052307 (ebook) | ISBN
9781683670032 (ebook) | ISBN 9781683670025 (pbk.)
Subjects: LCSH: Computational biology.
Classification: LCC QH324.2 (ebook) | LCC QH324.2 .K45 2018 (print) | DDC
570.285--dc23
LC record available at https://lccn.loc.gov/2017051454

All Rights Reserved


Printed in the United States of Amer­i­ca

10 9 8 7 6 5 4 3 2 1

Address editorial correspondence to


ASM Press, 1752 N St., N.W.,
Washington, DC 20036-2904, USA

Send ­orders to ASM Press, P.O. Box 605, Herndon, VA 20172, USA
Phone: 800-546-2416; 703-661-1593
Fax: 703-661-1501
E-­mail: [email protected]
Online: http://­www​.­asmscience​.­org
To Kina and Aidan, my won­der­ful and sup­port­ive fam­i­ly.

And to my brother Brian, who self­lessly do­nated his kid­ney, with­out­


which I would not have had the en­ergy
to write this book.
CONTENTS

Preface ix
For the Instructor xi
For the Student xiii
Acknowl­edgments xiv
About the Authors xv

CHAPTER –1 Getting Started 1

CHAPTER 00 Introduction 5
Activity 0.1: Biological Databases and Data Storage 20

CHAPTER 01 BLAST 31
Activity 1.1: BLAST Algorithm 36

CHAPTER 02 Protein Analy­sis 47


Activity 2.1: Hydrophobicity Plotting 52
Activity 2.2: Protein Secondary Structure Prediction 58

CHAPTER 03 Sequence Alignment 67


Activity 3.1: Dynamic Programming 74

CHAPTER 04 Patterns in the Data 91


Activity 4.1: Protein Sequence Motifs 94
Activity 4.2: Position-­Specific Weight Matrices 102

CHAPTER 05 RNA Structure Prediction 111


Activity 5.1: RNA Structure Prediction 118

CHAPTER 06 Phyloge­ne­tics 133


Activity 6.1: Phyloge­ne­tic Analy­sis 140

CHAPTER 07 Probability: All Mutations are not Equal (-ly Probable) 157
Activity 7.1: Generating PAM and BLOSUM Substitution
Matrices 163
­

CHAPTER 08 Bioinformatics Programming: A Primer 179

Index 191
PREFACE

T
his text­book is a hypertextbook. Half of the text­book ma­te­rial lies be­tween
the pages of this book and the other half on the In­ter­net. It seems nat­u­ral
that a hypertextbook, which com­bines print and on­line apps for mo­bile tech­
nol­ogy, would be a great way to learn the ba­sics of bioinformatics, which
uses in­for­mat­ics (com­pu­ta­tional) the­ory to study bi­o­log­i­cal da­ta.
This book was born out­of a mix of ne­ces­sity and in­spi­ra­tion.1 The ne­ces­sity
came from the dearth of bioinformatics in­struc­tional ma­te­ri­als ap­pro­pri­ate for my
com­bi­na­tion of bi­­ol­ogy stu­dents, with lit­tle or no com­puter back­ground,2 and
com­puter sci­ence stu­dents, who were in­ter­ested in the field but had lit­tle un­der­
stand­ing of bi­­ol­ogy. The need be­came acute when I learned that my fa­vor­ite bio-
informatics lab man­ual, Bioinformatics for Dummies (BFD3), would no lon­ger be
up­dated. BFD was a great lab man­ual for learn­ing how to per­form ba­sic bioinfor-
matics data anal­y­sis. This book did not ex­plain the prin­ci­ples be­hind the al­go­
rithms, but I could cover those dur­ing lec­tures. BFD was clear and fun to read and
pro­vided prac­ti­cal skills for bi­­ol­o­gists and oth­ers look­ing to an­a­lyze data. Unfortu-
nately, the most re­cent ver­sion was printed in 2007!
I kept us­ing the old edi­tion of BFD for some time, but even­tu­ally the tu­to­ri­als
be­came ob­so­lete and the stu­dents took lon­ger and lon­ger to com­plete the ex­er­
cises. In fact, sev­eral pas­sages of BFD were ob­so­lete a few months af­ter the
book was printed. Bioinformatics web­sites are con­stantly chang­ing, in­clud­ing
their de­signs and the URL links, and some­times the pages them­selves dis­ap­pear
al­to­gether. Since I be­gan writ­ing this book, two of the web­sites I teach in the book
and on­line ma­te­ri­als changed sig­nif­i­cantly, and one dis­ap­peared al­to­geth­er.
This led to my orig­i­nal in­spi­ra­tion for the hy­per- part of this hypertextbook.
What if I made my own bioinformatics tu­to­ri­als and sam­ple test data for com­
monly used anal­y­sis tools on­line in eas­ily up­dated fi­les? That way, when a link
changed or the pro­gram­mers moved a ra­dio but­ton around, I could eas­ily al­ter the
tu­to­rial to re­flect these changes in real time. Students would not have to wait for
a new ver­sion of a book to have an ac­cu­rate tu­to­ri­al.
The next in­spi­ra­tion arose from my use of pa­per-based puz­zles and prob­lems to
teach the bioinformatics al­go­rithms. The prob­lems I taught in class, com­bined with

ix
x  Prefa ce

the an­tic­i­pa­tory con­cep­tual ex­er­cises and lec­ture ma­te­rial, were very suc­cess­ful
for teach­ing how the meth­ods worked. Unfortunately, pa­per-based prob­lems also
had sig­nif­i­cant draw­backs: the stu­dents were given only one prac­tice prob­lem per
al­go­rithm, and they re­ceived very lit­tle feed­back as a re­sult. Typically, I would (1)
teach the method, (2) do an ex­am­ple with the stu­dents in class, (3) as­sign it for
home­work, (4) get it back a week later, and (5) re­turn it with feed­back a week af­ter
that. And that was it.
Fortunately, I re­al­ized that the struc­ture of the al­go­rithm puz­zles I taught would
be per­fect for touchscreen de­vices and lap­tops. Most of them in­volved ei­ther
slid­ing let­ters around or fill­ing in boxes with num­bers, both eas­ily done with a fin­
ger or a mouse. With the col­lab­o­ra­tion of my bioinformatics web­site de­signer and
co­au­thor Dennis Didulo, I cre­ated in­ter­ac­tive learn­ing tools that pro­vide lim­it­less
prac­tice and in­stant feed­back for stu­dents. When we combined the bioinformat-
ics soft­ware tu­to­ri­als and test data into one site, we had a com­pre­hen­sive learn­
ing par­a­digm for in­tro­duc­tory bioinformatics. (See “For the Student” sec­tion
be­low for an out­­line of the web­site fea­tures.)
In my class, I no­ticed an im­me­di­ate in­crease in al­go­rithm com­pre­hen­sion and
prob­lem-solving abil­ity. Students gained much more prac­tice, re­ceived more feed­
back, and per­formed much bet­ter on tests. Because new prob­lems were eas­ily
ran­domly gen­er­ated, each stu­dent had their own per­sonal data set. Best yet, I could
now quickly gen­er­ate new exam ques­tions and an­swers with the click of a but­ton!
And what did my stu­dents think? These quotes speak for them­selves.

“It is a won­der­ful learn­ing tool. The on­line pro­grams made learn­ing the
al­go­rithms al­most easy.”—Ruby, un­der­grad­u­ate stu­dent

“I didn’t want to tell you how much I liked the web­site be­cause I didn’t
want your ego to get too big.”—Emily, un­der­grad­u­ate stu­dent

“It was much bet­ter than that bioinformatics cat vid­eo.”—Pe­dro, grad­u­ate
stu­dent

“You can learn bioinformatics while waiting in line at the DMV or sitting
on your couch eating cheese puffs!”—Anonymous

Notes
1. Much like the in­ven­tion of the salad spin­ner.
2. Many bi­­ol­ogy stu­dents tell me flatly that they are “bad with com­put­ers” or even state that
“com­put­ers hate [them].” For the re­cord, com­put­ers re­ally don’t care about you at all­. Which
is why we should never give them weap­ons (see the film “Terminator”).
3. BFD, the bud­ding bioinformatician’s BFF.
FOR THE INSTRUCTOR

T
his hypertextbook can be used in a num­ber of ways: in a lec­ture or on­line
course, us­ing the book as an out­­line for a course, or us­ing just the sec­tions
of interest. It is important to note that, being a hypertextbook, the web
components are not supplemental, but instead are crucial for being able to
understand the content presented in the physical book. In my classes, I use
the in­ter­ac­tives in­side of class, and the stu­dents also use them out­­side of class
to help them solve al­go­rithm prob­lems or pre­pare for ex­ams. Generally, I use the
fol­low­ing ap­proach:

1. Teach the bi­o­log­i­cal rel­e­vance and back­ground of the meth­od.

2. Have stu­dents solve the con­cep­tual (an­tic­i­pa­tory) ex­er­cise in class, shar­ing
an­swers with one an­oth­er.

3. Lecture on the ba­sics of the al­go­rithm.

4. Have stu­dents bring out­their mo­bile de­vices (lap­tops, smart­phones, and
tab­lets) and solve the in­ter­ac­tive prob­lems.

5. Have stu­dents share their an­swers with neigh­bors in class and with the
in­struc­tor.

Then, to make sure the stu­dents prac­tice at home, I as­sign the pa­per-based prac­
tice prob­lems. Finally, in the com­puter lab, or for home­work, I as­sign the lab ex­er­
cises with the soft­ware based on the al­go­rithms.
So far, the ap­proach has been a great suc­cess in my classes. The on­line tools
in­crease com­pre­hen­sion and im­prove exam re­sults, and the eas­ily up­dated tu­to­
ri­als for bioinformatics anal­y­sis soft­ware and bi­o­log­i­cal da­ta­bases re­duce a lot of
stu­dent frus­tra­tion. I hope it proves as suc­cess­ful in your class as it has in mine.

xi
FOR THE STUDENT

T
his text­book is re­ally a hypertextbook, mean­ing that much of the most ex­cit­
ing learn­ing hap­pens on­line. Close to half of the book ma­te­ri­als are on­line,
and in each chap­ter you will be di­rected to the on­line tools as­so­ci­ated with
the text. The idea is to le­ver­age the uniquely pow­er­ful as­pects of the In­ter­net
to help you learn about bioinformatics. The puz­zle-like na­ture of bioinformat-
ics al­go­rithms makes them es­pe­cially suited to in­ter­ac­tiv­ity and “gamification”
(mak­ing dif­fi­cult things into games with points and scores). The in­ter­ac­tive na­ture
of mo­bile de­vices and their con­nec­tion to on­line bioinformatics soft­ware make
them use­ful learn­ing tools for un­der­stand­ing the the­ory be­hind bioinformatics
meth­ods (the al­go­rithms) and for gain­ing prac­ti­cal ex­pe­ri­ence with their im­ple­
men­ta­tion (soft­ware anal­y­sis and da­ta­bases). In or­der to en­hance learn­ing of
the prin­ci­ples be­hind bioinformatics al­go­rithms and make them more en­gag­ing,
the on­line re­sources have been de­signed to

• 
Be in­ter­ac­tive, with touchscreen puz­zle-like prob­lem sets that pro­vide in­stant
feed­back

• 
Be multiplatform, us­able on com­put­ers, tab­lets, and smart­phones

• 
Be highly prac­ti­cal, with di­rect links to data anal­y­sis web­sites and in­clud­ing
test data sets and step-through tu­to­ri­als

• 
Be eas­ily up­dat­ed, be­cause bioinformatics web­sites change con­stantly and
tu­to­ri­als of­ten need ad­just­ment

• 
Allow plenty of prac­tice through in­stant “ran­dom” prob­lem gen­er­a­tion and
quiz­zes

xiii
ACKNOWLEDGMENTS

I
wish to thank the lead­er­ship of the Cal­if­or­nia State University Program in Edu-
cation, Research, and Biotechnology (CSUPERB) and the grant re­view­ers who
ap­proved my pro­posal on mo­bile app ed­u­ca­tion tech­nol­ogy that pro­vided the
seed money for de­vel­op­ing the in­ter­ac­tive tech­nol­ogy and web re­sources.
I thank Greg Payne at ASM Press for lis­ten­ing to my ideas and tak­ing them
seriously and for his sup­port dur­ing the writ­ing and pub­lish­ing pro­cess, and I thank
my col­league at SDSU, Da­vid Lipson, for tell­ing Greg about my pro­ject. I thank the
hun­dreds of bioinformatics stu­dents who took my course at SDSU, who helped
me re­fine my al­go­rithm teach­ing meth­ods from their sub-alpha de­vel­op­ment
pen­cil-and-paper stages all­the way through to the in­ter­ac­tive app stage. You are
the rea­son I do all­this in the first place. I give special thanks to my spouse Kina
Thackray for her advice during the long process of developing the bioinformat-
ics learning algorithms, for encouraging me to submit grant proposals, and for
her very helpful comments on multiple drafts of the book. Finally, I need to thank
my for­mer bi­om­e­try pro­fes­sor Dr. Mi­chael Grant, who taught me sta­tis­tics
and in­tro­duced me to pro­gram­ming (SAS) and Dr. Gary Stormo, who gra­ciously
al­lowed me to pur­sue bioinformatics as a post­doc­toral re­searcher in his lab.

xiv
ABOUT THE AUTHORS

Scott T. Kelley is a Professor of Biology at San Diego State


University. He has a Ph.D. from the University of Colorado
and a B.A. from Cornell University. His lab at San Diego
State University combines phyloge­ ne­
tic methods and
culture-­independent molecular tools to study environmental
microbiology. Dr. Kelley has published extensively on the
human microbiome, the built environment, and many
­
natu­ral environments. He has published many papers on
bioinformatics, and has helped develop some widely-­used
tools for analyzing next-­generation sequence data sets for
microbial communities. He has received research grants from the National Insti-
tutes of Health, the National Science Foundation, the Alexander von Humboldt
Foundation, and the Alfred P. Sloan Foundation, among o ­ thers. He has served on
the scientific advisory board of the Clorox Com­pany, and his work has been fea-
tured by The New York Times, NPR, CBC (Canada), Time Magazine, and Der Spiegel,
among numerous ­others. He is a massive fan of the FC St. Pauli and Everton FC
football clubs; loves punk rock, jazz, and classical ­music; speaks German for fun;
and makes a mean apple pie. You can follow Scott on twitter@kelleybioinfo.

Dennis Didulo has been a Data Analytics/Software Engi-


neer at CareFusion since 2014 and a Software Test Engineer
at Becton, Dickinson and Company since 2016 and also
teaches online database and programming courses for the
University of Mary­land University College. He received his
master’s degree in information technology at De La Salle
University and his master’s degree in bioinformatics at San
Diego State University. Dennis has professional develop-
ment expertise in more than a dozen computer languages,
as well as expertise in database management, algorithm
design, and systems engineering. Dennis is a proud f­ather of five grown c­ hildren,
whom he surprised by flying back unannounced to the Philippines for a visit.
xv
CHAPTER
1
GETTING STARTED

Using the Website


Direct your browser on your phone, computer, or
tab­let to the fol­low­ing web­site:
http://​www.​kelleybioinfo.​org

There you will see the homepage, as shown at


right.
Touching or click­ing an icon (e.g., “Alignment”)
will take you to a new page that has tools re­lated
to the icon topic. The Alignment, Motifs, and
Phylogeny but­tons teach al­go­rithms and tools for
many types of se­ quence anal­y­
sis with DNA,
RNA, and pro­teins. The Protein and RNA but­tons
fo­cus on al­go­rithms for pre­dict­ing struc­tural fea­
tures of the func­tional mac­ro­mol­e­cules, while
the Probability but­ton teaches how to gen­er­ate
sub­sti­tu­tion ma­tri­ces.

Example: The Alignment Page


Clicking or touch­ing the Alignment button will take
you to the fol­low­ing page, which be­gins with the
BLAST al­go­rithm in­ter­ac­tive tool. All the pages
use this ba­sic de­sign.

1
2  CO MPUTATIONAL B IOL OGY

Ge­ne­ral fea­tures

Information on the in­ter­ac­tive learn­ing tool


G ETTI N G S TA R TED   3

Tutorials and test data for on­line bioinformatics soft­ware

While most of the pages look like the Alignment page, the Basics page is or­
ga­nized dif­fer­ently and mostly con­tains in­for­ma­tion and tu­to­ri­als.

How To Use This Book


I will as­sume you are fa­mil­iar with how to read/use a book, but remember that
the physical book is meant to be used in conjunction with the online compo­
nent. Throughout the text you will be directed to online modules via URLs and
QR codes. The online material is not supplemental, but is a critical portion of
this hypertextbook.
CHAPTER
00
INTRODUCTION

T
he word bioinformatics re­ fers to the com­ pu­ ta­tional anal­

sis of com­ plex
­ i­o­log­i­cal data. The “bio-” pre­fix in­di­cates bi­­ol­ogy, of course, while “in­for­
b
mat­ics” is the sci­ence of data pro­cess­ing, stor­age, and re­trieval (a.k.a. in­for­ma­
tion sci­ence) that first de­vel­oped in the 1960s. Bioinformatics it­self dates
back to the early 1970s, when com­put­ers were first used to an­a­lyze mo­lec­
u­lar se­quences. While our knowl­edge of bi­o­log­i­cal pro­cesses, the amount of mo­
lec­u­lar data, and the speed and through­put of com­pu­ta­tion have all­ex­panded
dra­mat­ic­ ally, the field of bioinformatics still pri­mar­ily fo­cuses on the anal­y­sis of
three crit­i­cal bi­o­log­i­cal mol­e­cules: DNA, RNA, and pro­tein.
These mol­e­cules are crit­i­cal to the cel­lu­lar pro­cesses of all­ liv­ing or­gan­isms,
and the anal­y­sis of the com­po­si­tion and pat­terns of these mol­e­cules should in
the­ory re­veal all­the se­crets to life. (Or, as Dr. Frankenstein would say, “It’s alive!
Bwahaha!”) In fact, be­cause DNA en­codes the in­for­ma­tion for all­the RNA and
protein in every cell, anal­y­sis of DNA se­quence pat­terns com­prises the ma­jor­ity of
bioinformatics. RNA and pro­tein se­quences are also an­a­lyzed us­ing spe­cific bioin-
formatics al­go­rithms, but the se­quences of these mol­e­cules are of­ten com­pu­ta­
tion­ally de­ter­mined from the DNA se­quence in one way or an­other (see be­low).
The pur­pose of this chap­ter is to ex­plain the gen­eral prop­er­ties of these bi­o­
log­i­cal mol­e­cules and how they are rep­re­sented and stored in the com­puter. It is
crit­i­cal to un­der­stand the con­nec­tion be­tween the data you ob­serve in com­puter
fi­les and the bi­o­log­i­cal mol­e­cules. Otherwise the data anal­y­sis and da­ta­bases
that store these data will make lit­tle sense. We also briefly dis­cuss what is known
as the cen­tral dogma of mo­lec­u­lar bi­­ol­ogy, how the DNA in­side cells is “read” by
the cel­lu­lar ma­chin­ery, and the gen­eral struc­ture of the gene. The in­tro­duc­tions to
each chap­ter pro­vide ad­di­tional back­ground in­for­ma­tion about the struc­ture and
func­tion of DNA, RNA, and pro­teins and how bioinformatics can be used to an­a­
lyze dif­fer­ent as­pects of these mol­e­cules.

5
6  CO MPUTATIONAL B IOL OGY

Why Bioinformatics?
When non­sci­en­tists ask me what I do for a liv­ing, I tell them I’m a com­pu­ta­tional
bi­­ol­o­gist. This ox­y­mo­ron usu­ally elic­its a con­fused ex­pres­sion (“You study the
­bi­­ol­ogy of com­put­ers? Say what?”). I quickly fol­low this by ask­ing them if they
have heard of DNA and the hu­man ge­nome, which most peo­ple have by now.
Then I tell them that the DNA that makes up the hu­man ge­nome is re­ally just 3
BILLION LETTERS in a com­puter. Here is a lit­tle snippet of DNA in­for­ma­tion from
the hu­man ge­nome:

AGAAAATCACCCTTCCCAGGGGGAAGGTGCTGGGCAGTGGCACTGCCTCTTGGGGGAAGAGGTTGGGCAG
GGGCTGACGGGCAATGGCAGATGACAGCATCCAAACTTCCACACACAGAGTCTGTTCCTTCCTCTTCCCC
GTGCCATCCCAACTCCCTTCTGCCTTGTCATCTACGTCATGGGAAGCAGGTGACATATCTGGCAAGTTAT
TTTGGGGGCCTGGCTTCTCCCAGGTGAAGAGGGAGCAGCAGCTGGAGGGGCAGAAAGAGGGGACAGGGAG
GGGCTGGAGGGCACAGCTGAAGACAGCCTGGGAGGTGACTGTCATCCCCTCCAGTCTCTGCACACTCCCG
GCTGCAGCAGAGCAGGAGGAGAGAGCACGGCCTGGAATGCTAATTTGCCAGGAGCTCACCTGCCTGCGTC
ACTGGGCACAGACGCCAGTGAGGCCAGAGGCCGGGCTGTGCTGGGGCCTGAGATGGGGTGGTGGGGAGAG
AGTCTCTCCCCTGCCCCTGTCTCTTCCGTGCAGGAGGAGCATGTTTAAGGGGAAGGGTTCAAAGCTGGTC
ACATCCCCAACAAAAAAGCCCACGGACAACGAAAAGCCCACTCGCTTGTCCAGTGCCACAGGAGGGGGCA
AGTGGAGGAGGAGAGGTGGCGGTGCTCCCCACTCCACTGCCAGTCGTCACTGGCTCTCCCTTCCCTTCAT
C C T C G TT C C C TAT C T G T C AC C ATTT C C T G T C G T C G TTT C C T C T G A AT G T C T C AC C C T G C C C T C C C T G C TT
GCAAGTCCCCTGTCTGTAGCCTCACCCCTGTCGCATCCTGACTACAATAACAGCTTCTGGGTGTCCCCGG
CATCCACTCTCTCTCCCTTCTTATCCCTTCCGTGACGGATGCCTGAGGAACCTTCCCCAAACTCTTCTGT
CCCATCCCTGCCCTGCTCAAAATCCAATCACAGCTCCCTAACGCTCCTGAATCAACGTGAAGTCCTGTCT
TGAGTAATCCGTGGGCCCTAACTCACTCATCCCAACTCTTCACTCACTGCCTTGCCCCACACCCTGCCAG

After examining this DNA se­quence, try an­swer­ing the fol­low­ing ques­tions:

1. Is this ac­tu­ally hu­man DNA? If not, what or­gan­ism is it from?


2. What is its bi­o­log­i­cal func­tion?
3. Is it go­ing to kill you? (Hey, it could be po­lio­vi­rus DNA. How would you know?)

My guess is that with­out­a com­puter and bioinformatics, you don’t stand a chance
of an­swer­ing these ques­tions cor­rectly. The above DNA se­quence in­for­ma­tion
codes for a small frag­ment of hu­man DNA on chro­mo­some 12. (Or it could be from
a vampire bat. Keep reading to find out!) In fact, the en­tire hu­man ge­nome con­tains
1,000 times this much in­for­ma­tion. (BTW, this in­for­ma­tion is called se­quence in­for­
ma­tion be­cause it is linked to­gether as a se­quence of let­ters.) And look how bor­ing
it is! The same 4 let­ters—A, G, C, and T—over and over again in dif­fer­ent com­bi­na­
tions. RNA and pro­tein in­for­ma­tion looks pretty sim­i­lar in the com­puter, ex­cept that
one can tell RNA se­quence data apart be­cause it con­tains U in­stead of T. Protein
se­quence data are also easy to dif­fer­en­ti­ate be­cause up to 21 dif­fer­ent let­ters rep­
re­sent­ing the var­i­ous amino ac­ids are used in the se­quences.
Here is some RNA se­quence in­for­ma­tion:

GUUUAAGGGACACCGCAGAAAUGGUGAAUACAAUGAAGACAAAGCUGUUGUGUGUACUGCUGCUUUGTGG

And here is some pro­tein se­quence in­for­ma­tion:

RCDRGLAQCHTVPVKSCSELRCFNGGTCWQAASFSDFVCQCPKGYTGKQCEVDTHATCYKDQGVTYRGTW
I N TR O D U C TI O N   7

Granted, the pro­tein se­quence is a lit­tle more in­ter­est­ing, but it is still pretty
mind-numbing to stare at all­day. However, mind-numbing tasks are ex­actly what
com­put­ers were built for: de­ter­min­ing the po­si­tions of all­the known stars in our
gal­axy, cal­cu­lat­ing com­pound in­ter­est for bil­li­ons of bank ac­counts, and search­ing
through all­the house cat video URLs on the In­ter­net, among other things.
In fact, the amount of mo­lec­u­lar se­quence data has grown so vast, and the
tech­nol­o­gies for gen­er­at­ing DNA se­quence in­for­ma­tion from or­gan­isms have
­be­come so ef­fi­cient, that com­puter pro­ces­sor and hard drive tech­nol­o­gies are
start­ing to fall be­hind the rate of bi­o­log­i­cal in­for­ma­tion gen­er­a­tion. The graph in
Fig. 0.1 shows the ex­po­nen­tial growth in the databanks from 1982 to 2008.
In 2008, the amount of in­for­ma­tion de­picted in Fig. 0.1 was con­sid­ered an ex­
treme amount of in­for­ma­tion. Now a sin­gle re­searcher can, in a sin­gle day, gen­er­
ate DNA se­quence in­for­ma­tion equiv­a­lent to all­ the to­tal se­quence in­for­ma­tion
avail­­able in 2008.
So, what can one do with all­these data? That ques­tion is the prin­ci­pal sub­ject
of this book, namely, how bioinformatics al­go­rithms and banks of fancy com­put­
ers can make sense of this grow­ing moun­tain of mo­lec­u­lar se­quence data. In the
next few sec­tions you will learn a lit­tle about these crit­i­cal bi­o­log­i­cal mol­e­cules
and how let­ters of the al­pha­bet can be used to rep­re­sent and store them in

FIGURE 0.1. Exponential growth of GenBank da­ta­base from 1982 to 2008. Courtesy
of National Library of Medicine.
8  CO MPUTATIONAL B IOL OGY

com­put­ers. Then, hav­ing pro­vided a ba­sic un­der­stand­ing of mo­lec­u­lar bi­­ol­ogy and


its com­pu­ta­tional rep­re­sen­ta­tion, the rest of the book will fo­cus on teach­ing you
about the al­go­rithms used to an­a­lyze these da­ta.

DNA in the Computer


Deoxyribonucleic acid, oth­er­wise (thank­fully) known as DNA, is life’s sin­gle most
im­por­tant mol­e­cule. DNA un­der­pins vir­tu­ally all­ of bi­­ol­ogy. With the ex­cep­tion of
a few vi­rus­es,1 life en­codes it­self us­ing the chem­i­cal nu­cle­o­tides of the DNA dou­
ble he­lix. Every liv­ing cell con­tains its mo­lec­u­lar in­for­ma­tion in the form of DNA,
in­clud­ing the 1 tril­lion cells in the hu­man body and ev­ery an­i­mal, plant, fun­gal, and
bac­te­rial cell on the planet. Placed end to end, the DNA from a sin­gle hu­man cell
could stretch an as­ton­ish­ing 3 me­ters.2 The amount of raw in­for­ma­tion con­tained
in this 3-meter length of DNA is sim­il­arly re­mark­able. The unit of mo­lec­u­lar in­for­
ma­tion in DNA is the nucleotide. Thus, the hu­man ge­nome, the com­plete set of
all­ the DNA in­for­ma­tion con­tained in each cell, con­tains ap­prox­i­ma­tely 3 bil­lion
pieces of in­for­ma­tion.
Theoretically, the abil­ity to read and in­ter­pret this chem­ic­ al code should al­
low us to learn a great deal about how cells and or­gan­isms func­tion and in­ter­act.

FIGURE 0.2. Chemical struc­ture of DNA at the atomic lev­el. A, ad­e­nine; T, thy­mine;
C, cy­to­si­ne; G, guanine. Courtesy of Zephyris (Richard Wheeler), under license ­
CC BY-SA 3.0.
I N TR O D U C TI O N   9

The chap­ters of this book show ways in which the com­bi­na­tion of ex­per­i­men­ta­
tion and DNA se­quence anal­y­sis3 can re­veal pow­er­ful new in­sights into mo­lec­u­lar
pro­cesses, cel­lu­lar mech­a­nisms, dis­ease, and bio­di­ver­sity. However, be­fore we
can an­a­lyze DNA se­quences in the com­puter, we must first store the DNA in­for­
ma­tion in a com­puter. How do we rep­re­sent the com­plex bio­chem­i­cal struc­ture
of DNA in a com­puter?
Figure 0.2 shows the chem­i­cal struc­ture of DNA at the atomic level, show­ing
all­the in­di­vid­ual at­oms and bonds be­tween them for just a small frac­tion of a typ­
i­cal DNA mol­e­cule. This fig­ure of a DNA dou­ble he­lix shows the ar­range­ment of
in­di­vid­ual at­oms in a frag­ment of DNA. This DNA seg­ment has a to­tal of 14 nu­cle­
o­tide base pairs. Two of the base pairings are shown on the right side of the figure.
A thy­mine (T) nu­cle­o­tide binds to ad­e­nine (A), and cy­to­sine (C) binds to gua­nine
(G). If we were to store this struc­ture in its full 3-dimensional (3D) glory, we would
have to keep track of the po­si­tion of ev­ery atom, bond, and bond an­gle for more
than 60 at­oms per DNA base pair. Remember, too, that the DNA in just one of
your cells is ap­prox­i­ma­tely 300,000,000 times lon­ger than the DNA in the fig­ure.
Saving all­these data, even in a big com­puter, is not very prac­ti­cal. Clearly, we
need an al­ter­na­tive to stor­ing ev­ery atom of ev­ery DNA mol­e­cule. Fortunately,
the chem­i­cal struc­ture of DNA is highly re­dun­dant and easy to sim­plify. Figure 0.3
shows that if we zoom in and flat­ten a piece of the struc­ture, we can see the four
nu­cle­o­tides that make up DNA.

FIGURE 0.3. Close-up view of a se­quence of DNA show­ing chem­i­cal struc­tures of the four nu­cle­o­tide
ba­ses and their pairings. Solid boxes cor­re­spond to those in Fig. 0.2. Courtesy of Zephyris (Richard Wheeler),
under license CC BY-SA 3.0.
10  CO MPUTATIONAL B IOL OGY

Each of the four nu­cle­o­tides is com­posed of three parts: the phos­phate that
binds the nu­cle­o­tides to­gether, the de­oxy­ri­bose sugar mol­e­cule, and the nu­cle­o­side
base it­self. The nu­cle­o­side base is re­ally the only in­ter­est­ing bit. The DNA mol­e­
cule is a long string of these four nu­cle­o­tides base-paired to com­ple­men­tary nu­
cle­o­tides on the op­po­site strand. The box in Fig. 0.4 shows the A and T bind­ing to
one an­other, mak­ing a base pair. To make it sim­pler and eas­ier to store in a com­
puter, we can use sin­gle let­ters in place of the nu­cle­o­tides: A, T, C, and G.
Single let­ters are per­fect for data stor­age. The whole hu­man ge­nome is “on­ly”
3 bil­lion let­ters—that’s just 6 MB of data, the size of a fam­ily photo, and you can
store a lot of pho­tos on a typ­i­cal lap­top. DNA data stor­age is even eas­ier when
you re­al­ize that one must store only half the data. DNA has two par­al­lel strands
run­ning in op­po­site di­rec­tions. In Fig. 0.4, the strand on the left runs from 5' (top)
to 3' (bot­tom), while the right­most strand runs in the op­po­site di­rec­tion.
Since ad­e­nine ALWAYS binds thy­mine, and gua­nine ALWAYS binds cy­to­sine, if
you know one strand, you au­to­mat­i­cally know the other. If one stores the left­
most strand as ACTG, it is triv­ial to re­con­struct the op­po­site strand as CAGT. The
DNA nu­cle­o­tides are al­ways writ­ten left to right, from the 5' to 3' di­rec­tion.4 This
is the se­quence (the or­der) in which the nu­cle­o­tides are writ­ten. When bi­­ol­o­gists
talk about a DNA se­quence this is what they are talk­ing about. When you have
one strand of a se­quence, you can de­ter­mine the other by find­ing its re­verse
com­ple­ment. To do so, move in the re­verse di­rec­tion, from the end back to the
be­gin­ning, and de­ter­mine the match­ing base for each nu­cle­o­tide.

FIGURE 0.4. Close-up view of base-paired nu­cle­o­tides (boxed). The ar­rows show how
DNA strands run in op­po­site di­rec­tions. A 5'-to-3' strand is paired with a com­ple­men­tary
strand run­ning in the 3' to 5' di­rec­tion.
I N TR O D U C TI O N   11

FIGURE 0.5. Reverse com­ple­men­ta­tion of a DNA se­quence.

For in­stance, say you have the DNA se­quence GACCTTA. To re­verse com­ple­
ment this se­quence, go to the last let­ter, in this case A, and write the com­ple­
men­tary base, T. Move back­wards un­til you reach the be­gin­ning. An ex­am­ple is
shown in Fig. 0.5.
The fi­nal se­quence is the re­verse com­ple­ment, which in this case is TAAGGTC.
Ta-da!

RNA in the Computer


DNA mol­e­cules are ex­tremely ex­cit­ing to the bioinformatician, but in the cell, DNA
is rather bor­ing. DNA by it­self does not cat­a­lyze any en­zy­matic re­ac­tion or per­
form any cel­lu­lar ac­tiv­ity. In or­der for cel­lu­lar ac­tiv­ity to oc­cur, the nu­cle­o­tide in­for­
ma­tion of DNA needs to be “read” by the cell’s com­plex cel­lu­lar ma­chin­ery
com­posed of pro­teins and RNA mol­e­cules. The way in­for­ma­tion passes from the
DNA to the cell is known as the cen­tral dogma of mo­lec­u­lar bi­­ol­ogy, and it can be
sum­ma­rized as fol­lows:

DNA ➞ [tran­scrip­tion] ➞ mRNA ➞ [trans­la­tion] ➞ pro­tein

In the first step of this pro­cess, a pro­tein en­zyme com­plex that in­cludes the RNA
po­ly­mer­ase un­winds the DNA dou­ble he­lix and tran­scribes one of the strands into
an RNA mol­e­cule called mRNA. The m in mRNA stands for mes­sen­ger, be­cause
the mRNA is a copy of the mes­sage con­tained in the DNA nu­cle­o­tides.
RNA mol­ec­ ules are very sim­i­lar to DNA but have a few key dif­fer­ences.

• RNA mol­e­cules are sin­gle-stranded.


• Instead of thy­mine, RNA mol­e­cules use ura­cil.
• RNA molecules con­tain a ri­bose sugar in the nu­cle­o­tide in­stead of a de­oxy­ri­
bose sug­ar.

Figure 0.6 il­lus­trates the ba­sic pro­cess of tran­scrip­tion, in which the RNA po­ly­
mer­ase pro­tein com­plex makes a copy of one strand of the DNA dou­ble he­lix (the
cod­ing strand) by syn­the­siz­ing a com­ple­men­tary sin­gle-stranded RNA mol­e­cule.
Along the tem­plate strand, the RNA po­ly­mer­ase moves in a 3' to 5' di­rec­tion, rec­
og­niz­ing each DNA nu­cle­o­tide, find­ing a com­ple­men­tary RNA nu­cle­o­tide, and add­ing
this nu­cle­o­tide to the 3' end of the grow­ing RNA mol­e­cule. Once the syn­the­sis is
12  COMPU TATIO NA L B IOL OGY

FIGURE 0.6. Simplified illustration of transcription in eukaryotes (including humans).


Steps 1–6 describe the steps of the molecular process of transcription, in which the RNA
polymerase protein complex (not shown) makes single-stranded RNA using one strand
of the DNA double helix as a template. Step 1 (initiation and elongation) in the figure is
common among all forms of life, while step 2 (termination) differs significantly between
eukaryotes and bacteria. Steps 3–6 do not occur in nucleus-free bacteria. Courtesy of
Kelvin Ma, under license CC BY 3.0.

com­plete, the RNA is un­cou­pled from the DNA, the RNA po­ly­mer­ase falls off,
and the DNA dou­ble he­lix re­forms.
A static im­age does not do jus­tice to the beauty of tran­scrip­tion. Fortunately,
the folks who run the DNA Learning Center (DNALC; https://​www.​dnalc.​org/​)
have cre­ated stun­ning an­i­ma­tions of many mo­lec­u­lar pro­cesses. Here is the link
to a 3D an­i­mated video of tran­scrip­tion:

https://​www.​dnalc.​org/​resources/​3d/​12-​transcription-​basic.​html
I N TR O D U C TI O N   13

A more ad­vanced ver­sion can be found here:

https://​www.​dnalc.​org/​resources/​3d/​13-​transcription-​advanced.​html

I highly en­cour­age you to check out­DNALC’s col­lec­tion of high-quality and beau­


ti­ful an­i­ma­tions.
Most of the RNA syn­the­sized in the cell is even­tu­ally trans­lated into a pro­tein
se­quence, and as stated earlier, this type of RNA is called mes­sen­ger RNA (mRNA)
be­cause it car­ries a mes­sage of in­for­ma­tion on how to syn­the­size the pro­tein.
Other types of RNA, known as struc­tural or non­cod­ing RNA, are also crit­i­cal to cel­
lu­lar func­tion but are not trans­lated into pro­teins (see Chap­ter 05). The mRNA can
be mod­i­fied, short­ened, or re­ar­ranged and even­tu­ally re­cy­cled by the cell with­out­
af­fect­ing the un­der­ly­ing DNA. Moreover, many RNAs can be syn­the­sized from
the same DNA tem­plate very quickly, one af­ter an­other, and the cell con­trols the
types and amounts of DNA made.

Protein Translation
The fi­nal des­ti­na­tion of mRNA is the ri­bo­some, the so-called pro­tein fac­tory of
the cell. The ri­bo­some is a mac­ro­mo­lec­u­lar com­plex com­posed of two very large
struc­tural RNA mol­e­cules called ri­bo­somal RNA (rRNA) and a lot of pro­teins. It is
at the ri­bo­some that the data en­coded in the DNA is trans­lat­ed5 into pro­tein in a
fac­to­ry-like man­ner. Figure 0.7 shows a ba­sic il­lus­tra­tion of trans­la­tion.
Again, the DNALC has cre­ated ex­cel­lent trans­la­tion vid­eos that are very worth
watch­ing.

Basic pro­tein trans­la­tion an­i­ma­tion:


https://​www.​dnalc.​org/​resources/​3d/​15-​translation-​basic.​html

Advanced trans­la­tion:
https://​www.​dnalc.​org/​resources/​3d/​16-​translation-​advanced.​html

Protein Sequences in the Computer


Proteins are the mo­lec­u­lar ma­chines that make cells work, ev­ery­thing from copy­
ing DNA (the DNA po­ly­mer­ase en­zyme) to gat­ing mol­e­cules in and out­of cells
(sug­ars, ions, etc.) to stor­ing mo­lec­u­lar en­ergy (ATP). From its ear­li­est days, much
of bioinformatics has fo­cused on try­ing to pre­dict the struc­ture of pro­teins and
other im­por­tant prop­er­ties, given their pri­mary se­quence (Fig. 0.8).
In the com­puter, the pri­mary se­quence of a pro­tein is rep­re­sented as a se­ries
of let­ters, each of which rep­re­sents one of the 21 amino ac­ids, and it rep­re­sents
the or­der that the amino ac­ids are linked to­gether end to end. Figure 0.9 il­lus­
trates the 21 amino ac­ids, some of which are com­pli­cated and would be chal­leng­
ing to store in a com­put­er.
The let­ters of the amino ac­ids, like the four ba­ses of DNA, are triv­ial to store in
the com­puter. Furthermore, it is re­ally easy to use com­put­ers to dig­i­tally trans­late
DNA se­quence to pro­tein. The ease of cre­at­ing pro­tein trans­la­tions in the com­puter
from read­ily ob­tain­able DNA se­quences (cheap and fast!) means that com­pu­ta­tional
al­go­rithms are the pri­mary way we learn in­for­ma­tion about pro­tein se­quences in
14  COMPU TATIO NA L B IOL OGY

FIGURE 0.7. Simplified illustration of translation. Steps 1–7 describe the process of
protein translation in which a molecular “machine” called the ribosome translates mRNA
into a protein. The steps shown in the figure are shared by all cellular life, though bacteria
do not contain a nucleus. Courtesy of Kelvin Ma, under license CC BY 3.0.

new or­gan­isms. A new bac­te­rial ge­nome, for in­stance, can be se­quenced and
as­sem­bled in a day and yield 3,000+ pro­tein se­quences.6
For ex­am­ple, here is a por­tion of the DNA se­quence that codes for the hu­
man he­mo­glo­bin beta pro­tein, part of the com­plex that binds ox­y­gen in red
blood cells:
I N TR O D U C TI O N   15

FIGURE 0.8. Aspects of pro­tein struc­ture. The struc­ture of a pro­tein can be de­scribed
at four dif­fer­ent lev­els. The pri­mary struc­ture is the se­quence of bonded amino ac­ids,
un­folded. The amino ac­ids fold into two ba­sic types of struc­tures known as sec­ond­ary
struc­tures: al­pha he­li­ces and beta sheets. The ter­tiary struc­ture is the 3D struc­ture of the
pro­tein and is the most important for understanding the protein's function. Finally, qua­ter­nary
struc­ture is an ar­range­ment of mul­ti­ple proteins bound together to make a single functional
macromolecule. Image courtesy of Thomas Shafee, under license CC BY 4.0.
16  COMPU TATIO NA L B IOL OGY

FIGURE 0.9. Chemical struc­tures of the 20 common amino ac­ids that are used to
make pro­teins. The let­ters below the names of the amino ac­ids can be used in­stead of the
full names to rep­re­sent the struc­ture. For ex­am­ple, V rep­re­sents va­line, a hy­dro­pho­bic
amino ac­id.

ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG
TTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGG
GGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGT
GCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACT
GTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCA
TCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAAT
GCCCTGGCCCACAAGTATCAC

Using the ge­netic code (Fig. 0.10) to com­pu­ta­tion­ally trans­late this se­quence
start­ing at the first base, here is the pro­tein trans­la­tion of this DNA se­quence:

MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN
ALAHKYH
I N TR O D U C TI O N   17

FIGURE 0.10. Genetic code for eu­kary­otes, like us. The pos­si­ble trip­let co­dons of mRNA
are listed with the amino ac­ids they en­code in eu­kary­otes (bac­te­ria have a slightly dif­fer­ent
ge­netic code). In this ta­ble, the AUG start co­don is shaded in green. The stop co­dons,
which cause ter­mi­na­tion of trans­la­tion, are shaded in pink.

Each group of three nu­cle­o­tides in the DNA se­quence codes for a dif­fer­ent amino
acid in the pro­tein se­quence. For ex­am­ple, the first three nu­cle­o­tides, ATG, code
for M (me­thi­o­nine). In Fig. 0.10, the ta­ble shows the RNA se­quences which cor­
re­spond to par­tic­u­lar amino ac­ids, but one can also use the DNA nu­cle­o­tides. To
do this: (i) tran­scribe the DNA into RNA (easy), (ii) break up the RNA se­quence
into groups of three nu­cle­o­tides (i.e., co­dons), and (iii) match the co­dons to the
amino ac­ids us­ing the ta­ble. For ex­am­ple, the RNA co­don CUU codes for leu­cine
(Leu), which is stored as an L in the com­put­er.

The Molecular Structure of a Gene


The gene is the ba­sic unit of he­red­ity that de­ter­mines some as­pect of the or­gan­
ism. Genetic in­for­ma­tion is, of course, en­coded in the or­gan­ism’s DNA, and this
in­for­ma­tion de­ter­mines how a par­tic­u­lar po­ly­pep­tide (pro­tein) or nu­cle­o­tide po­ly­
mer (RNA) is pro­duced. However, the DNA nu­cle­o­tides in the ge­nome that di­
rectly en­code a pro­tein make up only a small por­tion of the gene, par­tic­u­larly in
eu­kary­otes. Eukaryotes en­com­pass all­ mul­ti­cel­lu­lar or­gan­isms, in­clud­ing all­ an­i­mals,
plants, and fungi, as well as many sin­gle cel­lu­lar or­gan­isms, such as yeasts, par­a­
sites, and al­gae. Genes in eu­kary­otes can be ex­tremely com­plex, and this com­
plex­ity al­lows eu­kary­otes to (i) de­velop di­verse body plans us­ing nearly iden­ti­cal
18  CO MPUTATIONAL B IOL OGY

FIGURE 0.11. Ge­ne­ral struc­ture of a eu­kary­otic gene. Only ap­prox­i­ma­tely 5% of hu­man


and other mul­ti­cel­lu­lar eu­kary­ote DNA ac­tu­ally codes for pro­teins. This com­plex­ity al­lows
for highly dif­fer­en­ti­ated gene reg­u­la­tion, unique for dif­fer­ent cell types (e.g., brain cells
ver­sus kid­ney cells), and the abil­ity to make mul­ti­ple dif­fer­ent pro­teins from one pro­tein-
coding re­gion. Transcription is con­trolled by pro­teins called tran­scrip­tion fac­tors, which
in­flu­ence the bind­ing of RNA po­ly­mer­ase ei­ther pos­i­tively (more tran­scrip­tion) or
neg­a­tively (less tran­scrip­tion). Transcription fac­tors bind thou­sands of ba­ses up­stream
in en­hanc­er/silencer re­gions and also very close by in the prox­im ­ al pro­moter re­gions. The
RNA po­ly­mer­ase com­plex binds at the core pro­moter re­gion and acts to tran­scribe what is
known as the pre-mRNA, which in­cludes re­gions termed ex­ons and in­trons. Regulatory
re­gions 3' of the fi­nal exon sig­nal the RNA po­ly­mer­ase to stop tran­scrip­tion. The pre-mRNA
is fur­ther pro­cessed by cel­lu­lar pro­teins (post­trans­la­tional mod­i­fi­ca­tion) prior to trans­la­tion.
This pro­cess­ing in­cludes add­ing a 5' cap and a se­ries of ad­e­nine nu­cle­o­tides at the end of
the mRNA [called the po­ly(A) tail] followed by the pro­cess of splic­ing, in which the in­tron
re­gions are re­moved and the ex­ons spliced to­gether. The splic­ing pro­cess al­lows mul­ti­ple
dif­fer­ent pro­teins to be pro­duced from one pre-mRNA tran­script. For in­stance, in the fig­ure
the mid­dle exon could be re­moved, mak­ing a shorter pro­tein. While most genes have only
one or two splice var­i­ants, 1 to 2% of hu­man genes have nine or more al­ter­na­tive splice
variants. Courtesy of Thomas Shafee, under license CC BY 4.0.

pro­teins, (ii) dif­fer­en­tially reg­u­late pro­tein and RNA pro­duc­tion in thou­sands of


cell types in dif­fer­ent tis­sues, and (iii) cre­ate many com­bi­na­tions of pro­teins from
the same gene.
In or­der to bet­ter un­der­stand the pur­pose of bioinformatics al­go­rithms and da­
ta­bases, it is im­por­tant to have some com­pre­hen­sion of the struc­tural and func­
tional el­ e­
ments of genes. Figure 0.11 di­ ag
­rams the stan­ dard el­e­
ments in a
eu­kary­otic gene. Eukaryotic genes can be many thou­sands of nu­cle­o­tide ba­ses
long when all­as­pects are ac­counted for, and only a small por­tion is de­voted to
en­cod­ing the pro­tein se­quence. The rest is de­voted to bind­ing pro­teins that reg­u­
late tran­scrip­tion and trans­la­tion.
Figure 0.12 de­scribes the struc­ture of a bac­te­rial gene op­eron. Single-celled
bac­te­ria do not have com­plex cell types or body plans, so their ge­nome or­ga­ni­za­
I N TR O D U C TI O N   19

FIGURE 0.12. Ge­ne­ral struc­ture of a bac­te­rial gene. The reg­u­la­tory struc­ture of a


bac­te­rial gene is sim­i­lar to but much sim­pler than that in eu­kary­otes. Transcription and its
reg­u­la­tion are con­trolled by far fewer tran­scrip­tion fac­tors. The other main dif­fer­ences are the
lack of in­trons and the fact that mRNAs for most cod­ing re­gions (ORFs, which stands for open
read­ing frames) are tran­scribed in clus­ters. For in­stance, the mRNAs for all­the pro­tein
en­zymes in­volved in mak­ing the amino acid tryp­to­phan are syn­the­sized and ul­ti­mately
tran­scribed at the same time, a very ef­fi­cient pro­cess in busy bacteria. Note that not all
bacterial genes have enhancers/silencers. UTR, un­trans­lated re­gion; RBS, ri­bo­some
bind­ing site. Courtesy of Thomas Shafee, under license CC BY 4.0.

tion is much sim­pler. However, as sin­gle cells, bac­te­ria must adapt quickly to en­
vi­ron­men­tal con­di­tions, pro­duc­ing met­a­bolic en­zymes, pro­teins in­volved in
mo­til­ity, or cell sur­face pro­teins very rap­idly. Thus, un­like eu­kary­otes, bac­te­ria
clus­ter their genes into op­er­ons, con­tig­u­ous re­gions of DNA that each code for a
dif­fer­ent po­ly­pep­tide (pro­tein) in­volved in the same func­tional pro­cess. For ex­am­
ple, all­the pro­teins in­volved in me­tab­o­liz­ing glu­cose are next to each other in an
op­eron. In bac­te­ria, the pro­cess of tran­scrip­tion is di­rectly linked to trans­la­tion,
and they of­ten hap­pen more or less si­mul­ta­neously. In eu­kary­otes, mRNA is ex­
ported out­­side the nu­cleus of the cell for trans­la­tion.

Notes
1. Some vi­ruses, like HIV and in­flu­enza vi­rus, use RNA (the chem­i­cal cousin of DNA) as their
ge­netic ma­te­ri­al.
2. It would be rather thin, though, just 20 nano­me­ters in di­am­e­ter, or 5,000 times thin­ner than
a hu­man hair.
3. Also the anal­y­sis of RNA and pro­tein se­quences, which can be de­ter­mined from DNA
­se­quences.
4. The pro­tein en­zymes that bind to DNA to (biochemically) read it, or copy it, rec­og­nize di­rec­
tion­al­ity and al­ways move in the 3' to 5' di­rec­tion on the DNA.
5. A note on tran­scrip­tion and trans­la­tion. Transcription is the pro­cess of copy­ing the same text
from one place to an­other. The mRNA is pretty much the same lan­guage (nu­cle­ot­ ides). Trans-
lation is the pro­cess of ex­press­ing words in an­other lan­guage. DNA and RNA se­quences are
like one chem­ic­ al lan­guage, and pro­tein se­quences are like an­oth­er.
6. Hence an­other need for bioinformatics. Imagine the num­ber of ex­per­i­ments that would be
needed to test all­the func­tions of these pro­teins at the bench!
20  COMPU TATIO NA L B IOL OGY

ACTIVITY 0.1 BIOLOGICAL DATABASES AND DATA STORAGE

Motivation
The goals of this sec­tion are to help you (i) un­der­stand why com­put­ers are in­creas­ingly vi­tal in
the bi­o­log­i­cal sci­ences, (ii) learn the ba­sics of how bi­o­log­i­cal in­for­ma­tion, namely, DNA, RNA,
and pro­tein se­quence in­for­ma­tion, is stored in the com­puter, and (iii) be ­able to in­ter­pret com­
mon data file types used for stor­ing bi­o­log­i­cal in­for­ma­tion. DNA is life’s prin­ci­pal in­for­ma­tion
stor­age de­vi­ce and is pres­ent in ev­ery liv­ing cell. Some viruses, like Ebola, use RNA, but are they
really “alive”? (Statement intended to annoy virologists.) Complete un­der­stand­ing of an or­gan­
ism’s DNA tells you ev­ery­thing about the or­gan­ism: what pro­teins it makes, when and how it
ac­ti­vates genes, how dif­fer­ent cell types (e.g., heart, liver, and brain cells) de­velop, and mo­re.
However, be­cause re­search­ers gen­er­ate so much DNA se­quence from so many or­gan­isms,
and it is so very bor­ing to look at, we need com­put­ers to make any sense of these data. Below
is a se­quence of hu­man DNA, with each let­ter rep­re­sent­ing a nu­cle­o­tide in the DNA. This se­
quence is a tiny frac­tion of the 3 BILLION nu­cle­o­tides in the hu­man ge­nome:

CTGATGGGAATGCAAGCAGCCATTGAGCAGGCTATGAAGAGTCGTGAGATTCTGGGCATCTCAGACCCTC
AGACGCTGGCCCATGTGCTGACAGCCGGAGTGCAGAGTTCCTTGAATGACCCACGCCTCTTCATCTCCTA
TGAGCCCAGTACCCTCGAGGCTCCCCAGNCAGCACCAACACTCACCAACCTCACCCGAGAAGAACTACTG
GCCCAGCTACAGAGGAGCATCCACCATGAGGTCCTTGAGGGCAACGTGGGTTACCTACGAATAGATGATT
TCCCCGGCCAGGAGGTACTGAGTGAGCTGGGGGGATTCTTGGTGACCCATATGTGGAGGCAGCTCATGGA
CACCTCCTCCTTGGTGCTCGATCTCCGGTACTGTGCTGGTGGTCACATCTCTGGGATCCCTTATTTCATC

Note how this se­quence is just the same four let­ters over and over in dif­fer­ent com­bi­na­tions—
ter­ri­bly dull to read and nearly im­pos­si­ble to in­ter­pret. However, with fancy com­pu­ta­tional al­go­
rithms and banks of com­put­ers, we could use these let­ters to de­ter­mine in sec­onds the spe­cies
that the se­quence came from, whether the se­quence codes for a protein, struc­tural RNA, or a
piece of an­cient “junk” DNA, and the cel­lu­lar role of the gene coded for by this DNA. (This one
is a piece of Florida muskrat DNA that codes for an eye gene.) Far out­, right?
Before dis­cuss­ing how one com­pu­ta­tion­ally an­al­yzes DNA, RNA, or pro­tein se­quences, one
first needs to un­der­stand the ways in which these se­quences are stored in the com­puter. After
cov­er­ing the ba­sics of bi­o­log­i­cal se­quences in the in­tro­duc­tion, the ex­er­cises will cover the
National Center for Biotechnology Information’s (NCBI) sci­en­tific jour­nal ar­ti­cle search en­gine
and stor­age da­ta­base (PubMed) and the un­der­ly­ing text for­mat (MEDLINE). The ex­er­cises and
tu­to­ri­als will then cover the com­plex, in­for­ma­tion-rich GenBank data file for­mat and the much
sim­pler FASTA file for­mat. Finally, we will briefly re­view a se­ries of very use­ful bi­o­log­i­cal in­for­ma­
tion da­ta­bases for ex­plor­ing pro­tein se­quences and mi­cro­bial and eu­kary­otic ge­nomes.

Learning Objectives
 . Understand how and why bi­o­log­i­cal in­for­ma­tion is stored elec­tron­i­cally (Motivation).
1
2. Gain fa­mil­iar­ity with some com­monly used data stor­age for­mats and how to in­ter­pret them
(Concepts and Exercises).
I N TR O D U C TI O N   21

3. Learn the ba­sics of some pow­er­ful and highly use­ful bi­o­log­i­cal se­quence da­ta­bases
(Lab Exercises).

Concepts
To de­velop an ap­pre­ci­at­ion of the types of data that can be stored in bi­o­log­i­cal da­ta­bases, use
your knowl­edge of bi­­ol­ogy and clev­er­ness to de­ter­mine what the el­e­ments shown in bold­face
be­low in­di­cate about a par­tic­u­lar DNA se­quence stored in a da­ta­base at NCBI. The fol­low­ing text
file is an ex­am­ple of a GenBank for­mat­ted file. Write your guess as to what the items in­di­cated
in bold­face mean in the near­est box.

LOCUS HSDUT2 1177 bp DNA lin­


ear PRI 28-SEP-1997
DEFINITION Homo sa­pi­ens dUTPase (DUT) gene, exon 3.
ACCESSION AF018430
VERSION AF018430.1 GI:2443576
KEYWORDS .
SEGMENT 2 of 4
SOURCE Homo sa­pi­ens (hu­man)
ORGANISM Homo sa­pi­ens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
REFERENCE 1 (ba­ses 1 to 1177)
AUTHORS Pearlman,R.E.
TITLE Human ge­no­mic nu­clear and mi­to­chon­dria dUTPase gene
JOURNAL Unpublished
REFERENCE 2 (ba­ses 1 to 1177)
AUTHORS Pearlman,R.E.
TITLE Direct Submission
JOURNAL Submitted (11-AUG-1997) Biology, York University, 4700 Keele St.,
North York, ONT M3J 1P3, Can­ a­
da
FEATURES Location/Qualifiers
source 1..1177
/organism=“Homo sa­pi­ens”
/mol_type=“ge­no­mic DNA”
/db_xref=“tax­on:9606”
/map=“15q15-q21.1”
gene or­der(AF018429.1:<1..1735,1..1177,AF018431.1:1..45,
AF018432.1:658..732,AF018432.1:884..954,
AF018432.1:1391..>1447)
/gene=“DUT”
mRNA join(AF018429.1:<282..561,AF018429.1:1034..1172,560..651,
AF018431.1:1..45,AF018432.1:658..732,AF018432.1:884..954,
AF018432.1:1391..>1447)
/gene=“DUT”
/product=“dUTPase”
/note=“al­ter­na­tively spliced;
en­codes mi­to­chon­drial form
of the pro­ tein”
CDS join(AF018429.1:282..561,AF018429.1:1034..1172,560..651,
AF018431.1:1..45,AF018432.1:658..732,AF018432.1:884..954,
AF018432.1:1391..1447)
/gene=“DUT”
22  COMPU TATIO NA L B IOL OGY

/note=“DUT-M; al­ter­na­tively spliced; mi­to­chon­drial form of


the pro­tein; sim­

lar to H. sa­ pi­
ens dUTPase en­ coded by
GenBank Accession Number U90224”
/codon_start=1
/product=“dUTPase”
/protein_id=“AAB71393.1”
/db_xref=“GI:2443580”
/translation=“MTPLCPRPALCYHFLTSLLRSAMQNARGTAEGRSRGTLRARPAP
RPPAAQHGIPRPLSSAGRLSQGCRGASTVGAAGWKGELPKAGGSPAPGPETPAISPSK
RARPAEVGGMQLRFARLSEHATAPTRGSARAAGYDLYSAYDYTIPPMEKAVVKTDIQI
ALPSGCYGRVAPRSGLAAKHFIDVGAGVIDEDYRGNVGVVLFNFGKEKFEVKKGDRIA
QLICERIFYPEIEEVQALDDTERGSGGFGSTGKN”
   mRNA join(AF018429.1:<1018..1172,560..651,AF018431.1:1..45,
AF018432.1:658..732,AF018432.1:884..954,
AF018432.1:1391..>1447)
/gene=“DUT”
/product=“dUTPase”
/note=“al­ter­
na­
tively spliced; en­ codes nu­clear form of the
pro­tein”
   CDS join(AF018429.1:1018..1172,560..651,AF018431.1:1..45,
AF018432.1:658..732,AF018432.1:884..954,
AF018432.1:1391..1447)
/gene=“DUT”
/note=“DUT-N; al­ter­na­tively spliced; nu­clear form of the
pro­
tein; sim­i­
lar to H. sa­ pi­
ens dUTPase en­ coded by GenBank
Accession Number U90224”
/codon_start=1
/product=“dUTPase”
/protein_id=“AAB71394.1”
/db_xref=“GI:2443581”
/translation=“MPCSEETPAISPSKRARPAEVGGMQLRFARLSEHATAPTRGSAR
AAGYDLYSAYDYTIPPMEKAVVKTDIQIALPSGCYGRVAPRSGLAAKHFIDVGAGVID
EDYRGNVGVVLFNFGKEKFEVKKGDRIAQLICERIFYPEIEEVQALDDTERGSGGFGS
TGKN”
   ex­
on 560..651
/gene=“DUT”
/number=3
ORIGIN
1 tccctaaatc aacacagatc atgtggagga ataaaatggg gttaatatat gtaaaaccaa
61 ttaggaaact gtttctgggg caacacagta aagggcttat tcaatggata ggctagtatt
121 attagttagt aattgggccc tttttttctt tgtttctttt cttcattttt ttccttttca
181 aactatgggt tgtaaagcat ccaccttttg aaagtttgcc tttctgccct ttcacgctga
241 taagtacctc agtttccaat aaacttttgt tcaggggcaa acatttacaa tgttgacatc
301 tcttcacacc accaaaaata ttcatggaga attattttat ctaaagctgt ctttttaata
361 ataaaatagc cacctctacc ttcttcataa acttttaaga tgaattggta attcatcata
421 gcaaggttga ttttagaaac taaagttgca ttaattcatt aaatacactg aaagtaattt
481 tgtatgcttg gtcacaaaga aaatataaaa acaattttat aaatagattt gcagttattt
541 tctttcaata ttttcttagt gcctatgatt acacaatacc acctatggag aaagctgttg
601 tgaaaacgga cattcagata gcgctccctt ctgggtgtta tggaagagtg ggtaagtcat
661 ttaagaaaca ggtaactatt tgtcaagttc tcctttgtga tagattcttc atgtttcatt
721 tggggtaata agcaggcaat attgcttggg ctgtgtccta aaagaagcac catttgtgat
781 agcaaatgca ctctttgaaa ggctttattt acatctctgc tttgcctctt tttgaccctt
841 ttatttttct ccttcctcac tggagctttt aggctcacac tggcctagaa ggctgttctc
901 agaacatggc attttatatt atgagagtaa aacttctgac ctgttggtcc cagaatgtgt
I N TR O D U C TI O N   23

961 aagcctactt aaccttttct tgtttggcca tggggtttag ggtaagggat actcttcagt


1021 gtttgtagag gcactgggag gaagctagga caaaatggag ttacacgtca acaggtttga
1081 tttttcctgg aagcgaattc agtgtttacc agacagttcc tttgcagagc gttagttcct
1141 ttttgactac ttccaagtta acttaaggag gcatgga
//

The an­swers are shown be­low. This file for­mat can be quite rich in in­for­ma­tion. Indeed, one can
learn a lot about the gene func­tion, splic­ing, reg­u­la­tion, and other things by look­ing at such fi­les.

LOCUS HSDUT2 1177 bp DNA lin­ear PRI 28-SEP-1997


DEFINITION Homo sa­pi­ens dUTPase (DUT) gene, exon 3.
ACCESSION AF018430
VERSION AF018430.1 GI:2443576
A unique code spe­cific to this par­tic­u­lar
KEYWORDS .
GenBank en­try.
SEGMENT 2 of 4
SOURCE Homo sa­pi­ens (hu­man)
ORGANISM Homo sa­pi­ens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
REFERENCE 1 (ba­
ses 1 to 1177)
AUTHORS Pearlman,R.E.
TITLE Human ge­no­mic nu­clear and mi­to­chon­dria dUTPase gene
JOURNAL Unpublished
REFERENCE 2 (ba­
ses 1 to 1177)
AUTHORS Pearlman,R.E.
TITLE Direct Submission
JOURNAL Submitted (11-AUG-1997) Biology, York University, 4700 Keele St.,
North York, ONT M3J 1P3, Can­ a­
da
FEATURES Location/Qualifiers
source 1..1177 The num­ber of nu­cle­o­tides in this
/organism=“Homo sa­pi­ens” file. (Or amino ac­ids if this was a
/mol_type=“ge­no­mic DNA” pro­tein en­try.) See be­low.
/db_xref=“tax­on:9606”
/map=“15q15-q21.1”
gene or­der(AF018429.1:<1..1735,1..1177,AF018431.1:1..45,
AF018432.1:658..732,AF018432.1:884..954,
AF018432.1:1391..>1447)
/gene=“DUT”
mRNA join(AF018429.1:<282..561,AF018429.1:1034..1172,560..651,
AF018431.1:1..45,AF018432.1:658..732,AF018432.1:884..954,
AF018432.1:1391..>1447)
/gene=“DUT”
/product=“dUTPase” Indicates the exon po­si­tions in a
/note=“al­ter­na­tively spliced; splice var­i­ant. Note: some of the
en­codes mi­to­chon­drial form ex­ons are in other GenBank en­tries.
of the pro­ tein”
24  COMPU TATIO NA L B IOL OGY

CDS join(AF018429.1:282..561,AF018429.1:1034..1172,560..651,
AF018431.1:1..45,AF018432.1:658..732,AF018432.1:884..954,
AF018432.1:1391..1447)
/gene=“DUT”
/note=“DUT-M; al­ter­na­tively spliced; mi­to­chon­drial form of
the pro­tein; sim­

lar to H. sa­ pi­ens dUTPase en­ coded by
GenBank Accession Number U90224”
/codon_start=1
/product=“dUTPase”
/protein_id=“AAB71393.1”
/db_xref=“GI:2443580”
/translation=“MTPLCPRPALCYHFLTSLLRSAMQNARGTAEGRSRGTLRARPAP
RPPAAQHGIPRPLSSAGRLSQGCRGASTVGAAGWKGELPKAGGSPAPGPETPAISPSK
RARPAEVGGMQLRFARLSEHATAPTRGSARAAGYDLYSAYDYTIPPMEKAVVKTDIQI
ALPSGCYGRVAPRSGLAAKHFIDVGAGVIDEDYRGNVGVVLFNFGKEKFEVKKGDRIA
QLICERIFYPEIEEVQALDDTERGSGGFGSTGKN”
mRNA join(AF018429.1:<1018..1172,560..651,AF018431.1:1..45,
AF018432.1:658..732,AF018432.1:884..954,
AF018432.1:1391..>1447)
The pro­tein se­quence trans­la­tion
/gene=“DUT”
for a splice mRNA.
/product=“dUTPase”
/note=“al­ter­
na­
tively spliced; en­ codes nu­ clear form of the
pro­tein”
CDS join(AF018429.1:1018..1172,560..651,AF018431.1:1..45,
AF018432.1:658..732,AF018432.1:884..954,
AF018432.1:1391..1447)
/gene=“DUT”
/note=“DUT-N; al­ter­na­tively spliced; nu­ clear form of the
pro­
tein; sim­i­
lar to H. sa­ pi­
ens dUTPase en­ coded by GenBank
Accession Number U90224”
/codon_start=1
/product=“dUTPase”
/protein_id=“AAB71394.1”
/db_xref=“GI:2443581”
/translation=“MPCSEETPAISPSKRARPAEVGGMQLRFARLSEHATAPTRGSAR
AAGYDLYSAYDYTIPPMEKAVVKTDIQIALPSGCYGRVAPRSGLAAKHFIDVGAGVID
EDYRGNVGVVLFNFGKEKFEVKKGDRIAQLICERIFYPEIEEVQALDDTERGSGGFGS
TGKN”
ex­
on 560..651
/gene=“DUT”
/number=3
ORIGIN
1 tccctaaatc aacacagatc atgtggagga ataaaatggg gttaatatat gtaaaaccaa
61 ttaggaaact gtttctgggg caacacagta aagggcttat tcaatggata ggctagtatt
121 attagttagt aattgggccc tttttttctt tgtttctttt cttcattttt ttccttttca
181 aactatgggt tgtaaagcat ccaccttttg aaagtttgcc tttctgccct ttcacgctga
241 taagtacctc agtttccaat aaacttttgt tcaggggcaa acatttacaa tgttgacatc
301 tcttcacacc accaaaaata ttcatggaga attattttat ctaaagctgt ctttttaata
361 ataaaatagc cacctctacc ttcttcataa acttttaaga tgaattggta attcatcata
421 gcaaggttga ttttagaaac taaagttgca ttaattcatt aaatacactg aaagtaattt
481 tgtatgcttg gtcacaaaga aaatataaaa acaattttat aaatagattt gcagttattt
541 tctttcaata ttttcttagt gcctatgatt acacaatacc acctatggag aaagctgttg
601 tgaaaacgga cattcagata gcgctccctt ctgggtgtta tggaagagtg ggtaagtcat
661 ttaagaaaca ggtaactatt tgtcaagttc tcctttgtga tagattcttc atgtttcatt
I N TR O D U C TI O N   25

721 tggggtaata agcaggcaat attgcttggg ctgtgtccta aaagaagcac catttgtgat


781 agcaaatgca ctctttgaaa ggctttattt acatctctgc tttgcctctt tttgaccctt
841 ttatttttct ccttcctcac tggagctttt aggctcacac tggcctagaa ggctgttctc
901 agaacatggc attttatatt atgagagtaa aacttctgac ctgttggtcc cagaatgtgt
961 aagcctactt aaccttttct tgtttggcca tggggtttag ggtaagggat actcttcagt
1021 gtttgtagag gcactgggag gaagctagga caaaatggag ttacacgtca acaggtttga
1081 tttttcctgg aagcgaattc agtgtttacc agacagttcc tttgcagagc gttagttcct
1141 ttttgactac ttccaagtta acttaaggag gcatgga
//

The nu­cle­o­tide se­quence of the GenBank en­try.

Exercises
Lab ex­er­cises (prac­tice)
In this part of the ex­er­cise, you will learn the ba­sics of some very use­ful bi­o­log­i­cal
da­ta­bases. Follow the link or QR code be­low to back­ground in­for­ma­tion on data
for­mats and tu­to­ri­als on da­ta­bases you should use to an­swer the lab ex­er­cise
questions. The Basics link includes tutorials for the UniProt (protein) and Ensembl
(genome) databases, and the large collection of databases available at NCBI. The
NCBI drop-down menu includes links to at least 45 different databases, allowing
analysis of everything from gene expression to biochemistry to taxonomy.

From the Basics Link, under the Databases heading,


you can click to learn about the NCBI and UniProt
databases. After reviewing these, complete the Lab
Exercise on the next page.
http://​kelleybioinfo.​org/​algorithms/​basics/​
26  COMPU TATIO NA L B IOL OGY

Lab Exercise
Part 1: us­ing NCBI PubMed

1. How would you search in PubMed for all­pa­pers about bac­te­ria writ­ten by
Pace AND Lane? Use the MEDLINE au­thor field to be more spe­cific. Write the
PubMed search terms you used here.

2. Use the MEDLINE en­try of the Pace and Lane ar­ti­cle ­ti­tled “Evolutionary
re­la­tion­ships among sul­fur- and iron-oxidizing eubacteria” to get the GenBank
ac­ces­sion numbers in the article. Write the first five ac­ces­sion num­bers
­be­low.

3. How would you find all­the ex­perts on pro­ges­ter­one in San Di­ego? Do a spe­
cific search for the key­word and ad­dress. Write the search terms you would
use be­low.

4. Find the PubMed PMID for any two pa­pers from au­thors at San Di­ego State
University that have worked with 16S. Write the search term used and the
PMIDs be­low.

Search Terms:

PMID 1:

PMID 2:

5. Use a com­bi­na­tion of the search field and the left side­bar to search for re­view
pa­pers about Mycobacterium pub­lished in the last 5 years. Write the ti­tle of
the first search re­sult be­low:

Title:
I N TR O D U C TI O N   27

Part 2: data for­mats

1. Write the ti­tle line for the FASTA en­try be­low for the DNA se­quence of the nu­
clear splice mRNA var­i­ant de­scrip­tion from GenBank en­try AH005568.2.

What is a nu­clear splice mRNA var­i­ant any­way? Explain briefly below. (Hint:
Ask Dr. Wikipedia or Professor Google for things you don’t know or recognize.
The Internet: not just for cat videos anymore!)

2. Find the fol­low­ing in­for­ma­tion for the hu­man es­tro­gen re­cep­tor (ac­ces­sion:
NM_000125):

a. What is the se­quence of the polyadenylation [po­ly(A)] sig­nal? Hint: look for
“polyA_signal_sequence” in the fi­le.

b. Write the ti­tle line of the cod­ing re­gion pro­tein se­quence in FASTA for­mat
of es­tro­gen re­cep­tor al­pha iso­form 1 (Homo sa­pi­ens es­tro­gen re­cep­tor 1):

c. At what po­si­tion of the DNA re­cord does the pro­tein cod­ing re­gion start for
the al­pha iso­form?

d. What is a po­ly(A) sig­nal any­way (i.e., what is its func­tion)?


28  COMPU TATIO NA L B IOL OGY

Part 3: mi­cro­bial ge­nomes and UniProt da­ta­bas­es

1. Search the NCBI Genomes for Sulfolobus solfataricus P2.

a. What kind of archaeon is Sulfolobus solfataricus (next term in the lin­e­age


af­ter Archaea)?

b. What does it me­tab­o­lize?

c. What is this mi­crobe’s op­ti­mal growth tem­per­a­ture?

d. At what pH range does it grow best?

e. What is the num­ber of cur­rently pre­dicted pro­teins in the ge­nome?

f. Find the FASTA pro­tein se­quence of 30S ri­bo­somal pro­tein S3. Hint: click
on the pro­tein link (part e) and search the pro­tein ta­ble for “30S ri­bo­somal
pro­tein S3.” Then se­lect the link un­der “Protein prod­uct.” Write the ti­tle line
of the FASTA en­try be­low.

2. Using UniProt, find the fol­low­ing in­for­ma­tion about the gene in­volved in cys­tic
fi­bro­sis trans­mem­brane con­duc­tance reg­u­la­tion in Homo sa­pi­ens.

a. What is the ac­ces­sion num­ber of the full-length en­try? (No frag­ments.)

b. Describe briefly the func­tion of this gene (see da­ta­base en­try un­der Function).

c. What is its tis­sue spec­i­fic­i­ty?


CHAPTER
01
BLAST

T
he in­ven­tors of the BLAST (Basic Local Alignment Search Tool) com­puter
pro­gram should win the No­bel Prize in Physiology or Medicine. I’m se­ri­
ous! Actually, they should prob­a­bly share it with the in­ven­tors of the
FASTA al­go­rithm, which pre­ceded the more ef­fi­cient and flex­i­ble BLAST
­al­go­rithm.
Why should the mak­ers of a com­puter pro­gram that finds the best da­ta­base
match to a bi­o­log­i­cal se­quence win the most pres­ti­gious award in bio­med­ic­ al sci­
ence? The No­bel Prize is usu­ally re­served for dis­cov­er­ies made in the lab­o­ra­tory,
such as the dis­cov­ery of novel dis­ease or­gan­isms, how to ge­net­ic­ ally en­gi­neer
mice, pro­teins that glow in the dark, the mech­a­nism of cell death, that sort of thing.
A com­puter pro­gram seems a pal­try gim­mick in com­par­i­son, and bi­­ol­o­gists of­ten
chuckle or even scoff when I sug­gest that it de­serves the No­bel. Determining the
best match of a bi­o­log­i­cal se­quence ap­pears rather un­in­spir­ing com­pared with
the dis­cov­ery of an­ti­bi­ot­ics or pro­teins that act like vi­ruses. Yet in an era of rap­idly
ex­pand­ing DNA se­quenc­ing ca­pa­bil­ity and big bi­o­log­i­cal data, BLAST’s sim­plic­ity,
el­e­gance, and ef­fi­ciency make this al­go­rithm the most pow­er­ful bio­med­i­cal dis­
cov­ery tool in the his­tory of sci­ence.

BLAST It
Researchers use BLAST so of­ten that it has been turned into a verb. (Like “to
google” or “to chill”.) “Hey, did you BLAST that se­quence I gave you yet? No?
Well, what are you wait­ing for? BLAST it!” To un­der­stand why BLAST is so use­
ful and pop­ul­ar, imag­ine that you work in a mo­lec­u­lar bi­­ol­ogy lab study­ing a
novel in­fec­tious dis­ease and you just re­ceived the re­sults of your very first se­
quenc­ing run, per­haps the first ge­netic in­for­ma­tion in the his­tory of this un­known
or­gan­ism. Unfortunately, the se­quence looks like this1:

31
32  COMPU TATIO NA L B IOL OGY

>FX093345
TGCAGTCGATCATCAGCATACCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTTC
TTGGTGTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTCCTTCAGGTTAGAGACGTGCTAGTGCG
TGGCTTCGGGGACTCTGTGGAAGAGGCCCTATCGGAGGCACGTGAACACCTCAAAAATGGCACTTGTGGT
CTAGTAGAGCTGGAAAAAGGCGTACTGCCCCAGCTTGAACAGCCCTATGTGTTCATTAAACGTTCTGATG
CCTTAAGCACCAATCACGGCCACAAGGTCGTTGAGCTGGTTGCAGAAATGGACGGCATTCAGTACGGTCG
TAGCGGTATAACACTGGGAGTACTCGTGCCACATGTGGGCGAAACCCCAATTGCATACCGCAATGTTCTT
CTTCGTAAGAACGGTAATAAGGGAGCCGGTGGTCATAGCTATGGCATCGATCTAAAGTCTTATGACTTAG

That’s some ex­cit­ing re­search you have there! Well, maybe, but what ex­actly
is it? Clearly, it is a DNA se­quence of some sort, but is it the se­cret to un­cov­er­ing
a deadly new path­o­gen, or is it hu­man DNA con­tam­i­na­tion from the tech­ni­cian at
the se­quenc­ing fa­cil­ity? Is it part of a pro­tein toxin that al­lows a deadly mi­crobe to
de­stroy ep­i­the­lial cells in the lungs, or is it a reg­u­la­tory gene se­quence from a
house­fly that landed on your pi­pette tip when you weren’t look­ing?
Eyeballing a bunch of let­ters, the same four let­ters (A, G, T, and C) re­peated in
dif­fer­ent or­ders, will get you no­where. Wouldn’t it be great if, with­out­pil­fer­ing a
pi­pette or pour­ing a pe­tri plate, you could an­swer the fol­low­ing ques­tions?

• Is the ge­netic ma­te­rial from a bac­te­rium or a vi­rus?


• Did you ac­ci­den­tally se­quence a con­tam­i­nat­ing or­gan­ism?
• Does the se­quence code for a pro­tein?
• What is the bi­o­log­i­cal func­tion of the se­quence?
• Is the ge­netic ma­te­rial re­lated to other, pos­si­bly well-characterized, path­o­gens?
• Is the se­quence new to sci­ence?

Using BLAST, you can an­swer most of these ques­tions in mil­li­sec­onds.


That’s how long it takes for BLAST and a bunch of re­mote su­per­com­put­ers at
the National Center for Biotechnology Information (NCBI) to match your se­quence
to ev­ery DNA se­quence ever cata­logued. Figure 1.1 shows the BLAST re­sults
for your hy­po­thet­i­cal new path­o­gen. In un­der a sec­ond, you dis­cov­ered (i) the
na­ture of the or­gan­ism, namely, a known deadly vi­rus of the co­ro­na­vi­rus (a com­
mon cold vi­rus) fam­ily, (ii) that your se­quence en­codes a pro­tein, and (iii) that
the pro­tein is part of the vi­rus’s outer shell, a crit­i­cal part of the in­fec­tious pro­
cess. Not bad for a few sec­onds of effort. (Your tax dollars at work!)

Scaling Up: Massive Parallelization of BLAST


Imagine that in­stead of just one mys­tery se­quence you had 10, 100, or 1,000
­se­quences and you had to match them against da­ta­bases con­tain­ing mil­li­ons of
se­quences. Furthermore, these se­quences en­com­pass mul­ti­ple dif­fer­ent or­
gan­isms, are part of dif­fer­ent mo­lec­u­lar pro­cesses, and are in­volved in a va­ri­ety
of cel­lu­lar func­tions. Now you be­gin to see the scale of the prob­lem and why
bioinformatics can be so darn use­ful. By run­ning BLAST searches in par­al­lel on
a suite of high-­performance com­put­ers, it is pos­si­ble to an­a­lyze thou­sands of
mys­tery se­quences against mil­li­ons of known se­quences gen­er­ated by re­
search­ers all­over the world (Fig. 1.2).
B LA S T  33

FIGURE 1.1. Results of a BLAST nu­cle­o­tide search with the FX093345 nu­cle­o­tide
se­quence. The re­sults, re­turned in un­der a sec­ond, re­veal that the se­quence is an ex­act
match to the ge­nome of a se­vere acute re­spi­ra­tory syn­drome (SARS) co­ro­na­vi­rus pro­tein
first iso­lated in Shanghai (apparently misspelled “Shanhgai” in the da­ta­base), China. The
first 50 matches were to SARS co­ro­na­vi­ruses iso­lated at dif­fer­ent times, which pro­vi­des
con­fi­dence that we have in­deed iso­lated a SARS vi­rus. At the time this BLAST anal­y­sis was
per­formed, the se­quence was a per­fect 100% match (bottom right un­der the “Ident” col­umn)
to more than one se­quence, so we can­not say with cer­tainty that it is the Shanghai strain.
The fact that so many were iden­ti­cal also means that the re­sults ap­pear in a ran­dom or­der.
That, and the fact that da­ta­bases are con­stantly up­dated with new se­quences, means that
an­other search with the se­quence may re­turn dif­fer­ent re­sults from those seen in the figure.
One of the most important considerations for interpreting BLAST results is the E-value.
The E-value is the Expectation value, which tells the user how likely the search similarity
result is due to random chance. An E-value of 1e-4 is the same as 0.0001, which indicates
this similarity would occur 1 in 10,000 times by chance. Note, an E-value of 0.0 signifies a
likelihood below 1e-250.

Why a No­bel Prize?


The real power of bioinformatics, es­pe­cially BLAST, is two­fold. First, bioinformat­
ics meth­ods dra­mat­i­cally am­plify ex­per­i­men­tal re­sults. Second, bioinformatics
meth­ods gen­er­ate new test­able hy­poth­e­ses and tar­gets for fur­ther dis­cov­ery. To
un­der­stand how BLAST am­pli­fies bi­o­log­i­cal knowl­edge, one must first have an
ap­pre­ci­a­tion of just how much in terms of time, en­ergy, and re­sources it takes to
make dis­cov­er­ies us­ing the tools of mo­lec­u­lar bi­­ol­ogy. Without ba­sic in­for­ma­tion
on the ge­net­ics and bio­chem­i­cal func­tion of the genes or gene re­gions gleaned
from lab­o­ra­tory ex­per­i­men­ta­tion, BLAST matches would not be par­tic­u­larly
34  COMPU TATIO NA L B IOL OGY

FIGURE 1.2. Parallel BLAST search­es. Each “Query” se­quence is si­mul­ta­neously


matched to hun­dreds or thou­sands of se­quence da­ta­bases us­ing the BLAST al­go­rithm.
The scores for the best matches are re­trieved and ranked.

use­ful. As an ex­am­ple, Fig. 1.3 shows some of the stan­dard ex­per­i­men­tal ap­
proaches needed to de­ter­mine the func­tion of a pro­tein-coding gene found in a
bac­te­rial ge­nome.
While the cost of char­ac­ter­iz­ing even a sin­gle pro­tein-coding gene can be sig­
nif­i­cant in terms of both time and money, once we have this char­ac­ter­iza­tion and
this se­quence in a da­ta­base, al­go­rithms like BLAST be­come es­pe­cially pow­er­ful.
Figure 1.3B il­lus­trates how BLAST and DNA se­quenc­ing tech­nol­ogy can am­plify
bi­o­log­i­cal knowl­edge.
In the ex­am­ple, BLAST is re­ally am­pli­fy­ing the ex­per­i­men­tal knowl­edge by
al­low­ing us to in­fer the func­tion of 20 dif­fer­ent pre­vi­ously un­known DNA se­
quences through a rapid se­quence align­ment. The pro­cess can also go in the
other di­rec­tion. For in­stance, we could use BLAST to dis­cover if a DNA se­quence
that codes for a gene of un­known func­tion is pres­ent in ev­ery liv­ing or­gan­ism
ever se­quenced. However, should an ex­per­i­men­tal­ist de­cide to tar­get the pro­
tein prod­uct of this gene of un­known func­tion for anal­y­sis and fig­ure out­its mo­
lec­u­lar func­tion, we would sud­denly know the func­tion of this gene in mil­li­ons of
spe­cies.
Given the rate of se­quence gen­er­at­ion, it is safe to say that BLAST and other
bioinformatics meth­ods are the pri­mary source of in­for­ma­tion for 99.9% of all­
newly se­quenced DNA. This has made BLAST in­stru­men­tal for dis­cov­er­ing, among
other things, the fol­low­ing:
B LA S T  35

FIGURE 1.3. Characterizing the func­tion of a novel bac­te­rial pro­tein en­coded in the
Escherichia coli ge­nome and then us­ing BLAST to find sim­i­lar func­tions in newly
sequenced ge­nomes. (Left) Flowchart of ap­proaches nec­es­sary to char­ac­ter­ize a novel
pro­tein in the E. coli ge­nome. Most bac­te­ria have one copy of each gene on a sin­gle cir­cu­lar
chro­mo­some.2 Each of the flowchart steps rep­re­sents weeks or months of pains­tak­ing
work. Typically, this pro­cess may take a year and cost on the or­der of $50,000. (Right)
Once a gene has been char­ac­ter­ized, se­quences from other bac­te­rial ge­nomes (circles) can
be quickly matched to this se­quence us­ing BLAST. The fig­ure shows BLAST identifying
porin pro­teins in 20 dif­fer­ent bac­te­rial ge­nomes. When the matches are suf­fi­ciently strong,
we can in­fer that these pro­teins have func­tions highly sim­i­lar to that of the orig­i­nal,
which saves the need to ex­per­i­men­tally char­ac­ter­ize this pro­tein from all­20 ge­nomes.

• Novel path­o­gens, both an­i­mal and hu­man


• Life in ex­treme en­vi­ron­ments
• Mechanisms of ge­netic dis­eas­es
• Novel pro­tein func­tions
• Novel cel­lu­lar pro­cess­es

If this doesn’t war­rant a No­bel Prize in Medicine, frankly, I don’t know what
does.

Notes
1. Note the FASTA for­mat of the se­quence, the pro­gram’s en­dur­ing leg­acy to bioinformatics.
2. In the cell, the ge­no­mic DNA is usu­ally tightly wound in a su­per­coiled state.
36  CO MPUTATIONAL B IOL OGY

ACTIVITY 1.1 BLAST ALGORITHM

Motivation
The pur­pose of this ac­tiv­ity is to teach the ba­sic con­cepts be­hind the BLAST al­go­rithm and how
to use a web-based im­ple­men­ta­tion of this al­go­rithm to an­a­lyze DNA and pro­tein se­quence
data. BLAST (Basic Local Alignment Search Tool) is a fast com­pu­ta­tional method for mak­ing se­
quence align­ments. Sequence align­ments are a crit­i­cal part of bioinformatics. Computational
meth­ods for mak­ing pairwise align­ments of bi­o­log­i­cal mol­e­cules (DNA, RNA, or pro­tein) were
some of the very first bioinformatics al­go­rithms de­vel­oped. Among other things, se­quence align­
ments al­low re­search­ers to de­ter­mine the or­gan­isms from which the mol­e­cule came (hu­man,
oys­ter, pine tree, bac­te­rium, etc.) and pre­dict the cel­lu­lar func­tion of bi­o­log­i­cal mol­e­cules based
only on their se­quence. For ex­am­ple, BLAST can re­port with high con­fi­dence that the pro­tein se­
quence YNFGSGSAYGGSFGGVDGLLAGGEKATMQNL is ker­a­tin from the do­mes­tic dog hair found
on your so­fa.
BLAST was cre­ated to speed up the pro­cess of mak­ing se­quence align­ments. Full pairwise
se­quence align­ment meth­ods (see Chap­ter 03) are too com­pu­ta­tion­ally in­ten­sive to han­dle the
align­ment of thou­sands or mil­li­ons of se­quences. BLAST speeds up this pro­cess by “chop­ping
up” an in­put se­quence into smaller bits and match­ing these smaller bits to mil­li­ons of dif­fer­ent
se­quences. The al­go­rithm then at­tempts to ex­tend the se­quence align­ment to make a full align­
ment. Then the al­go­rithm ranks the se­quence align­ments, and the lon­gest align­ment with the
few­est mis­matches wins! In bioinformatics par­lance we call the good matches “hits,” and the
best ones are “best hits” or “top hits.” BLAST is a heu­ris­tic method, mean­ing that it is not guar­an­
teed to find the op­ti­mal align­ment, but it is much faster than more strin­gent ap­proaches.

Learning Objectives
 . Know the ba­sic pur­pose and util­ity of the BLAST com­pu­ta­tional method (Motivation).
1
2. Understand the con­cepts be­hind the BLAST al­go­rithm (Concepts and Exercises).
3. Correctly solve se­quence-matching prob­lems based on the BLAST al­go­rithm (Concepts and
Exercises).
4. Learn how to use NCBI’s BLAST web-based se­quence anal­y­sis web­site and be ­able to cor­
rectly in­ter­pret its out­­put (Concepts and Exercises).

Concepts
As men­tioned above, the pur­pose of the BLAST al­go­rithm is to find the best hit (high­est-scoring
match) of an un­known DNA or pro­tein se­quence in a da­ta­base. The method has been so suc­
cess­ful be­cause of its clever sim­plic­ity. In or­der to bet­ter grasp the method be­hind the al­go­rithm,
try the pre­pa­ra­tory ex­er­cise. Using your brain and a pen­cil, try to find re­gions (lo­cal align­ments)
that best match the fol­low­ing DNA se­quence and pro­tein se­quences. First, try to match the
Query DNA se­quence to the Sbjct1 (Subject) DNA se­quence. Then do the same for the pro­tein
Query and Sbjct se­quences. Find the re­gions of best align­ment be­tween the two, and keep in
B LA S T   37

mind that the match doesn’t have to be per­fect and may even need spaces to help it line up.
Circle or draw lines be­tween the match­ing let­ters.

DNA MATCH

Query: AGCGAATATTATGTTGAAGTAGCAAAGTCCTGGAGCCT

Sbjct: ACTACAGGGGAGTTTTGTTGAAGTTGCAAAGTCCTGGAGCCTCCAGAGGGC

PROTEIN MATCH

Query: MEMKATTALLNDRVLRAMLYFWCKAEETCALEVCEE

Sbjct: ETIRRAYPDANLLNDRVLRAMLYFWRKAEETCAPSVSMRKIVATWMLEVCEE

Reflection
• How much of the query did you try to match at one time?
• How did you find a match? Can you de­scribe it in words?
• Were there any mis­matches for the best se­quence?
• Were there ever mul­ti­ple matches? Would break­ing up the Query (in­tro­duc­ing a gap)
help?

Below is the an­swer. The ver­ti­cal lines in­di­cate a per­fect match be­tween the let­ters of the two
se­quences. Note that there are some mis­matches and that a big gap must be in­serted for the
end of the pro­tein Query se­quence to match the Sbjct se­quence (LEVCEE).

DNA MATCH

Query: AGCGAATATTATGTTGAAGTAGCAAAGTCCTGGAGCCT
|| ||||||||| |||||||||||||||||
ACTACAGGGGAGTTTTGTTGAAGTTGCAAAGTCCTGGAGCCTCCAGAGGGC
Sbjct:  

PROTEIN MATCH

Query: MEMKATTALLNDRVLRAMLYFWCKAEETCA — — — — — — — — — — LEVCEE


|||||||||||||| |||||||        ||||||

Sbjct: ETIRRAYPDANLLNDRVLRAMLYFWRKAEETCAPSVSMRKIVATWMLEVCEE

Most stu­dents solve this prob­lem by (i) slid­ing the Query se­quence along the Sbjct se­quence, (ii)
find­ing a short re­gion that matches well, and then (iii) ex­tend­ing the match as far as pos­si­ble.
This is es­sen­tially how the BLAST al­go­rithm works. Figure 1.1.1 de­tails the ba­sic steps of the
al­go­rithm.
38  CO MPUTATIONAL B IOL OGY

FIGURE 1.1.1. Principles be­hind the BLAST al­go­rithm. (1) The first step of the al­go­rithm is
to break the Query se­quence into smaller pieces called “words.” For DNA, the word size is
usu­ally 10 or 11 let­ters long, but the ex­am­ple uses four-letter words for simplicity. (Four-letter
words. *snicker*) Protein se­quence matches start with fewer let­ters. (2) Then the al­go­rithm
slides these smaller words across pos­si­ble tar­get se­quences un­til it finds a per­fect match
with one of the small words. (3) Starting with this small align­ment, BLAST then ex­tends the
align­ment un­til it runs out­of let­ters or the align­ment be­comes poor (lots of mis­matches). The
score of the align­ment is de­ter­mined by sum­ming up the scores for matches and mis­matches.
For DNA, BLAST uses +5 for a match and −4 for a mis­match. The scores of pro­tein se­quence
align­ments are de­ter­mined by us­ing a spe­cial ta­ble of match/mismatch scores, such as the
BLOSUM62 ma­trix (see Chap­ter 07).

Exercises
Interactive ex­er­cise (the­o­ry)
Use the on­line BLAST Interactive Link be­low to learn how the al­go­rithm makes
and scores BLAST se­quence align­ments. Click on the dark circle with the yellow
letter I at the top of the page to learn how to use the BLAST Exercise teach­ing
in­ter­ac­tive. Once you learn how it works, solve the ac­tiv­ity prob­lem.

BLAST Interactive Link


Link:
http://​kelleybioinfo.​org/​algorithms
/​default.​php?​o=1
B LA S T  39

Problem

Practice with the BLAST Exercise Link, then solve the prob­lem be­low.

1. Write the best align­ment of the Query to each DNA se­quence in the boxes
and cir­cle the first match­ing word from the Query.

2. Calculate scores and rank the three align­ments.


40  COMPU TATIO NA L B IOL OGY

Lab Exercises (Practice)


In this part of the ex­er­cise, you will learn how to an­al­yze mys­tery DNA and pro­
tein se­quences us­ing the BLAST al­go­rithm on­line at NCBI. You will also learn how
to in­ter­pret the out­­put from the pro­gram in­clud­ing what the val­ues mean and
how to find in­for­ma­tion about the best match in the da­ta­base to your query
­se­quence. You will also use a pro­gram, called ORF finder, that trans­lates a DNA se­
quence into likely pro­tein se­quences.

NCBI BLAST Tutorial


Link:
http://​kelleybioinfo.​org/​algorithms
/​tutorial/​TAli1.​pdf

Sample and lab ex­er­cise da­ta:


http://​kelleybioinfo.​org/​algorithms/​data
/​DAli1.​txt
B LA S T  41

Lab Exercise
Click on the sam­ple and lab exercise data link for the se­quence data used in this
ex­er­cise.

Part 1
Use NCBI BLAST tools to an­a­lyze the fol­low­ing DNA that you just se­quenced
from a plas­mid and an­swer the fol­low­ing ques­tions:

>Part1_Plasmid_Derived_Sequence
C G T T TA C G G C G T G G A C TA C C A G G G TAT C TA AT C C T G T T C G C T C C C C A A C G C T T T C G C T C C T C A G C G T
C A G T TA C T G C C C A G A G A C C C G C C T T C G C C A C C G G T G T T C C T C C T G ATAT C T G C G C AT T C C A C C G C TA
CACCAGGAATTCCAGTCTCCCCTGC

1. Use the NCBI BLAST tool to per­form a se­quence search with the above
­se­quence.

a. The high­est-scoring BLAST hit is to what named or­gan­ism? (Ignore the
­un­known/uncultured or­gan­ism hits.)

b. What is the gene name?

c. What is the func­tion of the gene, if known? (Don’t know? Try ask­ing “Pro­
fessor” Google or “Dr.” Wikipedia!)

d. Who sub­mit­ted the se­quence?

e. From what in­sti­tu­tion?

f. Get the fol­low­ing data for this par­tic­u­lar match.

i. E-value:

ii. Identities:
42  COMPU TATIO NA L B IOL OGY

Part 2

>Part2_Protein_Sequence
M N G T E G P N F Y V P F S N K T G V V R S P F E Y P Q Y Y L A E P W Q F S M L A AY M F L L I V L G F P I N F LT LY V T V Q H K K
L R T P L N Y I L L N L AVA N H F M V F G G F T T T LY T S L H G Y F V F G S T G C N L E G F FAT L G G E I A LW S LV V L A I E
RY V V VC K P M S N F R F G E N H A I M G VA F TW V M A L AC A A P P LVG W S RY I P E G M Q C S C G I DY Y T L K P E V N N E
S F V I Y M F V V H F T I P M T I I F F C Y G Q LV F T V K E A A A Q Q Q E S AT T Q K A E K E V T R M V I I M V I A F L I C W V P Y
A S VA F Y I F T H Q G S D F G P I L M T L PA F FA K S S A I Y N P V I Y I M M N K Q F R N C M LT T I C C G K N P F G E E E G S T
TASKTETSQVAPA

1. Use the NCBI BLAST tool to per­form a se­quence search with the pro­tein se­
quence above.

a. The high­est-scoring BLAST hit is to what or­gan­ism?

b. What is the gene?

c. What is the func­tion of the gene, if known?

d. Who sub­mit­ted the se­quence?

e. From what in­sti­tu­tion?

f. Get the fol­low­ing data for this par­tic­u­lar match.

i. E-value:

ii. Identities:
B LA S T  43

2. Use the NCBI ORF finder (https://​www.​ncbi.​nlm.​nih.​gov/​orffinder/​) to trans­


late the Mystery DNA be­low. This se­quence came from a bac­te­rial cul­ture, so
use the bac­te­rial ge­netic code.

a. First of all­, what ex­actly is an “ORF” any­way?

b. What are the nu­cle­o­tide po­si­tions of the lon­gest ORF?

c. How many dif­fer­ent read­ing frames did the pro­gram re­veal that were lon­
ger than 100 amino ac­ids (aa)?

d. BLAST the lon­gest pu­ta­tive ORF.

i. What is the name of the gene?

ii. What is the name of the or­gan­ism?

>Part2_Mystery_DNA
TCCCTCCACAAAGAATGGAGCTGTGAACTACTAGCACGCAATGTGATTCCTGCAATTGAAAATGAACAAT
ATAT G C TAC TTATAG ATA AC G G TATT C C G AT C G C TTATT G TAG TT G G G C AG ATTTA A AC C TT G AG AC T G A
GGTGAAATATATTAAGGATATTAATTCGTTAACACCAGAAGAATGGCAGTCTGGTGACAGACGCTGGATT
ATT G ATT G G G TAGCACCATT CGGACATT CT CAATTACTTTATA A A A A A ATGTGTC AGA A ATACC C TGATA
TGATCGTCAGATCTATACGCTTTTATCCAAAGCAGAAAGAATTAGGCAAAATTGCCTACTTTAAAGGAGG
TAAATTAGATAAAAAAACAGCAAAAAAACGTTTTGATACATATCAAGAAGAGCTGGCAACAGCACTTAAA
A AT G A ATTTA ATTTTATTAAAAAATAGAAGGAGACATC C C TTATGGGA ACTAGACTTAC A ACC C TATC A A
ATGGGCTAAAAAACACTTTAACGGCAACCAAAAGTGGCTTACATAAAGCCGGTCAATCATTAACCCAAGC
C G G C AG TT C TTTA A A A AC T G G G G C A A A A A A A ATTAT C C T C TATATT C C C C A A A ATTAC C A ATAT G ATAC T
GAACAAGGTAATGGTTTACAGGATTTAGTCAAAGCGGCCGAAGAGTTGGGGATTGAGGTACAAAGAGAAG
AACGCAATAATATTGCAACAGCTCAAACCAGTTTAGGCACGATTCAAACCGCTATTGGCTTAACTGAGCG
TGGCATTGTGTTATCCGCTCCACAAATTGATAAATTGCTACAGAAAACTAAAGCAGGCCAAGCATTAGGT
TCTGCCGAAAGCATTGTACAAAATGCAAATAAAGCCAAAACTGTATTATCTGGCATTCAATCTATTTTAG
GCTCAGTATTGGCTGGAATGGATTTAGATGAGGCCTTACAGAATAACAGCAACCAACATGCTCTTGCTAA
AGCTGGCTTGGAGCTAACAAATTCATTAATTGAAAATATTGCTAATTCAGTAAAAACACTTGACGAATTT
GGTGAGCAAATTAGTCAATTTGGTTCAAAACTACAAAATATCAAAGGCTTAGGGACTTTAGGAGACAAAC
TCAAAAATATCGGTGGACTTGATAAAGCTGGCCTTGGTTTAGATGTTATCTCAGGGCTATTATCGGGCGC
44  COMPU TATIO NA L B IOL OGY

AACAGCTGCACTTGTACTTGCAGATAAAAATGCTTCAACAGCTAAAAAAGTGGGTGCGGGTTTTGAATTG
GCAAACCAAGTTGTTGGTAATATTACCAAAGCCGTTTCTTCTTACATTTTAGCCCAACGTGTTGCAGCAG
G TTTAT C TT C A AC T G G G C C T G T G G C T G C TTTA ATT G C TT C TAC T G TTT C T C TT G C G ATTAG C C C ATTAG C
ATTTGCCGGTATTGCCGATAAATTTAATCATGCAAAAAGTTTAGAGAGTTATGCCGAACGCTTTAAAAAA
TTAGGCTATGACGGAGATAATTTATTAGCAGAATATCAGCGGGGAACAGGGACTATTGATGCATCGGTTA
CTGCAATTAATACCGCATTGGCCGCTATTGCTGGTGGTGTGTCTGCTGCTGCAGCCGGCTCGGTTATTGC
TT C AC C G ATT G C C TTATTAG TAT C T G G G ATTAC C G G T G TA ATTT C TAC G ATT C T G C A ATATT C TA A AC A A
GCAATGTTTGAGCACGTTGCAAATAAAATTCATAACAAAATTGTAGAATGGGAAAAAAATAATCACGGTA
AG A AC TAC TTT G A A A AT G G TTAC G AT G C C C G TTAT C TT G C G A ATTTAC A AG ATA ATAT G A A ATT C TTAC T
GAACTTAAACAAAGAGTTACAGGCAGAACGTGTCATCGCTATTACTCAGCAGCAATGGGATAACAACATT
GGTGATTTAGCTGGTATTAGCCGTTTAGGTGAAAAAGTCCTTAGTGGTAAAGCCTATGTGGATGCGTTTG
AAGAAGGCAAACACATTAAAGCCGATAAATTAGTACAGTTGGATTCGGCAAACGGTATTATTGATGTGAG
TAATTCGGGTAAAGCGAAAACTCAGCATATCTTATTCAGAACGCCATTATTGACGCCGGGAACAGAGCAT
CGTGAACGCGTACAAACAGGTAAATATGAATATATTACCAAGCTCAATATTAACCGTGTAGATAGCTGGA
AAATTACAGATGGTGCAGCAAGTTCTACCTTTGATTTAACTAACGTTGTTCAGCGTATTGGTATTGAATT
AGACAATGCTGGAAATGTAACTAAAACCAAAGAAACAAAAATTATTGCCAAACTTGGTGAAGGTGATGAC
AACGTATTTGTTGGTTCTGGTACGACGGAAATTGATGGCGGTGAAGGTTACGACCGAGTTCACTATAGCC
GTGGAAACTATGGTGCTTTAACTATTGATGCAACCAAAGAGACCGAGCAAGGTAGTTATACCGTAAATCG
TTTCGTAGAAACCGGTAAAGCACTACACGAAGTGACTTCAACCCATACCGCATTAGTGGGCAACCGTGAAG
A A A A A ATAG A ATAT C G T C ATAG C A ATA AC C AG C AC C AT G C C G G TTATTAC AC C A A AG ATAC C TT G A A AG
C T G TT G A AG A A ATTAT C G G TAC AT C AC ATA AC G ATAT C TTTA A AG G TAG TA AG TT C A AT G AT G C C TTTA A
CGGTGGTGATGGTGTCGATACTATTGACGGTAACGACGGCAATGACCGCTTATTTGGTGGTAAAGGCGAT
GATATTCTCGATGGTGGAAATGGTGATGATTTTATCGATGGCGGTAAAGGCAACGACCTATTACACGGTG
GCAAGGGCGATGATATTTTCGTTCACCGTAAAGGCGATGGTAATGATATTATTACCGATTCTGACGGCAA
T G ATA A ATTAT C ATT C T C T G ATT C G A AC TTA A A AG ATTTA AC ATTT G A A A A AG TTA A AC ATA AT C TT G T C
ATCACGAATAGCAAAAAAGAGAAAGTGACCATTCAAAACTGGTTCCGAGAGGCTGATTTTGCTAAAGAAG
TGCCTAATTATAAAGCAACTAAAGATGAGAAAATCGAAGAAATCATCGGTCAAAATGGCGAGCGGATCAC
CTCAAAGCAAGTTGATGATCTTATCGCAAAAGGTAACGGCAAAATTACCCAAGATGAGCTATCAAAAGTT
GTTGATAACTATGAATTGCTCAAACATAGCAAAAATGTGACAAACAGCTTAGATAAGTTAATCTCATCTG
TAAGTGCATTTACCTCGTCTAATGATTCGAGAAATGTATTAGTGGCTCCAACTTCAATGTTGGATCAAAG
TTTAT C TT C T C TT C A ATTT G C TAG AG C AG C TTA ATTTTTA AT G ATT G G C A AC T C TATATT G TTT C AC AC A
TTATAGAGTTGCCGTTTTATTTTATAAAAGGAGACAATATGGAAGCTAACCATCAAAGGAATGATCTTGG
TTTAG TT G C C C T C AC TAT G TT G G C AC A ATAC C ATA ATATTT C G C TTA AT C C G G A AG A A ATA A A AC ATA A A
CHAPTER
02
PROTEIN ANALYSIS

P
roteins are the work­horses of bi­­ol­ogy. There is a rea­son the cen­tral dogma of
mo­lec­u­lar bi­­ol­ogy ends in the for­ma­tion of pro­teins: ev­ery cel­lu­lar and phys­i­
o­log­ic­ al pro­cess within an or­gan­ism in­volves pro­teins. Whether it is for syn­
the­siz­ing RNA or DNA, prop­a­gat­ing sig­nals along nerve fi­bers, or de­stroy­ing
dis­ease-causing path­o­gens, pro­teins (mainly en­zymes) do most or all­of the
work. Proteins also de­ter­mine the shape and mo­bil­ity of cells (struc­tural pro­teins),
act as ga­te­keep­ers for mol­e­cules en­ter­ing and ex­it­ing cell mem­branes (mem­
brane pro­teins), re­spond to ex­ter­nal sig­nals such as hor­mones or glu­cose (re­cep­
tor pro­teins), and trans­mit sig­nals in­side of cells (cell sig­nal­ing pro­teins).
Figure 2.1 pres­ents a few ex­am­ples of pro­teins that per­form com­mon cel­lu­lar
pro­cess­es.

Protein Bioinformatics
In or­der to un­der­stand the bi­o­log­i­cal func­tion of a pro­tein, one must first un­der­
stand its phys­ic­ al prop­er­ties. Ideally, one would know the en­tire three-dimen­
sional (3D) struc­ture of a pro­tein and the lo­ca­tion and an­gle of ev­ery amino acid in
the pro­tein at the atomic level. Unfortunately, ex­per­im ­ en­tal de­ter­mi­na­tion of pro­
tein 3D struc­ture, pri­mar­ily per­formed by first crys­tal­liz­ing the pro­tein, is of­ten a
mon­um ­ en­tal un­der­tak­ing that is not pos­si­ble for all­types of struc­tures (e.g., pro­
teins in­side cell mem­branes known as trans­mem­brane pro­teins). Protein crys­tal
struc­tures can take months or years to de­ter­mine, and many labs de­vote their en­
tire re­search pro­gram to the crys­tal­li­za­tion of im­por­tant pro­teins.
Naturally, slow and me­thod­i­cal ex­per­i­men­tal meth­ods can­not pos­si­bly keep
up with the rate at which sci­en­tists are gen­er­at­ing pro­tein se­quence data. Sequenc­
ing tech­nol­o­gies can gen­er­ate the en­tire se­quence of mul­ti­ple bac­te­rial ge­nomes
in a day. Given that an av­er­age sized bac­te­rial ge­nome may en­code >3,000 pro­
tein se­quences, you can clearly see the need for com­pu­ta­tional meth­ods to speed
up this pro­cess of de­ter­min­ing pro­tein func­tions. To fill this need,1 bioinformati­
cians have de­signed al­go­rithms to le­ver­age our vast ex­ist­ing knowl­edge of amino
ac­ids and pro­tein struc­ture to de­velop al­go­rithms that can de­ter­mine struc­tural or
func­tional prop­er­ties of pro­teins us­ing only the pri­mary pro­tein se­quence. This is
47
48  COMPU TATIO NA L B IOL OGY

FIGURE 2.1. Examples of pro­teins in­volved in im­por­tant or­gan­is­mal or cel­lu­lar


pro­cesses. (A) Myosin. Image cour­tesy of Da­vid S. Goodsell/RCSB PDB, un­der li­cense
CC BY-4.0. (B) Complex be­tween nu­cle­o­some core par­ti­cle and DNA frag­ment. Image
cour­tesy of Emw, based on PDB ID 1aoi, un­der li­cense CC BY-3.0. (C) Cartoon rep­re­sen­ta­
tion of an­thrax toxin. Image cour­tesy of the Eu­ro­pean Bioinformatics Institute (http://​
www.​ebi.​ac.​uk/​). (D) Crystal struc­ture of li­gand bind­ing do­main of RORγ and li­gand.

great be­cause it is re­ally easy to de­ter­mine the pri­mary se­quence of a pro­tein


given its DNA se­quence: sim­ply run the DNA se­quence through a ge­netic code
trans­la­tor and voilà—you have the pro­tein se­quence! Figure 2.2 il­lus­trates the
ba­sic pro­cess of com­pu­ta­tional trans­la­tion of a DNA se­quence that con­tains a
pro­tein-coding se­quence.

Bioinformatics Methods
The al­go­rithms we cover in this chap­ter at­tempt to pre­dict phys­ic­ al and struc­tural
prop­er­ties of a pro­tein us­ing only its pri­mary se­quence, lit­er­ally a text string of amino
acid letters. (Like this protein sequence of unknown function: DRKELLEYISRAD.)
Of course, the fast­est meth­ods of de­ter­min­ing the struc­ture of a novel un­known
pro­tein se­quence is to find a highly sim­i­lar match in a da­ta­base to a pro­tein with
known struc­ture. For ex­am­ple, it would be easy to de­ter­mine most of the struc­
ture and func­tion of a newly se­quenced bac­te­rial outer mem­brane pro­tein (OMP)
be­cause there are many well-characterized OMPs with 3D struc­tures al­ready in
GenBank, UniProt, and other da­ta­bases. Matches to more dis­tantly re­lated
se­quences with sim­i­lar struc­tures can also re­veal im­por­tant struc­tural fea­
tures. The Pfam (pro­tein fam­ily) da­ta­base, COG (clus­ters of orthologous groups)
da­ta­base, and other da­ta­bases use var­i­ous al­go­rithms, such as mul­ti­ple-sequence
align­ments and hid­den Mar­kov mod­els, to group pro­teins into var­i­ous func­tional
clas­ses. A sig­nif­i­cant match to a fam­ily or clus­ter could pro­vide in­sight into the
pro­tein’s struc­ture or func­tion.
PR O TEI N A N A LY S I S   49

FIGURE 2.2. Computational trans­la­tion. Once the DNA se­quence is de­ter­mined,


we can use the ge­netic code to do a trans­la­tion with the com­puter. We can skip over
the tran­scrip­tion step here, be­cause the RNA is es­sen­tially a copy of the DNA se­quence.
Since we ini­tially do not know which strand of the DNA se­quence is read by the RNA
po­ly­mer­ase, it is best to trans­late both the DNA se­quence and its re­verse com­ple­ment
(the other strand). We also do not nec­es­sar­ily know the frame (read­ing frame). The
ri­bo­some reads the RNA in groups of nu­cle­o­tide trip­lets, called co­dons.2 By shift­ing the
start­ing point of the ri­bo­some, you change the read­ing frame for the en­tire se­quence. In
this fig­ure, frame 1 cor­re­sponds to a ri­bo­some start­ing on the first A of the orig­in
­ al DNA
se­quence, frame 2 cor­re­sponds to a ri­bo­some start­ing on the sec­ond A, and frame 3
cor­re­sponds to a ri­bo­some start­ing on the first T. When de­ter­min­ing a po­ten­tial pro­tein
se­quence from a given DNA se­quence, un­less one knows the start­ing co­don, the best
thing is to trans­late all­6 pos­si­ble read­ing frames and pick the lon­gest one that doesn’t
have a stop co­don. In prac­tice, one can also BLAST all­6 pos­si­ble pro­tein se­quences and
pick the one that matches a known pro­tein in the da­ta­base.

The prob­lem be­comes more dif­fi­cult as the pro­tein be­comes more dis­sim­i­lar
to known pro­teins. In this chap­ter, we cover meth­ods that pre­dict the prop­er­ties
of pro­tein se­quences based only on the amino acid com­po­si­tion of the se­
quence. The first method we dis­cuss pre­dicts the rel­a­tive hy­dro­pho­bic­ity of
pro­tein se­quences by scan­ning the se­quence and look­ing for re­gions of the
pro­tein with lots of hy­dro­pho­bic amino ac­ids. Hydrophobic means wa­ter (hy­dro-)
fear­ing (-phobic). Protein re­gions that are hy­dro­pho­bic are usu­ally found in­ter­
act­ing with hy­dro­pho­bic mol­e­cules, such as the lip­ids found in the cell mem­
brane. Figure 2.3 il­lus­trates some ex­am­ples of pro­teins with crit­i­cal hy­dro­pho­bic
re­gions.
The sec­ond method we cover is a prob­a­bil­ity ap­proach to de­ter­mine the sec­ond­
ary struc­ture of a novel pro­tein us­ing the pri­mary se­quence. Protein 3D struc­tures
50  COMPU TATIO NA L B IOL OGY

FIGURE 2.3. Examples of hy­dro­pho­bic re­gions in dif­fer­ent pro­teins. (A) A hy­po­thet­i­cal


trans­mem­brane pro­tein (red, green, and yel­low ovals) passes through a cell mem­brane
lipid bi­la­yer (blue cir­cles, hy­dro­philic lipid bi­la­yer heads; green ob­long shapes, hy­dro­pho­bic
tails). E, ex­tra­cel­lu­lar (out­­side the cell); I, in­tra­cel­lu­lar (in­side the cell); P, plasma mem­brane
(also just the cell mem­brane). The hy­dro­pho­bic re­gions of the pro­tein (green ovals) help
the pro­tein lo­cal­ize in the cell mem­brane. Image cour­tesy of Mag­nus Manske, un­der
li­cense CC BY-3.0. (B) Close-up of a ste­roid hor­mone-binding tran­scrip­tion fac­tor (hu­man
es­tro­gen re­cep­tor). The ste­roid hor­mone (mol­e­cule with sym­met­ri­cal rings in blue, cen­ter
top) binds in the hy­dro­pho­bic pock­et of the pro­tein, chang­ing the 3D struc­ture of the
re­cep­tor so it can ac­ti­vate gene ex­pres­sion.3 Steroid hor­mones are lip­ids (fats), which are
hy­dro­pho­bic mol­e­cules. This is why the bind­ing pocket also needs to be hy­dro­pho­bic or it
would re­pel the ste­roid hor­mone mol­e­cules.

are re­ally com­bi­na­tions of two main sec­ond­ary struc­tural el­e­ments, al­pha he­li­ces
and beta sheets, with loop re­gions con­nect­ing the var­i­ous struc­tural el­e­ments.
Figure 0.8 il­lus­trates how the com­bi­na­tions of pro­tein sec­ond­ary structure el­e­
ments can com­bine to form the fi­nal 3D struc­ture of a pro­tein. The Chou-Fasman
al­go­rithm that we look at in this chap­ter uses knowl­edge gleaned from many pro­
tein se­quences to de­ter­mine how of­ten par­tic­u­lar amino ac­ids are found in al­pha
he­li­ces or beta sheets.
PR O TEI N A N A LY S I S   51

Notes
1. “I feel the need, the need for speed.”—Tom Cruise, pi­lot and bioinformatician.
2. The ri­bo­some matches the co­don with the tRNA an­ti-codon to co­va­lently bind to­gether the
amino ac­ids dur­ing pro­tein syn­the­sis.
3. An ex­am­ple of this is the an­dro­gen re­cep­tor, which binds the ste­roid hor­mone tes­tos­ter­one
and acts as a tran­scrip­tion fac­tor to al­ter the ex­pres­sion of mus­cle-building genes, es­pe­cially
in “roid ragers.” Or pro­fes­sional ath­letes.
52  COMPU TATIO NA L B IOL OGY

ACTIVITY 2.1 HYDROPHOBICITY PLOTTING

Motivation
The pur­pose of this ac­tiv­ity is to teach the con­cepts be­hind hy­dro­pho­bic­ity plot­ting. Hydropho­
bicity plots use a sim­ple se­quence scan­ning ap­proach and ex­per­im ­ en­tal val­ues of amino acid
hy­dro­pho­bic­ity to de­ter­mine which parts of a pro­tein are hy­dro­pho­bic. Generally speak­ing, hy­
dro­pho­bic­ity is one of the most im­por­tant prop­er­ties of bi­o­log­i­cal mol­e­cules. Proteins tend to be
a mix of both hy­dro­pho­bic and hy­dro­philic amino ac­ids. Hydrophobic (“wa­ter-fearing”) pro­teins
tend to be found in parts of the cell that are rich in lip­ids, such as the cell mem­brane, the nu­clear
mem­brane, and ves­i­cles. If we were to de­ter­mine that a pro­tein had many long sec­tions of
­hy­dro­pho­bic amino ac­ids, this could mean that the pro­tein was part of a mem­brane or ves­i­cle.
Similarly, if we were to de­ter­mine that a part of the pro­tein was hy­dro­philic (“wa­ter lov­ing”), we
could pre­dict that this part of the pro­tein was not likely to be found in a mem­brane or bind­ing
lipid mol­ec­ ules. Instead, we might pre­dict that it was in­tra- or ex­tra­cel­lu­lar or bound to hy­dro­
philic charged mol­e­cules like DNA or RNA. In this sec­tion, you will learn a ba­sic method of hy­dro­
pho­bic­ity plot­ting for pre­dict­ing hy­dro­pho­bic (or hy­dro­philic) re­gions of any given pro­tein se­quence.
Afterwards, you will gain prac­tice with on­line soft­ware that de­ter­mines the most hy­dro­pho­bic
re­gions of pro­teins us­ing the same al­go­rithm.

Learning Objectives
1. Understand the ba­sic con­cept of pro­tein hy­dro­pho­bic­ity and why it is im­por­tant for un­der­
stand­ing pro­tein func­tion (Motivation).
2. Learn how to use a slid­ing win­dow al­go­rithm and an amino acid hy­dro­pho­bic­ity scale to de­
ter­mine the most hy­dro­pho­bic re­gion of a pro­tein (Concepts and Exercises).
3. Use on­line hy­dro­pho­bic­ity soft­ware to plot hy­dro­pho­bic re­gions of a pro­tein se­quence and
in­ter­pret the out­­put (Concepts and Exercises).

Concepts
To bet­ter un­der­stand the method be­hind hy­dro­pho­bic­ity plot­ting, try the pre­pa­ra­tory ex­er­cise on
the next page, which asks you to cir­cle or un­der­line the most hy­dro­pho­bic re­gion of the pro­tein
se­quence “Protein Seq 1” us­ing in­for­ma­tion in the di­a­gram. The di­a­gram in­di­cates the amount
of free en­ergy needed to dis­solve each of the amino ac­ids found in bi­o­log­i­cal pro­tein se­quences
into a hy­dro­pho­bic sol­vent. The more neg­a­tive the change in free en­ergy value (∆G), the more
read­ily the amino acid dis­solves into the sol­vent (in this case, octanol) and the more hy­dro­pho­bic
the amino acid. For ex­am­ple, leu­cine (L) has a ∆G of −1.2. The more hy­dro­philic amino ac­ids have
high pos­i­tive ∆G val­ues (e.g., his­ti­dine [H] has a ∆G of +2.4).
PR O TEI N A N A LY S I S   53

Protein Seq 1:
E R K PY LV AW M K R

FIGURE 2.1.1. Free en­er­gies for trans­fer of amino ac­ids from wa­ter to octanol. Green bars, charged
res­i­dues; orange bars, po­lar res­i­dues; pur­ple bars, hy­dro­pho­bic res­i­dues. Data from Bowie JU. 2005.
Nature 438:581–589.

Reflection
• How did you find the be­gin­ning of the hy­dro­pho­bic re­gion?
• Was search­ing for the hy­dro­pho­bic re­gion sim­i­lar in any way to the BLAST al­go­rithm? Why
or why not?
• Could you use the free en­ergy di­a­gram (Fig. 2.1.1) to score this hy­dro­pho­bic re­gion?
• Plot the av­er­age hy­dro­pho­bic­ity scores of the first 4 amino ac­ids, the mid­dle 4 amino ac­ids,
and the fi­nal 4 amino acids in the box on the next page.
54  COMPU TATIO NA L B IOL OGY

First 4     Middle 4     Final 4

The most hy­dro­pho­bic re­gion of the se­quence is un­der­lined be­low. To de­ter­mine a score for
this re­gion, one might sum up the to­tal free en­ergy score of this 6-amino-acid-long re­gion. Better
yet, the to­tal could be di­vided by the length of the hy­dro­pho­bic re­gion to de­ter­mine the av­er­age
hy­dro­pho­bic­ity of this re­gion. In this case, the most hy­dro­pho­bic re­gion would be neg­a­tive.

Protein Seq 1:
E R K PY LV AW M K R

The hy­dro­pho­bic­ity plot­ting method you will learn us­ing the in­ter­ac­tive mod­ule also scans
pro­tein se­quences a few amino ac­ids at a time and cal­cu­lates the av­er­age hy­dro­pho­bic­ity. How­
ever, instead of free en­ergy val­ues, we will use the hy­drop­a­thy scores de­vel­oped by Kyte and
Doo­lit­tle (a Kyte-Doolittle hy­drop­at­hy plot), in which the most hy­dro­pho­bic amino ac­ids have
large pos­i­tive val­ues. For ex­am­ple, the Kyte-Doolittle hy­drop­at­ hy value for the hy­dro­pho­bic amino
acid leu­cine is +3.8, while the hy­drop­a­thy of the po­lar amino acid his­ti­dine is −3.2.

Exercises
Interactive ex­er­cise (the­o­ry)
Use the on­line hy­dro­pho­bic­ity ex­er­cise link be­low to learn how to find the
hy­dro­pho­bic re­gions of a pro­tein se­quence. The Interactive Link ex­plains how to
use the teach­ing in­ter­ac­tive. Once you learn how it works, solve the ac­tiv­ity
prob­lem.

Hydrophobicity Interactive Link


Link:
http://​kelleybioinfo.​org/​algorithms
/​default.​php?​o=2
PR O TEI N A N A LY S I S   55

Problem
1. Fill in the two empty boxes us­ing the amino acid hy­dro­pho­bic­ity scores be­low.

2. Draw the hy­dro­pho­bic­ity plot in the gray box be­low.

3. Write the score and av­er­age for the last 5-amino-acid win­dow.
56  COMPU TATIO NA L B IOL OGY

Lab Exercises (Practice)


In this part of the ex­er­cise, you will learn how to de­ter­mine the hy­dro­pho­bic re­
gions of pro­tein se­quences us­ing the ProtScale web­site. You will also use a pro­
gram called TMHMM, which uses hid­den Mar­kov mod­els to find trans­mem­brane
pro­teins.

ProtScale Tutorial Link


Link:
http://​kelleybioinfo.​org/​algorithms
/​tutorial/​TPro1.​pdf

Sample and lab ex­er­cise da­ta:


http://​kelleybioinfo.​org/​algorithms
/​data/​DPro1.​txt
PR O TEI N A N A LY S I S   57

Lab Exercise

>ProteinSequence21A
M G P T S V P LV K A H R S S V S D Y V N Y D I I V R H Y N Y T G K L N I S A D K E N S I K LT S V V F I L I C C F I I L E N I F V L
LT I W K T K K F H R P M Y Y F I G N L A L S D L L A G VAY TA N L L L S G AT T Y K LT PA Q W F L R E G S M F VA L S A S V F S
L L A I A I E R Y I T M L K M K L H N G S N N F R L F L L I S A C W V I S L I L G G L P I M G W N C I S A L S S C S T V L P LY H K H
Y I L F C T T V F T L L L L S I V I LY C R I Y S LV R T R S R R L T F R K N I S K A S R S S E K S L A L L K T V I I V L S V F I A C
WA P L F I L L L L D V G C K V K T C D I L F R A E Y F LV L AV L N S G T N P I I Y T LT N K E M R R A F I R I M S C C K C P S G D
SAGKFKRPIIAGMEFSRSKSDNSSHPQKDEGDNPETIMSSGNVNSSS

1. Determine the fol­low­ing about ProteinSequence21A:

a. What is this pro­tein? (Hint: you may need to use an­other al­go­rithm that we
have al­ready cov­ered to search for a close match to this se­quence.)

b. Create a hy­dro­pho­bic­ity plot (win­dow size = 19) with this se­quence data
us­ing the Kyte-Doolittle hy­dro­pho­bic­ity scale. Draw/show the plot be­low.

c. Repeat the plot­ting us­ing the Eisenberg hy­dro­pho­bic­ity scale, a scale based
on dif­fer­ent free en­ergy val­ues. Describe briefly how they com­pare in
terms of trans­mem­brane pre­dic­tions?

2. Use ProteinSequence21A with TMHMM. Draw/show the TMHMM plot be­low.

a. How do the plot re­sults com­pare to the Kyte-Doolittle and Eisenberg graph
pre­dic­tions? Explain.
58  CO MPUTATIONAL B IOL OGY

ACTIVITY 2.2 PROTEIN SECONDARY STRUCTURE PREDICTION

Motivation
As de­scribed in the Chap­ter 02 in­tro­duc­tion, the struc­ture of a pro­tein can be de­scribed at three
dif­fer­ent lev­els: pri­mary (also writ­ten as 1°), sec­ond­ary (2°), and ter­tiary (3°). The pri­mary struc­
ture is the lin­ear or­der of amino ac­ids, and the ter­tiary struc­ture is the full 3D struc­ture. In be­
tween pri­mary and ter­tiary struc­tures stands the sec­ond­ary struc­ture, which also tells a great
deal about pro­tein struc­ture and is eas­ier to pre­dict than the ter­tiary struc­ture. The two ba­sic
types of sec­ond­ary struc­tures are al­pha he­li­ces and beta sheets. There are also loop re­gions,
which mainly serve to con­nect the he­li­cal and sheet struc­tural el­e­ments. Proteins are es­sen­tially
al­pha he­li­ces, beta sheets, and loops ar­ranged in dif­fer­ent com­bi­na­tions. Some ter­tiary struc­
tures are formed ex­clu­sively from one type of sec­ond­ary struc­ture (e.g., the struc­ture known as
a beta bar­rel), but more of­ten they have a mix of al­pha he­li­ces and beta sheets.
The pur­pose of this ac­tiv­ity is to teach the Chou-Fasman al­go­rithm for pre­dict­ing pro­tein sec­
ond­ary struc­ture based on the pri­mary pro­tein se­quence. The Chou-Fasman al­go­rithm was first
de­signed in the early 1970s by, you guessed it, Chou and Fasman. By look­ing at amino ac­ids in
pro­teins with known struc­ture, Chou and Fasman de­ter­mined how of­ten each amino acid ap­
peared in an al­pha he­lix, beta sheet, or loop re­gion. They then as­signed a like­li­hood score, called
a pro­pen­si­ty, for each of the 20 most common amino ac­ids. Some amino ac­ids are more com­
mon in al­pha he­li­ces than in beta sheets and vice versa. In this chap­ter, you will learn how to use
the pro­pen­si­ties to score a pro­tein se­quence as be­ing an al­pha-helical re­gion or a beta sheet and
then use on­line pre­dic­tion soft­ware on ac­tual pro­tein se­quences.

Learning Objectives
1. Understand the ba­sic con­cept of pro­tein sec­ond­ary struc­ture, the two main forms (al­pha he­
lix and beta sheet), and why it is im­por­tant for un­der­stand­ing pro­tein func­tion (Motivation).
2. Learn the prin­ci­ples be­hind the Chou-Fasman al­go­rithm and be ­able to cal­cu­late the rel­a­tive
scores for the like­li­hood of a pro­tein se­quence form­ing an al­pha he­lix or a beta sheet (Con­
cepts and Exercises).
3. Learn how to use a slid­ing-window al­go­rithm and sec­ond­ary struc­ture pro­pen­sity scores to pre­
dict whether a sec­tion of a pro­tein is an al­pha he­lix or a beta sheet (Concepts and Exercises).
4. Use on­line sec­ond­ary structure pre­dic­tion (Chou-Fasman) soft­ware to pre­dict pro­tein sec­ond­
ary se­quence and in­ter­pret the out­­put (Concepts and Exercises).

Concepts
To un­der­stand how to use the Chou-Fasman al­go­rithm, one must first un­der­stand the prin­ci­ple
of pro­pen­si­ty and how pro­pen­sity val­ues are calculated. The word propensity means an inclina­
tion or natural tendency to behave in a particular way. For example, I have a propensity for nerdi­
ness. (Set phasers to stun!) The sec­ond­ary structure pro­pen­sity of an amino acid is equiv­a­lent to
the prob­a­bil­ity (or the like­li­hood) of an amino acid be­ing part of a par­tic­u­lar type of sec­ond­ary
struc­ture. Three pro­pen­si­ties are cal­cu­lated for each amino ac­id:
PR O TEI N A N A LY S I S   59

P(a) = the pro­pen­sity of it be­ing in an al­pha he­lix


P(b) = the pro­pen­sity of it be­ing in a beta sheet
P(turn) = the pro­pen­sity of it be­ing in a turn

Propensities greater than 100 mean that an amino acid is more likely than by ran­dom chance
of be­ing in that par­tic­u­lar struc­ture, while pro­pen­si­ties less than 100 mean that the amino acid is
less likely to be found in that type of sec­ond­ary struc­ture.
Here are three ex­am­ples:

Amino Acid P(a) P(b) P(turn)

Alanine (A) 142 83 66


Threonine (T) 83 119 96
Asparagine (R) 67 89 156

These pro­pen­sity val­ues in­di­cate that al­a­nines are more likely to be found in al­pha he­li­ces than
chance would dic­tate and less likely to be in beta sheets or turns. Threonines have a higher pro­
pen­sity to be in beta sheets than in al­pha he­li­ces and turns, and as­par­a­gines are more likely to
be in turns than in al­pha he­li­ces or beta sheets. To be clear, just be­cause al­a­nines have a higher
pro­pen­sity to be in al­pha he­li­ces does not mean that they can­not also be in beta sheets or turns,
but it does mean that if you find a re­gion of a pro­tein se­quence with more al­a­nines, this re­gion
is more likely to be an al­pha he­lix.

Calculating Propensities
How are pro­pen­si­ties cal­cu­lated? Well, like most great bioinformatics val­ues, pro­pen­si­ties are
based on real (ex­per­i­men­tal) data. In this case, amino acid sec­ond­ary structure pro­pen­si­ties are
based on how many times each amino acid is found in al­pha he­li­ces, beta sheets, and turns in
known pro­tein struc­tures. Figure 2.2.1 shows an ex­am­ple of how to cal­cu­late the al­pha he­lix pro­
pen­sity [P(a)] of ly­sines.

Reflection 1
• Based on the data in Figure 2.2.1, what is the pro­pen­sity for ly­sines be­ing in a beta sheet? A
turn?
• If the ex­pected value for ly­sine be­ing in a turn was 0.1 be­cause only 10% (0.1) of the amino
ac­ids in the da­ta­base were in turns, how would this change the P(turn) for ly­si­ne?
• Could you use these num­bers to de­ter­mine whether a re­gion of a pro­tein was likely to be
an al­pha he­lix?

Where's Alphie?
Like "Where's Waldo" but not as frustrating!
The ta­ble on page 61 shows the pro­pen­si­ties of each amino acid for be­ing in an al­pha he­lix,
beta sheet, or turn. Use the val­ues to de­ter­mine which re­gion of the Protein Seq 2 might be
an al­pha he­lix.

Protein Seq 2:
A E E M H L R N G I Q C QWY F
60  COMPU TATIO NA L B IOL OGY

FIGURE 2.2.1. Calculation of P(a) for ly­sines. (A) A pro­tein da­ta­base with 10 pro­teins,
each with 300 amino ac­ids evenly split among al­pha he­li­ces, beta sheets, and turns. (B) To
cal­cu­late the P(a), first de­ter­mine the un­der­ly­ing prob­a­bil­ity of any given amino acid be­ing in
an al­pha he­lix. From the da­ta­base we know that it is 1/3 (0.33), since 1/3 of all­the amino
ac­ids in the da­ta­base have been ex­per­i­men­tally de­ter­mined to be in al­pha he­li­ces. Focusing
just on ly­sines, of the 100 ly­sines found in the da­ta­base, most of them are in al­pha he­li­ces
(0.8 or 80%), which sug­gests a pro­pen­sity of ly­sines to be in al­pha he­li­ces. The ob­served
P(a) for ly­sines is 0.80 (80/100), while the ex­pected value is 0.33 for any amino acid to be in
an al­pha he­lix. The pro­pen­sity is pretty easy to cal­cu­late: di­vide the ob­served value by the
ex­pected value and mul­ti­ply by 100.
PR O TEI N A N A LY S I S   61

Amino Acid P(a) P(b) P(turn)

Alanine (A) 142 83 66


Arginine (R) 98 93 95
Asparagine (N) 67 89 156
Aspartic acid (D) 101 54 146
Cysteine (C) 70 119 119
Glutamic acid (E) 151 37 74
Glutamine (Q) 111 110 98
Glycine (G) 57 75 156
Histidine (H) 100 57 95
Isoleucine (I) 108 160 47
Leucine (L) 121 130 59
Lysine (K) 114 74 101
Methionine (M) 145 105 60
Phenylalanine (F) 113 138 60
Proline (P) 57 55 152
Serine (S) 77 75 143
Threonine (T) 83 119 96
Tryptophan (W) 108 137 96
Tyrosine (Y) 69 147 114
Valine (V) 106 170 50

Reflection 2
• How does search­ing for al­pha he­li­ces com­pare to hy­dro­pho­bic­ity plot­ting or BLAST?
• How many amino ac­ids have a high pro­pen­sity for be­ing in both al­pha he­li­ces and beta
sheets?
• Are there any re­gions in the se­quence that might be more likely to be beta sheets?

Below is the an­swer. The un­der­lined re­gion has a large pro­por­tion of amino ac­ids that tend to be
found in al­pha he­li­ces. The mid­dle re­gion is more likely to be a turn (ital­ics), and the right side is
more likely to be a beta sheet (blue).
A E E M H L R N G I Q C QWY F

Exercises
Interactive ex­er­cise (the­o­ry)
Use the on­line ex­er­cise link be­low to learn how
to pre­dict the sec­ond­ary struc­ture of a pro­tein
se­quence. The Interactive Link ex­plains how to use
the teach­ing in­ter­ac­tive. Once you learn how it
works, solve the ac­tiv­ity prob­lem.

Chou-Fasman Interactive Link


Link:
http://​kelleybioinfo.​org/​algorithms
/​default.​php?​o=9
62  COMPU TATIO NA L B IOL OGY

Problem
1. Circle the “start” and “stop” re­gions of the pro­tein se­quence us­ing the Chou-
Fasman al­go­rithm for an al­pha he­lix.

2. Calculate the score for the al­pha he­lix.

3. Is it an al­pha he­lix or not? Answer yes or no be­low and in­di­cate why you
reached this con­clu­sion. (Hint: cal­cu­late the score for the ex­act same re­gion of
the se­quence be­ing a beta sheet. Which is greater, the al­pha he­lix or the beta
sheet score?)

SCORE FOR ALPHA HELIX: ________


IS IT AN ALPHA HELIX? ___ WHY or WHY NOT?
PR O TEI N A N A LY S I S   63

Lab Exercises (Practice)


In this part of the ex­er­cise, you will learn how to an­a­lyze pro­tein se­quences us­ing
the Chou-Fasman al­go­rithm on­line. You will also learn how to in­ter­pret the out­­put
from the pro­gram, in­clud­ing what the val­ues mean and how to find in­for­ma­tion
about the best match in the da­ta­base to your query se­quence.
You will also use the UniProt web­site and the ProtParam tool to an­a­lyze pro­
tein se­quences.

ProtScale Tutorial Link


Link:
http://​kelleybioinfo.​org/​algorithms/​tutorial
/TPro1.​pdf

Sample and lab ex­er­cise da­ta:


http://​kelleybioinfo.​org/​algorithms
/​data/​DPro1.​txt
64  COMPU TATIO NA L B IOL OGY

Lab Exercise

>ProteinSequence22A
MWVLINLLILMIMVLISVAFLTLLERKILGYIQDRKGPNKIMLFGMFQPFSDALKLLSKEWFFFNYSNLFIYSPMLMFFLS
LVMWILYPWFGFMYYIEFSILFMLLVLGLSVYPVLFVGWISNCNYAILGSMRLVSTMISFEINLFFLVFSLMMMVESFSFN
EFFFFQNNIKFAILLYPLYLMMFTSMLIELNRTPFDLIEGESELVSGFNIEYHSSMFVLIFLSEYMNIMFMSVILSLMFYG
FKYWSIKFILIYLFHICLIIWIRGILPRIRYDKLMNMCWTEMLMLVMIYLMYLYFMKEFLCI

1. Answer the fol­low­ing ques­tions about ProteinSequence22A.

a. First of all­, what is this pro­tein? (BLAST on UniProt web­site http://​www​
­​.​uniprot.​org/​blast/​)

Use the ProtParam tool (http://​web.​expasy.​org/​protparam/​) for the next


­ques­tions.

b. What is the mo­lec­u­lar weight?

c. What is the num­ber of pos­i­tively charged res­i­dues?

>ProteinSequence22B
MGPTSVPLVKAHRSSVSDYVNYDIIVRHYNYTGKLNISADKENSIKLTSVVFILICCFIILENIFVLLTIWKTKKFHRPMY
YFIGNLALSDLLAGVAYTANLLLSGATTYKLTPAQWFLREGSMFVALSASVFSLLAIAIERYITMLKMKLHNGSNNFRLFL
LISACWVISLILGGLPIMGWNCISALSSCSTVLPLYHKHYILFCTTVFTLLLLSIVILYCRIYSLVRTRSRRLTFRKNISK
ASRSSEKSLALLKTVIIVLSVFIACWAPLFILLLLDVGCKVKTCDILFRAEYFLVLAVLNSGTNPIIYTLTNKEMRRAFIR
IMSCCKCPSGDSAGKFKRPIIAGMEFSRSKSDNSSHPQKDEGDNPETIMSSGNVNSSS

Use ProtScale for the Chou-Fasman ques­tions.

2. Check out­ ProteinSequence22B for dif­fer­ent sec­ond­ary struc­ture el­e­ments.


(Use a win­dow size of 21.)
PR O TEI N A N A LY S I S   65

a. Check out­the Chou-Fasman al­pha-helical pre­dic­tions. How many re­gions


ex­ceed a thresh­old of 1.1? (The thresh­old of 1.1 was cho­sen be­cause re­
sults are only raw scores, and one should only con­sider strong peaks when
in­ter­pret­ing graphs.)

b. Draw/show the Chou-Fasman al­pha-helical plot be­low.


CHAPTER
03
SEQUENCE ALIGNMENT

A
lignments of bi­o­log­i­cal se­quences (DNA, RNA, and pro­tein) have been a
cen­ter­piece of bioinformatics from its in­cep­tion. Chapter 01 cov­ered the
rapid, high-performance pairwise BLAST aligner, and in this sec­tion we ad­
dress a more so­phis­ti­cated, al­beit slower, method for gen­er­at­ing pairwise
se­quence align­ments that is guar­an­teed to pro­duce a math­e­mat­i­cally op­ti­
mal align­ment. We also dis­cuss how pairwise align­ments can be turned into
mul­ti­ple sequence align­ments (MSAs).

What Is a Sequence Alignment?


A pairwise se­quence align­ment is a match of the or­dered chem­i­cal let­ters be­
tween two dif­fer­ent se­quences. The goal of a se­quence align­ment is to ver­ti­cally
align the po­si­tions of the se­quences that are ho­mol­o­gous to one an­other (i.e.,
were de­rived from a com­mon an­ces­tor). For ex­am­ple, let’s say I have two DNA
se­quences, one from a mouse and one from a hu­man, both of which en­code
a very im­por­tant pro­tein se­quence, namely globin proteins that are involved in
binding or transporting oxygen (e.g., hemoglobin). Mice and hu­mans are mam-
mals. As such, they have in­her­ited this gene from a com­mon an­ces­tor they
shared long ago. Figure 3.1 shows the ba­sic struc­tures of the mouse and hu­man
glo­bin pro­teins. The goal is to cre­ate a se­quence align­ment in which we prop­erly
align parts of the two dif­fer­ent, but re­lated, pro­teins with the same func­tion.

Sequence Alignments: Nature’s Experimental Results


DNA, RNA, and pro­tein se­quence align­ments are fun­da­men­tal to bioinformatics
re­search. In ad­di­tion to al­low­ing rapid iden­ti­fi­ca­tion of mo­lec­u­lar se­quences (e.g.,
BLAST), se­quence align­ments un­der­pin a ma­jor­ity of bioinformatics al­go­rithms in
one way or an­other. This is be­cause MSAs of re­lated mo­lec­u­lar se­quences from
dif­fer­ent or­gan­isms pro­vide the equiv­a­lent of an ex­per­i­men­tal read­out­ of mil­li­ons
or even bil­li­ons of years of evo­lu­tion. Instead of spend­ing years ma­nip­u­lat­ing ev­
ery nu­cle­o­tide or amino acid po­si­tion in a gene to de­ter­mine the func­tional
con­se­quences, one can ex­am­ine an MSA full of re­lated se­quences to de­ter­mine
67
68  COMPU TATIO NA L B IOL OGY

FIGURE 3.1. Structures of the mouse and hu­man glo­bin pro­teins. The bumpy blobs
rep­re­sent the three-dimensional struc­tures of the mouse and hu­man glo­bin pro­teins.
Notice how the struc­tures are su­per­im­pos­able. The goal is to cre­ate a se­quence align­ment
of the pro­tein se­quence, or un­der­ly­ing DNA se­quence, in which the ho­mol­o­gous
func­tional re­gions of the pro­teins are aligned. In the pro­cess of mak­ing a se­quence
align­ment be­tween the mouse and hu­man glo­bin pro­tein pri­mary amino acid se­quences,
the se­quence cor­re­spond­ing to mouse re­gion 1 (M REGION 1) should be aligned to
hu­man re­gion 1 (H REGION 1), and mouse re­gion 2 should be aligned to hu­man re­gion 2.
Clearly, it would make no sense to align the amino ac­ids (or the un­der­ly­ing DNA that
codes for these amino ac­ids) of mouse re­gion 1 to hu­man re­gion 2. This is an easy
prob­lem when the struc­tures of the ho­mol­o­gous mol­e­cules are avail­­able. The prob­lem
be­comes harder when one has only the two se­quences and no struc­ture, and harder still
the more dis­tantly re­lated (and dif­fer­ent) the se­quences are from one an­oth­er.

which nu­cle­o­tides (or amino acid po­si­tions) have ex­pe­ri­enced mu­ta­tions and which
have not.
To find the re­ally im­por­tant nu­cle­o­tides or amino ac­ids, any change of which
would de­stroy the func­tion of the mac­ro­mol­e­cule, here’s a hint: look for the con­
served align­ment po­si­tions, the po­si­tions that have not changed in any of the
se­quences. The more con­strained or con­served a nu­cle­o­tide or amino acid se­
quence is, the more im­por­tant it likely is for the func­tion­ing of the mac­ro­mol­e­
cule. Mutations at these po­si­tions would al­most cer­tainly lead to the death of the
or­gan­ism, and nat­u­ral se­lec­tion would re­move it from the gene pool. Figure 3.2
il­lus­trates ex­am­ples of se­quence align­ments show­ing both highly con­served po­
si­tions and var­i­able po­si­tions.
While the con­served po­si­tions in se­quence align­ments in­di­cate func­tion­ally
crit­i­cal nu­cle­o­tides or amino ac­ids, the so-called var­ia­ ble po­si­tions can also be
use­ful. The types and pat­terns of mu­ta­tions al­lowed at these var­i­able po­si­tions
S eq u e nc e A lignm ent  69

can be used in a large num­ber of bioinformatics ap­proaches that we will deal with
in other chap­ters, in­clud­ing

• RNA struc­ture pre­dic­tion


• Motif search­ing
• Weight ma­trix con­struc­tion
• Phylogenetic anal­y­sis
• Transition ma­tri­ces

FIGURE 3.2. MSAs of DNA and pro­tein se­quences in­di­cat­ing spe­cific con­served and
var­i­able po­si­tions. (A) MSA of a tran­scrip­tion fac­tor bind­ing site for the es­tro­gen re­cep­tor
pro­tein. The un­der­lined re­gions in­di­cate the nu­cle­o­tides that di­rectly bind the pro­tein
tran­scrip­tion fac­tor. Any mu­ta­tion to the con­served nu­cle­o­tides in­di­cated by an as­ter­isk would
pre­vent the bind­ing of the es­tro­gen re­cep­tor tran­scrip­tion fac­tor and pre­vent the tran­scrip­tion
(and there­fore trans­la­tion) of a pro­tein in­volved in fer­til­ity. (B) MSA of 5 re­lated bac­te­rial outer
mem­brane pro­tein se­quences. The pro­teins are chan­nels that al­low ions to move in or out­of
bac­te­rial cells. Asterisks in­di­cate amino ac­ids con­served in all­the outer mem­brane pro­teins.
The pe­ri­ods and co­lons in­di­cate amino acid po­si­tions with less con­ser­va­tion.
70  COMPU TATIO NA L B IOL OGY

• Protein sec­ond­ary-structure pre­dic­tion


• Hidden Mar­kov mod­els

What Are the Challenges in Aligning Sequences?


If se­quences have iden­ti­cal re­gions, the se­quence align­ment prob­lem is triv­i­al:
just slide the se­quences along­side each other un­til you find a per­fect match.
However, when align­ing se­quences from dif­fer­ent spe­cies, this is of­ten not pos­
si­ble. As or­gan­isms be­come more dis­tantly re­lated, or if the rate of evo­lu­tion of a
par­tic­u­lar gene is high, the se­quences be­come more di­ver­gent and more dif­fi­cult
to align. While the se­quences in the align­ment per­form more or less the same
func­tional role in the dif­fer­ent or­gan­isms, over time un­der­ly­ing DNA se­quences
di­verge through the pro­cess of mu­ta­tion. Nucleotides of the DNA se­quence cod­
ing for the gene can be re­placed (for in­stance, an A might change to a G), ad­di­
tional nu­cle­o­tides can be in­serted or de­leted, and some­times big stretches of the
DNA can even be com­pletely in­verted. These changes oc­cur for a va­ri­ety of rea­
sons. Errors in copy­ing (ge­nome rep­li­ca­tion1) of the se­quence, er­rors in re­com­bi­
na­tion, and mu­ta­genic com­pounds can all­cause mu­ta­tions. If the mu­ta­tions do
not com­pletely dis­able the gene and kill the or­gan­ism, or pre­vent re­pro­duc­tion,
these can be in­her­it­ed; some­times the mu­ta­tions even prove ad­van­ta­geous.
This leaves the fol­low­ing prob­lem, il­lus­trated in Fig. 3.3: how to make an ac­cu­
rate se­quence align­ment be­tween se­quences with many dif­fer­ences that are not
even of the same length.
The most dif­fi­cult and im­por­tant as­pect of proper se­quence align­ments is put­
ting gap char­ac­ters in the right place. The gap char­ac­ters rep­re­sent in­ser­tions or
de­le­tions, also called indels (Fig. 3.3). The rea­son for us­ing the term indel is that
with­out­know­ing the his­tory of the mu­ta­tions, it is im­pos­si­ble to know if the gaps
in the align­ment rep­re­sent an in­ser­tion in one se­quence or a de­le­tion in the other.
To solve this prob­lem, bioinformaticians in­vented types of dy­namic pro­gram­ming
al­go­rithms that have scor­ing schemes for nu­cle­ot­ ide (or amino acid) matches, mis­
matches, and gaps (indels). The ac­tiv­ity sec­tion for this chap­ter ex­plains how a
dy­namic pro­gram­ming ap­proach uses scores for matches, mis­matches, and gaps
to de­ter­mine the best align­ment of two se­quences. Generally, matches are pre­
ferred, so they get a pos­i­tive score, while mis­matches and gaps (also known as
gap pen­al­ties) are to be avoided, so they get a zero or neg­at­ive score. This max­i­
mizes the matches and min­i­mizes mis­matches or gaps, though it does not elim­i­
nate them.

Issues in Sequence Alignment


The big­gest hur­dle to ac­cu­rate se­quence align­ment is the scor­ing sys­tem. With
DNA se­quences, the scor­ing sys­tem is fairly ar­bi­trary: +1 for a match, −1 for a
mis­match, and 0 for a gap is pretty com­mon. However, these num­bers can eas­
ily be changed. For in­stance, if the se­quences are closely re­lated, mean­ing that
there has been less evo­lu­tion­ary time for the ac­cu­mu­la­tion of in­ser­tion and de­le­
tion mu­ta­tions which would cre­ate gaps in the se­quence align­ment, one could
use a more neg­at­ive gap pen­alty score. For closely re­lated pro­tein-coding DNA
se­quences, most de­fault scor­ing schemes work very well. However, for dis­tantly
re­lated se­quences, it is of­ten dif­fi­cult to de­ter­mine the proper gap pen­al­ties.2
S eq u e nc e A lignm ent  71

FIGURE 3.3. Mutational his­tory of a pro­tein-coding gene af­ter evo­lu­tion from a


com­mon an­ces­tor. (A) The two un­aligned se­quences at the top are de­rived from a
hu­man and a fruit fly. These two sec­tions of DNA code for a pro­tein with the same cel­lu­lar
func­tion, but they are clearly not iden­ti­cal. (B) The boxes in the mid­dle show what mu­ta­tions
lead to the di­ver­gence in the se­quences. The se­quence on the left is the evo­lu­tion­ary
an­ces­tor of both the hu­man and the fruit fly. The lin­e­age lead­ing to hu­mans ac­cu­mu­lated
two sub­sti­tu­tion mu­ta­tions (shown in red), while the lin­e­age lead­ing to fruit flies ac­cu­mu­
lated one sub­sti­tu­tion mu­ta­tion (shown in green) and a de­le­tion of three nu­cle­o­tides (shown
in blue brack­ets). (C) The fi­nal se­quence align­ment shows how these mu­ta­tions lead to
mis­matches and how the de­le­tion makes one se­quence shorter than the other, which needs
to be ac­counted for by gap char­ac­ters when per­form­ing the align­ment.

It turns out­that the pro­tein se­quences them­selves are much eas­ier to align
than the DNA se­quences that code for the pro­teins. Not only is there less re­dun­
dancy, but one can also use so­phis­ti­cated scor­ing schemes for matches and mis­
matches. These scor­ing schemes are known as the PAM and BLOSUM ma­tri­ces
(see Chap­ter 07 for de­tails on ma­trix cal­cu­la­tion) and pro­vide a score when an
amino acid in one se­quence matches the amino acid in the com­pared se­quence
and a mis­match score for ev­ery other amino acid. In protein sequence alignments
mis­matches can some­times be useful, as we will dis­cuss in Chap­ter 07. Activity 3.1
teaches how to use these scor­ing schemes to align pro­tein se­quences.
The most dif­fi­cult types of se­quences to align are non­cod­ing RNA or DNA se­
quences. Unlike pro­tein-coding genes, non­cod­ing DNA (e.g., pro­moter re­gions
up­stream of the cod­ing re­gion) and DNA that en­codes struc­tural RNA can ac­cu­
mu­late lots of in­ser­tions or de­le­tions and still func­tion. Insertions or de­le­tions in
pro­tein-coding genes most of­ten lead to frame­shift mu­ta­tions, which re­sult in
com­pletely new and non­func­tion­ing pro­tein se­quences which are usu­ally elim­i­
nated by nat­u­ral se­lec­tion. For RNA se­quences, align­ments are of­ten per­formed
by com­bin­ing in­for­ma­tion on the RNA’s struc­ture with a dy­namic pro­gram­ming
method. Other meth­ods, such as slid­ing win­dow al­go­rithms, can be used to align
non­cod­ing DNA.
72  CO MPUTATIONAL B IOL OGY

Multiple-Sequence Alignment
In this chap­ter, we mainly fo­cus on how to gen­er­ate a se­quence align­ment for two
se­quences. In prac­tice, one usu­ally wants to cre­ate a MSA. It turns out­that MSAs
can be cre­ated by clever ex­ten­sion of pairwise se­quence align­ments. The pro­
gres­sive align­ment method de­scribed by Paulien Hogeweg and Ben Hesper in
1984 was ef­fec­tively in­te­grated by the cre­a­tors of the ClustalW pro­gram a de­cade
later. The lat­est ver­sion of this soft­ware, one of the most-cited bioinformatics pro­
grams, is called Clustal Omega, which you will learn to use in Activity 3.1.
The ba­sic ap­proach of all­the pro­gres­sive align­ment meth­ods is as fol­lows.

1. Make pairwise align­ments of all­the se­quences.


2. Produce a guide tree based on the dis­tances be­tween all­the pairs (see the
dis­tance method in Chap­ter 06 for an ex­am­ple).
3. Progressively align se­quence pairs us­ing the guide tree, start­ing with the most
sim­i­lar se­quences.

FIGURE 3.4. Progressive se­quence align­ment of four DNA se­quences. At the top (left
to right), be­gin­ning with the un­aligned DNA se­quences, all­pos­si­ble pairs are aligned us­ing
a dy­namic pro­gram­ming ap­proach. Distances are cre­ated from the se­quence align­ments
and used to build a guide tree. At the bot­tom (left to right), us­ing the guide tree, the most
sim­i­lar se­quences are aligned in pairs. Consensus se­quences are made from the pairs,
and then the pair of con­sen­sus se­quences is aligned. Note how the fi­nal MSA re­tains the
gaps from the pre­vi­ous pairwise align­ments as well as from the con­sen­sus pair align­ment.
S eq u enc e A lignm ent  73

The key to this ap­proach is that at each stage of the pro­gres­sive align­ment
the pro­gram al­ways aligns only two se­quences. Figure 3.4 shows an ex­am­ple of
a pro­gres­sive MSA with four DNA se­quences.

Notes
1. Retroviruses (such as HIV) have high rates of mu­ta­tions that oc­cur dur­ing ge­nome rep­li­ca­tion
and re­sult in many mu­tant vi­ral var­i­ants. Most re­sult in non­func­tional vi­ruses, but oth­ers re­
sult in new func­tion­ing var­ia­ nts of the vi­rus that al­low the vi­rus to es­cape the im­mune sys­
tem. Cool but evil, sort of like Darth Vader.
2. Good se­quence align­ments are vi­tal for proper data in­ter­pre­ta­tion. For years, poor se­quence
align­ments that did not prop­erly ad­dress large in­ser­tion or de­le­tion mu­ta­tions (gaps) led re­
search­ers to con­clude that birds were re­lated to mam­mals. This changed when a re­search
group in China took a closer look and no­ticed that the se­quence align­ment was wrong.
74  CO MPUTATIONAL B IOL OGY

ACTIVITY 3.1 DYNAMIC PROGRAMMING

Motivation
Alignments of dif­fer­ent DNA, RNA, or pro­tein se­quences are a fun­da­men­tal as­pect of bioinfor-
matics. Sequence align­ments al­low sim­i­lar­ity com­par­i­sons (e.g., BLAST) for func­tion pre­dic­tion
and or­gan­ism iden­ti­fi­ca­tion. They also iden­tify when mu­ta­tions have oc­curred, such as the oc­cur­
rence of sin­gle nu­cle­o­tide po­ly­mor­phisms in the hu­man ge­nome or the evo­lu­tion of vi­ruses
dur­ing a pan­demic. Multiple-sequence align­ments are also crit­ic­ al for a num­ber of other bioinfor-
matics ap­proaches, in­clud­ing many of the other bioinformatics tools cov­ered in this book.
The pur­pose of this ac­tiv­ity is to teach a dy­namic pro­gram­ming al­go­rithm for the global (end-
to-end) align­ment of any two DNA or pro­tein se­quences.1 This al­go­rithm, called the Needleman-
Wunsch method, uses a scor­ing sys­tem and a grid ma­trix to ef­fi­ciently de­ter­mine the best
align­ment be­tween a pair of DNA or pro­tein se­quences. Dynamic pro­gram­ming al­go­rithms are
com­mon in math­e­mat­ics, com­puter sci­ence, and bioinformatics. An al­go­rithm is con­sid­ered
“dy­namic pro­gram­ming” or “dy­namic op­ti­mi­za­tion” if it breaks down a prob­lem into a set of eas­
ily solv­able sub­prob­lems, each of which is stored and used for the even­tual solution. (If only this
would work with personal relationships.) The cool thing about these so­lu­tions is that, given a set
of sim­ple as­sump­tions, they are guar­an­teed to pro­duce the op­ti­mal se­quence align­ment.2 In this
case, the as­sump­tions are a set of scores for the match­ing of two nu­cle­o­tides, for the mis­match­
ing of two nu­cle­ot­ ides, and for gaps when there has been an in­ser­tion or a de­le­tion. This chap­ter will
teach you how to solve dy­namic pro­gram­ming prob­lems for pairwise DNA and pro­tein se­quences
and then teach you how to use an on­line pro­gram for con­struct­ing MSAs. The Needleman-Wunsch
algorithm and dynamic programming can be a bit confusing at first glance, so make sure to read
through the tutorial on the website carefully and practice using the interactives.

Learning Objectives
1. Learn the im­por­tance of se­quence align­ment and the chal­lenges im­posed by dif­fer­ent types
of mu­ta­tions in gen­er­at­ing an ac­cu­rate se­quence align­ment (Motivation).
2. Use match, mis­match, and gap pen­al­ties to cre­ate a dy­namic pro­gram­ming ma­trix for a pair-
wise DNA se­quence align­ment (Concepts and Exercises).
3. Use the ma­trix to per­form the traceback step and de­ter­mine the fi­nal, best global (end-to-
end) align­ment of two se­quences (Concepts and Exercises).
4. Do the same for pairwise pro­tein se­quence align­ment us­ing the spe­cial­ized pro­tein scor­ing
ta­bles PAM and BLOSUM (Concepts and Exercises).
5. Learn how to use the Clustal Omega pro­gram to per­form pairwise and mul­ti­ple-sequence
align­ments (Concepts and Exercises).

Concepts
To pre­pare you to un­der­stand the prin­ci­ples and goals of dy­namic pro­gram­ming, try the fol­low­
ing an­tic­i­pa­tory ex­er­cise. The Needleman-Wunsch method, and other pairwise se­quence align­
S eq u enc e A lignm ent  75

ment meth­ods, uses two-dimensional graphs to solve the align­ment prob­lem. Below you will
find graphs that in­di­cate two pos­si­ble align­ments of the same pair of DNA se­quences, Seq1 and
Seq2. Follow the paths to de­ter­mine how to align the nu­cle­o­tides (let­ters) of the two dif­fer­ent
se­quences. Note that one se­quence is lon­ger than the other but the align­ment is “glob­al,” which
means that the se­quences must be aligned from end to end. This also means that, if the se­
quences are of dif­fer­ent lengths or very di­ver­gent from one an­other, you will need to ad­just with
spaces or other char­ac­ters to cre­ate gaps in the se­quence so that the cor­rect ba­ses line up prop­
erly. Most of­ten hy­phens are used to in­di­cate these gaps (indels). Write the align­ments be­low
the graphs.
To be­gin the prob­lem, in each graph start at the top left square and fol­low the X’s. The X’s in­
di­cate the path for each align­ment. Notice how Seq1 is shorter than Seq2 but in the graph they
are aligned end to end. This means that you must have gaps to get them to align prop­erly. Both
paths in­di­cate that the G in Seq1 should be aligned to the G in Seq2. From there, both paths
“move” di­ag­o­nally and have the A in Seq1 aligned to the A in Seq2. However, the next X in the
third square is di­ag­o­nal in path 1 but ver­ti­cal in path 2. This means that there are no ba­ses in
Seq1 to align with the G in Seq2, which cre­ates a gap (in­di­cated by a hy­phen).
The first three po­si­tions of the align­ments are given be­low. Try fin­ish­ing the rest for both
paths and an­swer the re­flec­tion ques­tions.

Seq1: GATTTA

Seq2: GAGTTCA

PATH 1: PATH 2:

G A T T T A G A T T T A

G X G X

A X A X

G X G X

T X T X

T X T X

C X C X

A X A X

Path 1:
Seq1: GAT _________________
Seq2: GAG _________________

Path 2:
Seq1: GA— _________________
Seq2: GAG _________________
76  COMPU TATIO NA L B IOL OGY

Reflection
• How do you align the let­ters of the DNA when the path moves di­ag­o­nally? Do the let­ters
al­ways match?
• What hap­pens when the paths move ver­ti­cally? Would a hor­i­zon­tal move be dif­fer­ent and, if
so, how?
• Could this type of graph­ing method be used for pro­tein se­quences?
• Which of the align­ments seems bet­ter? Could some kind of scor­ing scheme help?

Below is the an­swer. Each path showed a dif­fer­ent align­ment for the same pair of se­quences
(Seq1 and Seq2). Because Seq1 is shorter, in an end-to-end align­ment, all­the let­ters have a
match in Seq 1, but not all­the Seq2 let­ters have a match. Thus, in or­der to get the se­quences to
align prop­erly, it is nec­es­sary to put in some gaps (spaces) in the se­quence align­ment, in­di­cated
by a hy­phen. The graph in­di­cates what let­ters you should match up (a di­ag­o­nal move is a match
or a mis­match). However, when the path moves ver­ti­cally, it stays on one se­quence but moves
along the other. This in­di­cates that you must put a gap in the se­quence align­ment in the se­quence
that does not move. (At this point, I'm feeling quite moved.) Since in this case it is al­ways Seq1,
that is where all­the gaps are placed. (A hor­i­zon­tal move would have been a gap in Seq2.)

Path 1:
Seq1: GAT—TTA
Seq2: GAGTTCA

Path 2:
Seq1: GA—TTTA
Seq2: GAGTTCA

So which of the two align­ments for these same se­quences is bet­ter? The sec­ond align­ment
path seems bet­ter, but we can­not know with­out­a scor­ing scheme of some sort. Computers
need num­bers, and that is where the Needleman-Wunsch al­go­rithm comes in.

Finding the path through the graph


Hopefully you know how to find the path through the graph and make a se­quence align­ment.
(Though if you don’t, keep read­ing.) But how do we de­ter­mine the ideal path in the first place?
This is where dy­namic pro­gram­ming comes in. Figure 3.1.1 shows how to use dy­namic pro­
gram­ming for de­ter­min­ing the best path through a two-dimensional graph. Note that in solv­ing
the prob­lem, the al­go­rithm cal­cu­lates the val­ues for just one cell at a time. That is the es­sence
of dy­namic pro­gram­ming: break­ing a large prob­lem into man­age­able sub­prob­lems. Then, af­ter
the scores for all­the cells are de­ter­mined, the al­go­rithm traces back through the graph to find
the best align­ment. The Fig. 3.1.1 ex­am­ple uses the Needleman-Wunsch dy­namic pro­gram­ming
method to align a pair of short DNA se­quences: ACT and ACA. A lame problem, admittedly, but
Rome wasn't built in a day, was it? A crit­i­cal as­pect of the method is es­tab­lish­ing a scor­ing
scheme for matches, mis­matches, and gaps. Given these scores, the method is guar­an­teed to
pro­vide the op­ti­mal so­lu­tion. A typ­i­cal scor­ing scheme might be some­thing like the fol­low­ing.

Scoring Scheme: Match = +1; Mismatch = 0; Gap Penalty = −1


S eq u enc e A lignm ent  77

(continued )
78  COMPU TATIO NA L B IOL OGY

FIGURE 3.1.1. Solving a dy­namic pro­gram­ming align­ment prob­lem. The value for
each cell is based on the val­ues from the sur­round­ing cells. (1) Create a grid with your two
se­quences, add­ing an ex­tra row and col­umn (so, in this ex­am­ple, a 3-by-3 se­quence
be­comes a 4-by-4 grid). Start with a 0 score in the top left corner. (Gotta start somewhere!)
(2) Moving hor­i­zon­tally or ver­ti­cally, use the gap pen­alty score. To fill in the top out­­side row,
move hor­i­zon­tally and add the gap pen­alty to the score from the pre­vi­ous cell. For in­stance,
mov­ing right, add a −1 to the score from the orig­i­nal top left cell. Do the same to the left­most
col­umn go­ing down from the ini­tial start­ing cell. (3) To de­ter­mine the score for an empty cell,
cal­cu­late the fol­low­ing three scores: the di­ag­o­nal (match or mis­match), hor­i­zon­tal (gap
pen­alty), and ver­ti­cal (gap pen­alty), add­ing each score to that of the start­ing cell. In this case,
the di­ag­o­nal is a match (A to A), so we add +1 to the 0 that was in the top left cell, get­ting a
di­ag­o­nal score of +1. Next, de­ter­mine the gap scores for the cell com­ing from the ver­ti­cal and
hor­i­zon­tal.3 The gap pen­alty in this ex­am­ple is −1. The hor­i­zon­tal and ver­ti­cal start­ing cells each
have a score of −1, so the hor­i­zon­tal and ver­ti­cal scores for the new cell are both −2. (4) Pick
the high­est (max­i­mal) of the three num­bers, and that is the fi­nal score for the cell, in this case
+1. (5) The pro­cess re­peats it­self as you move to the next empty cell. (6) In this case, the
di­ag­o­nal is a mis­match (C to A), so we add 0 to the start­ing cell’s score, giv­ing us a −1
di­ag­o­nal score. The ver­ti­cal −1 gap pen­alty adds to the ver­ti­cal start­ing score of 1, giv­ing us a
fi­nal ver­ti­cal score of 0. The hor­i­zon­tal −1 gap pen­alty adds to the hor­i­zon­tal start­ing score of
−2, giv­ing us a fi­nal hor­i­zon­tal score of −3. So, the fi­nal score for the new cell is 0. (7) The
pro­cess is re­peated as you con­tinue to fill out­the rest of the empty cells un­til the en­tire
graph is com­pleted (8). (9) Solving the graph, traceback, and alignment. The Needleman-­
Wunsch al­go­rithm cre­ates a global (end-to-end) se­quence align­ment.4 Once the graph
is solved, start the traceback at the bot­tom right cor­ner (the high­est fi­nal score) for the
global align­ment. The ar­rows show the path of the traceback, which goes back to the
cell from which the high­est num­ber was de­rived. In this case, the “2” is the high­est,
and it came from the di­ag­o­nal cell. (10) Finish the traceback to the top left, and then
(11) fol­low the path to make the align­ment (12).
S eq u enc e A lignm ent  79

Exercises
Interactive ex­er­cise (the­o­ry)
Use the on­line Needleman-Wunsch align­ment in­ter­ac­tive link be­low to learn how
to cre­ate scor­ing ma­tri­ces, solve tracebacks, and pro­duce pairwise se­quence align­
ments. The Interactive Link explains how to use the teach­ing in­ter­ac­tives. Once you
learn how it works, solve the ac­tiv­ity prob­lem.

Needleman-Wunsch DNA/Protein Alignment


Interactive Link
Link:
http://​kelleybioinfo.​org/​algorithms
/​default.​php?​o=8
80  COMPU TATIO NA L B IOL OGY

Problems
1. Fill in the blank cells us­ing the Needleman-Wunsch al­go­rithm and then com­
plete the traceback. Using the traceback, write the se­quence align­ment in the
spaces be­low.
S eq u e nc e A lignm ent  81

2. Fill in the blanks in this pro­tein align­ment us­ing the Needleman-Wunsch al­go­
rithm and the PAM250 scor­ing ma­trix, shown in Fig. 3.1.2 (PAM250 is dis­
cussed fur­ther in Chap­ter 07). For this prob­lem all­the gap scores are −5, but
the match and mis­match scores come from the PAM250 ma­trix. For ex­am­ple,
an R-to-W mis­match ac­cord­ing to the PAM250 ma­trix is +2, while an M-to-M
match is +6. (Make sure that you can find these val­ues in the PAM250 ma­trix.)
To prac­tice this type of prob­lem, click on the Align Proteins but­ton in the in­ter­
ac­tive mod­ule.

FIGURE 3.1.2. PAM250 scor­ing ma­trix.


82  COMPU TATIO NA L B IOL OGY
S eq u enc e A lignm ent  83

Lab Exercises (Practice)


In this part of the ex­er­cise, you will learn how to use the Clustal Omega pro­gram,
which makes align­ments of two or more se­quences us­ing dy­namic pro­gram­ming
al­go­rithms.

Clustal Omega Tutorial Link


Link:
http://​kelleybioinfo.​org/​algorithms/​tutorial
/​TAli2.​pdf

Sample and lab ex­er­cise da­ta:


http://​kelleybioinfo.​org/​algorithms
/​data/​DAli2.​txt
84  COMPU TATIO NA L B IOL OGY

Lab Exercise
DNA mul­ti­ple-sequence align­ment
Unaligned DNA se­quences for ques­tions 1 to 3 be­low (and at the Sample and Lab
Exercise data link):

>BR110MP90
TAATATCAATAGAAGAATTAGCCAAAATTACGTCCTGTCAAACCCCCTATGGTAAATAGAAAAATAAATC
CGATAGCTCATAGGGATGAAGGAGTTAAAGTAATTTGGGAGCCATGGTATGTTGCGAGTCATCTAAAAATTT
TGATTCCAGTAGGAACTGCAATAATTATTGTGGCTGATGTGAAGTAAGCTCGAGTATCAACATCTATCCCTACTGT
AAATATATGATGGGCTCACACTACAAATCCTAGCAGACCAATTGCTATTATAGCATAAATTATTCCTAATAAAC
CGAAAGCTTCCTTTTTGCCACTTTCTTGTCTAATAATATGAGAAATTATTCCGAAACCAGGTAAAATTAGAATAT
A A AC TT C AG G AT G T C C G A A A A AT C A A A ATA A AT G C T G ATA A AG A ATAG G AT C C C C T C C AC C T G AT G G G
TCAAAGAAGGTAGTATTAATATTTCGATCTGTCAATAGTATAGTAATAGCTCCGGCTAATAC

>JE05
AATAATATCAATGGAAGAATTGGCTAAAACTACACCAGTTAATCCCCCTAAAGTAAAGAGGAAAATAAAT
CCAATAGCTCAAAGGGAGGAGGGGTTTAGGGTAATTTGAGACCCATGGTATGTAGCTAACCATCTAAAAATTT
TGATTCCAGTCGGAACTGCAATAATTATTGTGGCAGACGTGAAATAGGCGCGAGTATCTACATCTATTCCTACTG
TGAACATATGATGGGCTCATACTACAAAACCTAATAGTCCAATTGCTATTATAGCATAAATTATTCCCAATAAT
CCAAAAGCTTCCTTTTTTCCTCTTTTCTTGCCTAATAATATGAGAAATTATACCAAATCCTGGTAAAATTAAAATAT
AAACTTCAGGGTGCCCAAAAAATCAGAATAAGTGCTGATAGAGGATAGGGTCTCCTCCACCGGAGGGA
TCAAAAAAAGTAGTATTAATATTTCGGTCTGTTAAAAGTATAGTGATAGCCCCAGCTAAC

>JE06
GAATAATATCAATGGAAGAATTGGCTAAAACTACACCAGTTAATCCCCCTAAAGTAAAGAGGAAAATAAATC
CAATAGCTCAAAGGGAGGAGGGATTTAGGGTAATTTGAGACCCATGGTATGTAGCTAACCATCTAAAAATTT
TGATTCCAGTCGGAACTGCAATAATTATTGTGGCAGACGTGAAATAGGCGCGAGTATCTACATCTATTCCTACTGT
GAACATATGATGGGCCCATACTACAAAACCTAATAGTCCAATTGCTATTATAGCATAAATTATCCCCAATAATC
CAAAAGCTTCCTTTTTACCTCTTTCTTGCCTAATAATATGAGAAATTATACCAAATCCTGGTAAAATTAAAAATATA
AACTTCAGGGTGCCCAAAAAATCAGAATAAGTGTTGATAGAGGATAGGGTCTCCTCCACCGGAGGGAT
CAAAAAAAGTAGTATTAATATTTCGGTCTGTTAAAAGTATAGTGATAGCTCCAGCTAA

>JE11L95
TATCAATGGAAGAATTGGCTAAAATTACACCAGTTAATCCCCCTAAAGTAAAGAGGAAAATAAATCCAATAGCT
CAAAGGGAGGAGGGAGTCAGGGTAATTTGAGATCCATGGTATGTAGCCAATCATCTAAAAATTTTAATTCCAGTC
CGAACTGCAATAATTATTGTGGCAGATGTAAAATAGGCGCGAGTATCTACATCTATTCCTACTGTGAAGTATATG
G AT G A G C T C ATA C TA C A A A A C C TA ATA AT C C A AT T G C TAT TATA G C ATA A AT TAT T C C TA ATA AT C
CAAAAGCTTCCTTTTTTCCTCTTTCTTGCCTAATAATATGGGAAATTATACCAAATCCTGGTAAAATTAAAATATA
AACTTCAGGGTGCCCAAAAAATCAAAATAAGTGTTGATAGAGGATAGGGTCTCCTCCACCGGAGGGGT
CAAAAAAAGTAGTATTAATATTTCGGTCTGTTAAAAGTATAGTGATAGCCCCAGCTAATACCG
S eq u enc e A lignm ent  85

>JE17NB95
AATATCAATGGAAGAATTGGCTAAAACTACACCAGTTAATCCCCCTAAAGTAAAGAGGAAAATAAATCCAATAGCT
CAAAGGGAGGAGGGATTTAGGGTAATTTGAGACCCATGGTATGTAGCTAACCATCTAAAAATTTTGATTCCAGTC
GGAACTGCAATAATTATTGTGGCAGACGTGAAATAGGCGCGAGTATCTACATCTATCCCTACTGTGAACATNAT
G AT G G G C T C ATA C TA C A A A A C C TA ATA G T C C A AT T G C TAT TATA G C ATA A AT TAT T C C C A ATA AT C
CAAAAGCTTCCTTTTTTCCTCTTTCTTGCCTAATAATATGAGAAATTATACCAAATCCTGGTAAAATTAAAATATA
AACTTCAGGGTGCCCAAAAAATCAGAATAAGTGTTGATAGAGGATAGGGTCTCCTCCACCGGAGGGAT
CAAAAAAAGTAGTATTAATATTTCGGTCTGTTAAAAGTATAGTGATAGCCCCAGCTAACACC

>JE19NB95
ATATCAATGGAAGAATTGGCTAAAACTACACCAGTTAATCCCCCTAAAGTAAAGAGGAAAATAAATCCAATAGCT
CAAAGGGAGGAGGGATTTAGGGTAATTTGAGACCCATGGTATGTAGCTAATCATCTAAAAATTTTGATTCCAGTCG
GAACTGCAATAATTATTGTGGCAGATGTAAAATAGGCGCGAGTATCTACATCTATTCCTACTGTGAACATATGAT
GAGCTCATACTACAAAACCTAATAATCCAATTGCTATTATAGCATAAATTATTCCCAATAATCCAAAAGCTTCCTTT
TTAAACCTCTTTCTTGCCTAATAATATGAGAAATTATACCAAATCCTGGTAAAATTAAAATATAAACTTCAGGGT
GCCCAAAAAATCAGAATAAGTGTTGATAGAGGATAGGGTCTCCTCCACCGGAGGGATCAAAAAAAGTAGTATTA
ATATTTCGGTCTGTTAAAAGTATAGTGATAGCCCCAGCTAACACT

>JE39M95
TA ATAT C A AT G G AAGAATT GGCTAAAACTACACCAGTTA ATCC C C C TA A AGTA A AGAGGA A A ATA A ATC
CAATAGCTCAAAGGGAGGAGGGATTTAGGGTAATTTGAGACCCATGGTATGTAGCTAACCATCTAAAAATTTT
GATTCCAGTCGGAACTGCAATAATTATTGTGGCAGACGTGAAATAGGCGCGAGTATCTACATCTATCCCTACTGT
GAATATATGATGGGCTCATACTACAAAACCTAATAGTCCAATTGCTATTATAGCATAAATTATTCCCAATAATC
CAAAAGCTTCCTTTTTTCCTCTTTCTTGCCTAATAATATGAGAAATTATACCAAATCCTGGTAAAATTAAAATATA
AACTTCAGGGTGCCCAAAAAATCAGAATAAGTGTTGATAGAGGATAGGGTCTCCTCCACCGGAGGGAT
CAAAAAAAGTAGTATTAATATTTCGGTCTGTTAAAAGTATAGTGATAGCCCCAGCTA

1. Use the Clustal Omega align­ment pro­gram with the de­fault ma­trix to align
the DNA se­quence data. Write the first 10 po­si­tions of the align­ment for
only the first 4 aligned se­quences be­low (or the first 4 rows of the align­ment
if you are us­ing copy/paste).
86  COMPU TATIO NA L B IOL OGY

2. This next sec­tion is de­signed to give you prac­tice con­vert­ing file for­mats.
Many pro­grams ac­cept only a few se­lected file for­mats. Many al­low you to
use FASTA for­mats, but some need spe­cial­ized for­mats, so prac­tice con­vert­ing
file types is help­ful. The con­ver­sion web­site used here func­tions sim­il­arly to
the Clustal Omega site. Use https://​www.​ebi.​ac.​uk/​Tools/​sfc/​emboss_​
seqret/​to con­vert the Clustal align­ment you cre­ated in ques­tion 1 into a FASTA
for­mat. (Note: use the ENTIRE file, in­clud­ing the header that be­gins with
“CLUSTAL” and the as­ter­isks at the bot­tom that are not ac­tu­ally part of the
align­ment.)

Write the ti­tle lines and the first 10 po­si­tions of the align­ment for only the first
3 se­quences in the FASTA file (or the first 3 se­quences in the align­ment if you
are us­ing copy/paste).

3. Convert the Clustal align­ment to the Nexus/paup in­ter­leaved for­mat, which is
used of­ten in phy­lo­ge­netic an­a­ly­ses.

a. What kind of in­for­ma­tion does the header of the Nexus/paup file con­tain?

b. What do you think “ntax” and “nchar” re­fer to?

c. What is at the very last line of the fi­le?


S eq u enc e A lignm ent  87

Protein mul­ti­ple-sequence align­ment

Unaligned pro­tein se­quences for ques­tion 4 be­low (and at the Sample and lab
ex­er­cise data link):

>LCseedSfl
MKKLTVAISAVAASVLMAMSAQAAEIYNKDSNKLDLYGKVNAKHYFSSNDADDGDTTYVRLGFKGETQINDQLTG
FGQWEYEFKGNRAESQGSSKDKTRLAFAGLKFGDYGSIDYGRNYGVAYDIGAWTDVLPEFGGDTWTQTDVFM
TGRTTGVATYRNNDFFGLVDGLNFAAQYQGKNDRTDVTEANGDGFGFSTTYEYEGFGVGATYAKSDRTNDQVIY
GNNSLNASGQNAEVWAAGLKYDANNIYLATTYSETQNMTVFGNNHIANKAQNFEVVAQYQFDFGLRPSVAYLQSK
GKDLGAWGDQDLIEYIDVGATYYFNKNMSTFVDYKINLIDKSDFTKASGVATDDIVAVGLVYQF

>PhoEseedEco1
MKMKKSTLALVVMGIVASASVQAAEIYNKDGNKLDVYGKVKAMHYMSDNDSKDGDQSYIRFGFKGETQINDQL
TGYGRWEAEFAGNKAESDTAQQKTRLAFAGLKYKDLGSFDYGRNLGALYDVEAWTDMFPEFGGDSSAQTDNFM
TKRASGLATYRNTDFFGVIDGLNLTLQYQGKNENRDVKKQNGDGFGTSLTYDFGGSDFAISGAYTNSDRTNEQNLQ
SRGTGKRAEAWATGLKYDANNIYLATFYSETRKMTPITGGFANKTQNFEAVAQYQFDFGLRPSLGYVLSKGKDIEGI
GDEDLVNYIDVGATYYFNKNMSAFVDYKINQLDSDNKLNINNDDIVAVGMTYQF

>PhoEseedEco2
MKMKKSTLALVVMGIVASVSVQAAEIYNKDGNKLDVYGKVKAMHYMSDNDSKDGDQSYIRFGFKGETQINDQL
TGYGRWEAEFAGNKAESDTAQQKTRLAFAGLKYKDLGSFDYGRNLGALYDVEAWTDMFPEFGGDSSAQTDNFM
TKRASGLATYRNTDFFGVIDGLNLTLQYQGKNENRDVKKQNGDGFGTSLTYDFGGSDFAISGAYTNSDRTNEQNLQ
SRGTGKRAEAWATGLKYDANNIYLATFYSETRKMTPISGGFANKTQNFEAVAQYQFDFGLRPSLGYVLSKGKDIEGI
GDEDLVNYVDVGATYYFNKNMSAFVDYKINQLDSDNKLNINNDDIVAVGMTYQF

>PhoEseedEco4
MKKSTLALVVMGIVASASVQAAEIYNKDGNKLDVYGKVKAMHYMSDNDSKDGDQSYIRFGFKGETQINDQLT
GYGRWEAEFAGNKAESDTAQQKTRLAFAGLKYKDLGSFDYGRNLGALYDVEAWTDMFPEFGGDSSAQTDNFM
TKRASGLATYRNTDFFGVIDGLNLTLQYQGKNENRDVKKQNGDGFGTSLTYDFGGSDFAISGAYTNSDRTNEQNLQ
SRGTGKRAEAWATGLKYDANNIYLATFYSETRKMTPITGGFANKTQNFEAVAQYQFDFGLRPSLGYVLSKGKDIEGI
GDEDLVNYIDVGATYYFNKNMSAFVDYKINQLDSDNKLNINNDDIVAVGMTYQF

>PhoEseedSen1
MNKSTLAIVVSIIASASVHAAEVYNKNGNKLDVYGKVKAMHYMSDYDSKDGDQSYVRFGFKGETQINDQLT
GYGRWEAEFAGNKAESDSSQQKNRLAFAGLKLKDIGSFDYGRNLGALYDVEAWTDMFPEFGGDSSAQTDNF
MTKRASGLATYRNTDFFGIVDGLDLTLQYQGKNEDRDVKKQNGNGFGTSVSYDFGGSDFAVSGAYTLSDRTREQNLQ
RRGTGDKAEAWATGVKYDANDIYIATFYSETRNMTPVSGGFANKTQNFEAVIQYQFDFGLRPSLGYVLSKGKD
IEGVGSEDLVNYIDVGATYYFNKNMSAFVDYKINQLDSDNTLGINDDDIVAIGLTYQF

>PhoEseedSen2
MNKSTLAIVVSIIASASVHAAEVYNKNGNKLDVYGKVKAMHYMSDYDSKDGDQSYVRFGFKGETQINDQLT
GYGRWEAEFAGNKAESDSSQQKTRLAFAGLKLKDIGSFDYGRNLGALYDVEAWTDMFPEFGGDSSAQTDNFM
TKRASGLATYRNTDFFGIVDGLDLTLQYQGKNEDRDVKKQNGDGFGTSVSYDFGGSDFAVSGAYTLSDRTREQNLQ
RRGTGDKAEAWATGVKYDANDIYIATFYSETRNMTPVSGGFANKTQNFEAVIQYQFDFGLRPSLGYVLSKGKD
IEGVGSEDLVNYIDVGAIYYFNKNMSAFVDYKINQLDSDNTLGINDDDIVAIGLTYQF
88  COMPU TATIO NA L B IOL OGY

>PhoEseedSfl
MKKSTLALVVMGIVASASVQAAEIYNKDGNKLDVYGKVKAMHYMSDNASKDGDQSYIRFGFKGETQINDQLT
GYGRWEAEFAGNKAESDTAQQKTRLAFAGLKYKDLGSFDYGRNLGALYDVEAWTDMFPEFGGDSSAQTDNFM
TKRASGLATYRNTDFFGVIDGLNLTLQYQGKNENRDVKKQNGDGFGTSLTYDFGGSDFAISGAYTNSDRTNEQNLQ
SRGTGKRAEAWATGLKYDANNIYLATFYSETRKMTPITGGFANKTQNFEAVAQYQFDFGLRPSLGYVLSKGKDIEGI
GDEDLVNYIDVGATYYFNKNMSAFVDYKINQLDSDNKLNINNDDTVAVGMTYQF

>PhoEseedSty
MNKSTLAIVVSIIASASVHAAEVYNKNGNKLDVYGKVKAMHYMSDYDSKDGDQSYVRFGFKGETQINDQLT
GYGRWEAEFASNKAESDSSQQKTRLAFAGLKLKDIGSFDYGRNLGALYDVEAWTDMFPEFGGDSSAQTDNFM
TKRASGLATYRNTDFFGIVDGLDLTLQYQGKNEDRDVKKQNGDGFGTSVSYDFGGSDFAVSGAYTLSDRTREQNLQ
RRGTGDKAEAWATGVKYDANDIYIATFYSETRNMTPVSGGFANKTQNFEAVIQYQFDFGLRPSLGYVLSKGKD
IEGVGSEDLVNYIDVGATYYFNKNMSAFVDYKINQLDSDNTLGINDDDIVAIGLTYQF

>nmpCseedEco1
MNIYRAVTSFFNNSSKKGLTMKKLTVAISAVAASVLMAMSAQAAEIYNKDSNKLDLYGKVNAKHYFSSNDADD
GDTTYARLGFKGETQINDQLTGFGQWEYEFKGNRAESQGSSKDKTRLAFAGLKFGDYGSIDYGRNYGVAYDIGAWT
DVLPEFGGDTWTQTDVFMTQRATGVATYRNNDFFGLVDGLNFAAQYQGKNDRSDFDNYTEGNGDGFGFSATYEYE
GFGIGATYAKSDRTDTQVNAGKVLPEVFASGKNAEVWAAGLKYDANNIYLATTYSETQNMTVFADHFVANKAQN
FEAVAQYQFDFGLRPSVAYLQSKGKDLGVWGDQDLVKYVDVGATYYFNKNMSTFVDYKINLLDKNDFTKEGANKSLI

>nmpCseedSty
MKLKLVAVAVTSLLAAGVVNAAEVYNKDGNKLDLYGKVHAQHYFSDDNGSDGDKTYARLGFKGETQINDQLTG
FGQWEYEFKGNRTESQGADKDKTRLAFAGLKFADYGSFDYGRNYGVAYDIGAWTDVLPEFGGDTWTQTDVFMT
GRTTGVATYRNTDFFGLVEGLNFAAQYQGKNDRDGAYESNGDGFGLSATYEYEGFGVGAAYAKSDRTNNQVKAA
SNLNAAGKNAEVWAAGLKYDANNIYLATTYSETLNMTTFGEDAAGDAFIANKTQNFEAVAQYQFDFGLRPSIAYLKS
KGKNLGTYGDQDLVEYIDVGATYYFNKNMSTFVDYKINLLDDSDFTKAAKVSTDNIVAVGLNYQF

4. Use Clustal Omega with the un­aligned pro­tein se­quences.

a. Find the lon­gest stretch of com­plete con­served po­si­tions. Write the amino
ac­ids for this stretch of se­quence be­low (which is the same in all­the
­se­quences).

b. Find the re­gion or re­gions of the align­ment in which most or all­of the se­
quences have a 3-character gap (“---”). Write the amino acid let­ters for the
se­quences that do NOT have gaps.
S eq u enc e A lignm ent  89

Notes
1. This is also called a pairwise se­quence align­ment.
2. BLAST is faster and highly ac­cu­rate, but it is not guar­an­teed to pro­duce an op­ti­mal se­quence
align­ment.
3. It is nec­es­sary to have the scores from the three sur­round­ing cells, which is why we added
an ex­tra row and col­umn when we started the graph.
4. The Smith-Waterman var­i­ant of this al­go­rithm pro­duces a lo­cal align­ment, which aligns two
se­quences but does not con­strain the align­ment to be from end to end. This is im­por­tant
when align­ing, say, a short se­quence to an en­tire ge­nome. In the Smith-Waterman var­i­ant, all­
the neg­a­tive cell val­ues are changed to 0 when the graph is cal­cu­lated, which makes the pos­
i­tive lo­cal align­ments vis­ib
­ le. Then the traceback starts at the high­est scor­ing cell re­gard­less
of where that value is in the graph, and pro­ceeds un­til it reaches a 0 val­ue.
CHAPTER
04
PATTERNS IN THE DATA

E
arly in the pri­mor­dial days of molecular sequencing, anal­y­sis of DNA and
pro­tein se­quence align­ments un­cov­ered in­ter­est­ing non­ran­dom pat­terns of
nu­cle­o­tide var­i­a­tion. One of the ear­li­est discoveries came from the com­par­i­
son of the up­stream (5′) pro­moter re­gions of pro­tein-coding genes, right in
the midst of the bind­ing site of the RNA po­ly­mer­ase. This re­gion con­tained
what is known as a TATA box, which is vi­tal for the cor­rect bind­ing of the RNA
po­ly­mer­ase and tran­scrip­tion of the mes­sen­ger RNA. Analysis of the se­quence
align­ment con­tain­ing the TATA box re­gion, even vi­su­ally, shows a sig­nif­i­cant bias
to­wards T and A nu­cle­o­tides (Fig. 4.1). The TATA box turns out­to be cru­cial for
bind­ing the pro­tein known, shock­ingly enough, as the TATA bind­ing pro­tein, which
is a crit­i­cal part of the RNA po­ly­mer­ase com­plex. Experimental mu­ta­tions of
these con­served nu­cle­o­tides re­duced or com­pletely elim­i­nated the pro­cess of
tran­scrip­tion. If such a mu­ta­tion hap­pened in a gene that was crit­ic­ al for or­gan­ism
de­vel­op­ment or sur­vival, the or­gan­ism with the mu­ta­tion would be elim­i­nated via
nat­ur­al se­lec­tion and the mu­ta­tion would not pass on to fu­ture gen­er­a­tions.
Harsh, but true.
Such pat­terns proved to be ex­tremely com­mon across a vast va­ri­ety of ge­
nomes. Figure 4.2 il­lus­trates a num­ber of other ex­am­ples us­ing se­quence log­os.
Sequence logos were cre­ated by Tom Schneider and Mike Ste­phens back in the
dark ages (1990, when cell phones were the size of bricks and email was only for
nerds) and were de­signed as a sim­ple vi­sual means of il­lus­trat­ing the highly con­
served po­si­tions in a se­quence align­ment (DNA, RNA, or pro­tein). These po­si­
tions were likely to be ex­tremely im­por­tant and, there­fore, con­served by nat­u­ral
se­lec­tion. The cal­cu­la­tion of se­quence logos is quite sim­ple and is based on in­for­
ma­tion the­ory. The level of con­ser­va­tion at a po­si­tion of a se­quence align­ment,
Rseq , is cal­cu­lated as fol­lows:

⎛ N ⎞
Rseq = log2 N − ⎜ − ∑
⎝ n=1
pn log2 pn ⎟

where N is the num­ber of dis­tinct sym­bols, and pn is the ob­served fre­quency of


sym­bol n at po­si­tion p. For DNA or RNA, there are four sym­bols—A, G, C, and T
91
92  COMPU TATIO NA L B IOL OGY

FIGURE 4.1. Sequence align­ment of re­gions 5′ up­stream from 10 pro­tein-coding


genes in the bac­te­rial Escherichia coli ge­nome. There is a clear bias to­wards T and
A nu­cle­o­tides, though none of the se­quences are com­pletely iden­ti­cal.

(U for RNA)—and the max­i­mum se­quence logo score for a po­si­tion is 2 if all­the
nu­cle­o­tides at a po­si­tion are the same1 (e.g., all­ A nu­cle­o­tides). Protein se­quences
have many more sym­bols, 20 to­tal rep­re­sent­ing each of the most fre­quently oc­cur­
ring amino ac­ids, so the max­i­mum bit score (if 100% of the amino ac­ids at a po­si­
tion are of one type) is 4.32.
Figure 4.2 il­lus­trates se­quence logos made for se­quence align­ments of non­
cod­ing DNA se­quences.
Similar con­served pat­terns can be de­tected in pro­tein se­quence align­ments.
These re­gions in­di­cate amino ac­ids that are crit­ic­ al in the func­tion­ing of the pro­
tein. For in­stance, highly con­served amino ac­ids in a tran­scrip­tion fac­tor (TF) could
be crit­ic­ al in al­low­ing the TF DNA-binding domain to bind DNA.

Sequence Motifs
The types of short se­quence pat­terns shown in the fig­ures here can also be
found in RNA and pro­tein se­quences, and are gen­er­ally re­ferred to as mo­tifs.
One goal of bioinformatics has been to de­velop al­go­rithms to au­to­mat­ic­ ally iden­
tify func­tional mo­tifs in the vast sea of DNA, RNA, and pro­tein databanks. For in­
stance, we might want to search for all­the al­ter­nate splice sites in the ge­nome of
the fruit fly or find all­the bind­ing sites of a par­tic­u­lar TF in the ge­nome of the Black
Death bac­te­ri­um.
The al­go­rithms cov­ered in this chap­ter are de­signed to take into ac­count the
facts that mo­tifs tend to be short and that, while cer­tain po­si­tions tend to be con­
served, there is also a lot of po­si­tional var­i­abil­ity among re­lated mo­tifs. The two
meth­ods we dis­cuss in the chap­ter in­clude (i) a ba­sic pro­tein se­quence mo­tif al­
go­rithm and (ii) a DNA mo­tif search tool called a weight ma­trix.2 In gen­eral, these
and other mo­tif-searching al­go­rithms have the fol­low­ing in com­mon.

1. They are based on ex­per­i­men­tal da­ta.


2. They find mo­tifs too short for BLAST search­es.
3. They find mo­tifs too “fuzzy” for BLAST search­es.
4. They al­low “weight­ed” po­si­tional bi­as.
5. They gen­er­ate func­tional hy­poth­e­ses.
PATTER N S I N TH E D ATA   93

FIGURE 4.2. Lo­gos for se­quences in­volved in crit­i­cal cell func­tions. The big­ger the let­ter
is shown at a po­si­tion, the more con­served (and im­por­tant) that nu­cle­o­tide is in the pro­cess.
(Top) Conserved sequences within E. coli promoter regions. (Middle) Exon and in­tron
splice sites. (Bottom) A site in a bac­te­rial ge­nome that binds a TF. The se­quence logos were
gen­er­ated us­ing the on­line WebLogo soft­ware (http://​weblogo.​berkeley.​edu/​).

That these motifs are based on ex­per­i­men­tal data can­not be over­stated. They
must be built on align­ments of ex­per­i­men­tally tested se­quences of known func­
tion, and the more the bet­ter. Since the mo­tifs them­selves tend to be short and
have highly var­i­able po­si­tions as well as con­served po­si­tions, one gen­er­ally can­not
use BLAST to find match­ing mo­tifs in new se­quences. However, be­cause they
are short and fuzzy, matches us­ing mo­tif-searching al­go­rithms need to be taken as
hy­po­thet­i­cal be­cause many are likely to be false pos­i­tives.

Notes
1. This is also known as the num­ber of “bits.”
2. Chapter 07 dis­cusses the gen­eral method called hid­den Mar­kov mod­els, which can also be
used to find mo­tifs.
94  COMPU TATIO NA L B IOL OGY

ACTIVITY 4.1 PROTEIN SEQUENCE MOTIFS

Motivation
Protein se­quence mo­tifs are short stretches of amino ac­ids with spe­cific func­tional roles that are
found in many dif­fer­ent types of pro­teins. For in­stance, the DNA-binding mo­tif known as a zinc
fin­ger is found in nu­mer­ous dif­fer­ent types of tran­scrip­tion fac­tors (TFs), pro­teins that reg­u­late
the tran­scrip­tion of DNA to mes­sen­ger RNA. Although the amino ac­ids in the bind­ing mo­tif are
sim­i­lar across many dif­fer­ent TFs, the other amino ac­ids in these pro­teins can be very dif­fer­ent,
yet they all­need to bind DNA; hence, they have a DNA-binding mo­tif. Being ­able to iden­tify mo­
tifs, given the pri­mary pro­tein se­quence, can pro­vide im­por­tant in­sight into a func­tional as­pect of
un­known pro­teins.
In this ac­tiv­ity, you will learn how to turn a mul­ti­ple-sequence align­ment of known pro­tein
mo­tifs into a search pat­tern for scan­ning new pro­teins for the same mo­tif. In the first step, you
will learn how to build a po­si­tion-specific search pat­tern us­ing the se­quence align­ment. In the
sec­ond step, you will learn how to scan new pro­tein se­quences for pos­i­tive matches. You will
also learn how to use on­line mo­tif search­ing soft­ware to search for match­ing mo­tifs in pro­tein
se­quences.

Learning Objectives
1. Understand the ba­sics of se­quence mo­tifs and how de­tect­ing mo­tifs in pro­teins helps to
iden­tify im­por­tant func­tional as­pects (Motivation).
2. Learn how to build a se­quence mo­tif from align­ments of short pro­tein se­quences and use
them to search for matches within novel pro­tein se­quences (Concepts and Exercises).
3. Use se­quence mo­tif match­ing soft­ware to build and search for pro­tein mo­tifs (Concepts and
Exercises).

Concepts
This pre­lim­i­nary ex­er­cise will help you un­der­stand the prin­ci­ples be­hind pro­tein se­quence mo­tif
search­ing. The align­ment on the next page con­tains a se­ries of short se­quences of a pro­tein
mo­tif in the DNA-binding do­main of the an­dro­gen re­cep­tor (AR) from var­i­ous dif­fer­ent mam­mal
spe­cies. The AR binds the hor­mone tes­tos­ter­one, and once ac­ti­vated by the hor­mone bind­ing,
the AR moves from the cell cy­to­plasm to the nu­cleus, where it binds the reg­u­la­tory DNA
se­quences of many genes. This bind­ing helps to ac­ti­vate the tran­scrip­tion of genes which are
vi­tal for male re­pro­duc­tion and de­vel­op­ment.
PATTER N S I N TH E D ATA   95

Positions
1 2 3 4 5 6
Human AR Motif: R V L E G Q
Dog AR Motif: R A M E G K
Camel AR Motif: R A M E G Q
Horse AR Motif: R V M E G K
Mouse AR Motif: R V V E G Q
Bear AR Motif: R A L E G K

Look at each po­si­tion of the align­ment. Can you make a pat­tern (known as a pro­file) that matches
all­the se­quences? Hint: the first let­ter in the pat­tern would be an R be­cause they all­have an ar­
gi­nine (R) at the first po­si­tion.

PROFILE:

FIND PROFILE MATCH IN THIS SEQUENCE:

M FWVY RV M E G K S K

Reflection
• Which po­si­tions are the most con­served? Most var­i­able?
• How might you search for your pat­tern in a da­ta­base of pro­tein se­quences? (How did the
BLAST al­go­rithm find the first match­ing “word”?)
• What if there were 10 pos­si­ble choices of amino acid at po­si­tion 2? How might you in­di­cate
a hy­per­var­i­able po­si­tion?

Below is the pro­file for this se­quence align­ment and a match to the pro­file in the test pro­tein
se­quence. The an­swer uses hy­phens to in­di­cate sep­a­rate po­si­tions of the mo­tif.

PROFILE: R - [V or A] - [L or M or V] - E - G - [Q or K]

PROFILE MATCH IN THIS SEQUENCE:

M FWVY RV M E G K S K

In three of the po­si­tions of the se­quence align­ment, all­the mo­tifs have an iden­ti­cal amino acid.
This sug­gests that mu­tat­ing any of these three amino ac­ids would in­hibit the bind­ing of the AR
to its tar­get DNA se­quence. For in­stance, if one were to mu­tate the DNA se­quence that coded
for this mo­tif so that there was a D at po­si­tion 4 in­stead of an E, the mo­tif would prob­a­bly no
lon­ger bind the cor­rect DNA se­quence.
One es­pe­cially bad con­se­quence of such a mu­ta­tion would be in­fer­til­ity. Since the AR is
vi­tal in the pro­cess of sper­mato­gen­e­sis (de­vel­op­ment of sperm), in­di­vid­u­als with such a mu­ta­
tion could not re­pro­duce and this mu­ta­tion would not pass to the next gen­er­a­tion.1 This pro­
cess of nat­ur­ al se­lec­tion is the likely rea­son why one only ob­serves an E at the 4th po­si­tion of
this mo­tif in all­these spe­cies (or an R at the 1st, or a G at the 5th). Once a pro­file is gen­er­ated
96  COMPU TATIO NA L B IOL OGY

for a par­tic­u­lar se­quence mo­tif, this pro­file can then be scanned across mil­li­ons
of pro­teins in da­ta­bases for match­es.

Exercises
Interactive ex­er­cises (the­o­ry)
Use the on­line ex­er­cise link be­low to learn how to make pro­tein se­quence
­mo­tifs and use them to search for matches. The Interactive Link ex­plains how to
use the teach­ing in­ter­ac­tive. Once you learn how it works, solve the ac­tiv­ity
­prob­lem.

Sequence Motif Interactive Link


Link:
http://​kelleybioinfo.​org/​algorithms
/​default.​php?​o=4
PATTER N S I N TH E D ATA   97

Problem
Build the se­quence mo­tif and in­di­cate whether the pro­file you built matches the
pro­tein se­quences be­low.
98  COMPU TATIO NA L B IOL OGY

Lab Exercises (Practice)


In this part of the ex­er­cise, you will learn how to use se­quence mo­tifs to search for
matches on­line. You will also learn how to in­ter­pret the out­­put from the pro­gram.

ScanProsite Tutorial Link


Link:
http://​kelleybioinfo.​org/​algorithms
/​tutorial/​TMot2.​pdf

Sample and lab ex­er­cise da­ta:


http://​kelleybioinfo.​org/​algorithms
/​data/​DMot2.​txt
PATTER N S I N TH E D ATA   99

Lab Exercise
1. Make a pro­file man­u­ally from the fol­low­ing data us­ing the pat­tern syn­tax used
by PROSITE. See tu­to­rial and link for ScanProsite.

Protein se­quences for gen­er­at­ing a pro­file:

P2_DROME/200-226 CVVCGDKSSGKHY
DR_CANFA/547-573 CLICADEASGCHY
1H16_CAEEL/14-40 CAICQESAEGFHF
2H18_CAEEL/11-37 CEVCPDKTSYRHF
3H18_CAEEL/11-37 CPVCGDRTSLRHF
4H18_CAEEL/11-37 CPPCGDLTSPSHF
5H18_CAEEL/11-37 CDLCGDPSRGWHF
F6H18_CAEEL/11-37 CFQCLDWTAGANF
7H18_CAEEL/11-37 CTWCPDQTGWFHF
8H18_CAEEL/11-37 CWPCWDPTVGGHY
9H18_CAEEL/11-37 CEACGDKTLGYHF
0H18_CAEEL/11-37 CSFCGDKTIPRNF
AH18_CAEEL/11-37 CPRCQEDTQRYHY
BH18_CAEEL/11-37 CQECGDKTWWRNF

a. What is the pro­file?

b. Use the Option 2 “Submit MOTIFS” to search for hits with your pro­file in
the PROSITE da­ta­base (use the de­fault pa­ram­e­ters). Fill in the ta­ble be­low
with the Swiss-Prot/UniProtKB ac­ces­sion num­bers of hits to three dif­fer­ent
or­gan­isms, as well as the mo­tif se­quence from the or­gan­ism that matches
the pro­file you cre­ated, and the sci­en­tific and, if avail­­able, com­mon
name of the or­gan­ism. (The “Pattern” should be the se­quence from
the or­gan­ism that matches your search pat­tern.)

Swiss-Prot/UniProtKB Acc #  Pattern (Specific Match)  Organism

i. 

ii. 

iii. 
100  CO MPUTATION AL B IOL OGY

2. Use ScanProsite Option 1 to scan the pro­tein P04150 (UniProtKB Identifier)
and an­swer the fol­low­ing ques­tions.

a. What is the pro­tein?

b. What mo­tif(s) did you find (i.e., what amino acid se­quence or se­quences)?

c. Describe the mo­tif and what it binds.

3. Search the P04150 us­ing InterProScan at http://​www.​ebi.​ac.​uk/​interpro​


/​search/​sequence-​search. Select the Advanced Options menu be­low the
in­put win­dow and se­lect only 2 box­es: PfamA and Prosite-Profiles. What ma­jor
do­mains did you dis­cov­er?

4. Use the SMART (Simple Modular Architecture Research Tool) to study your
cool new se­quence from a fe­ral Cher­no­byl chick­en2 (use the normal, not the
ge­no­mic, ver­sion).
SMART home page: http://​smart.​embl-​heidelberg.​de/​.

> Cher­no­byl chick­en


MQFAPLLLGVFLLCGSARGSDSSASNAITCFTRGLDLRKETEDVLCPANCPLWQFYVFGDGI
YASLSSVCGAAIHRGVITNAGGAVRVQTLPGQENYPAVHANGIQSQVLSRWASSFSVTPGTN
NLALEAVGRSVATARPATGKRPKKTLEKKAGNKDCKADIAFLIDGSYNIGQRRFNLQKNFV
GKVAVMLGIGTEGPHVGVVQASEHPKIEFYLKNFTAAKEVLFAIKELGFRGGNSNTGKALK
HAAQKFFSMENGARKGIPKIIVVFLDGWPSDDLEEAGIVAREFGVNVFIVSVAKPTTEELGM
VQDIGFIDKAVRCRNNGFFSYQMPSWFGTTKYVKPLVQKLCSHEQMLCSKTCYNSVNIGFLI
DGSSSVGESNFRLMLEFISNVAKAFEISDIGSKIATVQFTYDQRTEFSFTDYTTKEKVLSAIRNI
RYMSGGTATGDAISFTTRNVFGPVKDGANKNFLVILTDGQSYDDVRGPAVAAQKAGITVFS
VGVAWAPLDDLKDMASEPRESHTFFTREFTGLEQMVPDVINNNGICKDFLDSKQ

a. Search for outlier ho­mo­logues, PFAM, sig­nal pep­tides, and in­ter­nal re­
peats. What high-probability mo­tifs did you dis­cover? Can you de­scribe them
in words?

b. What is the name/function of the pro­tein? (Hint: BLAST the pro­tein se­quence.)
PATTER N S I N TH E D ATA   101

5. Use Ensembl at http://​www.​ensembl.​org to an­swer the next few ques­tions


about the homolog of the Cher­no­byl chicken protein in the mouse genome.
(Hint: use the Ensembl Tutorial in the BASICS sec­tion and search the mouse
ge­nome.)

a. Which mouse chro­mo­some is it on?

b. How many ex­ons does it have?

c. What is the length of the lon­gest mRNA tran­script in nu­cle­o­tides? Ensembl
uses bp (base pairs) be­cause it is based on the DNA, but RNA does not
have pairs since it is sin­gle stranded, so it should be nu­cle­o­tides (nt).

Notes
1. Natural se­lec­tion re­quires both sur­vival AND re­pro­duc­tion, not to men­tion var­i­a­tion and her­i­
ta­bil­ity. Sequence align­ments are a pow­er­ful way to de­ter­mine the re­sults of na­ture’s ex­per­
i­men­ta­tion: what is/is not al­lowed in na­ture.
2. These fe­ral chick­ens prob­a­bly fed on the gamma ra­di­a­tion-eating fungi in­side the re­ac­tor core:
https://​www.​sciencedaily.​com/​releases/​2007/​05/​070522210932.​htm
102  CO MPUTATION AL B IOL OGY

ACTIVITY 4.2 POSITION-SPECIFIC WEIGHT MATRICES

Motivation
Transcription fac­tors (TFs) help to de­ter­mine whether a par­tic­u­lar gene (or set of genes) is tran­
scribed into mes­sen­ger RNA. So-called ac­ti­va­tor TFs help ac­ti­vate (in­crease) the tran­scrip­tion
of a gene. Most ac­ti­va­tor TFs bind DNA up­stream (5′) of the pro­tein-coding re­gion and make it
more likely that RNA po­ly­mer­ase will bind and tran­scribe the DNA tem­plate strand. RNA po­ly­
mer­ase might bind any­way, but ac­ti­va­tors make it more likely to hap­pen by di­rect or in­di­rect in­
ter­ac­tions with the RNA po­ly­mer­ase. Repressor TFs also bind DNA but do the op­po­site, mak­ing
it less likely that RNA po­ly­mer­ase will tran­scribe the DNA. TFs typ­i­cally bind to spe­cific se­
quences of DNA, al­though how strin­gent these se­quences are can vary. In fact, they usu­ally
bind to dif­fer­ent, though highly sim­i­lar, DNA se­quences. In or­der to pre­dict these DNA bind­ing
sites, we need a method that takes this var­i­a­tion into ac­count. This was the idea be­hind the cre­
a­tion of the po­si­tion-specific weight ma­trix (PSWM).
This ac­tiv­ity teaches how to cre­ate PSWMs us­ing se­quence align­ments of ex­per­i­men­tally
de­ter­mined TF bind­ing sites (TFBSs) and how to use a PSWM to search and find new se­quences
that match the PSWM. These matches could be bind­ing sites for the TF. This is sim­i­lar in prin­ci­ple
to the cre­a­tion of pro­tein se­quence mo­tifs, but with DNA se­quences. It’s the same idea, but
more math-y. After cal­cu­lat­ing the PSWM, you will learn how to scan DNA se­quences for high-
scoring matches to the PSWM and then use on­line soft­ware for this pur­pose.

Learning Objectives
1. Understand the prin­ci­ples of PSWMs and how they are used for de­tect­ing pro­tein-DNA bind­
ing sites (Motivation).
2. Be ­able to con­struct and cal­cu­late a PSWM and use it to scan a DNA se­quence for sig­nif­i­cant
matches (Concepts and Exercises).
3. Learn how to use PSWM soft­ware to de­tect DNA bind­ing sites of TFs (Concepts and
Exercises).

Concepts
This pre­lim­i­nary ex­er­cise should help you un­der­stand the prin­ci­ples be­hind PSWMs. The se­
quence align­ment on the next page con­tains a se­ries of short DNA se­quences that have been
ex­per­i­men­tally de­ter­mined to bind the TF known as HotStuff (Donna Summer’s BFF TF).
PATTER N S I N TH E D ATA   103

Binding site se­quences for the HotStuff TF

1 2 3 4 5 6
———————————
A G C T A A
T G C T G A
A G C T C G
T G C T T G
A G T T A A
T G C T G G
A G C T T A
T G C T C A
———————————

How many nu­cle­o­tides are at each po­si­tion? (Fill in the ta­ble be­low.)

Position
1 2 3 4 5 6
A 4
G
C
T

What is the best match in the fol­low­ing se­quence?


(Hint: the align­ment has 6 po­si­tions.)

ACAATGCTCAAGGG

Reflection
• Which po­si­tions are the most con­served? The most var­i­able?
• Is po­si­tion 3 more weighted to­wards a C or a T? How would you de­scribe the weight of
po­si­tion 1?
• How might you score a match to this PSWM us­ing the fre­quency of the nu­cle­o­tides at each
po­si­tion?
• How many of each nu­cle­o­tide should we see at each po­si­tion if all­ba­ses were equally
likely? Hint: there are 4 nu­cle­o­tides in DNA, and as­sume each is equally com­mon (1/4 A,
1/4 G, 1/4 C, and 1/4 T).

On the next page are the an­swers show­ing the num­ber of each nu­cle­o­tide at ev­ery po­si­tion and
the best match of the ma­trix in the se­quence. Notice how the po­si­tions are weighted to­wards
par­tic­u­lar nu­cle­o­tides at cer­tain po­si­tions. Position 1 is weighted equally be­tween A and T but
away from G and C, while po­si­tion 3 is heavily weighted to­wards C, though there is a slight pos­
si­bil­ity of a T. There is no weight to­wards any of the nu­cle­ot­ ides at po­si­tion 5—all­are equally like­ly.
104  CO MPUTATION AL B IOL OGY

1 2 3 4 5 6
A 4 0 0 0 2 5
G 0 8 0 0 2 3
C 0 0 7 0 2 0
T 4 0 1 8 2 0

To get the best match in a se­quence, we scan the se­quence from left to right, look­ing at six
nu­cle­o­tides at a time. Why six? Because the align­ment is six nu­cle­o­tides long. The win­ner is the
one with the best over­all match to the ma­trix.

ACAATGCTCAAGGG

The question then arises: how do we get a score for the best fit to the ma­trix? In or­der to answer
this question, we need one more trans­for­ma­tion of the ma­trix. In this trans­for­ma­tion, the fre­
quency of the nu­cle­o­tides is used to de­ter­mine a score for each nu­cle­o­tide at each po­si­tion. The
ma­trix trans­for­ma­tion is es­sen­tially the nat­u­ral log of the fre­quency of the base at each po­si­tion
di­vided by the expected fre­quency of that base at that po­si­tion.
For ex­am­ple, the observed frequency of A at po­si­tion 1 is 0.5 (50%, or 4 out­of 8 pos­si­ble
nu­cle­o­tides). At po­si­tion 3, C has an observed frequency of 0.875 (87.5%, or 7 out­of 8 pos­si­ble
nu­cle­o­tides). Clearly, these are more com­mon than ex­pected by chance. If there were no bi­ases,
we would ex­pect a fre­quency closer to 0.25 (25%, or 2 out­of 8) of each nu­cle­o­tide at each po­
si­tion. Calculating the PSWM score is easy. Simply cal­cu­late the nat­u­ral log of the like­li­hood of
the base at that po­si­tion,1 which is the ob­served fre­quency (f ) of each nu­cle­o­tide (i) at each po­
si­tion (j) di­vided by the ex­pected fre­quency of each nu­cle­o­tide (pi):

fi ,j
ln
pi

We will as­sume that the ba­ses are all­pres­ent at equal fre­quen­cies in the or­gan­isms. Since DNA
has four nu­cle­o­tides, each base has a pi of 0.25. pa = pg = pt = pc = 0.25. (This is not al­ways the
case, since some or­gan­isms are G/C rich while oth­ers are A/T rich, but close enough!) This equa­tion
works un­til you re­al­ize that many of the fre­quen­cies in the ta­ble are 0, and you can­not take the
nat­ur­ al log of 0. However, with a few ad­just­ments to the equa­tion, we can cal­cu­late a very close
ap­prox­i­ma­tion:

ln
(n i ,j )
+ pi / (N + 1)
pi

where ni,j is the num­ber of ba­ses i at po­si­tion j and N is the to­tal num­ber of se­quences in the
align­ment. The nu­mer­a­tor is very close to the fre­quency, but add­ing the ex­tra 0.25 en­sures that
the num­ber is never 0.
Using our equa­tion, we can then trans­form the fol­low­ing ma­trix of po­si­tion-specific nu­cle­ot­ ide
counts to a PSWM. Let’s start with po­si­tion 1.
PATTER N S I N TH E D ATA   105

1 2 3 4 5 6
A 4 0 0 0 2 5
G 0 8 0 0 2 3
C 0 0 7 0 2 0
T 4 0 1 8 2 0

Calculating the PSWM score for the A at po­si­tion 1 is as fol­lows. Since nA,1 = 4 (counts of A’s
at po­si­tion 1), pi = 0.25, and N = 8 (num­ber of se­quences), we get the fol­low­ing:

( 4 + 0.25 ) / (8 + 1) = + 0.64
ln
0.25

This is the same value for T at po­si­tion 1, and the 0s in the ma­trix all­have the same val­ue:

(0 + 0.25 ) / (8 + 1)
ln = − 2.19
0.25

Using the PSWM equation (and a calculator, unless you’re a math savant), we can read­ily fill out­
the PSWM val­ues for po­si­tion 1 and the rest of the positions (the highest PSWM score(s) at
each position are indicated in bold):

1 2 3 4 5 6
A +0.64 −2.19 −2.19 −2.19 0.0 +0.85
G −2.19 +1.29 −2.19 −2.19 0.0 +0.37
C −2.19 −2.19 +1.17 −2.19 0.0 −2.19
T +0.64 −2.19 −0.59 +1.29 0.0 −2.19

Exercises
Interactive ex­er­cises (the­o­ry)
Now that you know the ba­sics of how to cal­cu­late a PSWM us­ing a se­quence align­ment, the
on­line ex­er­cises will give you prac­tice cal­cu­lat­ing ma­tri­ces and teach you how to scan se­quences
to find a re­gion with the high­est score given the ma­trix. The In­ter­ac­tive Link has a more detailed
ex­pla­na­tion of how to cal­cu­late a PSWM and how to use it to scan a se­quence look­ing for the
best match. Once you learn how it works, solve the ac­tiv­ity prob­lem.

Weight Matrix Interactive Link


Link:
http://​kelleybioinfo.​org/​algorithms/​default.​php?​o=11
106  CO MPUTATION AL B IOL OGY

Problem
Enter the cor­rect val­ues in the boxes be­low (ex­cept the ones in the equa­tion).
First, fill in the num­ber of each of the four nu­cle­o­tides at po­si­tion 3. Second, use
these num­bers to cal­cu­late the PSWM val­ues for po­si­tion 3. Finally, fill in the
po­si­tion-specific scores for the nu­cle­o­tides in the high­lighted 5-base se­quence
win­dow and cal­cu­late the to­tal score.
PATTER N S I N TH E D ATA   107

Lab Exercises (Practice)


In this part of the ex­er­cise, you will learn how to use a pro­gram that searches se­
quences for TFBSs. This pro­gram uses PSWMs, like the ones you cre­ated ear­lier,
to de­tect po­ten­tial bind­ing sites in new se­quences.

Transcription Factor Binding Site Tutorial Link


Link:
http://​kelleybioinfo.​org/​algorithms/​tutorial​
/TMot1.​pdf

Sample and lab ex­er­cise da­ta:


http://​kelleybioinfo.​org/​algorithms/​data​/
DMot1.​txt
108  CO MPUTATION AL B IOL OGY

Lab Exercise
>EP_1
GAGAGCGGGCAGGAGGCGGGTTGGGAGGGCGCGGAGCCCCGGGTTCGGGGGAGACTGGAG
GGGCGCACGTGCGGCCGGGTGCGAGCGCGCGGCGGGGGAGGCTGCGGGGCGGCGCGGGGG
CGCGCGCGGAGCCCGAGCGGCGGCGCCAGGTCACACAACCTGTTTTGGCGCCTGCGGGCG
CCTGGGCCCAAGGGTGCGACGCGGGGGCGCCTGAGCCGGGACACAGGGGGTGCGGTGAGC
GCCAGGCGCCGCGGGGAGTTAAAAAGTTCGGGACCTGAGCGGTGCGTGGTTCCGCGGTGG
CCGCCTCTTCCTGCCGCGCAGGCCGAGGGTCCCGACGGCGCCGCTCACCGCTCCGGGACT
CAGCCTTTCTGGGCCCGGCCTGCGGTTCCCTCGGGGCCGGGGAGAGGGTGGAGCGCGGGA
GGAGGGGCGCCGGGTGGGGACGCCCAGGCCCTTCGTCGGGGGAGGGCGCTCCACCCGGGC
TGGAGTTGCAGAGCCCAGCAGATCCCTGCGGCGTTCGCGAGGGTGGGACGGGAAGCGGGC
TGGGAAGTCGGGCCGAGGTGGGTGTGGGGTTCGGGGTGTATTTCGTCCACGAGCCGGGGA

Use the se­quence above with the LASAGNA TFBS search pro­gram to an­swer the
fol­low­ing ques­tions. (Under “Matrix-Derived Models”, choose “Use TRANSFAC
Matrices.”)

1. What are the names and scores of the TFs with the two high­est-scoring ma­tri­
ces (the two larg­est num­bers in the Score col­umn)?

2. What is the se­quence snip­pet within EP_1 se­quence (Sequence col­umn) that
matches the sec­ond-highest-scoring TF and there­fore could be its bind­ing
site?

3. In the re­sults, se­lect the name of the TF in the left col­umn. Using the Refer­
ences link, can you de­scribe a known bi­o­log­ic­ al role of the high­est-scoring TF?

Notes
1. Why cal­cu­late the log like­li­hood? Using the nat­ur­al log (or log­a­rithms) of a num­ber makes
cal­cu­la­tions more con­ve­nient, es­pe­cially with large num­bers. For in­stance, in­stead of mul­ti­
ply­ing fre­quen­cies, you can add the logs of these num­bers.
CHAPTER
05
RNA STRUCTURE PREDICTION

D
NA is fun­da­men­tally bor­ing, biochemically speak­ing. Sure, DNA con­tains
the blue­print for the en­tire struc­ture and func­tion of all­liv­ing cells (big deal),
but the dou­ble he­lix is as sta­ble and in­ac­tive a mol­e­cule as they get. The
two an­ti­par­al­lel strands com­bine via their com­ple­men­tary Wat­son-Crick
base pairings, with mul­ti­ple hy­dro­gen bonds per pair­ing. Without low­er­ing
the en­ergy of ac­ti­va­tion us­ing pro­tein helicase en­zymes, one must bring DNA to
around 95°C (nearly the boil­ing point of wa­ter) in or­der to sep­ar­ ate the strands of
the dou­ble he­lix. Indeed, this sta­bil­ity is part of what makes DNA such a ro­bust
trans­mit­ter of bi­o­log­i­cal in­for­ma­tion. DNA can be un­wrapped, un­zipped, cop­ied, and
tran­scribed, and it then re-forms quickly into a per­fect dou­ble he­lix.
DNA’s cousin RNA, on the other hand, is an en­tirely dif­fer­ent story. The chem­
i­cal com­po­si­tion of RNA is very sim­i­lar to that of DNA, but with a few changes
that make all­the dif­fer­ence. Like DNA, RNA is a long po­ly­mer chain com­posed of
four re­peat­ing nu­cle­o­tides in which the sugar com­po­nents of the nu­cle­o­tides are
joined to­gether by phosphodiester bonds. RNA has a pen­tose sugar slightly dif­
fer­ent from that of DNA (ri­bose ver­sus de­oxy­ri­bose), with an ad­di­tional hy­droxyl
group. RNA also fea­tures the ni­trog­e­nous base ura­cil in place of thy­mine. Most
im­por­t antly, as shown in Fig. 5.1, cel­lu­lar RNA mol­e­cules do not form dou­ble
he­li­ces.
The sin­gle-stranded na­ture of RNA makes it a par­tic­u­larly dy­namic and in­ter­
est­ing mol­e­cule and gives RNA the po­ten­tial for both struc­tural flex­ib ­ il­ity and
bio­chem­i­cal ac­tiv­ity. Because only one strand of RNA is syn­the­sized, the nu­cle­o­
tides are not bound to their com­ple­ments on an­other strand, leav­ing the hy­dro­
gen bonds open. This al­lows them to form in­ter­ac­tions with pro­teins and other
RNA mol­e­cules and, most im­por­tantly, with them­selves. Given the right se­
quence of nu­cle­o­tides, and some­times a lit­tle help from pro­teins, RNA mol­e­cules
can form a va­ri­ety of com­plex struc­tures that per­form crit­i­cal cel­lu­lar func­tions, as
de­picted in Fig. 5.2.

111
112  CO MPUTATION AL B IOL OGY

FIGURE 5.1. RNA ver­sus DNA. The left side of the im­age shows a sin­gle strand of RNA,
as well as the chem­i­cal struc­tures of the four ni­trog­e­nous ba­ses that it com­prises. The
right side shows a dou­ble-stranded DNA he­lix and its four ni­trog­e­nous ba­ses. Credit:
NHGRI/Darryl Leja.

Roles of RNA in Cells


You al­ready have some fa­mil­iar­ity with struc­tural RNA mol­e­cules from learn­ing
about the pro­cess of tran­scrip­tion and trans­la­tion (see Chap­ter 00). For ex­am­ple,
dur­ing trans­la­tion, trans­fer RNA (tRNA) mol­e­cules (Fig. 5.2A) shut­tle amino ac­ids
to the ri­bo­some, the pro­tein “fac­to­ry” of cells. The ri­bo­some it­self is also mostly
com­posed of RNA.1 The two sub­units of the ri­bo­some, called the small and large
sub­units, each have struc­tural RNA mol­e­cules called the small sub­unit RNA and
large sub­unit RNA, re­spec­tively, that form the core of this pro­tein-RNA mo­lec­u­lar
com­plex (Fig. 5.2B).
tRNAs, be­cause they are rel­a­tively small, crit­i­cal to cel­lu­lar func­tion, and highly
abun­dant, were the first crys­tal­lized RNA struc­tures. The three-dimensional ren­
der­ing of this mol­e­cule clearly re­vealed how RNA struc­tures form: by the RNA
fold­ing upon it­self. Figure 5.2 shows ex­am­ples of RNA struc­ture folds, in which
RNA mol­e­cules use hy­dro­gen bonds to bind to them­selves, mak­ing an­ti­par­al­lel
R N A S tructur e Pr ed iction   113

FIGURE 5.2. Some sweet, sweet RNA mol­e­cules and a few se­lected mo­lec­u­lar in­ter­
actions. (A) A tRNA mol­e­cule. tRNAs are made from a sin­gle strand of RNA which
base-pairs with it­self to form a struc­ture ca­pa­ble of car­r y­ing amino ac­ids to the ri­bo­some.
(B) Ribosomal sub­units are com­posed of RNA (yel­low and or­ange) and pro­teins (blue) that
come to­gether to form the ri­bo­some, the cell’s pro­tein-synthesizing ma­chin­ery. (C) The
spliceosome, made of small nu­clear RNAs and pro­teins, is part of the RNA ed­it­ing ma­chin­ery
in eu­kary­otic cells. (D) Secondary struc­ture of the mir210 microRNA. MicroRNAs reg­u­late the
ex­pres­sion of other genes. (E) The CRISPR-Cas9 sys­tem is a ge­nome ed­it­ing sys­tem found in
pro­kary­otic cells to pro­tect against bac­te­rio­phage in­vad­ers, which uses RNA at mul­ti­ple steps
to guide the pro­cess. Panels A and B cour­tesy of Yikrazuul, un­der li­cense CC BY-3.0. Panel C
re­printed from Will CL, Luhrmann R. 2011. Cold Spring Harb Perspect Biol 3(7), with per­mis­sion.
Panel E cour­tesy of Mirus Bio LLC. http://www.mirusbio.com

struc­tures. Figure 5.2A shows a two-dimensional flat­tened sec­ond­ary struc­ture


of a tRNA that in­di­cates how the nu­cle­o­tides com­ple­ment each other, much like
in DNA. The paired re­gions are called stems, and the RNA must loop around to
fold back upon it­self. These loop re­gions can have very in­ter­est­ing bio­chem­i­cal
prop­er­ties. The un­paired nu­cle­o­tides in these loops have open hy­dro­gen bonds
that can in­ter­act with and bind other mol­e­cules, or even bind other re­gions within
the same mol­ec­ ule. For in­stance, the an­ti­co­don hair­pin loop of tRNAs (Fig. 5.2A)
has open hy­dro­gen bonds that al­low the an­ti­co­don nu­cle­o­tides of a par­tic­u­lar
tRNA to fleet­ingly bind mRNA dur­ing trans­la­tion. This bind­ing pro­cess is crit­i­cal
for at­tach­ing the cor­rect amino acid dur­ing pro­tein syn­the­sis.
Figure 5.3 shows an ex­am­ple of a larger and more com­plex struc­tural RNA,
ri­bo­nu­cle­ase P (RN­ase P), the first nat­u­rally oc­cur­ring RNA mol­e­cule proven to
114  CO MPUTATION AL B IOL OGY

FIGURE 5.3. Secondary-structure di­ag ­ ram of the RN­ase P struc­tural RNA show­ing
ex­am­ples of RNA struc­tural el­e­ments known to oc­cur in this and other molecules.
Adapted from Maeda T, Furushita M, Hamamura K, Shiba T. 2001. FEMS Microbiol
Lett 198:141–146, with permission.

have en­zy­matic ac­tiv­ity in the ab­sence of a pro­tein com­po­nent. The sec­ond­ary-


structure di­ag ­ ram il­lus­trates a va­ri­ety of struc­tural el­e­ments found in the mol­e­
cule. Many of these el­e­ments were pre­dicted by bioinformatics meth­ods and
later ver­i­fied ex­per­i­men­tally in the lab­o­ra­tory. These types of struc­tural el­e­ments
are also found in many other RNA mol­e­cules and to­gether de­ter­mine both the
sec­ond­ary and much of the ter­tiary struc­ture of the mol­e­cule.

Predicting RNA Structure


So much DNA, so lit­tle time! That should re­ally be the sub­ti­tle of this book. RNA
mol­e­cules, like pro­tein se­quences, are also en­coded in all­or­gan­isms’ DNA,
R N A S tructur e Pr ed iction   115

and just like with pro­teins, there are a mess of them. While many of the large
struc­tural RNAs have been well char­ac­ter­ized, there are thou­sands of smaller
RNAs pre­dicted to be in ge­nomes that re­main to be stud­ied. With the rapid and
ac­cel­er­at­ing ac­cu­mu­la­tion of DNA se­quences, the ques­tion is: can we use DNA
se­quences that code for po­ten­tial struc­tural RNA mol­e­cules to pre­dict their sec­
ond­ary or even ter­tiary struc­tures?
The two meth­ods we will cover for pre­dict­ing RNA struc­ture based on pri­mary
se­quence in­for­ma­tion are (i) ther­mo­dy­nam­ics-based pre­dic­tion and (ii) mu­tual in­
for­ma­tion (MI). Thermodynamic meth­ods use ex­per­i­men­tally de­ter­mined RNA
base-pairing and base-stacking val­ues to de­ter­mine what the most sta­ble po­ten­
tial RNA struc­ture is for an RNA se­quence. Given a DNA se­quence en­cod­ing a
struc­tural RNA, these meth­ods first de­ter­mine a se­ries of po­ten­tial ways in which
the se­quence could fold upon it­self. Then, us­ing the ex­per­i­men­tally pre­de­ter­mined
RNA stack­ing en­er­gies, the al­go­rithm de­ter­mines the free en­ergy of all­the struc­
tures, and the one with the low­est free en­ergy wins (like golf, but more exciting).
Figure 5.4 il­lus­trates po­ten­tial folds and en­er­gies for the same RNA se­quence.
Since fold­ing al­go­rithms max­i­mize the num­ber of base-pairings, and as­sume
that the mol­e­cule folds only upon it­self with lots of stems, ther­mo­dy­nam­ics-
based pre­dic­tions tend to per­form more poorly with larger RNA struc­tures. Also,
the larger the struc­ture, the more pos­si­ble folds the RNA could make, which

FIGURE 5.4. Three RNA folds for the same se­quence. Fold 1 has the low­est free
en­ergy and would be pre­ferred by the ther­mo­dy­namic pre­dic­tion method. Reprinted
from Maeda T, Furushita M, Hamamura K, Shiba T. 2001. FEMS Microbiol Lett 198:141–146,
with per­mis­sion.
116  CO MPUTATION AL B IOL OGY

ex­po­nen­tially in­creases the com­pu­ta­tional time. These meth­ods also do not pre­
dict high­er-level in­ter­ac­tions such as pseudoknots and base tri­ples. However,
ther­mo­dy­namic meth­ods can be ex­tremely fast, re­quire only a sin­gle se­quence
to make pre­dic­tions, and tend to work well with small RNA mol­e­cules.
The sec­ond method, MI, ad­dresses the prob­lem of RNA sec­ond­ary-structure
pre­dic­tion very dif­fer­ently. Instead of fold­ing a sin­gle RNA se­quence by it­self, MI
uses evo­lu­tion­ary in­for­ma­tion from many closely re­lated se­quences. Specifically,
MI an­a­lyzes the mu­ta­tional var­i­a­tion in mul­ti­ple-sequence align­ments gen­er­ated
from col­lec­tions of the same RNA struc­ture from many dif­fer­ent or­gan­isms. In
these align­ments, MI iden­ti­fies var­ia­ ble po­si­tions in the se­quence align­ment in
which changes (mu­ta­tions) at one po­si­tion of the se­quence align­ment cor­re­late
with changes at an­other po­si­tion. These cor­re­lated po­si­tions mu­tu­ally in­form
one an­other, hence the name. The con­cept is that if two po­si­tions (or more in
some cases) al­ways change at the same time, this pro­vi­des ev­id ­ ence that the
po­si­tions are in­ter­act­ing within a se­quence. Figure 5.5 pro­vi­des an ex­am­ple of
how and why cor­re­lated mu­ta­tions like this oc­cur.
The ex­am­ple in Fig. 5.5 is ac­tu­ally quite a com­mon pat­tern. Paired re­gions of­
ten show strong pat­terns of cor­re­lated mu­ta­tions be­cause stem re­gions and
pseudoknot re­gions are likely to be crit­i­cally im­por­tant to the sta­bil­ity and func­
tion of the mol­e­cule. Too many un­com­pen­sated mu­ta­tions will lead to a poorly
func­tion­ing mol­e­cule and could lead to death of the or­gan­ism. In other words,

FIGURE 5.5. Example of com­pen­sa­tory mu­ta­tions in RNA struc­ture. (A) Multiple-


sequence align­ment of three re­lated RNA se­quences with the same func­tion but from
dif­fer­ent or­gan­isms. The two po­si­tions in­di­cated in bold­face are cor­re­lated. (B) The struc­
tures be­low the align­ment show the pro­cess of com­pen­sa­tory mu­ta­tion. In this case, the
sec­ond (com­pen­sa­tory) mu­ta­tion re­stores the sta­bil­ity of the RNA mol­e­cule dis­rupted by
the first (in­ter­me­di­ate) mu­ta­tion be­cause base pairs in stem re­gions have a lower free
en­ergy than in hair­pin loops.
R N A S tructur e Pr ed iction   117

nat­u­ral se­lec­tion would take its course. For in­stance, mu­ta­tions that de­sta­bi­lize
the stems of a tRNA mol­e­cule could lead to a mol­e­cule that does not func­tion in
pro­tein syn­the­sis. No pro­tein syn­the­sis means no pro­teins, which means no cell.
Since MI cares only about cor­re­lated mu­ta­tions in a se­quence align­ment,
more than just base pairs in stem re­gions can be de­tected. This means that pseu-
doknots and other strange in­ter­ac­tions, such as base tri­ples, can be de­ter­mined
given enough se­quence data and var­i­a­tion. Clearly, the down­side of MI is the
need for lots of se­quences and also a sig­nif­i­cant amount of var­i­at­ion among the
se­quences (no var­i­a­tion means no cor­re­la­tion or pre­dic­tion). However, with the
grow­ing se­quence da­ta­bases from thou­sands of new ge­nomes, se­quence data
are not re­ally a lim­it­ing fac­tor any­mo­re.

Notes
1. The ri­bo­some is a com­plex RNA-protein mac­ro­mol­e­cule.
118  CO MPUTATION AL B IOL OGY

ACTIVITY 5.1 RNA STRUCTURE PREDICTION

Motivation
Most of the RNA diversity in cells comes in the form of mes­sen­ger RNA (mRNA) des­tined
for the ri­bo­some, where the mRNA is used in pro­tein syn­the­sis and later re­cy­cled. However,
there are also many crit­i­cal struc­tural RNA mol­e­cules that have a va­ri­ety of ba­sic cel­lu­lar func­
tions and are not tran­scribed into pro­teins. In fact, the ri­bo­some it­self is largely made of two
struc­tural RNAs, one in the small sub­unit and one in the large sub­unit. Structural RNA mol­e­
cules called trans­fer RNAs (tRNAs) shep­herd the amino ac­ids to the ri­bo­some dur­ing pro­tein
syn­the­sis. Other struc­tural RNAs are in­volved in gene reg­u­la­tion and in­tron splic­ing, and
some even act as en­zymes.
As with pro­teins, pre­dict­ing the struc­ture of these RNAs helps us un­der­stand how they func­
tion. This ac­tiv­ity cov­ers two al­go­rithms for pre­dict­ing RNA struc­ture. The first uses ther­mo­
dy­namic fold­ing rules to find the best two-dimensional (sec­ond­ary struc­tural) fold of a given RNA
se­quence. RNA strands eas­ily fold on them­selves and form hy­dro­gen bonds, mak­ing he­li­cal-like
re­gions sim­i­lar to DNA. The sec­ond method uses the prin­ci­ples be­hind mu­tual in­for­ma­tion (MI).
MI re­quires align­ment of dif­fer­ent RNA se­quences and looks for in­stances when mu­ta­tions are
cor­re­lated to pre­dict like­ly-interacting RNA nu­cle­o­tides. After mas­ter­ing the prin­ci­ples of these
al­go­rithms, you will learn how to use on­line RNA ther­mo­dy­namic pre­dic­tion and MI pre­dic­tion
soft­ware and how to in­ter­pret their out­­put.

Learning Objectives
 . Learn about the bi­o­log­i­cal func­tion of struc­tural RNA mol­e­cules (Motivation).
1
2. Understand both prin­ci­ples of RNA fold­ing and sec­ond­ary and ter­tiary struc­tural el­e­ments
(Motivation).
3. Use free-energy ther­mo­dy­namic rules to choose the most sta­ble RNA fold among a se­ries of
folds for the same RNA se­quence (Concepts and Exercises).
4. Learn the prin­ci­ple of MI and how it can be used to pre­dict both sec­ond­ary and ter­tiary RNA
struc­tural el­e­ments (Concepts and Exercises).
5. Learn how to use RNA pre­dic­tion soft­ware and in­ter­pret the out­­put (Concepts and Exercises).

Concepts
Algorithm 1: ther­mo­dy­namic sec­ond­ary-structure pre­dic­tion
Problem: Given a set of pos­si­ble RNA sec­ond­ary struc­tural folds for a par­tic­u­lar RNA se­quence,
which is the best?

Solution: To an­swer this ques­tion, we will use the ther­mo­dy­namic prin­ci­ple of free en­ergy. The
RNA struc­ture with the low­est to­tal free en­ergy, i.e., the most sta­ble pre­dicted struc­ture, will be
cho­sen as the best pre­dic­tion.
R N A S tructur e Pr ed iction   119

To bet­ter un­der­stand the prin­ci­ples be­hind the ther­mo­dy­namic struc­ture pre­dic­tion method,
try the pre­pa­ra­tory ex­er­cise be­low. Using your brain and a pen­cil, try fold­ing the RNA se­quence
upon it­self by bring­ing com­ple­men­tary nu­cle­o­tides (A and U, G and C) to­gether. Start by draw­ing
lines con­nect­ing paired nu­cle­o­tides, and then draw the struc­ture sim­i­lar to the ones shown in
Fig. 5.5. (Hint: start by con­nect­ing the sec­ond nu­cle­o­tide and the last nu­cle­o­tide.)

Sequence: 5’ – G A G G U C G G A A G A C C U – 3’

Structure:

Reflection
• What types of RNA el­e­ments does your fold have?
• How many Wat­son-Crick pairings did you find? How many un­paired nu­cle­o­tides?
• If you were to mu­tate the fifth nu­cle­o­tide from a U to a G, how would this change the
struc­ture? What would this do to the sta­bil­ity of the mol­e­cule?
• Since A-U or U-A pairs have 2 hy­dro­gen bonds, and G-C or C-G pairings have three, how
many to­tal hy­dro­gen bonds are there in your struc­ture?

Below is the folded struc­ture. Since the nu­cle­o­tides are all­con­nected from 5′ to 3′ via phospho-
diester (co­va­lent) bonds, the mol­e­cule must twist around to be ­able to fold on it­self. This is what
cre­ates the hair­pin loop struc­ture.

Sequence: 5’ – G A G G U C G G A A G A C C U – 3’

Structure:

G A G A
G   A G   A
C G C G
U A G  A
G C G C
G C G C
A U A U
G G

BEST FOLD SWITCH U to G at po­si­tion 5

The best fold has a stem and a hair­pin loop struc­ture. The to­tal num­ber of hy­dro­gen bonds for
this fold is 13. It is the com­bi­na­tion of these hy­dro­gen bonds plus the stack­ing en­er­gy that de­ter­
mines the free en­ergy of the stem re­gions. The other main de­ter­mi­nants of the mol­ec­ ule’s free
en­ergy are the num­ber and size of the loop re­gions, which raise the free en­ergy of the mol­ec­ ule
and lower its sta­bil­ity. For in­stance, the mu­ta­tion from a U to a G in the 5th po­si­tion of the se­
quence re­sults in the cre­at­ ion of an in­ter­nal 2-base loop be­cause the nu­cle­ot­ ides no lon­ger pair. It
is easy to see how this ad­di­tion would sig­nif­i­cantly raise the free en­ergy of the RNA struc­ture.
120  COMPU TATIO NAL B IOL OGY

Calculating RNA free en­er­gy


Thermodynamic meth­ods use em­pir­i­cally (ex­per­i­men­tally) de­ter­mined free en­er­gies of RNA
struc­tural el­e­ments, in­clud­ing nu­cle­o­tide pairings (A-U, U-A, G-C, C-G, and even G-U1 and U-G),
stack­ing en­er­gies, hair­pin loops, in­ter­nal loops, and bulges. Specifically, these meth­ods use
what are known as the near­est-neighbor rules, shown in Fig. 5.1.1.

FIGURE 5.1.1. Nearest-neighbor free-energy rules for RNA struc­tures, first de­vel­oped
by Turner and col­leagues in 1999 and then up­dated in 2004.2 The first (up­per­most) ta­ble
shows the free en­er­gies for base pairs stacked over other base pairs. For in­stance, an A-U
base pair stacked over a C-G base pair (A-U and C-G are “neigh­bors” in which the A and C
and the U and G are co­va­lently bound) has a to­tal free en­ergy of −2.1. This free en­ergy
in­cludes both the en­ergy of the pair­ing (A to U) and the fact that it is next to a C-G pair.
The lower ta­ble shows the free en­er­gies of hair­pin loops, in­ter­nal loops, and bulge loops.
Notice how the pairings all­have neg­a­tive free-energy val­ues (more sta­bil­ity) and the loops
have pos­i­tive free en­er­gies (less sta­bil­ity) that vary de­pend­ing on the size of the loop.
R N A S tructur e Pr ed iction   121

Comparing pos­si­ble struc­tures


The next step is to use the near­est-neighbor (Turner) free-energy val­ues to de­ter­mine the free
en­ergy of a given struc­ture. Drawing all­the dif­fer­ent pos­si­ble RNA struc­tures for a given RNA
se­quence is be­yond the scope of this book. However, it is im­por­tant to point out­that the num­
ber of pos­si­ble RNA struc­tures scales ex­po­nen­tially with the length of the se­quence. In fact, an
RNA se­quence with 100 nu­cle­o­tides has more than 1025 pos­si­ble struc­tures. Thus, the ther­mo­
dy­namic meth­ods are re­stricted to work­ing with shorter RNA se­quences. Figure 5.1.2 shows
the steps of cal­cu­lat­ing and com­par­ing the free en­er­gies of two small RNA sec­ond­ary struc­tures
for the same pri­mary RNA se­quence.

FIGURE 5.1.2. Calculating the to­tal free en­er­gies of two pos­si­ble sec­ond­ary struc­tures for the same
RNA se­quence: UCGCUGUUCCACAGGA. Structure 1 fea­tures a 5-nucleotide hair­pin loop and a bulge.
Structure 2 fea­tures a 3-nucleotide hair­pin loop and a bulge. Structure 1 has the low­est free en­ergy and,
therefore, is the most sta­ble of the two structures.

Algorithm 2: Mutual Information (MI)


The sec­ond method, known as MI, uses a com­par­at­ive ap­proach to RNA struc­ture pre­dic­tion.
Specifically, it com­pa­res the var­i­a­tion in RNA se­quences across re­lated or­gan­isms and de­ter­
mines whether pairs of po­si­tions in the se­quence align­ment co­vary. In other words, when there
is a mu­ta­tional change at one se­quence po­si­tion, is there a cor­re­spond­ing change at an­other
po­si­tion? If this oc­curs mul­ti­ple times at the same two po­si­tions, this is ev­i­dence that the po­si­
tions are cor­re­lated (i.e., mu­tu­ally in­for­ma­tive) and that the po­si­tions are in­ter­act­ing in the mol­e­cule.
The fol­low­ing pre­pa­ra­tory ex­er­cise should help you un­der­stand po­si­tional co­vari­a­tion in an
RNA se­quence align­ment and how this ev­i­dence can be used to pre­dict RNA struc­ture.
122  COMPU TATIO NAL B IOL OGY

Below is a se­quence (se­quence 1) of a known RNA struc­ture:

Sequence 1: GAUCCUGCCUUCACGAUC

Here is the same se­quence aligned with two other se­quences:

Sequence 1: GAUCCUGCC--UUCACGAUC
Sequence 2: GACCCUGCC--UUCAGGGUC
Sequence 3: CAACCUGCCAGUUCACGUUG

The fig­ure be­low shows the known RNA struc­ture of se­quence 1 on the left. The other se­
quences (2 and 3) have point mu­ta­tions at po­si­tions 1, 3, 18, and 20. Fill in the dif­fer­ent nu­cle­o­
tides for po­si­tions 3 and 18 in RNA se­quence 2, and positions 1, 3, 18, and 20 in RNA se­quence 3
in the spaces pro­vided in the fig­ure. Also, fill in the ex­tra nu­cle­o­tides for the 2-base AG nu­cle­o­
tide indel in se­quence 3 at the top of the RNA struc­ture. (Notice that the other se­quences do
not have that ex­tra A and G, so the align­ment is filled in with gaps.)
R N A S tructur e Pr ed iction   123

Reflection
• How did the mu­ta­tions af­fect the stem struc­ture of the se­quence 2 RNA? The se­quence 3 RNA?
• How do po­si­tions 3 and 18 co­vary (i.e., how are they mu­tu­ally in­for­ma­tive)? How many
times do they change to­gether? How about po­si­tions 1 and 20?
• We know from the struc­ture of se­quence 1 that the po­si­tion 2 nu­cle­o­tide (A) is in­ter­act­ing
with the sec­ond-to-last nu­cle­o­tide (U) in the same se­quence. Would MI help us pre­dict this
in­ter­ac­tion? Why or why not?
• Notice that there seems to be an in­ser­tion of two ex­tra nu­cle­o­tides in the hair­pin loop of
se­quence 3. The hy­phens are put in the align­ment to ac­count for this indel mu­ta­tion
(in­ser­tion in se­quence 3 or a de­le­tion in the an­ces­tor of se­quences 1 and 2). How might
this change the free en­ergy of the struc­ture?

The fig­ure be­low shows the an­swer. Notice how the change in a nu­cle­o­tide at one part of the
stem in se­quence 2 and se­quence 3 (com­pared with se­quence 1) cor­re­lates with a change at
an­other part of the same se­quence that main­tains the base pair and the stem struc­ture. This is a
very com­mon pat­tern in RNA se­quence align­ments—the most com­mon, in fact. It makes sense
be­cause the stem struc­tures are crit­i­cal for the sta­bil­ity of the mol­e­cule, as you know from the
pre­vi­ous sec­tion. Since the prin­ci­ple of MI re­lies on cor­re­lated changes (co­vari­a­tion) be­tween
nu­cle­o­tide po­si­tions in a se­quence align­ment as ev­id
­ ence of in­ter­ac­tion, this means two things:
(i) if there are no changes, MI can­not pre­dict in­ter­ac­tions, and (ii) other cor­re­lated changes, such
as pseudoknots or base tri­ples, can also be de­tect­ed.
124  COMPU TATIO NAL B IOL OGY

Figure 5.1.3 shows many in­stances of co­vari­a­tion in a larger align­ment of 10 re­lated RNA se­
quences and how to iden­tify mutually in­for­ma­tive nu­cle­o­tide po­si­tions.

FIGURE 5.1.3. Principle of MI. This com­par­a­tive ap­proach re­quires a mul­ti­ple-sequence align­ment (A). The
method searches for cor­re­lated po­si­tions in the align­ment. For in­stance, in this align­ment, po­si­tions 2 and
14 ap­pear to be cor­re­lated (B) be­cause as po­si­tion 2 changes, from an A to a U, for ex­am­ple, there is a
cor­re­spond­ing change in po­si­tion 14 from a U to an A. All in­stances of cor­re­spond­ing changes are iden­ti­fied
(C), and this can be used to de­ter­mine the sec­ond­ary struc­ture of the RNA se­quences. Finally, the
pre­dicted struc­ture of the first se­quence in the align­ment is shown (D). All the cor­re­lated changes in this
ex­am­ple are base pairing in stem regions, but other changes are also pos­si­ble to pre­dict.
R N A S tructur e Pr ed iction   125

Exercises
Interactive ex­er­cises (the­o­ry)
Use the RNA free-energy and MI links be­low to learn how to de­ter­mine the best
RNA struc­ture for a sin­gle se­quence us­ing ther­mo­dy­namic cal­cu­la­tions and to
pre­dict in­ter­act­ing po­si­tions us­ing a mul­ti­ple-sequence align­ment and MI.

RNA Free-Energy Interactive Link


Link:
http://​kelleybioinfo.​org/​algorithms
/​default.​php?​o=3

Mutual Information Interactive Link


Link:
http://​kelleybioinfo.​org/​algorithms
/​default.​php?​o=10
126  COMPU TATIO NAL B IOL OGY

Problems
1. Determine the RNA free en­ergy of the fol­low­ing se­quence. Show your work.

Total free en­er­gy: _____________


R N A S tructur e Pr ed iction   127

2. a. Draw lines be­low the ta­ble con­nect­ing pre­dicted in­ter­act­ing po­si­tions based
on the prin­ci­ple of MI.

b. Draw the pre­dicted struc­ture of the first se­quence in the align­ment (top
row) be­low.
128  COMPU TATIO NAL B IOL OGY

Lab Exercises (Practice)


In this part of the ex­er­cise, you will learn how to use pro­grams for RNA fold­ing
and MI.

Mfold (Free Energy) Tutorial Link:


Link:
http://​kelleybioinfo.​org/​algorithms
/​tutorial/​TRna1.​pdf

MatrixPlot (Mutual Information) Tutorial Link:


Tutorial:
http://​kelleybioinfo.​org/​algorithms
/​tutorial/​TRna2.​pdf

Sample and lab ex­er­cise data (for both Mfold and


MatrixPlot):
http://​kelleybioinfo.​org/​algorithms
/​data/​DRna1.​txt
R N A S tructur e Pr ed iction   129

Lab Exercise
Part 1. RNA fold­ing
Use Mfold to pre­dict the struc­ture of a tRNA se­quence and an­swer the fol­low­ing
ques­tions.

>Haemophilus_influenzae_tRNA
GGGGAUAUAGCUCAGUUGGGAGAGCGCUUGAAUGGCAUUCAAGAGGUCGUCGGUUC
GAUCCCGAUUAUCUCCACCA

1. What is Haemophilus influenzae?

2. What is a tRNA and what does it do?

3. Draw/show your tRNA se­quence in the anal­y­sis win­dow and an­swer the fol­
low­ing ques­tions or fol­low the di­rec­tions be­low.

a. Click on an en­ergy dot plot file link. What is this en­ergy plot tell­ing you?
Explain.

b. Draw or paste the top left cor­ner of the en­ergy dot plot from po­si­tions 1 to
30. Your an­swer should in­clude nine 10 × 10 po­si­tion squares.

c. Draw the RNA struc­tural el­e­ment pre­dicted in this re­gion of the dot plot.
You should be ­able to find this el­e­ment in the struc­ture 1 im­age fi­le.

d. Compare the pre­dicted RNA struc­ture 1 to struc­ture 2. How do they dif­fer,
and how might this change the free en­ergy of the mol­e­cule? Explain.
130  COMPU TATIO NAL B IOL OGY

Part 2

Use MatrixPlot to pre­dict the struc­ture of the FASTA-aligned se­quences in­cluded


with the sam­ple data (RNA1 to RNA12).

http://​kelleybioinfo.​org/​algorithms/​data/​DRna1.​txt

1. Write the match­ing nu­cle­o­tide po­si­tions of the lon­gest pre­dicted stem-loop
re­gion (the lon­gest di­ag­o­nal) in the graph. Approximate po­si­tions will suf­fice
given the dif­fi­culty of read­ing the graph. (As the old saying goes, you get what
you pay for, and these bioinformatics websites are free.)

2. Write the num­bers of the pre­dicted in­ter­act­ing po­si­tions in­di­cated by the di­ag­
o­nal in the top left cor­ner of the MatrixPlot graph. Note: there are 345 aligned
se­quence po­si­tions (x and y axes in­clude po­si­tions 1 to 345).

3. Write the in­ter­act­ing RNA nu­cle­o­tides at these po­si­tions us­ing the RNA1
se­quence.

Notes
1. Hydrogen bond­ing of gua­nine with ura­cil (con­sid­ered non­ca­non­i­cal base pairs) is very com­
mon in RNA mol­ec­ ules.
2. Turner DH, Mathews DH. 2009. NNDB: the near­est neigh­bor pa­ram­et­er da­ta­base for pre­
dict­ing sta­bil­ity of nu­cleic acid sec­ond­ary struc­ture. Nucleic Acids Res 38:D280–D282.
CHAPTER
06
PHYLOGENETICS

O
ne of the most re­mark­able, and ar­gu­­ably the most im­por­tant, bioinformat­
ics dis­cov­er­ies rev­o­lu­tion­ized our en­tire un­der­stand­ing of life on Earth. In
1977, Carl Woese and col­leagues at the University of Il­li­nois man­u­ally
aligned pieces of ri­bo­somal RNA1 se­quences iso­lated from com­mon bac­te­
ria (Escherichia coli and cy­a­no­bac­te­ria), a few eu­kary­otes (yeast and an
aquatic plant called duck­weed), and some meth­ane-generating “bac­te­ria” pre­vi­
ously iso­lated from dairy cows. Using some sim­ple cal­cu­la­tions to es­ti­mate how
dif­fer­ent the se­quences were from one an­other and some par­si­mo­ny-type rea­
son­ing (see Ac­tiv­ity 6.1), the au­thors gen­er­ated a small phy­lo­ge­netic tree that
com­pletely up­ended our un­der­stand­ing of life on Earth (Fig. 6.1).
While the tree doesn’t look like much, it strongly sug­gested that the methano­
genic bac­te­ria (e.g., Methanobacterium) and their rel­a­tives weren’t bac­te­ria at all­.
Rather, they com­prised a ma­jor new branch in the tree of life. As you can imag­ine,
this re­sult caused quite a stir and was greeted with much skep­ti­cism. This new
hy­poth­e­sis of life over­turned the ba­sic 5-kingdom de­scrip­tion of life that had been
ac­cepted since the mid-19th cen­tu­ry. However, the more se­quences sci­en­tists
gen­er­ated (in­clud­ing whole-genome se­quences and align­ments), and the bet­ter
the align­ment and phy­lo­ge­netic al­go­rithms, the clearer and more solid the pat­tern
be­came (Fig. 6.2).

Ramifications of the “Big Tree”


This phy­log­eny not only changed our un­der­stand­ing of the evo­lu­tion of life but
also in­spired the de­vel­op­ment of mo­lec­u­lar tech­niques that al­lowed us to study
mi­crobes in ba­si­cally any en­vi­ron­ment with­out­ hav­ing to cul­ture them first. The
DNA tech­niques in­vented by Nor­man Pace and col­leagues (also at the University
of Il­li­nois) al­lowed re­search­ers to find novel mi­cro­bial life ev­ery­where, in­clud­ing

• Boiling geo­ther­mal hot springs more acidic than bat­tery ac­id


• Every rock, soil, and plant sur­face on the plan­et
• Deep in­side mineshafts, be­low fro­zen lakes in Ant­arc­tica, and in the sed­i­ments
at the bot­tom of the ocean

133
134  COMPU TATIO NAL B IOL OGY

FIGURE 6.1. The “uni­ver­sal” phy­lo­ge­netic tree circa 1987. The tree of life nat­u­rally
sep­a­rated into three do­mains: the Eubacteria (now called Bacteria), the Eukaryota (now
Eukarya), and the then-newly iden­ti­fied group called the Archaebacteria be­cause they
were bac­te­ria-like. The Archaebacteria turned out­to be fun­da­men­tally dif­fer­ent from the
Bacteria and share many mo­lec­u­lar and cel­lu­lar as­pects with the Eukarya, and they are
now called the Archaea.

• In clouds, in­dus­trial waste pits, shower cur­tains, and scald­ing steam vents on
Ha­wai­ian vol­ca­noes2
• The mouth, gut, and sur­face of ev­ery an­i­mal on Earth

The num­ber of dis­cov­er­ies pro­ceed­ing from this re­search has been mind-
blow­ing. Since 1977 and the de­vel­op­ment of cul­ture-independent mo­lec­u­lar meth­
ods, we have learned that

• 99.999% of life is sus­pected to be mi­cro­bial and has yet to be cul­tured3


• Bacteria ac­count for the ma­jor­ity of the bio­mass on the plan­et
• Microbes ex­ist that can sur­vive boil­ing wa­ter, pH near 0, and salt con­cen­tra­
tions greater than 25%
• Ocean life is mainly mi­cro­bial, with ev­ery mil­li­li­ter of sea­wa­ter loaded with
bac­te­ria, ar­chaea, eu­kary­otic cells, and even more vi­rus­es
• The num­ber of spe­cies on Earth is es­ti­mated at greater than one bil­li­on4
• The num­ber of mi­cro­bial cells on planet Earth is ∼1029 and the num­ber of vi­ruses
is ∼1031  5
• The hu­man body hosts ∼100 tril­lion mi­crobes, 10 times more than hu­man cells

We are now us­ing these tech­niques, in­spired by the sim­plest of se­quence


align­ments and phy­lo­ge­net­ics, to study mi­crobes and their re­la­tion­ship to hu­
man health, dis­cover new path­o­gens, track sources of pol­lu­tion, study the
oceans and the tun­dra, and dis­cover new mech­a­nisms to change the course of
evo­lu­tion.
PH Y LO G EN ETI C S   135

FIGURE 6.2.The expanding tree of life. New molecular methods, including whole
genome sequencing and direct sequencing of DNA from complex environmental
samples have greatly expanded our understanding of microbial diversity. The Bacteria
are indicated in blue, the Eukarya in red, and the Archaea in green. This tree is biased
towards bacterial lineages and only includes a few select representative sequences.
However, it does illustrate how our understanding of microbial diversity has grown.
A ­complete depiction of microbial diversity would include millions of branches and
would be impossible to display in a figure. Still doubted by some in the early-2000s, the
pattern of this phylogenetic tree continues to strengthen with the addition of genomic
sequences.6

Uses of Phylogenetics
The use of phy­lo­ge­net­ics long pre­dates the work of Woese and col­leagues, al­
though the com­pu­t a­tional meth­ods trace back only to the 1960s. Most phy­
lo­­ge­netic trees were built to de­scribe the re­la­tion­ships among macrofauna and
-flora like plants, in­sects, birds, mar­mots, and whales. Prior to DNA se­quenc­
ing, phy­lo­ge­netic anal­y­sis re­lied on mor­pho­log­i­cal data (pres­ence or ab­sence of
scales, bones, and hair) and other vis­i­ble char­ac­ter­is­tics. However, with the de­
vel­op­ment of DNA se­quenc­ing meth­ods, nu­cle­o­tides and amino ac­ids be­came
pre­dom­i­nant.
Beyond the big tree, phy­lo­ge­netic the­ory has had an enor­mous im­pact on our
un­der­stand­ing not only of the re­la­tion­ships among or­gan­isms but also of the re­la­
tion­ships among genes within ge­nomes, the pro­cess of mo­lec­u­lar evo­lu­tion, the
evo­lu­tion of gene fam­i­lies, pat­terns of re­com­bi­na­tion, and hor­i­zon­tal gene trans­
fer. Phylogenies have also been used to dis­cover new forms of life and the or­i­gins
of deadly vi­ruses and bac­te­ria. Figure 6.3 shows some ex­am­ples of the many
pos­si­ble uses of phy­lo­ge­net­ics.
136  COMPU TATIO NAL B IOL OGY

FIGURE 6.3. Some uses of phy­lo­ge­netic trees. (A) Pathogen iden­ti­fi­ca­tion. Sequence
align­ment and phy­lo­ge­netic anal­y­sis were used to show that the deadly se­vere acute
re­spi­ra­tory syn­drome (SARS) vi­rus was a type of co­ro­na­vi­rus (a com­mon cold vi­rus). The
num­bers in­di­cated the max­i­mum pos­si­ble boot­strap val­ues (see “The Bootstrap” later in
this chap­ter). (B) Relationships among ste­roid re­cep­tors, tran­scrip­tion fac­tors that bind
ste­roid hor­mones like es­tro­gen (ER, es­tro­gen re­cep­tor) or tes­tos­ter­one (AR, an­dro­gen
re­cep­tor). This phy­log­eny shows that all­ ste­roid re­cep­tors were de­rived from an an­ces­tral
pro­tein that likely bound an es­tro­gen-like hor­mone. PR, pro­ges­ter­one re­cep­tor; GR,
glu­co­cor­ti­coid re­cep­tor; MR, min­er­al­o­cor­ti­coid re­cep­tor. The lam­prey se­quence outgroups
are high­lighted in red. Reprinted from Thorn­ton JW. 2001. Proc Natl Acad Sci U S A
98:5671–5676, with per­mis­sion. (C) Phylogeny of cul­tured and un­cul­tured my­co­bac­te­rial
spe­cies (Mycobacterium tu­ber­cu­lo­sis causes, you guessed it, tu­ber­cu­lo­sis). The bold­face
num­bers in­di­cate novel un­cul­tured spe­cies of my­co­bac­te­ria from the air of a hos­pi­tal
pool where the life­guards had been get­ting sick and cough­ing up blood. Reprinted from
Angenent LT, Kelley ST, St Amand A, Pace NR, Hernandez MT. 2005. Proc Natl Acad Sci
U S A 102:4860–4865, with per­mis­sion.

How To Interpret Phylogenetic Trees


There are many ways to draw phy­lo­ge­netic trees and dif­fer­ent as­pects of trees
that re­quire in­ter­pre­ta­tion. A phy­log­eny is a hy­poth­e­sis of the evo­lu­tion­ary re­la­tion­
ships of or­gan­isms or genes usu­ally based on mo­lec­u­lar data. The min­im ­ um in­for­
ma­tion a phy­lo­ge­netic tree shows is the re­la­tion­ships among the taxa. You will see
the word taxon (singular) or taxa (plu­ral) used of­ten with phy­log­e­nies. It comes from
PH Y LO G EN ETI C S   137

FIGURE 6.4. Aspects of phy­lo­ge­netic trees. (A) The to­pol­ogy of the tree in­di­cates the
re­la­tion­ships among the taxa. The taxa at the tips (leaves, blue cir­cles) of the tree are
con­nected by branches to the other taxa via in­ter­nal nodes (red cir­cles). The nodes in­di­cate
the com­mon an­ces­tor, and the fewer the nodes be­tween taxa, the closer their phylo­
genetic re­la­tion­ship. Both trees are unrooted; how­ever, the bot­tom tree shows ad­di­tional
in­for­ma­tion in the form of branch lengths. Longer branches in­di­cate more evo­lu­tion­ary
change in the se­quence since the split from the com­mon an­ces­tor. (B) On the left is a
clad­o­gram-type tree rooted with the squid (an in­ver­te­brate) outgroup. Outgroups are used
to de­ter­mine the or­der of evo­lu­tion­ary events. Squid make a good outgroup in this case
be­cause they are “out­­side” the group of three ver­te­brates. The trees to the right have the
same to­pol­ogy, just side­ways. The ar­row shows that one can ro­tate taxa at a node with­out­
af­fect­ing the in­ter­pre­ta­tion, since evo­lu­tion­ary time al­ways ex­tends from the root node
out­­ward.

the word tax­on­omy and is a gen­eral term to en­com­pass any level of tax­o­nomic
or­ga­ni­za­tion. A taxon can re­fer to a spe­cies, ge­nus, or fam­ily of or­gan­isms, but it can
also re­fer to genes or even un­known groups. Figure 6.4 shows how to read dif­fer­ent
as­pects of phy­lo­ge­netic trees that can be used to de­scribe the re­la­tion­ships among
taxa, the amount of mu­ta­tional change since the taxa split from a com­mon an­ces­tor,
and even the sta­tis­ti­cal sup­port of the re­la­tion­ships.

The Bootstrap
In Ac­tiv­ity 6.1, we will cover two meth­ods for cre­at­ing phy­lo­ge­netic trees us­ing
mul­ti­ple-sequence align­ments of DNA or pro­tein se­quences: dis­tance and par­si­
138  COMPU TATIO NAL B IOL OGY

mony. These tree-building meth­ods aim to de­ter­mine the best phy­lo­ge­netic tree
for a given set of taxa. However, while these meth­ods can build a phy­log­eny, they
can­not by them­selves de­ter­mine the sta­tis­ti­cal sig­nif­i­cance of the tree.
This is where the boot­strap7 method comes in. Phylogenetic boot­strap­ping is
the most com­monly used method for de­ter­min­ing how well the data sup­port the
re­la­tion­ships in the tree. A boot­strap is a com­mon sta­tis­ti­cal pro­ce­dure that cre­ates
a ran­dom re­sam­pling of the data with re­place­ment. In the case of phy­lo­ge­netic
anal­y­sis, the boot­strap re­sam­ples the po­si­tions of the mul­ti­ple-sequence align­
ment, cre­at­ing a new data set of the same size (same num­ber of po­si­tions). In a
typ­i­cal boot­strap anal­y­sis, many hun­dreds or thou­sands of boot­strap rep­li­cates are
per­formed, and each rep­li­cate cre­ates a ran­domly sam­pled se­quence align­ment
(Fig. 6.5A). During each rep­li­cate, a phy­lo­ge­netic tree is built from the ran­domly as­

FIGURE 6.5. Example of boot­strap anal­ys­ is. (A) Making two boot­strap data sets by
sam­pling with re­place­ment. One hun­dred boot­strap rep­li­cates would make 100 data sets.
Notice how some po­si­tions have been sam­pled mul­ti­ple times in a data set while oth­ers
not at all­. (B) A phy­lo­ge­netic tree is built for each data set. All the re­sult­ing trees are
com­bined to make a con­sen­sus tree. If all­the trees have a par­tic­u­lar node, this node is said
to have 100% boot­strap sup­port. In a stan­dard ap­proach, if a node is found in fewer than
50% of the boot­strap trees, all­the un­sup­ported branches are col­lapsed into a sin­gle node.
PH Y LO G EN ETI C S   139

sem­bled align­ment. If one per­forms 1,000 boot­strap rep­li­cates, one has cre­ated
1,000 phy­lo­ge­netic trees of the same set of taxa. In the fi­nal step, all­the boot­strap
rep­li­cates are sum­ma­rized into a sin­gle boot­strap con­sen­sus tree (Fig. 6.5B).
The idea be­hind the boot­strap in phy­lo­ge­net­ics is sim­ple: if ev­ery sin­gle tree in
all­the boot­strap rep­li­cates (100% of the trees) shows taxon A closely re­lated to
taxon B, this is the high­est sup­port that can be achieved and in­di­cates that the re­la­
tion­ship of taxon A to taxon B is strongly sup­ported by the data. For ex­am­ple, in the
SARS phy­log­eny (Fig. 6.3A), there is 100% sup­port for the re­la­tion­ship of the SARS
vi­rus to the group 2 co­ro­na­vi­ruses. In other words, no mat­ter what po­si­tions of the
align­ment are se­lected, SARS is al­ways closely re­lated to the group 2 co­ro­na­vi­
ruses. Bootstrap val­ues of 95% or higher are con­sid­ered sig­nif­i­cant, though in prac­
tice val­ues of 70% or higher can be trust­ed.
It is im­por­tant to keep in mind sev­eral things about the boot­strap. First, all­the
boot­strap does is cre­ate ran­dom­ized data sets. The boot­strap is a sta­tis­ti­cal
method, not a phy­lo­ge­netic method. The phy­lo­ge­netic anal­y­sis is per­formed sep­
a­rately with each boot­strap data set. For in­stance, one can per­form a neigh­bor-
joining, a max­i­mum-parsimony, or a max­i­mum like­li­hood boot­strap anal­y­sis (as
we will see in Ac­tiv­ity 6.1). Second, the boot­strap phy­log­eny is a con­sen­sus phy­
log­eny, not the best phy­log­eny for the data set. For in­stance, the best phy­log­eny
may re­solve all­the re­la­tion­ships, but the boot­strap of­ten col­lapses the branches
that are not well sup­ported (Fig. 6.5B). Third, the more boot­straps per­formed, the
bet­ter. However, some meth­ods are very slow (like max­i­mum par­si­mony and es­
pe­cially max­i­mum like­li­hood), and de­pend­ing on the num­ber of taxa, it may take
too much time to do thou­sands of boot­strap rep­li­cates.

Notes
1. The se­quences were ac­tu­ally RNAs iso­lated from or­gan­isms grown in cul­tures—a real pain in
the neck! Now we just se­quence the DNA that codes for the RNA (or any­thing else).
2. To read about those last two, see Kelley ST, Theisen U, Angenent LT, St Amand A, Pace
NR. 2004. Molecular analysis of shower curtain biofilm microbes. Appl Environ Microbiol
70:4187–4192 and Ellis DG, Bizzoco RW, Kelley ST. 2008. Halophilic Archaea determined
from geothermal steam vent aerosols. Environ Microbiol 10:1582–1590.
3. Locey KJ, Len­non JT. 2016. Scaling laws pre­dict global mi­cro­bial di­ver­sity. Proc Natl Acad
Sci U S A 113:5970–5975.
4. One study es­ti­mated to­tal num­ber of eu­kary­otic spe­cies at 8.7 mil­lion (Mora C, Tittensor DP,
Adl S, Simpson AG, Worm B. 2011. How many spe­cies are there on Earth and in the ocean?
PLoS Biol 9:e1001127), while mi­cro­bial di­ver­sity (mainly Bacteria and Archaea) has been es­ti­
mated at 1 tril­lion (see end­note 7, above).
5. See Kallmeyer J, Pockalny R, Adhikari RR, Smith DC, D’Hondt S. 2012. Global dis­tri­bu­tion
of mi­cro­bial abun­dance and bio­mass in subseafloor sed­i­ment. Proc Natl Acad Sci U S A
109:16213–16216 and Whit­man WB, Coleman DC, Wiebe WJ. 1998. Prokaryotes: the un­
seen ma­jor­ity. Proc Natl Acad Sci U S A 95:6578–6583.
6. Why only 3 do­mains—why not a 4th, 5th, or more? Have we missed them? Are vi­ruses the
4th do­main? Stay tuned!
7. The term “boot­strap” comes from the id­iom “pull your­self up by your own boot­straps,” which
means to do it with­out­any out­­side as­sis­tance (i.e., all­by your­self). Bootstraps can be found
on cow­boy boots and are used to put them on. In the con­text of sta­tis­tics or phy­lo­ge­net­ics,
boot­strap meth­ods re­sam­ple the same data as used to gen­er­ate the re­sult, and thereby to
test the ro­bust­ness of the re­sult. “Pull the re­sult up by its own da­ta.”
140  COMPU TATIO NAL B IOL OGY

ACTIVITY 6.1 PHYLOGENETIC ANALYSIS

Motivation
A phy­log­eny is a di­a­gram that in­di­cates the evo­lu­tion­ary re­la­tion­ships among or­gan­isms or the
mo­lec­u­lar se­quences (DNA, RNA, or pro­tein) within or­gan­isms.1 Phylogenetic trees have been
used to, among other things, de­ter­mine the re­la­tion­ships among song­birds, study the evo­lu­tion
of drug re­sis­tance in HIV, pre­dict new gene func­tions, and dis­cover new forms of mi­cro­bial life.
Phylogenetic meth­ods, al­go­rithms for com­put­ing phy­log­e­nies, re­quire mul­ti­ple-sequence align­
ments of DNA or pro­tein se­quences of the same gene from dif­fer­ent or­gan­isms to de­ter­mine
the evo­lu­tion­ary re­la­tion­ships among the or­gan­isms.2 One may also make se­quence align­ments
of dif­fer­ent but re­lated genes (e.g., all­gene se­quences in the glo­bin fam­ily) to de­ter­mine how
the gene fam­ily evolved. Phylogenetic meth­ods use in­for­ma­tion from the mu­ta­tions that have
ac­cu­mu­lated among the se­quences over evo­lu­tion­ary time to de­ter­mine how the se­quences are
re­lated to one an­oth­er.
In this ac­tiv­ity, we will cover two ba­sic meth­ods for con­struct­ing the best phy­lo­ge­netic tree,
given a set of aligned se­quence data. The un­der­ly­ing prin­ci­ples be­hind these meth­ods are very
dif­fer­ent, but they have the same goal: find­ing the best phy­lo­ge­netic tree for the data. We will
also cover a sta­tis­ti­cal ap­proach for as­sess­ing the qual­ity of the phy­log­eny and use an on­line re­
source for build­ing phy­lo­ge­netic trees us­ing these meth­ods.

Learning Objectives
1. Learn the prin­ci­ples and goals of phy­lo­ge­net­ics, some uses of phy­log­e­nies, how to in­ter­pret
phy­lo­ge­netic trees, and how the boot­strap method can be used to de­ter­mine tree ac­cu­racy
(Motivation).
2. Be ­able to cal­cu­late a dis­tance ma­trix based on a mul­ti­ple-sequence align­ment and un­der­
stand how it can be used to build a neigh­bor-joining tree (Concepts and Exercises).
3. Use the max­i­mum-parsimony prin­ci­ple and set the­ory to de­ter­mine the an­ces­tral char­ac­ters
in a phy­lo­ge­netic tree (Concepts and Exercises).
4. Use the max­i­mum-parsimony prin­ci­ple to de­ter­mine the best among a set of phy­lo­ge­netic
trees (Concepts and Exercises).
5. Learn how to use an on­line pro­gram for build­ing phy­lo­ge­netic trees us­ing neigh­bor join­ing
and max­i­mum par­si­mony and run a boot­strap anal­y­sis (Concepts and Exercises).

Concepts
This ac­tiv­ity cov­ers two re­lated, but fun­da­men­tally dif­fer­ent, ap­proaches to es­ti­mat­ing the phy­lo­
ge­netic re­la­tion­ships among a set of aligned DNA or pro­tein se­quences. The first method we will
cover is re­ferred to as a dis­tance method be­cause it uses the over­all dis­tance (dis­sim­i­lar­ity)
among the se­quences in a mul­ti­ple-sequence align­ment3 to de­ter­mine the re­la­tion­ships. The
sec­ond method, re­ferred to as max­i­mum par­si­mony (MP), also uses a mul­ti­ple-sequence align­
ment but fo­cuses on spe­cific nu­cle­o­tides or amino acid po­si­tions in the align­ment to de­ter­mine
PH Y LO G EN ETI C S   141

how the se­quences are re­lated to one an­other. In sci­ence, the prin­ci­ple of par­si­mony states that
in choos­ing be­tween com­pet­ing hy­poth­e­ses, the one with the few­est as­sump­tions should be
se­lect­ed.4 In the case of phy­lo­ge­netic trees, MP se­lects the phy­lo­ge­netic tree that re­quires the
fewest num­ber of changes (i.e., the max­im ­ ally par­si­mo­ni­ous tree).

Algorithm 1: dis­tance meth­od


The idea be­hind dis­tance-based meth­ods is sim­ple: the greater the sim­il­ar­ity of two se­quences
(the shorter their dis­tance), the more closely re­lated they are to each other. Distance val­ues
range from 0 to 1, with se­quences that are iden­ti­cal (all­DNA nu­cle­o­tides or pro­tein amino ac­ids)
hav­ing a pairwise dis­tance value of 0 and se­quences that have noth­ing in com­mon hav­ing a dis­
tance value of 1.
The fol­low­ing ex­er­cise should help you un­der­stand the prin­ci­ples be­hind dis­tance cal­cu­la­
tions. Below is a mul­ti­ple-sequence align­ment of four DNA se­quences, each from a dif­fer­ent
spe­cies of mar­mot. By com­par­ing the se­quences in the align­ment, can you tell which spe­cies
are more closely re­lated? Which ones are dis­tantly re­lat­ed?

Species 1 A T A T TT C G A T
Species 2 A T C G TC C G G A
Species 3 G C C G TT C G C A
Species 4 G T A G TC G G A T

Closest (small­est dis­tance) spe­cies:

Farthest (great­est dis­tance) spe­cies:

Reflection
• What is the to­tal num­ber of iden­ti­cal nu­cle­o­tides be­tween spe­cies 1 and spe­cies 2? How
about spe­cies 1 and spe­cies 3?
• Could you use the num­ber of dif­fer­ences and the length (num­ber of po­si­tions) in the
se­quence align­ment to cal­cu­late a dis­tance score be­tween spe­cies 1 and 2?
• Could you use a sim­i­lar ap­proach with pro­tein se­quences? What would you com­pare?
• How many to­tal pairwise com­par­i­sons are there in this align­ment? If there were 100
se­quences, would you of­fer your com­puter some candy if it helped you out­?

The an­swer is as fol­lows. The se­quences for spe­cies 1 and spe­cies 4 have 4 out­of 10 nu­cle­o­
tides that are dif­fer­ent be­tween them. The same is true of spe­cies 2 and spe­cies 3. Since the
se­quences are 10 nu­cle­o­tides in length, the dis­tance is 4 out­of 10, or 0.4, for these pairs and
makes them equally close matches. Species 1 and spe­cies 3, and spe­cies 3 and spe­cies 4, have
6 out­of 10 nu­cle­o­tides dif­fer­ent, a dis­tance of 0.6, which makes them equally far match­es.

Species 1 A T A T TT C G A T
Species 2 A T C G TC C G G A
Species 3 G C C G TT C G C A
Species 4 G T A G TC G G A T
142  COMPU TATIO NAL B IOL OGY

Number of dif­fer­ent nu­cle­o­tides


1—2 5
1—3 6
1—4 4
2—3 4
2—4 5
3—4 6
Closest : 1 and 4, 2 and 3
Farthest : 1 and 3, 3 and 4

Using dis­tances to build a phy­lo­ge­netic tree


Distance meth­ods es­ti­mate the phy­lo­ge­netic re­la­tion­ships among a set of taxa by (i) de­ter­min­
ing all­the pairwise dis­tances be­tween all­se­quences5 in an align­ment and then (ii) us­ing these
dis­tances to build a phy­lo­ge­netic tree. There are sev­eral ways to build phy­lo­ge­netic trees based
on a pairwise dis­tance ma­trix. We will fo­cus on a sim­ple and widely used method called the
neigh­bor-joining (NJ) al­go­rithm. As its name im­plies, it builds a phy­lo­ge­netic tree by “join­ing
neigh­bors” se­quen­tially, re­sult­ing in a sin­gle best tree for all­the se­quences. NJ is very fast, even
with thou­sands of se­quences, and it can be used with any dis­tance ma­trix, which makes it the
most pop­u­lar al­go­rithm for dis­tance-based phy­lo­ge­netic an­a­ly­ses.
Figure 6.1.1 de­tails how to de­ter­mine a dis­tance ma­trix for a set of 4 re­lated DNA se­quences
us­ing a sim­ple dis­tance met­ric. Figure 6.1.2 de­scribes a (sim­pli­fied) it­er­a­tive pro­cess of us­ing a
dis­tance ma­trix to build an NJ phy­lo­ge­netic tree.

FIGURE 6.1.1. Calculating a dis­tance ma­trix with 4 aligned DNA se­quences.


(A) Distances are cal­cu­lated be­tween all­pairs of se­quences. Based on the se­quence
align­ment in the up­per left cor­ner, se­quences S1 and S2 dif­fer at two nu­cle­o­tide po­si­tions
out­of five to­tal in the align­ment, mak­ing the pro­por­tional dif­fer­ence 0.4. (B) All pairs of
dis­tances are cal­cu­lated to com­plete the ma­trix.
PH Y LO G EN ETI C S   143

FIGURE 6.1.2. Using the dis­tance ma­trix from Fig. 6.1.1 to build an NJ tree. (1) The
ini­tial tree is a so-called star phy­log­e­ny in which all­the spe­cies are at­tached to a cen­tral
node, and it has no struc­ture. (2) The first step is to search the ma­trix for the near­est
neigh­bors: the se­quences with the short­est dis­tance. These are joined to­gether on the
tree. (3) Next, the av­er­age dis­tances are cal­cu­lated for the two se­quences that were
joined (S3 and S4) to all­the other se­quences in the ma­trix. This cre­ates a new, smaller
dis­tance ma­trix. (4) Repeat step 2 with the new dis­tance ma­trix. Find the short­est
dis­tance and join the taxa. (5) Recalculate the dis­tance ma­trix and join nodes un­til there
are no more dis­tances.

The on­line dis­tance ma­trix in­ter­ac­tive link (see Exercises) has ad­di­tional ex­pla­na­tions of how
to build a ma­trix for a set of se­quences and gen­er­ate a phy­lo­ge­netic tree. The dis­tance cal­cu­la­
tions per­formed here are the sim­plest pos­si­ble, and there are more so­phis­ti­cated met­rics that
ac­count for mu­ta­tional bi­ases and the pos­si­bil­ity of mu­ta­tion re­ver­sals. Also, the NJ tree-
build­ing ap­proach is more com­pli­cated than de­scribed and in­cludes es­ti­ma­tions of branch
lengths. However, the steps in Fig. 6.1.1 and 6.1.2 and the al­go­rithm in­ter­ac­tive show the ba­sic
prin­ci­ples of ma­trix build­ing and NJ tree con­struc­tion.

Algorithm 2: max­i­mum par­si­mony (MP) meth­od


MP uses the same data (a mul­ti­ple-sequence align­ment) as dis­tance meth­ods, but it doesn’t
cre­ate a ma­trix of sim­i­lar­i­ties. Instead, MP ex­am­ines each po­si­tion in a mul­ti­ple-sequence align­
ment sep­a­rately to see if it has any use­ful in­for­ma­tion. If so, MP de­ter­mines what each of the
in­for­ma­tive po­si­tions says about how the se­quences are re­lat­ed.
To gain a sense of how this works, try the fol­low­ing prob­lem. In this mul­ti­ple-sequence align­
ment of pro­tein se­quences from 4 dif­fer­ent spe­cies of mar­mots, try to iden­tify amino ac­ids that
are shared be­t ween spe­cies. For ex­am­ple, spe­cies 1 and spe­cies 4 share a Y at po­si­tion 2,
144  COMPU TATIO NAL B IOL OGY

while spe­cies 2 and 3 share an A at the same po­si­tion. Write down which po­si­tions are shared
be­tween the se­quences be­low.

1 2 3 4 5 6 7 8 910
Species 1 VYLGHEFQKS
Species 2 VALRHDFQKW
Species 3 VALRHDFQFW
Species 4 VYLGHEFQFS

Positions with shared amino ac­ids


Species 1 and 2:
Species 1 and 3:
Species 1 and 4:
Species 2 and 3:
Species 2 and 4:
Species 3 and 4:

Reflection
• Based on your po­si­tional anal­y­sis of this mul­ti­ple-sequence align­ment, can you con­clude
what spe­cies are most closely re­lat­ed?
• What do po­si­tions 1 and 3 tell you about which spe­cies are more closely re­lat­ed?
• Are there any con­flict­ing data? In other words, do the shared amino ac­ids at some po­si­tions
con­tra­dict the pat­terns at other po­si­tions?
• Could a sim­i­lar ap­proach be used with DNA se­quences?

Here are the an­swers. Positions 2, 4, 6, and 10 in­di­cate a close re­la­tion­ship be­tween spe­cies 1
and 4 and spe­cies 2 and 3. Position 9 con­tra­dicts this re­sult, and the rest of the po­si­tions do not
say any­thing about which se­quence is re­lated to which ac­cord­ing to MP.

1 2 3 4 5 6 7 8 910
Species 1 VYLGHEFQKS
Species 2 VALRHDFQKW
Species 3 VALRHDFQFW
Species 4 VYLGHEFQFS

Positions with shared amino ac­ids


Species 1 and 2: 135789
Species 1 and 3: 13578
Species 1 and 4: 13578246 and 10
Species 2 and 3: 13578246 and 10
Species 2 and 4: 13578
Species 3 and 4: 135789
PH Y LO G EN ETI C S   145

The ba­sic prin­ci­ple be­hind MP is that only shared de­rived po­si­tions (called syn­ap­o­mor­phies)6 are
use­ful for de­ter­min­ing phy­lo­ge­netic re­la­tion­ships. In the example, this in­cludes po­si­tions 2, 4, 6,
9, and 10. Positions that are shared in all­spe­cies are use­less for MP, be­cause they don’t give you
any in­for­ma­tion about whether spe­cies 1 is closer to spe­cies 2 than it is to spe­cies 3 or spe­cies 4.
These po­si­tions are con­sid­ered to be an­ces­tral char­ac­ter states (de­rived from a sin­gle com­mon
an­ces­tor spe­cies), and so the al­go­rithm ig­nores them.

Which phy­lo­ge­netic tree is the short­est (most par­si­mo­ni­ous)?


Unlike the NJ dis­tance method, MP does not build a sin­gle best tree. Instead, it chooses the
best, most (max­i­mally) par­si­mo­ni­ous tree among a set of trees. Figures 6.1.3 to 6.1.6 show how
MP de­ter­mines which of two pos­si­ble trees is the most par­si­mo­ni­ous. The al­go­rithm ac­tu­ally
has two dis­tinct but re­lated as­pects. The first part of the pro­cess uses prin­ci­ples of set the­ory to
re­duce the num­ber of pos­si­bil­i­ties at each node of the tree. The sec­ond part finds the min­i­mum
num­ber of changes that could have oc­curred to pro­duce the pat­tern of nu­cle­o­tides.
Figure 6.1.3 pro­vi­des a prob­lem with a small se­quence align­ment and two phy­lo­ge­netic trees
that show the pos­si­ble re­la­tion­ships among 5 se­quences (from 5 dif­fer­ent taxa). There are ac­tu­
ally 105 pos­si­ble trees with 5 dif­fer­ent taxa, but 2 are suf­fi­cient to show the ba­sic pro­cess.

FIGURE 6.1.3. Two pos­si­ble phy­lo­ge­netic trees for one data set of 5 se­quences with
7 nu­cle­o­tide align­ment po­si­tions. The bot­tom half of the fig­ure shows the nu­cle­o­tides
of the first po­si­tion mapped to the tips of the phy­lo­ge­netic tree. Next, we will de­ter­mine
the min­i­mum num­ber of changes needed on the phy­log­eny to pro­duce this pat­tern of
nu­cle­o­tides at this po­si­tion (see Fig. 6.1.4 and 6.1.5).
146  COMPU TATIO NAL B IOL OGY

In Fig. 6.1.4, set the­o­ry7 is used to de­ter­mine the most par­si­mo­ni­ous set of nu­cle­o­tides8 at
each node of the tree. The fol­low­ing de­scribes the set the­ory no­ta­tion for solv­ing this part of the
prob­lem. (You will likely find the con­cept eas­ier to grasp in the fig­ure and the on­line tu­to­ri­al/inter­
active.) The set the­ory logic is as fol­lows:

{node} = {higher node 1} ∩ {higher node 2}

Translated into En­glish, if the two higher nodes have some­thing in com­mon, the node con­tains
only what they have in com­mon (the in­ter­sec­tion, ∩). For ex­am­ple:

{node} = {A, G} ∩ {G}

{node} = {G}

But what hap­pens if they do not have any­thing in com­mon? Easy, you keep ev­ery­thing! This is
also known as the union (∪) of the two sets.

If {higher node 1} ∩ {higher node 2} = ∅

FIGURE 6.1.4. Using prin­ci­ples of set the­ory to re­duce the num­ber of pos­si­bil­i­ties at
each node. To solve the prob­lem, start at the tips (leaves) of the tree and move to­wards the
base (root). (A) In tree 1, the first node is an A be­cause the in­ter­sec­tion of the two tips above
is A. In tree 2, the first node is ei­ther an A or a G. The in­ter­sec­tion of {A} and {G} is the empty
set, so we keep ev­ery­thing—the union {A, G}. (B) Proceed down to the base of each tree
us­ing the above tips/nodes to de­ter­mine the most par­si­mo­ni­ous pos­si­bil­i­ties at each node.
PH Y LO G EN ETI C S   147

{node} = {higher node 1} ∪ {higher node 2}

For ex­am­ple:

If {higher node 1} ∩ {higher node 2} = {G} ∩ {A} = ∅

{node} = {higher node 1} ∪ {higher node 2} = {G} ∪ {A} = {G, A}

Figure 6.1.5 shows the sec­ond part of the pro­cess: de­ter­min­ing a path through the tree that
re­quires the least num­ber of changes (i.e., is max­i­mally par­si­mo­ni­ous). The on­line tu­to­rial al­lows
you to prac­tice with this con­cept of min­i­mi­za­tion.
Finally, the min­i­mi­za­tion pro­cess is re­peated for each po­si­tion in the se­quence align­ment (Fig.
6.1.6). While te­dious, it be­comes quicker when one re­al­izes that cer­tain po­si­tions can be ig­nored
be­cause they are par­si­mo­ni­ously un­in­for­ma­tive.

FIGURE 6.1.5. Having al­ready min­i­mized the pos­si­bil­i­ties at each node, choose the
char­ac­ters (nu­cle­o­tides) at each node that re­quire the few­est changes. (A) Starting
at the root of the tree, the only op­tion in both trees is G. Moving up the right­most branch,
we see that no changes are re­quired. The po­si­tion doesn’t mu­tate (change) along the
branch, and 0 steps are re­quired. Moving up the left branch, if we choose G at the node,
no changes are re­quired. Choosing A would re­quire 1 mu­ta­tion (step) from a G to an A.
This is not the min­i­mum, so choos­ing G is pref­er­a­ble for both trees. (B) Complete the
min­i­mi­za­tion for all­nodes. In tree 1, the path cho­sen re­quires 1 to­tal change in the tree for
this po­si­tion, the G-to-A mu­ta­tion in­di­cated by the ar­row. No other path will be less than 1
for this po­si­tion in this tree. In tree 2, the path we choose has 2 mu­ta­tions, and this is also
the min­i­mum for this po­si­tion on this tree.
148  COMPU TATIO NAL B IOL OGY

FIGURE 6.1.6. Finish the prob­lem by cal­cu­lat­ing the num­ber of steps for each po­si­tion
on each tree. Then, add them up to de­ter­mine the to­tal num­ber of steps across all­
po­si­tions for each tree. In this case, tree 1 has a min­im
­ um to­tal of 8 steps across all­
po­si­tions in the align­ment. This is one step shorter than is needed for tree 2, so tree 1
is the most par­si­mo­ni­ous op­tion.

For ex­am­ple, if all­the nu­cle­o­tides are the same at a po­si­tion, there will be 0 changes (steps)
on all­the pos­si­ble trees at this po­si­tion (see po­si­tion 5 in Fig. 6.1.6). Similarly, if there has been
only 1 change in one of the se­quences, and all­the rest are the same, the num­ber of changes
(steps) for this po­si­tion will al­ways be 1 no mat­ter which tree is tested (po­si­tion 4 in Fig. 6.1.6).
These po­si­tions are called un­in­for­ma­tive char­ac­ters be­cause they can­not in­form which of the trees
are most par­si­mo­ni­ous. In fact, MP tree-build­ing soft­ware com­pletely ig­nores these po­si­tions.
The on­line MP tu­to­rial ex­plains how to de­ter­mine the best phy­lo­ge­netic tree given a mul­ti­ple-
sequence align­ment:
http://​kelleybioinfo.​org/​algorithms/​interactive/​IMax.​pdf.
PH Y LO G EN ETI C S   149

Exercises
Interactive ex­er­cises (the­o­ry)
Use the on­line phy­log­eny in­ter­ac­tive links be­low to learn how to build dis­tance
ma­tri­ces and phy­log­e­nies us­ing the dis­tance method and learn how to se­lect the
bet­ter phy­log­eny un­der the MP cri­te­rion. Once you learn how they work, solve the
ac­tiv­ity prob­lems.

Distance Matrix Interactive Link


Link:
http://​kelleybioinfo.​org/​algorithms
/​default.​php?​o=5

Maximum Parsimony Interactive Link


Link:
http://​kelleybioinfo.​org/​algorithms
/​default.​php?​o=12
150  COMPU TATIO NAL B IOL OGY

Problems
1. Distance matrix problem: answer the questions below.

a. Fill in the ta­bles.

b. Determine the first node of the tree and write it out­in set no­ta­tion (see the
in­ter­ac­tive).

FIRST TREE NODE:

c. Recalculate the dis­tance ma­trix pro­por­tions af­ter join­ing the first node.

RECALCULATED DISTANCE MATRIX:


PH Y LO G EN ETI C S   151

2. Parsimony problem: fill out­the nodes in the tree be­low us­ing par­si­mony.
­Calculate the length of the tree for this po­si­tion and write it next to each tree.
152  CO MPUTATION AL B IOL OGY

Lab Exercises (Practice)


In this part of the ex­er­cise, you will learn how to use a phy­lo­ge­netic anal­y­sis
web­site.

Phylogenetic Analysis Tutorial Link


Link:
http://​kelleybioinfo.​org/​algorithms
/​tutorial/​TPhy1.​pdf

Sample and lab ex­er­cise data (sam­ple data 1):


http://​kelleybioinfo.​org/​algorithms
/​data/​DPhy1.​txt

Sample and lab ex­er­cise data (sam­ple data 2):


http://​kelleybioinfo.​org/​algorithms
/​data/​DPhy2.​txt
PH Y LO G EN ETI C S   153

Lab Exercise
Using the DNA and pro­tein mul­ti­ple-sequence align­ments (sam­ple data 1 and
sam­ple data 2, re­spec­tively), per­form the phy­lo­ge­netic an­a­ly­ses us­ing the on­line
phy­lo­ge­netic soft­ware de­scribed in the tu­to­ri­al.

1. Using the sam­ple data spec­i­fied be­low, draw/show the fol­low­ing re­sults.

NOTES:
• In the Workflow Settings, uncheck the box­es for Multiple Alignment and
Alignment curation. The se­quences are al­ready aligned and cu­rated. Also,
leave the Visualization box checked and choose TreeDyn if not al­ready
checked, and the workflow should be run “all­at once.”
• Choose pro­tein or DNA/RNA on the Input Data page as ap­pro­pri­ate.
• Leave the rest of the de­fault set­tings on the Input Data page.

a. Sample data 1, NJ tree (FastDist + Neighbor). Draw/show the tree be­low.

b. Sample data 1, par­si­mony (TNT). Draw/show the tree be­low.

2. Analyze sam­ple data 2 us­ing the NJ ap­proach (FastDist + Neighbor). Draw the
top four branches of the tree or paste the tree be­low.

3. Perform a boot­strap anal­y­sis with 100 resamplings of the data us­ing sam­ple
data 1 with the NJ cri­te­rion se­lected (FastDist + Neighbor). Draw/show the
re­sults be­low. Circle the branch with the high­est boot­strap sup­port.
154  CO MPUTATION AL B IOL OGY

Notes
1. DNA and other mol­e­cules are most of­ten used to de­ter­mine the re­la­tion­ships among or­gan­
isms. However, some­times re­search­ers are in­ter­ested in the evo­lu­tion of the mol­e­cules
them­selves (e.g., gene fam­ily ex­pan­sion, the evo­lu­tion of drug re­sis­tance in HIV, and RNA
struc­tural chang­es).
2. Information from mul­ti­ple-sequence align­ments of dif­fer­ent gene se­quences within or­gan­
isms can also be com­bined to in­crease in­for­ma­tion and phy­lo­ge­netic ac­cu­ra­cy.
3. See mul­ti­ple-sequence align­ment in Chap­ter 03, Ac­tiv­ity 3.1.
4. Parsimony is also known as Oc­cam’s (or Ock­ham’s) ra­zor, named af­ter Wil­liam of Ock­ham, an
En­glish Fran­cis­can Friar (c. 1287–1347) from the town of Ock­ham. The town is also known
for open­ing its doors to Wil­liam and Ellen Craft, slaves who es­caped the United States af­ter
the pas­sage of the bar­baric Fugitive Slave Act of 1850 and who be­came im­por­tant fig­ures in
the ab­o­li­tion­ist move­ment. And it has a cool mill. They are very proud.
5. Distances can be cal­cu­lated be­tween pairs of aligned DNA/RNA se­quences, pro­tein se­
quences, and even or­gan­ism char­ac­ter­is­tics like num­ber of toes, hair color, be­hav­iors, and
other things.
6. Synapomorphy is de­rived from the Greek: “syn-” means “shared,” and “-morph” means “shape.”
7. In case you’ve for­got­ten (or never knew) how sets work, here is a short primer. The in­ter­sec­
tion of two sets is sym­bol­ized by ∩ and is equal to the set of what is shared be­tween the
two sets. The union of two sets is sym­bol­ized by ∪ and is equal to the set of all­ob­jects in
both sets. For ex­am­ple, if A = {1, 3} (set A in­cludes the num­ber 1 and 3) and if B = {3, 4}, then
A ∩ B = {3} be­cause the num­ber 3 is all­they have in com­mon, and A ∪ B = {1, 3, 4}, which is
all­the num­bers in both sets. And fi­nally, there is the con­cept of the empty set, sym­bol­ized
by {} or ∅, a set with noth­ing in­side it. For ex­am­ple, if A = {1, 2} and B = {3, 4}, then A ∩ B = ∅
(the empty set) be­cause the two sets share noth­ing in com­mon.
8. This ex­am­ple uses DNA nu­cle­o­tides, but the data could eas­ily be amino ac­ids or even phys­i­
cal (mor­pho­log­i­cal) char­ac­ter­is­tics of or­gan­isms. Phylogeneticists re­fer to these gen­er­ally as
char­ac­ters and the num­ber of changes as steps.
CHAPTER
07
PROBABILITY: ALL MUTATIONS ARE
NOT EQUAL (-LY PROBABLE)

Y
ou have al­ready en­coun­tered the use of prob­a­bil­ity in sev­eral al­go­rithms in
this book, in­clud­ing the Chou-Fasman pro­pen­si­ties in Chap­ter 02 as well
as the se­quence logos and po­si­tion-spe­cific weight ma­tri­ces in Chap­ter 04.
In this chap­ter, we dis­cuss the con­cepts be­hind and gen­er­a­tion of prob­a­bil­
ity ma­tri­ces that are used in many bioinformatics al­go­rithms. Most of the
chap­ter and ex­er­cises fo­cus on how to de­ter­mine the prob­a­bil­ity (like­li­hood1)
of amino ac­ids mu­tat­ing to other amino ac­ids. These amino acid sub­sti­tu­tion
ma­tri­ces—​Point Ac­cepted Mutation (PAM) and BLOcks SUb­sti­tu­tion Ma­trix
(BLOSUM)—​dra­mat­i­cally im­prove the per­for­mance of dif­fer­ent pro­tein align­ment
al­go­rithms. We also briefly dis­cuss the ad­vanced con­cept of hid­den Mar­kov
mod­els (HMMs), a pow­er­ful means for mak­ing prob­a­bil­ity ma­tri­ces “on the fly.”
HMMs are used in many bioinformatics ap­pli­ca­tions, in­clud­ing pre­dict­ing ge­no­mic
re­peat re­gions, trans­mem­brane pro­teins, and pro­tein-coding re­gions in ge­nomes
and clus­ter­ing dis­tantly re­lated pro­tein se­quences into fam­i­lies.

Protein (Amino Acid) Substitution Matrices


One early re­al­i­za­tion made while an­a­lyz­ing pro­tein mul­ti­ple-sequence align­ments
was that not all­amino acid mu­ta­tions oc­cur with the same fre­quency. You should
have fa­mil­iar­ity with amino acid sub­sti­tu­tion ma­tri­ces from work­ing with them in the
pro­tein Needleman-Wunsch anal­y­sis in Ac­tiv­ity 3.1 (see PAM250 and BLOSUM62
links in Fig. 7.1).
Instead of us­ing fixed match and mis­match val­ues such as those used in DNA
se­quence align­ments, align­ments of pro­tein se­quences use a ma­trix of log odds2
scores for matches and mis­matches. The ra­dio but­ton on the web­site’s in­ter­ac­
tive mod­ule is set by de­fault to the PAM250 ma­trix, but one can also use the
BLOSUM62 ma­trix. Both PAM and BLOSUM ma­tri­ces con­tain log odds scores
for ev­ery amino acid chang­ing to ev­ery other amino acid, as well as not chang­ing
(Fig. 7.2). Note that the sub­sti­tu­tion ma­tri­ces are sym­met­ri­cal. For ex­am­ple, the
log odds score of a change of arginine (R) to serine (S) is the same as for S to R.

157
158  COMPU TATIO NAL B IOL OGY

FIGURE 7.1. Screenshot of the Needleman-Wunsch in­ter­ac­tive mod­ule for teach­ing


pro­tein se­quence align­ments us­ing amino acid sub­sti­tu­tion ma­tri­ces. By de­fault, the
traceback cal­cu­la­tions use the PAM250 sub­sti­tu­tion ma­trix.

The BLOSUM ma­tri­ces also con­tain log odds scores, but they are cal­cu­lated
in a dif­fer­ent way, as you will learn later in the chap­ter.
These sub­sti­tu­tion ma­tri­ces have proven very help­ful for im­prov­ing the per­for­
mance of many al­go­rithms, par­tic­u­larly se­quence align­ment meth­ods such as
dy­namic pro­gram­ming and BLAST. The rea­son they are so use­ful in se­quence
align­ment is that the scores re­ally help dif­fer­en­ti­ate among pos­si­ble align­ments.
For in­stance, be­low are two small se­quence align­ments of the same query (QY)
se­quence to two dif­fer­ent sub­ject se­quences (S1 and S2).

QUERY SEQ : H L R W S
SUBJECT 1 : H L S E S
SUBJECT 2 : M L S W S

Alignment 1 Alignment 2
QY: H L R W S QY: H L R W S
S1: H L S E S S2: M L S W S

Using the PAM250 ma­trix in Fig. 7.2, one can score both by add­ing up the log-
odds scores for all­the align­ment po­si­tions. For in­stance, in align­ment 2 the score
for an H (his­ti­dine)-to-M (me­thi­o­nine) change is −2, that for an L-to-L change is +6,
and so forth.
In this case, even though both align­ments have two mis­matches, the PAM
scor­ing sys­tem tells us that align­ment 2 is a bet­ter align­ment be­cause it has a
higher over­all score.

Alignment 1 Alignment 2
QY: H L R W S QY: H L R W S
+6 +6 +0 −7 +3 =+8 –2 +6 +0 +17 +3 =+24
S1: H L S E S S1: M L S W S
P R OB AB IL IT Y: AL L M U TATI O N S A R E N O T EQ U A L ( - LY PR O B A B LE)   159

FIGURE 7.2. The PAM250 sub­sti­tu­tion ma­trix. This ma­trix shows the log odds scores
of ev­ery amino acid changing to ev­ery other amino acid, or not chang­ing at all­. Positive
num­bers mean that the like­li­hood is higher than ex­pected by chance, 0 means the same
as chance, and neg­a­tive means less likely than chance. For ex­am­ple, al­a­nine not mu­tat­ing
(A to A on the ta­ble, top left red box) is more likely (+2) than al­a­nine mu­tat­ing to cys­te­ine
(A to C, −2, fifth-row red box). All the val­ues along the di­ag­o­nal are for the amino acid NOT
chang­ing. Typically, the log odds of no change are the high­est val­ues, but some sub­sti­tu­
tions are very com­mon. For in­stance, a change from an iso­leu­cine to a leu­cine (+2) is just
as likely as an al­a­nine not mu­tat­ing.

This dra­matic dif­fer­ence be­tween the align­ment scores (remember this is a


log scale) is be­cause W (tryp­to­phan) has a very high score (+17) for not chang­ing.
In align­ment 1, a change from E (glu­tamic acid) to W is very un­likely (−7) ac­cord­
ing to the PAM250 ma­trix. This is a 24-point dif­fer­ence on a log­a­rith­mic scale for
the align­ment of just one amino acid (tryptophan is weird). It should be noted that
while the BLOSUM62 ma­trix has dif­fer­ent val­ues, the trend is still the same (i.e.,
W to W is +11) even though the method for de­ter­min­ing the BLOSUM val­ues is
sig­nif­i­cantly dif­fer­ent (see Ac­tiv­ity 7.1).

What Determines Substitution Bias?


In the PAM250 sub­sti­tu­tion ma­trix, there is a high like­li­hood of an iso­leu­cine (I)-to-
leucine (L) change and vice versa (+2; Fig. 7.2). On the other hand, there is a low
160  COMPU TATIO NAL B IOL OGY

FIGURE 7.3. Venn di­a­gram of amino acid prop­er­ties (left) and the ge­netic code (right).
Amino ac­ids within cir­cles of the Venn di­ag
­ ram are con­sid­ered more biochemically sim­i­lar.

like­li­hood (−3) of iso­leu­cine be­ing changed to gly­cine (G). Why is this? There are
two fun­da­men­tal rea­sons, which are il­lus­trated in Fig. 7.3. First, iso­leu­cine and
leu­cine are “chem­i­cal cous­ins.” These amino ac­ids are sim­i­larly sized and have
sim­i­lar bio­chem­i­cal prop­er­ties; namely, they are both hy­dro­pho­bic (Fig. 7.3, left
side, al­i­phatic group­ing). A change of I to L in any given pro­tein will usu­ally have a
very mod­est ef­fect on the pro­tein’s func­tion. However, iso­leu­cine and gly­cine are
very dif­fer­ent biochemically, and a sub­sti­tu­tion of one for the other could spell
di­sas­ter. See how in Fig. 7.3, left side, I and G are not in any shared group­ings?
If such a sub­sti­tu­tion were to oc­cur in a crit­ic­ al cel­lu­lar pro­tein, it could elim­i­nate
an or­gan­ism’s abil­ity to func­tion or reproduce (Darwin says, "Goodbye!").
Second, the like­li­hood of a sub­sti­tu­tion is also de­pen­dent upon how many nu­
cle­o­tide changes (mu­ta­tions) need to oc­cur in the un­der­ly­ing pro­tein-coding DNA.
For ex­am­ple, the ge­netic code ta­ble shows that only one nu­cle­o­tide in the first
po­si­tion of the co­don needs to change (AUU to CUU) to cause an iso­leu­cine-to-
leucine mu­ta­tion (Fig. 7.3, right side). On the other hand, an iso­leu­cine-to-glycine
mu­ta­tion re­quires two in­de­pen­dent mu­ta­tions in the co­don (AUU to GGU). Since
the prob­a­bil­ity that two in­de­pen­dent events will oc­cur is the prod­uct of each in­de­
pen­dent prob­a­bil­i­ty,3 the dif­fer­ences in co­don se­quences have a con­sid­er­able
im­pact on the like­li­hood of amino acid mu­ta­tions.
Creating a prob­a­bil­ity model that takes into ac­count both the bio­chem­i­cal
prop­er­ties of the amino acid AND the like­li­hood of mu­ta­tions at the DNA level
would be dif­fi­cult, to say the least. Instead of do­ing this, a re­searcher named Mar­
ga­ret Dayhoff tried an­other ap­proach: count­ing the dif­fer­ent amino acid sub­sti­tu­
tions that oc­cur in na­ture.

PAM and BLOSUM


Dayhoff’s PAM amino acid sub­sti­tu­tion ma­trix and Henikoff and Henikoff’s
BLOSUM sub­sti­tu­tion ma­trix both es­ti­mate rates of sub­sti­tu­tions us­ing pro­
tein mul­ti­ple-sequence align­ments. Rates of sub­sti­tu­tion are es­ti­mated by
count­
ing the times amino ac­ ids change to other amino ac­ ids and ask­ ing
P R OB AB IL IT Y: AL L M U TATI O N S A R E N O T EQ U A L ( - LY PR O B A B LE)   161

whether this is more likely or less likely than ex­pected by chance. Finally, the
ra­tio of these val­ues is used to de­ter­mine the log odds scores. Activity 7.1 ex­
plains the dif­fer­ences be­tween PAM and BLOSUM and the steps of cal­cu­lat­
ing these ma­tri­ces.
Multiple dif­fer­ent sub­sti­tu­tion ta­bles have been cre­ated us­ing both the PAM and
BLOSUM ap­proaches. The choice of us­ing a PAM or BLOSUM sub­sti­tu­tion ma­trix
for BLAST or other ap­pli­ca­tions should ide­ally be based on the over­all dis­sim­i­lar­ity
of the se­quences be­ing com­pared or aligned. The greater the dis­sim­i­lar­ity, the
higher the PAM sub­sti­tu­tion ma­trix num­ber that should be used. A PAM1 ma­trix
might be more ap­pro­pri­ate for very closely re­lated se­quences, while a PAM250
ma­trix may be more ap­pro­pri­ate for highly di­ver­gent se­quences. The num­bers
in­di­cate the ex­pected rate of mu­ta­tion among se­quences per 100 amino ac­ids.

• PAM1: 1% mu­ta­tion rate (1 mu­ta­tion per 100 amino acid po­si­tions)


• PAM50: 50% mu­ta­tion rate (50 mu­ta­tions per 100 po­si­tions)
• PAM250: 250 mu­ta­tions per 100 po­si­tions (po­si­tions mu­tate more than one time)

Like PAM, BLOSUM num­bers also in­di­cate di­ver­gence lev­els of the se­quences
used to make the ma­tri­ces, but the val­ues go in the op­po­site di­rec­tion: lower num­
bers are used for more di­ver­gent se­quences, higher num­bers for less di­ver­gent
se­quences. A BLOSUM62 ma­trix was gen­er­ated from se­quences with an av­er­
age of 62% over­all sim­i­lar­ity, while a BLOSUM90 ma­trix was gen­er­ated from se­
quences that were 90% sim­i­lar. Similarity equiv­a­lences be­tween the two ma­tri­ces
are as fol­lows:

PAM BLOSUM
PAM100 BLOSUM90
PAM120 BLOSUM80
PAM160 BLOSUM60
PAM200 BLOSUM52
PAM250 BLOSUM45

When choos­ing what ma­trix to use, it is best to pick the ma­trix that fits the gen­
eral di­ver­gence of the se­quences be­ing aligned. In prac­tice, the choice of ma­trix
does not make a par­tic­u­larly big dif­fer­ence in alignments.

Hidden Mar­kov Models


The gen­er­al­ized prob­a­bi­lis­tic ap­proach known as HMMs has been broadly ap­plied
in bioinformatics. Outside the fields of en­ gi­
neer­
ing and com­ puter sci­ ences,
HMMs found ap­pli­ca­tion in the field of speech rec­og­ni­tion in the 1970s. Shortly
there­af­ter, early bioinformaticians ap­plied them to the anal­y­sis of bi­o­log­i­cal se­
quence data, es­pe­cially pro­tein se­quences, though they have also found a num­
ber of other ap­pli­ca­tions. While the math­e­mat­ics be­hind HMMs are be­yond the
scope of this book, it is im­por­tant to have an ap­pre­ci­a­tion for HMMs and how
they are used in bioinformatics.
In Ac­tiv­ity 2.1 you gained some fa­mil­iar­ity with the TMHMM server, an HMM
used to pre­dict the lo­ca­tion of trans­mem­brane do­mains in pro­tein se­quences.
162  COMPU TATIO NAL B IOL OGY

HMMs have been used in bioinformatics to pre­dict novel genes in ge­nomes,


per­form mul­ti­ple-sequence align­ments, and fold RNA struc­tures.4 The ge­nius of
HMMs is their abil­ity to find a se­quen­tial pat­tern given a suf­fi­cient set of known
ex­am­ples. For in­stance, pro­tein se­quences of known trans­mem­brane do­mains
pro­vide the in­put for cre­at­ing an HMM for de­tect­ing transmembrane domains
in protein sequences of unknown function.
HMM can be de­fined as “a sta­tis­ti­cal Mar­kov model in which the sys­tem be­
ing mod­eled is as­sumed to be a Mar­kov chain with un­ob­served (hid­den) states.”5
Nice. But what is a Mar­kov chain, and what is meant by “hid­den states”? A Mar­
kov chain is a sta­tis­ti­cal model that says that the prob­a­bil­ity of the next item in a
se­quence de­pends only on the state of the pre­vi­ous item in the se­quence. For
in­stance, one could have a Mar­kov chain for pre­dict­ing the weather. In a Mar­kov
chain, the chance it will be sunny to­mor­row would de­pend on the weather the
pre­vi­ous day (rainy or sunny). In the trans­mem­brane do­main ex­am­ple, this might
mean that the prob­a­bil­ity of an amino acid in a se­quence (Y, for ex­am­ple) be­ing
part of a trans­mem­brane do­main is de­pen­dent on the pre­vi­ous amino acid in the
se­quence (M, for ex­am­ple). In a Mar­kov chain, one uses a tran­si­tion ma­trix which
has the prob­a­bil­i­ties of all­the pos­si­ble tran­si­tions from one state to an­other,
much like the PAM and BLOSUM ma­tri­ces.
If one knows all­the tran­si­tion prob­ab ­ il­i­ties for, say, the amino acids in a trans­
mem­brane do­main, the prob­lem is easy. However, since these prob­ab ­ il­i­ties are
un­known they are con­sid­ered hid­den prob­a­bil­i­ties, and these hid­den tran­si­tions
must be in­ferred from real ex­ist­ing data. To de­ter­mine an HMM, one must have a
lot of known data of a par­tic­u­lar type. This could in­clude known trans­mem­brane
do­main se­quences, groups of in­tron se­quences (for in­tron pre­dic­tion), var­i­ous
al­pha he­li­ces, etc. The HMM pro­ce­dures “train” on spe­cific data sets and pro­duce
an HMM spe­cific for de­tect­ing new var­i­ants of the same type. In the lab ac­tiv­ity,
you will be us­ing the Pfam HMM-based da­ta­base to group a mys­tery se­quence
into a pro­tein fam­i­ly.

Notes
1. In com­mon speech, prob­a­bil­ity and like­li­hood are used in­ter­chan­ge­ably, though there are
some dif­fer­ences—at least to sta­tis­tics nerds. Both terms are used in this chap­ter, though
like­li­hood is the more ap­pro­pri­ate term for the chance that one amino acid mu­tates into an­
other amino ac­id.
2. The log odds score is the log­a­rithm of the odds ra­tio. In the case of amino acid sub­sti­tu­tion
ma­tri­ces, the odds ra­tio is the ra­tio of the ob­served like­li­hood of a sub­sti­tu­tion to the like­li­
hood ex­pected by chance.
3. It has been es­ti­mated that the chance of you get­ting caught in a tor­nado is 1 × 10 − 6, while you
have a 1 × 10−7 chance of be­ing bit­ten by a shark. So, the ex­pected prob­a­bil­ity of you be­ing
in a tor­nado AND be­ing bit­ten by a shark is (1 × 10 − 6) × (1 × 10 −7) = 1 × 10 −13. Unless you get
caught in a sharknado. Then the prob­ab ­ il­ity in­creases to 1 × 10 −1.
4. For a re­view, see Yoon B-J. 2009. Hidden Mar­kov mod­els and their ap­pli­ca­tions in bi­o­log­i­cal
se­quence anal­y­sis. Curr Genomics 10:402–415.
5. Black EF, Ma­rini L, Vaidya A, Berman D, Willman M, Sal­o­mon D, Bar­thol­o­mew A, Ken-
yon N, Mc­Henry K. 2014. Using hid­den Mar­kov mod­els to de­ter­mine changes in sub­ject
data over time, study­ing the im­mu­no­reg­u­la­tory ef­fect of mes­en­chy­mal stem cells. Proc IEEE
Int Conf Escience 1:83–91.
P R OB AB IL IT Y: AL L M U TATI O N S A R E N O T EQ U A L ( - LY PR O B A B LE)   163

ACTIVITY 7.1 GENERATING PAM AND BLOSUM SUBSTITUTION MATRICES

Motivation
Protein se­quence match­ing (e.g., BLAST), mul­ti­ple-sequence align­ments, and phy­lo­ge­netic an­a­
ly­ses have long used sub­sti­tu­tion ma­tri­ces to de­ter­mine the best align­ments or the best trees.
Substitution ma­tri­ces pro­vide scores, called log odds scores, in­di­cat­ing the like­li­hood of each of
the 20 most com­monly oc­cur­ring amino ac­ids mu­tat­ing into all­the other amino ac­ids or not mu­
tat­ing at all­. The two ap­proaches taught in this chap­ter for cre­at­ing sub­sti­tu­tion ma­tri­ces in­clude
PAM (Point Ac­cepted Mu­ta­tion) and BLOSUM (BLOcks SUbstitution Matrix). Both ap­proaches
use ob­served pat­terns of amino acid sub­sti­tu­tions to gen­er­ate sub­sti­tu­tion ma­tri­ces. PAM de­ter­
mines these prob­a­bil­i­ties us­ing phy­lo­ge­netic trees, while BLOSUM ba­ses the prob­a­bil­i­ties on
con­served blocks of aligned se­quences. Both meth­ods also cal­cu­late how of­ten the mu­ta­tions
are ex­pected to oc­cur by chance based on the fre­quency of the amino acids. The scores are logs
of the ra­tios of the ob­served prob­a­bil­i­ties to the ex­pected prob­a­bil­i­ties.
This ac­tiv­ity will teach the prin­ci­ples of two dif­fer­ent ap­proaches for gen­er­at­ing amino acid
sub­sti­tu­tion ma­tri­ces, as well as how to cal­cu­late the log odds scores. The lab ex­er­cises will
show how these meth­ods are used in pro­tein BLAST an­a­ly­ses and in­tro­duce the Pfam (pro­tein
fam­ily) da­ta­base, which uses the more so­phis­ti­cated HMM ap­proach to clus­ter groups of func­
tion­ally re­lated pro­teins.

Learning Objectives
1. Learn the bi­o­log­i­cal prin­ci­ples be­hind sub­sti­tu­tion ma­tri­ces and how prob­a­bi­lis­tic ap­proaches
can be used in bioinformatics (Motivation).
2. Use phy­lo­ge­netic trees to es­ti­mate amino acid sub­sti­tu­tion rates and gen­er­ate PAM-like sub­
sti­tu­tion ma­tri­ces (Concepts and Exercises).
3. Use con­served blocks within pro­tein se­quence align­ments to es­ti­mate amino acid sub­sti­tu­
tion rates and gen­er­ate BLOSUM-like sub­sti­tu­tion ma­tri­ces (Concepts and Exercises).
4. Learn how PAM and BLOSUM scores are used in pro­tein BLAST anal­y­sis (Concepts and
Exercises).
5. Gain fa­mil­iar­ity with the HMM-based Pfam da­ta­base (Concepts and Exercises).

Concepts
To bet­ter un­der­stand the bi­o­log­i­cal prin­ci­ples be­hind amino acid sub­sti­tu­tion ma­tri­ces, try the
fol­low­ing ex­er­cise. On the next page is a Venn di­a­gram il­lus­trat­ing the bio­chem­i­cal sim­i­lar­i­ties
among the 20 most com­mon amino ac­ids. The curved lines en­com­pass let­ter des­ig­na­tions for
amino ac­ids with sim­i­lar bio­chem­i­cal prop­er­ties, which are la­beled out­­side the di­a­gram. For ex­
am­ple, iso­leu­cine (I), va­line (V), and leu­cine (L) are all­al­ip
­ hatic amino ac­ids that also be­long to a
larger set of hy­dro­pho­bic amino ac­ids.
164  COMPU TATIO NAL B IOL OGY

The pro­tein se­quence align­ments be­low are of the same length and have the same num­ber
of iden­ti­ties (4 iden­ti­cal matches out­of 10, i.e., 4/10). Use the in­for­ma­tion in the Venn di­a­gram to
de­ter­mine which of the two matches is biochemically more like­ly.

Match 1
Query: F G Q V I P A K R
Subjt: FANVM P A R E

Match 2
Query: F G Q V I P A K R
Subjt: FSCVFPAFV

Reflection
• Are some mis­matches more use­ful than oth­ers for de­ter­min­ing the bet­ter match? Why or
why not?
• Based on the Venn di­a­gram and your un­der­stand­ing of pro­teins, would an E-to-M mu­ta­tion
be more or less likely than an E-to-N mu­ta­tion? Why?
• Can you use the PAM log odds ma­trix be­low to score the two align­ments? The scores are
ad­di­tive and the more pos­i­tive, the more likely the sub­sti­tu­tion (or no sub­sti­tu­tion). For
ex­am­ple, a change from N to N is +2, while a change from N to Y or Y to N is −2.

Match 1
Query: F G Q V I P A K R
Subjt: FANVM P A R E
SCORE:

Match 2
Query: F G Q V I P A K R
Subjt: FSCVFPAFV
SCORE:

Match 1 is the bet­ter match. The match­ing amino ac­ids are the same be­tween match 1 and
match 2, but the 4 mismatches (italicized below) in both align­ments are very dif­fer­ent. Based on
the Venn di­a­gram, the mu­ta­tions lead­ing to the dif­fer­ences in match 1 seem more plau­si­ble. K to
R sub­sti­tutes one pos­i­tive amino acid for an­other, while K to F sub­sti­tutes a hy­dro­philic for a hy­
dro­pho­bic amino acid. The PAM250 ma­trix score also sup­ports this as­sess­ment.

Match 1
Query: F G Q V I P A K R
Subjt: FANVM P A R E

Match 2
Query: F G Q V I P A K R
Subjt: FSCVFPAFV
P R OB AB IL IT Y: AL L M U TATI O N S A R E N O T EQ U A L ( - LY PR O B A B LE)   165

Match 1
Query: F G Q V I P A K R
Subjt: F A N V M P A R E
SCORE: 9+1+1+4+2+6+2+3-1=27

Match 2
Query: F G Q V I P A K R
Subjt: F S C V F P A F V
SCORE: 9+1-5+4+1+6+2-5-2=11

Sweet Lou
To cal­cu­late the like­li­hood of amino acid sub­sti­tu­tions, Dayhoff, the in­ven­tor of the PAM ma­trix,
came up with a clever idea. Instead of de­riv­ing a com­plex for­mula that in­cluded bio­chem­i­cal
prop­er­ties and the ge­netic code, why not sim­ply count how many times each amino acid mu­
tated to ev­ery other amino acid us­ing avail­­able pro­tein se­quence align­ments? Dayhoff also cal­
cu­lated the num­ber of times that each amino acid did not change at all­.
To un­der­stand how count­ing can be used to cal­cu­late the fu­ture like­li­hood of cer­tain events,
con­sider an anal­ogy from the sport of base­ball. In base­ball, one of the most dif­fi­cult things to
do is to hit the ball with the bat.1 The best hit­ters man­age to hit the ball ef­fec­tively only 3 out­
of ev­ery 10 at­tempts and are even less likely to hit the best of all­out­­comes, a home run. Fig-
ure 7.1.1 shows the hit­ting sta­tis­tics of the De­troit Tigers leg­end Lou Whitaker (nick­named
Sweet Lou). One way to de­ter­mine how likely it would be for Sweet Lou to hit a home run
would be to count how many times he ac­tu­ally hit a home run and com­pare that num­ber to the
num­ber of times he did some­thing else, in­clud­ing strik­ing out­, get­ting to first base, and walk­
ing. Based on his ca­reer stats, Sweet Lou hit a home run 3 out­of ev­ery 100 times he tried to
hit the ball.
To ex­tend the base­ball anal­ogy a bit fur­ther, let’s cal­cu­late a log odds score for Sweet Lou hit­
ting a home run. We have the ob­served like­li­hood (0.03), but what is the ex­pected like­li­hood? In
this case, the ex­pected like­li­hood might be the like­li­hood of a typ­i­cal pro­fes­sional base­ball player
hit­ting a home run. If the typ­i­cal player hits a home run 3 times out­of ev­ery 1,000 times try­ing,
then the ex­pected like­li­hood is 0.003. If we cal­cu­late the log odds score us­ing the base 10 log­a­
rithm,2 then Lou Whitaker’s log odds score of hit­ting a home run (SHR) would be

0.03 ⎞
SHR = log10 ⎛⎜ = 1.0
⎝ 0.003 ⎟⎠

mean­ing that Lou Whitaker is 10 times more likely to hit a home run than the typ­i­cal pro­fes­sional
player. If the value were −1.0, he would be 10 times less likely, and 0 means he would hit home
runs at the typ­i­cal rate [log10 (1) = 0].
Similar to the base­ball anal­ogy, we could use pro­tein se­quence data and count­ing to de­ter­
mine the like­li­hood of a leu­cine (L) mu­tat­ing to a tryp­to­phan (W) or to a his­ti­dine (H) or not mu­
tat­ing at all­. Dayhoff re­al­ized that she could look for mu­ta­tions us­ing pro­tein mul­ti­ple-sequence
align­ments. Both the Dayhoff PAM method and the BLOSUM method use count­ing to de­ter­
mine ob­served like­li­hoods, though the count­ing is done in a dif­fer­ent man­ner for each method. In
ad­di­tion to the ob­served like­li­hoods, both meth­ods also cal­cu­late an ex­pected like­li­hood, which
166  COMPU TATIO NAL B IOL OGY

FIGURE 7.1.1. Using ex­ist­ing data to de­ter­mine like­li­hoods: a base­ball anal­o­gy. The
fig­ure shows 4 sea­sons of hit­ting sta­tis­tics for Lou Whitaker (Sweet Lou), one of the
fin­est sec­ond base­men ever to play the game. Using the data at the bot­tom left, we
could cal­cu­late the like­li­hood that Lou would hit a home run (the HR col­umn) the next
time he came up to bat. To do this, we could count the to­tal num­ber of home runs (HR)
Lou hit and di­vide that num­ber by the to­tal num­ber of times he did any­thing else at bat,
in­clud­ing strik­ing out­(SO), walk­ing (BB), or get­ting a reg­ul­ar hit. In his ca­reer, Sweet Lou
hit 244 home runs at 8,570 times at bat (AB), for a like­li­hood of ∼0.03 (hit a home run 3
times out­of 100 tries). The same type of cal­cu­la­tion could be done to see how likely he
was to get to first base, walk, or strike out­. Photo cour­tesy of Aaron Cald­well, un­der
li­cense CC BY-2.0.

is the like­li­hood that this would have hap­pened by ran­dom chance just based on the fre­quen­cies of
the amino ac­ids. The fi­nal score that is cal­cu­lated for both meth­ods is the log odds score, which
is the log of the ra­tio of the ob­served like­li­hood to the ex­pected like­li­hood. A pos­i­tive log odds
score in­di­cates that the mu­ta­tion is more likely than chance, while a neg­a­tive score in­di­cates
that it is less likely than chance.

Calculating a PAM Matrix


The first step in cal­cu­lat­ing ei­ther a PAM or BLOSUM sub­sti­tu­tion ma­trix is to count the num­ber
of ob­served amino acid sub­sti­tu­tions and the num­ber of times amino ac­ids did not change.
To de­ter­mine what sub­sti­tu­tions oc­curred, Dayhoff con­structed phy­lo­ge­netic trees us­ing a
max­i­mum-parsimony ap­proach with the pro­tein mul­ti­ple-sequence align­ments that ex­isted
at the time.
The first part of the ap­proach is to cal­cu­late the mu­ta­tion prob­a­bil­ity Mi,j, which is the like­li­
hood that amino acid i will mu­tate into amino acid j. Figure 7.1.2 il­lus­trates how a phy­lo­ge­netic
tree can be used to count amino acid mu­ta­tion pat­terns for an al­an ­ ine (A) to a cys­te­ine (C), which
can then be used to cal­cu­late MA,C.
P R OB AB IL IT Y: AL L M U TATI O N S A R E N O T EQ U A L ( - LY PR O B A B LE)   167

FIGURE 7.1.2 Counting al­a­nine sub­sti­tu­tions us­ing a phy­lo­ge­netic tree. (A) Phylo­
genetic tree re­con­struc­tion of the re­la­tion­ship of 4 aligned se­quences. The tree has been
in­verted so that the root of the tree is at the top, with the 4 se­quences from the align­ment
at the bot­tom. The tree in­cludes 3 an­ces­tral re­con­structed se­quence re­con­struc­tions, one
at the root (top) and two de­scen­dants (mid­dle), giv­ing rise to the 4 se­quences (bot­tom). (B)
Counting the num­ber of times in this data that al­a­nine mu­tated to cys­te­ine. (C) Counting the
num­ber of times al­a­nine re­mained un­changed. (D) Counting the num­ber of times al­a­nine
changed to an­other amino acid (in this case, leu­cine).
168  COMPU TATIO NAL B IOL OGY

FIGURE 7.1.3. Observed amino acid sub­sti­tu­tion pat­terns for 10,000 events. The red
boxes in­di­cate the mu­ta­tions of A to C and C to A, which were com­bined to de­ter­mine the
mu­ta­tion rate. The PAM ma­trix also cal­cu­lated the rate of not mu­tat­ing. A did not change
9,867 times, mak­ing MA,A = 9,867/10,000 = 0.9867.

Figure 7.1.3 shows a part of the orig­i­nal sub­sti­tu­tion ta­ble used by Dayhoff to cre­ate the PAM
scor­ing ma­trix. The data set avail­­able in the late 1970s was very lim­ited be­cause few se­quences
were avail­­able at the time for gen­er­at­ing the ob­served and ex­pected prob­a­bil­i­ties us­ing the
phy­lo­ge­netic count­ing ap­proach like the one shown in Fig. 7.1.2. To make the cal­cu­la­tions more
mean­ing­ful and eas­ier to cal­cu­late, Dayhoff scaled the sub­sti­tu­tions such that each amino acid
un­der­went 10,000 to­tal events. The ma­trix was also made sym­met­ri­cal, such that Mi,j = Mj,i by
com­bin­ing the counts of i to j and j to i and then di­vid­ing this over the to­tal num­ber of events (in
this case, 10,000). For ex­am­ple, in Fig. 7.1.3, there is 1 ob­served A-to-C mu­ta­tion and 3 C-to-A
mu­ta­tions, for a to­tal of 4. Thus, MA,C = MC,A = 4/10,000 = 0.0004.
Finally, to cal­cu­late the log odds score, one takes the nat­u­ral log of the ob­served mu­ta­tion
rate di­vided by the ex­pected rate as

pi × Mi ,j M observed frequency
Si ,j = log = log i ,j = log
pi × p j pj expected frequency

where Si,j is the log odds score for a sub­sti­tu­tion of amino acid i to j, pi and pj are the fre­quen­cies
of amino acid i to j re­spec­tively, and Mi,j is the mu­ta­tion rate. These log odds scores were then
cal­cu­lated for the tran­si­tion of ev­ery amino acid to ev­ery other amino ac­id.
P R OB AB IL IT Y: AL L M U TATI O N S A R E N O T EQ U A L ( - LY PR O B A B LE)   169

The orig­i­nal PAM ma­trix was based on a data set that in­cludes 71 fam­i­lies of closely re­lated pro­
teins. In or­der to ac­count for sub­sti­tu­tion pat­terns among more dis­tantly re­lated pro­teins, Dayhoff
also in­tro­duced a scal­ing fac­tor that pro­jected the mu­ta­tion rates for more dis­tantly re­lated se­
quences. Different ma­tri­ces were then cre­ated for higher lev­els of pro­tein di­ver­gence. A PAM1 ma­
trix as­sumed an av­er­age of 1 amino acid sub­sti­tu­tion per 100 amino ac­ids, while a PAM250 ma­trix
as­sumed an av­er­age of 250 sub­sti­tu­tions per amino acid (each amino acid mu­tated mul­ti­ple times).

Calculating a BLOSUM Matrix


As pre­vi­ously men­tioned, the log odds scores of the BLOSUM ma­trix are highly sim­i­lar in prin­ci­
ple to those of the PAM ma­trix. The scores are also based on the log ra­tio of the ob­served mu­
ta­tion rate to the ex­pected mu­ta­tion rate. However, in­stead of build­ing a phy­lo­ge­netic tree, the
cre­a­tors of the BLOSUM ma­trix, Henikoff and Henikoff, used the se­quence align­ments di­rectly.
Figure 7.1.4 shows the count­ing step of the al­go­rithm. Using align­ment col­umns in con­served
blocks of mul­ti­ple-sequence align­ments, the first stage is to count all­the pos­si­ble amino acid
pairings, re­ferred to as tuples.

FIGURE 7.1.4. Counting amino acid tuples (pairs) in se­quence align­ment blocks.
(A) Multiple-sequence align­ment of four pro­teins. (B) The first step of the BLOSUM
al­go­rithm is to iden­tify blocks of po­si­tions in the se­quence align­ment with­out­gaps. (C)
Using these blocks, the al­go­rithm counts all­the pos­si­ble pairs within each col­umn. These
are the ob­served sub­sti­tu­tions. In the two col­umns an­al­yzed in the fig­ure, the al­go­rithm
counted 6 NN tuples in the first col­umn and 4 EE tuples, 1 RE tuple, and 1 ER tuple in the
sec­ond col­umn.
170  CO MPUTATION AL B IOL OGY

These tuples are then used to cal­cu­late the ob­served fre­quen­cies of the pairings (Fig. 7.1.5).
Then the fre­quency of the in­di­vid­ual amino ac­ids in the tuples is used to cal­cu­late the ex­pected
fre­quency of all­the tuples (Fig. 7.1.6). Finally, with the ob­served and ex­pected fre­quen­cies cal­cu­
lated, the last step is to cal­cu­late the log odds scores as fol­lows:

⎛ P(O) ⎞
log odds ratio = 2 × log2 ⎜
⎝ P(E ) ⎟⎠

FIGURE 7.1.5. Calculation of ob­served fre­quen­cies [P(O)]. (A) The tuple ta­ble with the to­tal
num­ber of pairs. (B) Calculation of ob­served fre­quen­cies. If the amino acid tuple count is of the
same amino acid (i.e., NN), the tuple count is di­vided by the to­tal num­ber of ob­served tuples. If
the amino acid count is of two dif­fer­ent amino ac­ids (i.e., RE), one first com­bines the to­tal of both
pos­si­ble com­bi­na­tions (RE and ER) and then di­vi­des by the to­tal num­ber of ob­served tuples.
(C) Table of ob­served fre­quen­cies. The ze­ros in­di­cate tuples that have not yet been ob­served.
P R OB AB IL IT Y: AL L M U TATI O N S A R E N O T EQ U A L ( - LY PR O B A B LE)   171

FIGURE 7.1.6. Calculation of ex­pected fre­quen­cies [P(E)]. (A) Frequencies of each


amino acid in the con­served se­quence blocks. Glutamic acid (E) com­prises half (10/20 = 0.5)
of the to­tal amino ac­ids in the tuples from the high­lighted se­quence block in Fig. 7.1.3.
(B and C) The ex­pected val­ues of the tuples are cal­cu­lated by mul­ti­ply­ing the in­de­pen­dent
prob­a­bil­it­ ies (fre­quen­cies) of each amino acid. For in­stance, the ex­pected fre­quency of a
QQ pair is the square of their in­de­pen­dent fre­quen­cies. Since the ma­trix is sym­met­ri­cal,
the ex­pected fre­quency of an EQ pair is the com­bined fre­quency of EQ and QE.

BLOSUM ma­tri­ces, like the PAM ma­tri­ces, have also been de­signed for se­quences with var­
i­ous lev­els of di­ver­gence. Unlike the PAM ma­tri­ces, larger BLOSUM num­bers should be used
with more sim­i­lar se­quences. The BLOSUM80 ma­trix was con­structed us­ing pro­tein se­quence
align­ments that clus­tered to­gether at the 80% sim­i­lar­ity level, while the BLOSUM62 ma­tri­ces
clus­tered at 62%.
172  CO MPUTATION AL B IOL OGY

Exercises
Interactive ex­er­cise (the­o­ry)
Use the on­line PAM and BLOSUM in­ter­ac­tive links be­low to learn how these ma­
tri­ces are cre­ated us­ing ob­served amino acid sub­sti­tu­tions. The in­ter­ac­tive links
ex­plain how to use the teach­ing in­ter­ac­tives. Once you learn how they work, use
them to solve the prob­lems in the next sec­tion.

BLOSUM Interactive Link


Link:
http://​kelleybioinfo.​org/​algorithms
/​default.​php?​o=13

PAM Interactive Link


Link:
http://​kelleybioinfo.​org/​algorithms
/​default.​php?​o=14
P R OB AB IL IT Y: AL L M U TATI O N S A R E N O T EQ U A L ( - LY PR O B A B LE)   173

Problems
Solving for a BLOSUM ma­trix
1. Circle the se­quence block in the mul­ti­ple-sequence align­ment be­low that
could be used for BLOSUM ma­trix cal­cu­la­tions.

2. Fill in the blanks us­ing the tuple val­ues in the ta­ble.


174  COMPU TATIO NAL B IOL OGY

3. Fill in the blanks us­ing the ex­pect­ed-probability val­ues in the ta­ble.
P R OB AB IL IT Y: AL L M U TATI O N S A R E N O T EQ U A L ( - LY PR O B A B LE)   175

Lab Exercises (Practice)


In this part of the ex­er­cise, you will ex­plore how PAM and BLOSUM are used in
BLAST pro­tein searches. You will also learn how to use the hid­den Mar­kov mod­el-
based Pfam da­ta­base.

Blastp Tutorial Link


Link:
http://​kelleybioinfo.​org/​algorithms
/​tutorial/​TProb1.​pdf

Sample and lab ex­er­cise da­ta:


http://​kelleybioinfo.​org/​algorithms
/​data/​DProb1.​txt

Pfam Tutorial Link


Link:
http://​kelleybioinfo.​org/​algorithms
/​tutorial/​TProb2.​pdf

Sample and lab ex­er­cise da­ta:


http://​kelleybioinfo.​org/​algorithms/​data
/​DProb1.​txt
176  COMPU TATIO NAL B IOL OGY

Lab Exercise
Part 1. BLOSUM and PAM: us­ing blastp ad­vanced pa­ram­et­ ers

1. Use the pro­tein BLAST (blastp) pro­gram at NCBI to per­form a BLAST search
with dif­fer­ent PAM and BLOSUM ma­tri­ces.

a. Search us­ing the prot1 se­quence from the sam­ple data. Adjust the blastp
ad­vanced pa­ram­e­ters so that the search is per­formed with the PAM250
ma­trix. Find a re­sult with an iden­tity of less than 85%. Write the an­swers
to the prob­lems be­low.

Protein func­tion:

NCBI ref­er­ence se­quence iden­ti­fi­er:

Organism (sci­en­tific name, and com­mon name if avail­­able):

Identities:

Positives:

Gaps:

First 5 po­si­tions of the align­ment of Query to Sbjct:

b. Repeat the search with the prot1 se­quence, but this time per­form the
search with the BLOSUM90 ma­trix. Find the same match re­sult as in part
a and write the an­swers to the prob­lems be­low.

Identities:

Positives:

Gaps:

First 5 po­si­tions of the align­ment of Query to Sbjct:

c. Note any dif­fer­ences be­tween the two BLAST align­ments.


P R OB AB IL IT Y: AL L M U TATI O N S A R E N O T EQ U A L ( - LY PR O B A B LE)   177

Part 2. Pfam da­ta­base

Search the Pfam da­ta­base with the prot2 se­quence from the sam­ple data and
write/draw the an­swers to the fol­low­ing ques­tions.

1. What sig­nif­i­cant Pfam match or matches did you find? Write the de­scrip­tion.

2. What are any known func­tions of the pro­tein fam­ily (fam­il­ies)?

3. What are the ex­pected value (E val­ues) of this match from the se­quence search
re­sults page?

4. Draw/show the first two po­si­tions of the HMM logo.

5. What does the logo say about po­ten­tially im­por­tant amino ac­ids in this pro­tein?

Notes
1. Not sur­pris­ing since a hard ball about the size of a hu­man fist is hurled at the bat­ter at speeds
close to 100 mi­les per hour (160 km/h), of­ten with a wicked spin.
2. The PAM ma­trix cal­cu­lates the log odds score us­ing the nat­ur­al log­ar­ithm, while BLOSUM
uses the base 2 log­ar­ithm. It turns out­that the base used is not all­that im­por­tant, but I find
base 10 eas­i­est to in­ter­pret.
CHAPTER
08
BIOINFORMATICS PROGRAMMING:
A PRIMER

T
he main goals of this hypertextbook are to teach the pur­pose of bioinfor­
matics, the al­go­rithms un­der­ly­ing some of the more com­monly used bioin­
formatics pro­grams, and how to use bioinformatics soft­ware to an­a­lyze
se­quence data. The fi­nal chap­ter fo­cuses on the next step in the evo­lu­tion of
the bioinformatician: pro­gram­ming. While there are many ter­rific pro­grams
al­ready avail­­able for bioinformatics anal­y­sis, there comes a time when you may
need to re­for­mat a large data set for a par­tic­u­lar pro­gram, or you may need to
com­pile a Unix pro­gram, or you may de­cide that cut­ting and past­ing data sets into
web­sites will de­lay your grad­u­a­tion or pub­li­ca­tion by ap­prox­i­ma­tely a de­cade. The
pur­pose of this chap­ter is to fa­mil­iar­ize you with a widely used bioinformatics pro­
gram­ming en­vi­ron­ment, namely, the Unix op­er­at­ing sys­tem (OS), and two com­
monly used bioinformatics pro­gram­ming lan­guages: the R and Python lan­guages.
Many ex­cel­lent books, on­line tu­to­ri­als, and clas­ses ex­ist for teach­ing Unix, R, and
Python, and af­ter fin­ish­ing the ex­er­cises in this chap­ter you should be ready to
tackle some of these re­sources and learn­ing on your own. In the prim­ers be­low
you will find links to free re­sources and use­ful text­books for ad­di­tional self-tutorials
or in­struc­tions.

The Unix Operating System


The col­lec­tion of soft­ware known as the OS runs all­as­pects of a com­pu­ta­tional
de­vice (com­puter, phone, tab­let, etc.). The OS con­trols and di­rects all­the pro­ces­
sors and de­vices, it runs the ap­pli­ca­tion soft­ware, and it con­tains and con­trols the
com­puter file struc­ture. The most com­monly used com­puter OSs, namely, Win­
dows and the MacOS, have very friendly user in­ter­faces that make it easy to run
pro­grams, store data, and op­er­ate ex­ter­nal de­vices. Smartphones are even eas­
ier. While these in­ter­faces make stan­dard tasks like open­ing a spread­sheet and
stor­ing pho­tos sim­ple, the win­dows-style in­ter­faces make for poor pro­gram­ming
en­vi­ron­ments.
Unix-style OSs are not user friendly, but they are dy­na­mite pro­gram­ming en­vi­
ron­ments. In a Unix sys­tem, one types spe­cific in­struc­tions on the com­mand line

179
180  COMPU TATIO NAL B IOL OGY

into a spe­cial win­dow called a ter­mi­nal to make things hap­pen. For in­stance, in­
stead of us­ing the menu bar to make a new folder (in Unix systems, folders are
called directories) on your com­puter, you can quickly make one with a com­mand
called “mkdir” (make di­rec­tory). Let’s say you had a lot of se­crets to keep: you
could make a directory called “TooManySecrets.”

$ mkdir TooManySecrets

Then, in­stead of click­ing or touch­ing the TooManySecrets directory on the com­


puter to open it and see the con­tents in­side, you could in­stead use the cd com­
mand, which stands for “change di­rec­to­ry,” fol­lowed by the ls com­mand, which
lists the fi­les.

$ cd TooManySecrets
$ ls

It may seem awk­ward, but once you learn the com­mands and un­der­stand the
OS, with one com­mand you can lo­cate any file or any pro­gram on the com­puter
no mat­ter what di­rec­tory you are in. For ex­am­ple, in­stead of click­ing through four
dif­fer­ent fold­ers to find the fi­les in TooManySecrets, you might do the fol­low­ing
in­stead:

$ cd ∼/Documents/ScottStuff/MISC/TooManySecrets

Ta-da! The / sym­bol tells the com­puter to look in a subdirectory. The ∼ is the home
di­rec­tory. So, this com­mand changes di­rec­tory by fol­low­ing a path from the home
di­rec­tory to Documents to ScottStuff to MISC to TooManySecrets. To find the
path to any directory, just type the command pwd (print working directory) and
then press the Enter key. In a Windows or Mac-style OS, to get to the file TooMa­
nySecrets, you would need to open Documents, then open ScottStuff, then open
MISC, and then search for the file in the di­rec­tory. The cd com­mand does the
same thing with­out­all­the click­ing.
Even bet­ter, the Unix com­mand line al­lows you to ex­e­cute any soft­ware pro­
gram from any­where on the com­puter, as long as that pro­gram is in­stalled in the
cor­rect di­rec­tory. For ex­am­ple,

$ py­thon

This com­mand finds and opens the Python pro­gram. If con­fig­ured cor­rectly, you
can also open brows­ers, spread­sheets, or any other pro­gram. These are ma­jor
ad­van­tages for bioinformatics pro­gram­ming, and ev­ery bioinformatics pro­gram­
mer needs to be com­pe­tent with the Unix lan­guage.

A short Unix tu­to­ri­al


This tu­to­rial pres­ents a short in­tro­duc­tion to the Unix/Linux OSs. Unix is an OS
de­vel­oped in the early 1970s that is widely used by pro­gram­mers. Linux1 is an
open-source ver­sion of the Unix OS de­vel­oped in the 1990s. Many com­put­ers
run this OS, and a lot of pro­gram­ming code has been de­vel­oped to run on Unix-
B I O I N FO R M ATI C S PR O G R A M M I N G : A PR I M ER   181

like sys­tems, so it is good for bioinformaticians to be fa­mil­iar with this OS. After
the tu­to­rial in this chap­ter, you can com­plete a more ex­ten­sive tu­to­rial at
http://​www.​ee.​surrey.​ac.​uk/​Teaching/​Unix/​

You can also learn from this ex­cel­lent, free on­line book, The Linux Command Line,
by Wil­liam Shotts:

http://​linuxcommand.​org/​tlcl.​php

There are also many other avail­­able on­line tu­to­ri­als and books. The MacOS
comes with a Unix ter­mi­nal, which can be used to com­plete the tu­to­rial ex­er­cise.
If you are us­ing Windows 10, a Bash shell com­mand line tool is in­cluded for de­
vel­op­ers, but you may need to ac­ti­vate it.2 Alternatively, you will need to down­
load and in­stall a Unix em­u­la­tor or a vir­tual ma­chine on Windows, or in­stall a fla­vor
of Linux on your per­sonal com­put­er.

Step 1: Open a ter­mi­nal win­dow

Linux/Unix Tutorial: The terminal window

Unix-like systems can be accessed by opening a terminal window and typing on


the command line. MacOS is built using a “flavor” of Linux.

Unix-like systems have a clear hierarchical directory structure. Every file and
folder (directory) in these systems is accessible from the command line, as long
as you know the file "path" to what you want to find.

To start playing around in Unix, open a terminal window that should look some­
thing like the one below.
182  COMPU TATIO NAL B IOL OGY

Step 2: Entering com­mands

Linux/Unix Tutorial: Entering commands

In the terminal window, you type in commands and hit “Enter” and things hap­
pen! Here, I have typed the commands pwd and then ls.

The pwd command stands for “print working directory” and reports the directory
path of the current directory. The ls command lists all the files and directories
within the current directory.

Step 3: More com­mands

Linux/Unix Tutorial: More commands

There are lots of commands you can use to navigate the operating system. Ap­
proximately 20 to 30 commands are used all the time. Unix is tedious at first, but
it is much faster than clicking through a lot of folders and scrolling through win­
dows. Below are some example commands.
B I O I N FO R M ATI C S PR O G R A M M I N G : A PR I M ER   183

Step 4: And even more com­mands

Linux/Unix Tutorial: And more commands

Introduction to R
R is a pro­gram­ming lan­guage and en­vi­ron­ment for sta­tis­ti­cal com­put­ing and
graph­ics. R has the tools of a com­mer­cial sta­tis­ti­cal pack­age (e.g., SPSS, Systat,
or SAS) but is free of charge (I wish I had discovered R years before I did.) I reg­u­
larly use R to test for sta­tis­ti­cal cor­re­la­tions, make box plots, and per­form an­a­ly­
ses of var­i­ance, re­gres­sions, or doz­ens of other sta­tis­ti­cal tests. However, the
rea­son that R is a must for bioinformatics is be­cause R comes with many other
pack­ages (li­brar­ies), in­clud­ing many bioinformatics pack­ages. Dozens of bioinfor­
matics meth­ods have been coded in R, and an R li­brary is pretty much the first
ac­ces­si­ble place for a new se­quence anal­y­sis al­go­rithm. At the time of this writ­
ing, the open-source Bioconductor de­pos­i­tory (http://​www.​bioconductor.​org)
had more than 1,200 R pack­ages for high-throughput data anal­y­sis. The CRAN3
de­pos­i­tory also con­tains doz­ens of li­brar­ies4 with power­­ful sta­tis­tics and al­go­
rithms for bi­o­log­i­cal data anal­y­sis, such as “veg­an,” “random­Forest,” “ggplot,” and
many mul­ti­var­i­ate anal­y­sis pack­ag­es.5
This tu­to­rial teaches a few ba­sics to get you go­ing, in­clud­ing (i) how to in­stall
R, (ii) how to up­load a data file, and (iii) how to per­form a few sim­ple sta­tis­ti­cal
an­a­ly­ses.
To con­tinue learn­ing R, I rec­om­mend the tu­to­rial at

http://​www.​cyclismo.​org/​tutorial/​R/​index.​html

Another more chal­leng­ing tu­to­rial can be found at


184  COMPU TATIO NAL B IOL OGY

http://​tryr.​codeschool.​com/​levels/​1/​challenges/​1

Finally, I rec­om­mend the ex­cel­lent R Cookbook, by Paul Teetor.

Step 1: Installation in­struc­tions

R Tutorial: Installation

To install R, go to the R website: http://www.r-project.org

R Tutorial: Installation

Choose the closest CRAN Mirror site for download.


B I O I N FO R M ATI C S PR O G R A M M I N G : A PR I M ER   185

Step 2: Commands in R

R Tutorial: Opening the R console


Double click the R icon and you should get a window that looks like this:

Step 3: Reading in a data set for sta­tis­ti­cal anal­y­sis

This next sec­tion re­quires that you down­load a data set and put it on your com­
puter desk­top.6 The file used in the ex­er­cise can be found at

http://​kelleybioinfo.​org/​algorithms/​basics/​programming/​RTestData.​txt

Notes
1. The sam­ple names can­not start with a num­ber. For in­stance, you can­not put
001 in­stead of S001. If you have num­bers, put a let­ter in front of them.
2. Do not al­low spaces in any of the names of var­i­ables. No funny sym­bols or
spe­cial char­ac­ters. Letters and num­bers on­ly.
3. Empty cells in a data set, called a data frame in R, must be re­placed by NA (for
“not avail­­able”).

The next part loads the data into R. This is called the dataframe. In this case, read
the data in us­ing the read.table func­tion and as­sign it to the var­i­able d (for “da­ta”).
Note that many text­books and tu­to­ri­als as­sign val­ues to var­ia­ bles us­ing the ar­row
sym­bol. For in­stance, the code shown in step 2 could be writ­ten

d<-read.table(“Desktop/RTestData.txt”,head­er=TRUE)

However, most pro­gram­ming lan­guages use the equal sign to as­sign var­i­ables. I
think it looks much bet­ter than the ar­row, and it works great in R. You can use ar­
row symbols if you like, but it takes twice as many keystrokes.

Step 4: Analyzing your da­ta

After you have in­stalled R and loaded up the test data set, try out­some of the
very ex­cit­ing sta­tis­ti­cal an­a­ly­ses!
186  COMPU TATIO NAL B IOL OGY

R Tutorial: Reading in a data set

Step 1: You need a text-only file that R can read. I created a tab-delimited text file called RTest-
Data.txt. This data is from a gum disease study and has abundances of bacteria found in the
human mouth.

Description of RTestData: Below is the meaning of each column header.

id=Code for each patient. Two rows for each subject: one before and one
after gum cleaning.
strep=Percentage of Streptococcus bacteria
lepto=Percentage of Leptotrichia bacteria
prev=Percentage of Prevotella bacteria
fuso=Percentage of Fusobacteria bacteria
veil=Percentage of Veillonella bacteria
time=Time that sample was taken: 1–before gum cleaning; 2–after gum cleaning
status=Disease status: 1 is healthy, 2 is diseased gums
pocket=Average gum pocket depth across all the teeth in the mouth (in millimeters)
deepest=Depth of the deepest gum pocket in the mouth (in millimeters)

RTestData.txt can be downloaded by saving it as a text file or by copy/pasting data into a text file:
http://kelleybioinfo.org/algorithms/basics/programming/RTestData.txt

R Tutorial: Reading in a dataset (Mac and Linux)

Step 2: To read in your data set, you need to know where the data set is on your computer.
(I made it easy and put it on my desktop.) Then type the path to the folder/directory in the
console and hit return.
B I O I N FO R M ATI C S PR O G R A M M I N G : A PR I M ER   187

R Tutorial: Reading in a data set (Windows)

To read in your data set in Windows, you have to find the path to the file. To find
the path, right-click the data file and choose “Properties” at the bottom of the
menu. You will get a window that looks like this:

R Tutorial: Simple analyses

Note that af­ter read­ing in the var­i­able, to ac­cess the data as­so­ci­ated with the
var­i­able, you must use the $. Because it is an­noy­ing to keep typ­ing the dol­lar
sign, I cop­ied the data to a new var­i­able name like this: strep=d$strep

Introduction to Python
Python has be­come the most widely used pro­gram­ming lan­guage in bioinformat­
ics, and for good rea­son. Not only is Python a flex­i­ble, fully func­tional pro­gram­
ming lan­guage used in ap­pli­ca­tions around the world, but also it was de­signed to
188  COMPU TATIO NAL B IOL OGY

be rel­a­tively easy to learn. Guido van Rossum,7 Python’s cre­a­tor, com­bined the
style of the C pro­gram­ming lan­guage with the ease of a sim­pler learn­ing lan­
guage called ABC. The re­sult was a clean and rapid script­ing lan­guage with a sim­ple
syn­tax that re­sults in fewer bugs. Python re­ally shines in bioinformatics be­cause
it is won­der­ful for open­ing, read­ing, and pars­ing data fi­les, and be­cause it is a tre­
men­dous so-called glue lan­guage. With Python it is easy to glue to­gether many
dif­fer­ent pro­grams (R, Unix, C, and Python) into a pow­er­ful anal­y­sis workflow.
The Python ex­er­cises in this chap­ter as­sume the use of PYTHON 3.x ver­sion
of the pro­gram­ming lan­guage. The cur­rent ver­sion as of this writ­ing is Python 3.6.
Python is avail­­able for down­load and in­stal­la­tion at

https://​www.​python.​org/​

if it is not al­ready on your sys­tem (check this). You can read more about why Py­
thon is awe­some at

https://​www.​python.​org/​about/​success/​

The tu­to­rial here in­cludes a short primer on us­ing the Python Interpreter and
an ex­am­ple of how to write, save, and ex­e­cute a sim­ple pro­gram­ming file. If you
want to learn more af­ter fin­ish­ing this, there are many ex­cel­lent free tu­to­ri­als, in­
clud­ing on the Python web­site.
Python tu­to­ri­al:
https://​docs.​python.​org/​3/​tutorial/​index.​html
Python for to­tal be­gin­ners:
https://​wiki.​python.​org/​moin/​BeginnersGuide/​NonProgrammers
https://​automatetheboringstuff.​com/​

There are loads of other re­sources and books for learn­ing Python, in­clud­ing a
book called Python for Biologists, by Martin Jones.

Step 1: Invoking Python


Python on the command line

After installing Python, you should be able to invoke Python on the command line in
a terminal window by typing the name of the program. This opens the Python Inter-
preter, where you can run Python code directly in the terminal by typing ‘python’
and hitting the Enter key.
B I O I N FO R M ATI C S PR O G R A M M I N G : A PR I M ER   189

This part as­sumes that you have in­stalled a ver­sion of Python 3 and can run
the pro­gram in a ter­mi­nal win­dow. There are other ways to run Python, in­clud­ing
with a text ed­i­tor like Emacs.

Step 2: Running a loop

The Python Interpreter: Instant feedback

Let’s do something a little more interesting. This code snippet prints the numbers
from 5 to 9. How would you change the loop to print numbers from 3 to 101? Try
it yourself!

Step 3: Saving a Python pro­gram to a fi­le

This part shows you how to save a file and ex­e­cute it us­ing Python on the com­
mand line. Most peo­ple save their Python fi­les by add­ing a “.py” file ex­ten­sion at
the end of the name. This al­lows you to re­mem­ber that it is in­deed a Python pro­
gram file, and many pro­grams will also rec­og­nize this as a Python file by the file
ex­ten­sion. You will need to use a text ed­i­tor to save your fi­le; there are many to
choose from, in­clud­ing ed­i­tors that come with your OS, such as Notepad or nano.8

Python: Saving your programming code

Once you quit the Python interpreter (type Control-d to quit) all your work disap­
pears. The interpreter is nice for testing out simple functions or for doing calcula­
tions, but your work is lost after you quit. To save your program, you need to write it
in a separate file. Below we save our work with the Emacs text editor, a commonly
used text editor for programming. (The nano program is an editor that often comes
with Linux. Type nano at the prompt.
190  COMPU TATIO NAL B IOL OGY

After you finish these primers you can work your way through the excellent web­
sites, online pdfs, and books mentioned throughout the chapter. Armed with this
knowledge, you are on your way towards rapid, custom-designed high-through­
put analysis of your own crazy data. Both R and Python have scores of freely
available bioinformatics-specific libraries that will allow you to implement the al­
gorithms found in the book and much more—including parsing data sets and ana­
lyzing new data.

Biopython: http://biopython.org​
Bioconductor: https://www.bioconductor.org​

Certainly, one of the best ways to truly learn a programming language is to have
a project that you really want, or really need, to do. Perhaps you need to reformat
a data set to run with specialized analysis software (Python), or you need to per­
form stepwise multiple regression analysis (R), or maybe you need to create a
rapid workflow that uses a series of unrelated programs (Unix). Programming can
be a challenging, error-filled journey, but it’s worth it when you experience that
feeling of glee as your code crunches through thousands of data points to pro­
duce an analysis of unsurpassed beauty. Then you’re ready to take on the world!

Notes
1. Linux was created by a bloke named Linus Torvald, ergo “Linus’s Unix,” short­ened to “Linux.”
2. http://​www.​windowscentral.​com/​how-​install-​bash-​shell-​command-​line-​windows-​10
3. From the R web­site: “CRAN is a net­work of ftp and web serv­ers around the world that store
iden­ti­cal, up-to-date, ver­sions of code and doc­um­ en­ta­tion for R.”
4. Inside R pack­ages can be read­ily in­stalled by us­ing the in­stall.package com­mand. For ex­am­
ple, to in­stall ggplot2, use in­stall.packages(“ggplot2”). To use the new li­brary, type li­brary­
(ggplot2).
5. https://​cran.​r-​project.​org/​web/​views/​Multivariate.​html
6. To save this file from a web browser, you can, for ex­am­ple, click “Save As” to save as a text
file in Firefox, or click “Save As” and se­lect “page source” in Safari. Also, you can copy/paste
into a text-only pro­ces­sor like nano or Emacs. To check that it is a text file, you can use the
com­mand head or less. If you see a lot of gar­bage, it is not text-only. Note that Microsoft
Word is not a text-only pro­ces­sor.
7. van Rossum was a mas­sive Monty Python fan, hence the name: Python.
8. I pre­fer the Emacs ed­i­tor that can be in­stalled for all­OSs, though it does have a steep learn­
ing curve.
INDEX

Adenine, nu­cle­o­tide base in DNA and pair­ing, 9, 10 BLAST (Basic Local Alignment Search Tool) al­go­rithm, 1
Alanine, pro­pen­si­ties for, 59 al­go­rithm ac­tiv­ity, 36–38
Alignment page, 1–3 BLAST It, 31–32
Amino ac­ids BLAST tu­to­rial, 40
base­ball anal­ogy for like­li­hood in sub­sti­tu­tion, char­ac­ter­iz­ing pro­tein in Escherichia coli
165–166 ge­nome, 35
chem­i­cal struc­tures of, 16 con­cepts of, 36–38
free en­er­gies for trans­fer of, 53 in­ter­ac­tive ex­er­cise, 38, 39
PAM and BLOSUM ma­tri­ces, 160–161 lab ex­er­cises, 40–44
pro­pen­si­ties, 59, 61 learn­ing ob­jec­tives, 36
sub­sti­tu­tion bias, 159–160 mas­sive parallelization of, 32, 34
sub­sti­tu­tion ma­tri­ces, 157–159 mo­ti­va­tion, 36
Venn di­a­gram of prop­er­ties and ge­netic code, 160 power of, 33–35
Androgen re­cep­tor, 136 re­sults of search of hy­po­thet­i­cal FX093345
Anthrax toxin, rep­re­sen­ta­tion of, 48 nu­cle­o­tide, 32, 33
Archaea, phy­lo­ge­netic tree of life, 134 se­quence align­ment, 158
Asparagine, pro­pen­si­ties for, 59 Blastp Tutorial Link, 175, 176
BLOSUM (Blocks Substitution Matrix), 157–159
Bacteria, phy­lo­ge­netic tree of life, 134 ac­tiv­ity gen­er­at­ing PAM and, 163–166
Bacterial gene, gen­eral struc­ture of, 19 cal­cu­la­tion of, 169–171
Baseball anal­ogy, like­li­hood in amino acid sub­sti­tu­tion, PAM and, 160–161
165–166 BLOSUM Interactive Link, 172
Bioinformatics, 5 Bootstrap
com­puter and, 6–8 anal­y­sis, 138
hid­den Mar­kov Models (HMMs), 161–162 phy­lo­ge­net­ics, 137–139
meth­ods, 48–50 term, 139n7
power of, 33–35
pro­tein, 47–48 Cher­no­byl chicken, se­quence, 100–101
Bioinformatics soft­ware, 1–3 Chou-Fasman al­go­rithm, 50, 58, 157
Python pro­gram­ming lan­guage, 187–190 Chou-Fasman Interactive Link, 61
R pro­gram­ming lan­guage, 183–187 Clustal Omega align­ment pro­gram, 72, 74, 85–86, 88
Unix op­er­at­ing sys­tem, 179–183 Clustal Omega Tutorial Link, 83
Biological da­ta­bases and data stor­age ClustalW pro­gram, 72
con­cepts, 21–25 Computer. See al­so Bioinformatics soft­ware
ex­er­cises, 25–28 bioinformatics, 6–8
learn­ing ob­jec­tives, 20–21 DNA in the, 8–11
mo­ti­va­tion, 20 pro­tein se­quences in the, 13–14, 16–17
Biological mol­e­cules, prop­er­ties, 5 RNA in, 11–13

191
192  INDEX

Cyanobacteria, 133 Humans, DNA se­quence align­ment, 67, 68


Cytosine, nu­cle­o­tide base in DNA and pair­ing, 9, 10 Hydrophobicity Interactive Link, 54
Hydrophobicity plot­ting
Data stor­age, bi­o­log­i­cal da­ta­bases and, 20–28 ac­tiv­ity, 52–55
Deoxyribonucleic acid (DNA) con­cepts, 52–54
chem­i­cal struc­ture at atomic level, 8 lab ex­er­cises, 56–57
com­pu­ta­tional trans­la­tion of DNA se­quence, 49 learn­ing ob­jec­tives, 52
in the com­puter, 8–11 mo­ti­va­tion, 52
dou­ble he­lix struc­ture of, 111, 112
mul­ti­ple se­quence align­ment of, 69 Jones, Martin, 188
pro­gres­sive se­quence align­ment of four DNA
se­quences, 72 Kelley Bioinformatics
re­verse com­ple­men­ta­tion of DNA se­quence, 11 Alignment page, 1–3
ri­bo­nu­cleic acid (RNA) vs., 111, 112 web­site, 1
se­quence mo­tifs, 92
se­quence show­ing four nu­cle­o­tide ba­ses and Likelihood, amino acid sub­sti­tu­tion, 165–166
pair­ing, 9 Linux, 180, 190n1
Distance Matrix Interactive Link, 149 The Linux Command Line (Shotts), 181
DNA. See Deoxyribonucleic acid (DNA) Lysines, cal­cu­la­tion of pro­pen­sity, 59, 60
DNA Learning Center (DNALC), 12
Double he­lix, de­oxy­ri­bo­nu­cleic acid (DNA), 111, 112 Mar­kov mod­els, 48, 56, 70, 93n2
Dynamic pro­gram­ming hid­den Mar­kov mod­els (HMMs), 161–162
ac­tiv­ity, 74–78 MatrixPlot, pre­dict­ing RNA struc­ture, 128, 130
con­cepts, 74–76 MatrixPlot (Mutual Information) Tutorial Link, 128
lab ex­er­cises, 83 Maximum Parsimony Interactive Link, 149
learn­ing ob­jec­tives, 74 Methanobacterium, 133
mo­ti­va­tion, 74 Mfold (Free Energy) Tutorial Link, 128
solv­ing align­ment prob­lem, 77–78 Mineralocorticoid re­cep­tor, 136
Molecular bi­­ol­ogy, 5
Escherichia coli, 133 Mouse, DNA se­quence align­ment, 67, 68
char­ac­ter­iz­ing novel bac­te­rial pro­tein in Multiple se­quence align­ments (MSAs), 67, 72–73
ge­nome, 35 DNA and pro­tein se­quences, 69
con­served se­quences in pro­moter re­gions of, 93 lab ex­er­cises, 84–88
se­quence align­ment of re­gions in E. coli pro­gres­sive se­quence align­ment of four DNA
ge­nome, 92 se­quences, 72
Estrogen re­cep­tor, 136 Mutations, ret­ro­vi­ruses, 73n1
Eukaryota, phy­lo­ge­netic tree of life, 134 Mutual Information (MI)
Eukaryotes al­go­rithm for pre­dict­ing RNA struc­ture, 115–117,
gen­eral struc­ture of gene, 18 121–124
ge­netic code for, 17 ex­er­cise, 127
sim­pli­fied il­lus­tra­tion of RNA tran­scrip­tion in, 12 prin­ci­ple of, 124
Mutual Information Interactive Link, 125
FASTA pro­tein se­quence, 27, 28 Mycobacterium, 26
Mycobacterium tu­ber­cu­lo­sis, 136
GenBank, 7, 20, 21, 48 Myosin, rep­re­sen­ta­tion of, 48
Gene, mo­lec­u­lar struc­ture of, 17–19
Genetic in­for­ma­tion, mo­lec­u­lar struc­ture of a gene, National Center for Biotechnology Information (NCBI),
17–19 20–21, 25
Glucocorticoid re­cep­tor, 136 BLAST tool, 40–44
Guanine, nu­cle­o­tide base in DNA and pair­ing, 9, 10 ex­er­cise for Sulfolobus solfataricus, 28
ex­er­cise us­ing NCBI PubMed, 26
Hesper, Ben, 72 NCBI ORF finder, 43
Hidden Mar­kov Models (HMMs) Needleman-Wunsch DNA/Protein Alignment
bioinformatics, 161–162 Interactive Link, 79
def­i­ni­tion, 162 Needleman-Wunsch meth­od
Hogeweg, Paulien, 72 dy­namic pro­gram­ming al­go­rithm, 74, 76, 80–81
Human es­tro­gen re­cep­tor, 50 screen­shot of mod­ule, 158
Human ge­nome, small frac­tion of, 20 sub­sti­tu­tion ma­tri­ces, 157
I N D EX   193

Nucleosome core par­ti­cle:DNA frag­ment Propensities


com­plex, 48 cal­cu­lat­ing for amino ac­ids, 59–61
Nucleotide ba­ses, pair­ing in DNA se­quence, 9, 10 for ly­sines, 60
Protein
Oc­cam’s ra­zor, 154n4 bioinformatics, 47–48
cel­lu­lar pro­cesses, 48
Pace, Nor­man, 133 chem­i­cal struc­tures of amino ac­ids for, 16
PAM (Point Accepted Mutation) com­pu­ta­tional trans­la­tion, 49
ac­tiv­ity gen­er­at­ing BLOSUM and, 163–166 hy­dro­pho­bic­ity plot­ting ac­tiv­ity, 52–55
amino acid sub­sti­tu­tion pat­terns, 168 hy­dro­pho­bic re­gions in, 50
BLOSUM and, 160–161 se­quence mo­tifs, 92–93
cal­cu­la­tion of, 166–169 struc­ture as­pects, 15
sub­sti­tu­tion ma­trix, 157–159 trans­la­tion, 13
PAM Interactive Link, 172 Protein sec­ond­ary struc­ture
Parsimony, 154n4 cal­cu­lat­ing pro­pen­si­ties, 59–61
Patterns in data, 91–92 con­cepts, 58–59
logos for se­quences in crit­i­cal cell lab ex­er­cises, 63–65
func­tions, 93 learn­ing ob­jec­tives, 58
se­quence align­ment of re­gions in Escherichia coli mo­ti­va­tion, 58
ge­nome, 92 pre­dic­tion ac­tiv­ity, 58–62
se­quence mo­tifs, 92–93 Protein se­quence
Pfam (pro­tein fam­ily) da­ta­base, 48 in the com­puter, 13–14, 16–17
Pfam Tutorial Link, 175, 177 FASTA, 27, 28
Phylogenetic anal­y­sis mul­ti­ple se­quence align­ment of, 69
ac­tiv­ity, 140–148 Protein se­quence mo­tifs
con­cepts, 140–141 ac­tiv­ity, 94–101
dis­tance method, 141–143 con­cepts, 94–96
in­ter­ac­tive ex­er­cises, 149–151 lab ex­er­cises, 98–101
lab ex­er­cises, 152–153 learn­ing ob­jec­tives, 94
learn­ing ob­jec­tives, 140 mo­ti­va­tion, 94
max­i­mum par­si­mony (MP) method, 143–148 ProtParam tool, 64
mo­ti­va­tion, 140 ProtScale Tutorial Link, 56, 63
Phylogenetic Analysis Tutorial Link, 152 Python for Biologists (Jones), 188
Phylogenetics, uses of, 135, 136 Python pro­gram­ming lan­guage, 187–190
Phylogenetic tree, 133 tu­to­rial, 188
as­pects of, 137
boot­strap anal­y­sis, 138 Retroviruses, 73n1
boot­strap method, 137–139 Ribonucleic acid (RNA)
count­ing al­a­nine sub­sti­tu­tions us­ing, 166, cal­cu­lat­ing RNA free en­ergy, 120
167 com­pen­sa­tory mu­ta­tions in RNA struc­ture, 116
in­ter­pre­ta­tion of, 136–137 in the com­puter, 11–13
max­i­mum par­si­mony (MP) method, 145–148 CRISPR-Cas9 sys­tem, 113
set the­ory, 146, 154n7 de­oxy­ri­bo­nu­cleic acid (DNA) vs., 111, 112
uni­ver­sal, 134 fold­ing al­go­rithms, 115–117
us­ing dis­tances to build a, 142–143 folds for se­quence, 115
Phylogeny, 133–134, 136, 140 mes­sen­ger RNA (mRNA), 13
Position-specific weight ma­tri­ces pre­dict­ing struc­ture, 114–117
ac­tiv­ity, 102–108 ri­bo­somal sub­units, 113
con­cepts, 102–105 roles of, in cells, 112–114
lab ex­er­cises, 107–108 sec­ond­ary struc­ture, 113
learn­ing ob­jec­tives, 102 sec­ond­ary-structure di­a­gram of RN­ase P struc­tural
mo­ti­va­tion, 102 RNA, 114
Probability se­quence mo­tifs, 92
hid­den Mar­kov Models (HMMs), 161–162 sim­pli­fied il­lus­tra­tion of tran­scrip­tion in eu­kary­otes, 12
pro­tein sub­sti­tu­tion ma­tri­ces, 157–159 sim­pli­fied il­lus­tra­tion of trans­la­tion, 14
sub­sti­tu­tion bias, 159–160 sin­gle-stranded na­ture of, 111
Progesterone re­cep­tor, 136 spliceosome, 113
Progressive align­ment method, 72 trans­fer RNA, 112, 113
194  INDEX

Ribonucleic acid (RNA) struc­ture pre­dic­tion cal­cu­lat­ing PAM ma­trix, 166–169


ac­tiv­ity, 118–128 con­cepts, 163–166
cal­cu­lat­ing RNA free en­ergy, 120, 121 de­ter­min­ing sub­sti­tu­tion bias, 159–160
com­par­ing pos­si­ble struc­tures, 121 in­ter­ac­tive ex­er­cise, 172–174
con­cepts, 118–124 lab ex­er­cises, 175–177
lab ex­er­cise, 129–130 learn­ing ob­jec­tives, 163
learn­ing ob­jec­tives, 118 mo­ti­va­tion for, 163
mo­ti­va­tion, 118 PAM and BLOSUM, 160–161
mu­tual in­for­ma­tion (MI) method, 115–117, 121–124 pro­tein (amino acid), 157–159
ther­mo­dy­namic sec­ond­ary-structure pre­dic­tion, Sulfolobus solfataricus, ex­er­cise us­ing, 28
115, 118–121
RNA. See Ribonucleic acid (RNA) Taxonomy, 137
RNA Free-Energy Interactive Link, 125 Thermodynamic sec­ond­ary-structure pre­dic­tion, RNA
R pro­gram­ming lan­guage, 183–187 struc­ture, 115, 118–121
tu­to­rial, 183–187 Threonine, pro­pen­si­ties for, 59
Thymine, nu­cle­o­tide base in DNA and pair­ing, 9, 10
ScanProsite Tutorial Link, 98 Torvald, Linus, 190n1
Sequence align­ment, 67 Transcription, 12, 19n5
chal­lenges in, 70 Transcription Factor Binding Site Tutorial Link, 107
is­sues in, 70–71 Transcription fac­tors (TFs), 102
mu­ta­tional his­tory of pro­tein-coding gene, 71 Transcription fac­tors bind­ing sites (TFBSs), 102, 107
na­ture’s ex­per­i­men­tal re­sults, 67–70 Translation, 14, 19n5
po­si­tion-specific weight ma­tri­ces (PSWMs), Tree of life, 133
102–108 ex­pand­ing, 135
pro­gres­sive, of four DNA se­quences, 72 ram­i­fi­ca­tions of, 133–134
Sequence Motif Interactive Link, 96
Set the­ory, 146, 154n7 UniProt, 48
Severe acute re­spi­ra­tory syn­drome (SARS), 33, 136 Unix op­er­at­ing sys­tem, 179–183
Shotts, Wil­liam, 181 Linux/Unix tu­to­rial, 181–183
SMART (Simple Modular Architecture Research Tool), tu­to­rial, 180–181
100
Smith-Waterman var­i­ant of al­go­rithm, 89n4 Wat­son-Crick base pairings, 111, 119
Steroid hor­mones, 50 Weight Matrix Interactive Link, 105
Substitution ma­tri­ces Whitaker, Lou, 165, 166
ac­tiv­ity gen­er­at­ing PAM and BLOSUM, 163–171 Woese, Carl, 133, 135
cal­cu­lat­ing BLOSUM ma­trix, 169–171 www.​kelleybioinfo.​org, 1

You might also like