Pairwise Sequence Allignment
Pairwise Sequence Allignment
Pairwise Sequence Allignment
tics
Biological Research
Traditional biology
Small team working on a specialized topic
Small team working on limited numbers of
targets
( gene, enzyme) in a well defined
experiment to answer precise questions
New high-throughput biology
Large teams working on many targets
(thousand of
genes, enzymes) using cutting edge
technology
Large amount of information is being
created every day.
HelixInfoSystems
Paradigm Shift in
Biology
To use the flood of knowledge, which will
pour across the
computer networks of the world, biologists
not only must
become computer literate, but also change
their approach
to the problem ofHelixInfoSystems
understanding life.
The Problem
How to store and use
these
datas in a useful way?
HelixInfoSystems
The Solution
Leads to a new
insight
called
Bioinformatics.
HelixInfoSystems
What is Bioinformatics?
Using information technology to help solve
biological problems by designing novel and
incisive algorithms and methods of analysis
Bioinformatics = Data Mgmt + Knowledge
Discovery
Data Mgmt =Integration + Transformation +
Cleansing
Knowledge Discovery = Statistics +
Algorithms + Databases
HelixInfoSystems
Where is Bioinformatics
Used?
Genome Projects need to store and
organize DNA sequences Database
Development
How do we find protein coding regions,
introns and exons in genomic DNA
sequences? Gene Finding
Under which conditions is a certain gene
transcribed? Gene Expression Analysis
HelixInfoSystems
Where is Bioinformatics
Used?
What do we know about a specific
protein?
How can we compare protein
sequences? Sequence Alignment
Can we predict protein structures?
Protein Structure Prediction
HelixInfoSystems
The Aims of
Bioinformatics are threefold
The First
To organise data in a way that allows
researchers
to access existing information and
to submit new
entries as they are produced
HelixInfoSystems
10
The Second
HelixInfoSystems
11
The Third
To use these tools to analyse the data
and interpret the
results in a biologically meaningful
manner.
HelixInfoSystems
12
Challenges in
Bioinformatics
Explosion of information
HelixInfoSystems
13
Bioinformatics software
By combining computational
methods with
experimental biology, major
discoveries can be
made faster and more efficiently.
HelixInfoSystems
14
Software Development
Flow Chart
Analyze the
Problem
Refine
Develop Algorithm
Implementation
using
Programming
Further Experiment
Define
HelixInfoSystems
15
Lack of
Bioinformaticians
Software needs to be easier to access, use
and
understand
Biologists need to learn about the software,
its
limitations, and how to interpret its results
HelixInfoSystems
16
A Bioinformatician
should be a
Database user
Tool user
Database developer
Tool developer
Training, practicing or developing
Doing bioinformatics experiments
HelixInfoSystems
17
The future of
Bioinformaticians
In the future, bioinformatics is likely to
become more
central to the way biology is done
Bioinformatics we are woefully short in
terms of having a critical mass of people who
understand both biology and computational
approaches
HelixInfoSystems
18
Representing and
Retrieving Sequences
Definition
A sequence is a linear set of characters
(sequence elements) representing
nucleotides or amino acids
DNA composed of four nucleotides or
"bases": A,C,G,T
RNA composed of four also: A,C,G,U (T
transcribed as U)
proteins are composed of amino acids
(20)
HelixInfoSystems
20
Representation of
Sequences
characters
simplest
easy to read, edit, etc.
bit-coding
more compact, both on disk and
in memory
comparisons more efficient
HelixInfoSystems
21
Character representation
of sequences
DNA or RNA
use 1-letter codes (e.g., A,C,G,T)
protein
use 1-letter codes
can convert to/from 3-letter codes
(e.g., A = Ala = Alanine
C = Cys = Cysteine)
HelixInfoSystems
22
Representing uncertainty
in nucleotide sequences
It is often the case that we would like to
represent uncertainty in a nucleotide
sequence, i.e., that more than one base is
possible at a given position
to express ambiguity during sequencing
to express variation at a position in a
gene during evolution
to express ability of an enzyme to
tolerate more than one base at a given
position of a recognition site
HelixInfoSystems
23
Representing uncertainty in
nucleotide sequences
To do this for nucleotides, we use a
set of single character codes that
represent all possible combinations
of bases
This set was proposed and adopted
by the International Union of
Biochemistry and is referred to as
the I.U.B. code
HelixInfoSystems
24
A, C, G, T, U
R = A, G (puRine)
Y = C, T (pYrimidine)
S = G, C (Strong hydrogen
bonds)
W = A, T (Weak hydrogen bonds)
M = A, C (aMino group)
K = G, T (Keto group)
B = C, G, T (not A)
D = A, G, T (not C)
H = A, C, T (not G)
V = A, C, G (not T/U)
N = A, C, G, T/U (iNdeterminate)
X or - are sometimes used
HelixInfoSystems
25
Representing
uncertainty in protein
sequences
26
Sequence
Analysis
27
HelixInfoSystems
28
Frequency
analysis
29
Sequence
Alignment
Common mistake:
sequence similarity is not
homology!
Homologous sequences: derived
from a
common ancestor
Relation of
sequences
Relation of
sequences
Paralogs: Similar sequences within a single
organism
Measuring homology by
pairwise alignment
Similar
33
Database
searching
One
34
Multiple
alignment
Simultaneously aligning a number
of sequences to produce optimal
global or local alignments (of similar
subsequences).
Applications
global alignment : measurement
of evolutionary distance between
species
local alignment : identifying
subdomains HelixInfoSystems
within sequences(ex.35
exon/intron boundaries,
promoters,
HelixInfoSystems
36
Gene finding
Parts
DNA.
An approach - identify exactly
where four specific signal type can
be found.
The start codon. (ATG)
Beginning of each intron.(GT-)
End of each intron.(-AG)
The stop codon.
HelixInfoSystems
37
Gene finding
The
Analyzing protein
sequence
Proteins
3D-structure determines
structural and functional properties.
But we only know its 1D-sequence.
Prediction is needed.
Proteins structure
primary structure : sequence of
amino acids
secondary structure : folding of
amino sequence
HelixInfoSystems
39
Whole genome
analysis
Looking at genomes
in their entirety.
Still
Pairwise Sequence
Similarity
HelixInfoSystems
41
Why compare
sequences?
Understanding the function, role, and/or structure of
biomolecular sequences (DNA, RNA, proteins)
Evolution reuses, builds
modifies
successful structures.
on,
duplicates,
HelixInfoSystems
and
implies
42
Protein vs.
Amino acids
DNA
20 letter alphabet
43
Fundamental
Concepts
&
Definitions
Question?
Given two strings
HelixInfoSystems
45
Answer
One way of measuring their distance is to look
at what will take to transform X into Y
Edit distance
Edit operations performed on X to change to Y
I ) insertion, D ) deletion, S ) substitution
M ) match (no edit operation is performed)
Example
X = applied
Y = prince
HelixInfoSystems
46
Example
6 edit operations
Edit transcript DMSDMIIMD
appli__ed
_pr_ince_
Is a sequence of edits
unique?
HelixInfoSystems
47
Example continued
SSSSSMD
applied
prince_
DDMIIIIDDMD
app____lied
__princ__e_
DDDDDDDIIIIII
applied______
_______prince
6 edit operations
9 edit operations
13 edit operati
48
Hamming Distance
The number of mismatching
position between the two strings
Example: The Hamming distance
between "toned" and "roses" is 3.
HelixInfoSystems
49
Levenshtein Distance
The minimum number of edit
operation required to convert one
string to another.
Similar to Edit Distance
HelixInfoSystems
50
Alignments
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Three elements:
Perfect matches
Mismatches
Gaps
HelixInfoSystems
51
Choosing
Alignments
There are many possible alignments
For example, compare:
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
to
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA--
52
Scoring
Rule
Example Score =
(# matches) (# mismatches)
(# gaps) x 2
HelixInfoSystems
53
Example
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
54
The optimal
alignment
An
optimal There are two possible linear alignments
1.
GDVEKGKKIFIMKCSQ
alignment is one
| |||||
GCVEKGKIFINWCSQ
that exhibits the
2.
GDVEKGKKIFIMKCSQ
most
||| |||
correspondences,
GCVEKGKIFINWCSQ
and
the
least
GDVEKGKKIFIMKCSQ
differences. | ||||| ||| |||
GCVEKGK-IFINWCSQ
55
Methods of
Sequence Alignment
Dot Matrix
Dynamic
Programming
Word or K-tuple
HelixInfoSystems
56
Dot
Matrix
Established in 1970 by A.J. Gibbs and
G.A.McIntyre
Method for comparing two amino acid
or nucleotide sequences
HelixInfoSystems
57
Dot Matrix
Displays any possible sequence
alignments as diagnols in a matrix
Reveals the presence of any
Insertion/deletion
Direct and Inverted repeats
Limitation:does not show actual
region of alignment
HelixInfoSystems
58
Dynamic
Programming
DP is a method for solving certain kind of problems
DP can be applied when the solution of a
problem includes solutions to sub problems
We need to find a recursive formula for the
solution
We can recursively solve subproblems, starting
from the trivial case, and save their solutions in
memory
In the end well get the solution of the whole
problem
HelixInfoSystems
59
Word or KTuple
Align two sequences very quickly
Searches for identical short stretches
of sequences and joining them
HelixInfoSystems
60
Scoring Matrix
HelixInfoSystems
61
62
Scoring
Matrices
63
Unitary
Matrix
Simplest scoring scheme
Amino acids pairs are classified into
Identical
A
Non-identical
A 1
Identical pairs are scored 1
Non-identical pairs are scored 0R 0
Less effective for detection of N 0
weak similarities
HelixInfoSystems
2 types:
R
0
1
0
D 0 0
N
0
0
1
0
D
0
0
0
1
64
Percent Accepted
Mutation
PAM (Percent Accepted Mutation) /
MDM (Mutation Data Matrix ) / Dayhof
Similar sequences organized into
phylogenetic trees
Number of amino acid changes counted
Relative mutabilities evaluated
20 x 20 amino acid substitution matrix
calculated
HelixInfoSystems
65
Percent Accepted
Mutation
PAM 1: 1 accepted mutation event per
100 amino acids; PAM 250: 250
mutation events per 100
PAM 1 matrix can be multiplied by itself
N times to give transition matrices for
sequences that have undergone N
mutations
PAM 250: 20% similar; PAM 120: 40%;
PAM 80: 50%; PAM 60: 60%
HelixInfoSystems
66
BLOSUM
Larger set of sequences considered
Sequences organized into signature
blocks
Consensus sequence formed
60% identical: BLOSUM 60
80% identical: BLOSUM 80
HelixInfoSystems
67
HelixInfoSystems
68
BLOSUM 62
HelixInfoSystems
69
HelixInfoSystems
70
DNA
Mutations
A
PURINES: A, G
PYRIMIDINES C, T
Transitions:
AG, CT
Transversions:
AC, AT
CG, GT
HelixInfoSystems
71
HelixInfoSystems
0.99
0.99
72
Use of scoring
matrices
Deciding which scoring matrix you
should use in order of obtain the best
alignment results is a difficult task.
Selection of a "wrong" scoring matrix
will most probable strongly influence
on the outcome of the analysis
HelixInfoSystems
73
HelixInfoSystems
74
Selection of
Scoring Matrices
For closely related sequences choose
BLOSUM matrices created for highly
similar alignments, like BLOSUM80. You
can also select low PAM matrices such
as PAM1.
For distant related sequences, select
low BLOSUM matrices (for example
BLOSUM45) or high PAM matrices such
as PAM250.
HelixInfoSystems
75
Gap
Penalties
The cost of opening a gap is higher than
the gap extension penalty
This reflects the tendency for insertions
and deletions to occur over several
residues at a time
Scoring scheme should penalize new gaps
more
BLOSUM 62: -11 for gap opening; -1 for
gap extension
BLOSUM 50: -12/-1 penalty
HelixInfoSystems
76
Gap
Penalties
HelixInfoSystems
77
DOTPLO
A dotplot gives
T an overview of all possible al
A
T
T
C
Sequence 2
A
C
A
T
A
A C
A T
A C
Sequence 1
HelixInfoSystems
A C
78
Sequence 2
A
C
A
T
A
A C
A T T A C1 G
Sequence
A C
T A C A T T A C G T A C
One possible alignment:
A T A C A C T T 79
A
HelixInfoSystems
Window-based
approaches
Word Size
Window / Stringency
HelixInfoSystems
80
Word size
algorithm
Word
Size = 3
T A C G G T A T G
A C A G T A T C
C
T
A
T
G
A
C
A
T A C G G T A T G
A C A G T A T C
T A C G G T A T G
A C A G T A T C
T A C G G T A T G
T A C G G T A T G
A C A G T A T C
HelixInfoSystems
81
Window / stringency
T A C G G T A T G
Window = 5 / Stringency = 4
T C A G T A T C
T A C G G T A T G
T C A G T A T C
T A C G G T A T G
T C A G T A T C
T A C G G T A T G
T C A G T A T C
C
T
A
T
G
A
C
A
T A C G G T A T G
HelixInfoSystems
82
Considerati
ons method is
The window/stringency
more sensitive than the wordsize
method (ambiguities are permitted).
The smaller the window, the larger
the weight of statistical (unspecific)
matches.
With large windows the sensitivity for
short sequences is reduced.
HelixInfoSystems
83
Insertions / Deletions
inT a Dotplot
Sequence 2
A
C
T
G
T
C
A
T
T
Sequence 1
T A C T G - T C A T
| | | | |
| | | |
T A C T G T T C A T
HelixInfoSystems
84
Hemoglobin
-chain
Dotplot
(Window =
130 /
Stringency
= 9)
HelixInfoSystems
85
Hemoglobin
-chain
Dotplot
(Window
= 18 /
Stringency
= 10)
HelixInfoSystems
86
Conclusions
What about these alike areas?
Whats the best path through the dot
matrix?
How long do I extend it?
How can I zoom-in on it to see exactly
whats happening?
Where, specifically,
is this alignment; how
HelixInfoSystems
87
Dynamic
Programming
Dynamic programming
Dynamic
programming
guaranteed to
algorithms
are
89
Dynamic Programming
A computational method that works well
for some types of optimization problems.
It works well for problems that can be
solved in a stepwise manner, always
saving just the best solution so far at
each step. (Its efficient because you
disregard loser paths.)
Works well for pair-wise sequence
alignment.of proteins or nucleic acids.
HelixInfoSystems
90
Dynamic Programming
91
Needleman Wunsch
Three
steps
programming
in
dynamic
1. Initialization
2. Matrix fill (scoring)
3. Traceback (alignment)
HelixInfoSystems
92
Needleman and
Wunsch
(global alignment)
Sequence 1:
EE
Sequence 2:
HEAGAWGH
PAWHEAE
Scoring parameters:
Gap penalty:
penalty of 8
BLOSUM50
Linear gap
HelixInfoSystems
93
Step1: Initializaton
The
94
Score Matrix
(BLOSUM 50)
H
-2
-1
-1
-2
-1
-4
-2
-2
-1
-1
-2
-1
-3
-2
-1
-1
-3
-3
-3
-3
-3
15
-3
-3
-3
-3
10
-2
-2
-2
-3
-2
10
-1
-3
-1
-3
-3
-2
-1
-3
-2
-1
-1
-1
-3
-1
-3
-3
HelixInfoSystems
95
96
Creation of an alignment
path
matrix
If F(i-1,j-1),
F(i-1,j) and F(i,j-1) are known
we can calculate F(i,j)
Three possibilities:
xi and yj are aligned, F(i,j) = F(i-1,j1) + s(xi ,yj)
xi is aligned to a gap, F(i,j) = F(i1,j) - d
yj is aligned to a gap, F(i,j) = F(i,j1) - d
HelixInfoSystems
97
Creation of an
alignmentF(i,path
matrix
j) = F(i-1, j-1) + s(x ,y )
i
F(i, j) = max
F(i, j) = F(i-1, j) - d
F(i, j) = F(i, j-1) - d
F(i-1, j-1)
F(i, j-1)
s(xi ,yj)
F(i-1,j)
-d
-d
F(i, j)
HelixInfoSystems
98
Creation of an alignment
H
Ematrix
A
G
A
W
G
H
E
E
path
0
-8
-16 -24 -32 -40 -48 -56 -64 -72 -80
P
-8
A
-16
W
Boundary conditions
-24
F(i, 0) = -i d
-32
F(j, 0) = -j d
H
E
A
-40
E
-48
HelixInfoSystems
99
Creation of an
H
E
A
G
A
W
G
H
E
alignment
path matrix
0
-8
-16 -24 -32 -40 -48 -56 -64 -72
-2
P
-8
F(i, j) = max
-10
A
-16
-24
-9
-3
E
-80
P-H=-2
F(i, j) = F(i-1, j) - d
E-P=-1
H-A=-2
= -8 -8= -16
F(1,0) - d
= -8 -8= -16
= -2
E-A=-1
-32
F(1,0) + s(xi ,yj) = -8 -1 = -9
-40
-48
-56
= -9
-2 -1 = -3
-8 -2 = -10
F(1,2) = max -16 -8 = -24 = -10
-2 -8 = -10
= -2 -8 = -10
F(2,2) = max
-10 -8 = -18 = -3
HelixInfoSystems-9
-8 = -17
100
Backtrac
king
HelixInfoSystems
101
E
-16
A
-24
G
-32
A
-40
W
-48
G
-56
H
-64
E
-72
E
-80
-2
-9
-17
-17
-25
-25
-42
-49
-57
-65
-73
-12
-33
-20
-20
A -16
-10
-3
-4
-52
-60
-7
-15
-36
-13
-13
-44
-6
-28
-5
-5
W -24
-18
-11
-29
-37
-18
-13
-8
-9
-13
-7
-21
-3
-3
H -32
-14
-19
-22
-8
-16
-16
-9
-12
-15
-7
A -48
-30
-16
-3
-11
-11
-12
-12
-15
-11
3
3
-5
-5
E -40
E -56
-38
-24 -11 -6
-12 -14 -15 -12
Optimal global alignment: HEAG AWGHE-E
--P- AW-HEAE
0
P
-8
H
-8
HelixInfoSystems
-5
-9
102
2
1
1
Smith and
Waterman
(local alignment)
Two differences:
0
1. F(i, j) = max
H EAGAW G H E E
PAW H EAE
Scoring parameters:
Gap penalty:
BLOSUM
Linear gap penalty of 8
HelixInfoSystems
103
H
0
E
0
A
0
G
0
A
0
W
0
G
0
H
0
E
0
E
0
0
12
12
0
6
0
0
0
5
5
0
20
20
10
12
18
4
22
22
16
10
18
14
28
28
21
13
10
20
27
6
13
18
12
4
0
A WGH E
Optimal local alignment:
A W-H E
16
26
HelixInfoSystems
20
104
HelixInfoSystems
105
H
0
E
0
A
0
G
0
0
H
E
A
E
0
0
0
0
10
2
0
0
2
16
8
6
0
8
21
13
0
0
13
18
A
0
W
0
G
0
H
0
E
0
0
5
0
20
0
12
0
4
12
18
22
0
14
0
6
10
18
28
20
10
20
27
16
26
0
0
0
5
12
0
4
HelixInfoSystems
E
0
106
H
0
E
0
A
0
G
0
A
0
W
0
G
0
H
0
E
0
E
0
0
0
0
0
10
10
2
16
16
8
21
21
13
13
18
12
4
Second best local alignment:
HelixInfoSystems
0
HEA
HEA
107
Dynamic programming
Methods
108