Pairwise Sequence Allignment

Bioinforma
tics
Biological Research
Traditional biology
Small team working on a specialized topic
Small team working on limited numbers of
targets
( gene, enzyme) in a well defined
experiment to answer precise questions
New high-throughput biology
Large teams working on many targets
(thousand of
genes, enzymes) using cutting edge
technology
Large amount of information is being
created every day.
HelixInfoSystems
Paradigm Shift in
Biology
To use the flood of knowledge, which will
pour across the
computer networks of the world, biologists
not only must
become computer literate, but also change
their approach
to the problem ofHelixInfoSystems
understanding life.
Whats Really Next

Bridging the gap between the real
world of biology
and precise logical nature of
computers with an
interdisciplinary perspective.
HelixInfoSystems
The Problem
How to store and use
these
datas in a useful way?
HelixInfoSystems
The Solution
Leads to a new
insight
called
Bioinformatics.
HelixInfoSystems
What is Bioinformatics?
Using information technology to help solve
biological problems by designing novel and
incisive algorithms and methods of analysis
Bioinformatics = Data Mgmt + Knowledge
Discovery
Data Mgmt =Integration + Transformation +
Cleansing
Knowledge Discovery = Statistics +
Algorithms + Databases
HelixInfoSystems
Where is Bioinformatics
Used?
Genome Projects need to store and
organize DNA sequences Database
Development
How do we find protein coding regions,
introns and exons in genomic DNA
sequences? Gene Finding
Under which conditions is a certain gene
transcribed? Gene Expression Analysis
HelixInfoSystems
Where is Bioinformatics
Used?
What do we know about a specific
protein?
How can we compare protein
sequences? Sequence Alignment
Can we predict protein structures?
Protein Structure Prediction
HelixInfoSystems
The Aims of
Bioinformatics are threefold
The First
To organise data in a way that allows
researchers
to access existing information and
to submit new
entries as they are produced
HelixInfoSystems
10
The Second
To develop tools,algorithms and statistics

that aid
in the analysis of data.
HelixInfoSystems
11
The Third
To use these tools to analyse the data
and interpret the
results in a biologically meaningful
manner.
HelixInfoSystems
12
Challenges in
Bioinformatics
Explosion of information
Need for faster, automated analysis to process

large amounts of data
Need for integration between different types of
information (sequences, literature, annotations,
protein levels, RNA levels etc)
Need for smarter software to identify
interesting relationships in very large data sets
HelixInfoSystems
13
Bioinformatics software
By combining computational
methods with
experimental biology, major
discoveries can be
made faster and more efficiently.
HelixInfoSystems
14
Software Development
Flow Chart
Analyze the
Problem
Refine
Develop Algorithm
Implementation
using
Programming
Further Experiment
Define
HelixInfoSystems
15
Lack of
Bioinformaticians
Software needs to be easier to access, use
and
understand
Biologists need to learn about the software,
its
limitations, and how to interpret its results
HelixInfoSystems
16
A Bioinformatician
should be a
Database user
Tool user
Database developer
Tool developer
Training, practicing or developing
Doing bioinformatics experiments
HelixInfoSystems
17
The future of
Bioinformaticians
In the future, bioinformatics is likely to
become more
central to the way biology is done
Bioinformatics we are woefully short in
terms of having a critical mass of people who
understand both biology and computational
approaches
HelixInfoSystems
18
Representing and
Retrieving Sequences
Definition
A sequence is a linear set of characters
(sequence elements) representing
nucleotides or amino acids
DNA composed of four nucleotides or
"bases": A,C,G,T
RNA composed of four also: A,C,G,U (T
transcribed as U)
proteins are composed of amino acids
(20)
HelixInfoSystems
20
Representation of
Sequences
characters
simplest
easy to read, edit, etc.
bit-coding
more compact, both on disk and
in memory
comparisons more efficient
HelixInfoSystems
21
Character representation
of sequences
DNA or RNA
use 1-letter codes (e.g., A,C,G,T)
protein
use 1-letter codes
can convert to/from 3-letter codes
(e.g., A = Ala = Alanine
C = Cys = Cysteine)
HelixInfoSystems
22
Representing uncertainty
in nucleotide sequences
It is often the case that we would like to
represent uncertainty in a nucleotide
sequence, i.e., that more than one base is
possible at a given position
to express ambiguity during sequencing
to express variation at a position in a
gene during evolution
to express ability of an enzyme to
tolerate more than one base at a given
position of a recognition site
HelixInfoSystems
23
Representing uncertainty in
nucleotide sequences
To do this for nucleotides, we use a
set of single character codes that
represent all possible combinations
of bases
This set was proposed and adopted
by the International Union of
Biochemistry and is referred to as
the I.U.B. code
HelixInfoSystems
24
The I.U.B. Code
A, C, G, T, U
R = A, G (puRine)
Y = C, T (pYrimidine)
S = G, C (Strong hydrogen
bonds)
W = A, T (Weak hydrogen bonds)
M = A, C (aMino group)
K = G, T (Keto group)
B = C, G, T (not A)
D = A, G, T (not C)
H = A, C, T (not G)
V = A, C, G (not T/U)
N = A, C, G, T/U (iNdeterminate)
X or - are sometimes used
HelixInfoSystems
25
Representing
uncertainty in protein
sequences
Given the size of the amino acid

alphabet, it is not practical to design a
set of codes for ambiguity in protein
sequences
Fortunately, ambiguity is less common
in protein sequences than in nucleic acid
sequences
Could use bit-coding as for nucleic acids
but rarely done
HelixInfoSystems
26
Sequence
Analysis
- Exploring the unknown information

about the
Biological Sequences (DNA,RNA &
Protein) by
using various statistical
methods .
HelixInfoSystems
27
What kinds of analyses

are there?
HelixInfoSystems
28
Frequency
analysis
Determine the frequency of occurrence

of sequence elements.
Applications
using oligomer frequency to
distinguish coding and noncoding
regions.
frequency of amino acids for predicting
3D structure of protein, functionality,
location in the cell.
HelixInfoSystems
29
Sequence
Alignment
Common mistake:
sequence similarity is not
homology!
Homologous sequences: derived
from a
common ancestor
Relation of
sequences
Homologs: similar sequences in 2 different

organisms
derived from a common ancestor sequence.
Orthologs: Similar sequences in 2 different

organisms
that have arisen due to a speciation event
Functionality Retained.
Relation of
sequences
Paralogs: Similar sequences within a single
organism
that have arisen due to a gene duplication event.
Xenologs: similar sequences that have arisen

out of
horizontal transfer events (symbiosis, viruses,

etc)
Measuring homology by
pairwise alignment
Similar
protein sequence means

similar function.
Infer
the function of the new gene

from the known one.
Needs
alignment of nucleic acid

sequences or HelixInfoSystems
amino acid
33
Database
searching
One
of the most common

sequence alignment application is
querying of sequence databases.
Develop
database query system

for sequence querying.(that is,
making good sequence alignment
algorithm)
HelixInfoSystems
34
Multiple
alignment
Simultaneously aligning a number
of sequences to produce optimal
global or local alignments (of similar
subsequences).
Applications
global alignment : measurement
of evolutionary distance between
species
local alignment : identifying
subdomains HelixInfoSystems
within sequences(ex.35
Finding regions of interest in

nucleic acid sequence
Identification of gene
components.
exon/intron boundaries,
promoters,
HelixInfoSystems
36
Gene finding
Parts
of a gene is scattered over
DNA.
An approach - identify exactly
where four specific signal type can
be found.
The start codon. (ATG)
Beginning of each intron.(GT-)
End of each intron.(-AG)
The stop codon.
HelixInfoSystems
37
Gene finding
The
two major types : eukaryotic

and prokaryotic
eukaryotic gene finding is much
harder
presence of introns.
Coding density is low.
Underlying technologies : hidden
Markov models, decision trees,
HelixInfoSystems
38
neural networks...
Analyzing protein
sequence
Proteins
3D-structure determines
structural and functional properties.
But we only know its 1D-sequence.
Prediction is needed.
Proteins structure
primary structure : sequence of
amino acids
secondary structure : folding of
amino sequence
HelixInfoSystems
39
Whole genome
analysis
Looking at genomes
in their entirety.
Still
growing and needs explorations.

Examples
characterization of genes that can be
assigned to functional roles by homology
examination of specific oligomer content
across the whole genome
identify laterally transferred genes
prediction of events which caused
evolution of two genomes from a common
HelixInfoSystems
40
ancestor
Pairwise Sequence
Similarity
HelixInfoSystems
41
Why compare
sequences?
Understanding the function, role, and/or structure of
biomolecular sequences (DNA, RNA, proteins)
Evolution reuses, builds
modifies
successful structures.
on,
duplicates,
High sequence similarity usually

significant
functional or structural similarity.
HelixInfoSystems
and
implies
42
Protein vs.
Amino acids
DNA
20 letter alphabet
Each with specific chemical

properties essential for protein
function
Nucleotides
4 letter alphabet
Chemical properties of bases
not directly relevant to
function
Also need to search reverse
complement
HelixInfoSystems
43
Fundamental
Concepts
&
Definitions
Question?
Given two strings
what is their distance?
HelixInfoSystems
45
Answer
One way of measuring their distance is to look
at what will take to transform X into Y
Edit distance
Edit operations performed on X to change to Y
I ) insertion, D ) deletion, S ) substitution
M ) match (no edit operation is performed)
Example
X = applied
Y = prince
HelixInfoSystems
46
Example
6 edit operations
Edit transcript DMSDMIIMD
appli__ed
_pr_ince_
Is a sequence of edits
unique?
HelixInfoSystems
47
Example continued
SSSSSMD
applied
prince_
DDMIIIIDDMD
app____lied
__princ__e_
DDDDDDDIIIIII
applied______
_______prince
6 edit operations
9 edit operations
13 edit operati
Which one is better than

the other?
HelixInfoSystems
48
Hamming Distance
The number of mismatching
position between the two strings
Example: The Hamming distance
between "toned" and "roses" is 3.
HelixInfoSystems
49
Levenshtein Distance
The minimum number of edit
operation required to convert one
string to another.
Similar to Edit Distance
HelixInfoSystems
50
Alignments
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Three elements:
Perfect matches
Mismatches
Gaps
HelixInfoSystems
51
Choosing
Alignments
There are many possible alignments
For example, compare:
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
to
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA--
Which one is better?

HelixInfoSystems
52
Scoring
Rule
Example Score =
(# matches) (# mismatches)
(# gaps) x 2
HelixInfoSystems
53
Example
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Score: (+1x13) + (-1x2) + (-2x4) = 3

------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA
Score: (+1x5) + (-1x6) + (-2x11) = -23

HelixInfoSystems
54
The optimal
alignment
An
optimal There are two possible linear alignments
1.
GDVEKGKKIFIMKCSQ
alignment is one
| |||||
GCVEKGKIFINWCSQ
that exhibits the
2.
GDVEKGKKIFIMKCSQ
most
||| |||
correspondences,
GCVEKGKIFINWCSQ
and
the
least
GDVEKGKKIFIMKCSQ
differences. | ||||| ||| |||
GCVEKGK-IFINWCSQ
Insertion of one break maximizes the identities.

HelixInfoSystems
55
Methods of
Sequence Alignment
Dot Matrix
Dynamic
Programming
Word or K-tuple
HelixInfoSystems
56
Dot
Matrix
Established in 1970 by A.J. Gibbs and
G.A.McIntyre
Method for comparing two amino acid
or nucleotide sequences
HelixInfoSystems
57
Dot Matrix
Displays any possible sequence
alignments as diagnols in a matrix
Reveals the presence of any
Insertion/deletion
Direct and Inverted repeats
Limitation:does not show actual
region of alignment
HelixInfoSystems
58
Dynamic
Programming
DP is a method for solving certain kind of problems
DP can be applied when the solution of a
problem includes solutions to sub problems
We need to find a recursive formula for the
solution
We can recursively solve subproblems, starting
from the trivial case, and save their solutions in
memory
In the end well get the solution of the whole
problem
HelixInfoSystems
59
Word or KTuple
Align two sequences very quickly
Searches for identical short stretches
of sequences and joining them
HelixInfoSystems
60
Scoring Matrix
HelixInfoSystems
61
Why need Scoring

Matrix?
To find the best alignment one needs to
examine all
possible alignments
To reflect the quality of the possible
alignments one
needs to score them
There can be different alignments with the
same highest score
Variations in the scoring scheme may
change the ranking of alignments
HelixInfoSystems
62
Scoring
Matrices
Table of values that describe

the probability
of a residue pair occurring in an
alignment
HelixInfoSystems
63
Unitary
Matrix
Simplest scoring scheme
Amino acids pairs are classified into
Identical
A
Non-identical
A 1
Identical pairs are scored 1
Non-identical pairs are scored 0R 0
Less effective for detection of N 0
weak similarities
HelixInfoSystems
2 types:
R
0
1
0
D 0 0
N
0
0
1
0
D
0
0
0
1
64
Percent Accepted
Mutation
PAM (Percent Accepted Mutation) /
MDM (Mutation Data Matrix ) / Dayhof
Similar sequences organized into
phylogenetic trees
Number of amino acid changes counted
Relative mutabilities evaluated
20 x 20 amino acid substitution matrix
calculated
HelixInfoSystems
65
Percent Accepted
Mutation
PAM 1: 1 accepted mutation event per
100 amino acids; PAM 250: 250
mutation events per 100
PAM 1 matrix can be multiplied by itself
N times to give transition matrices for
sequences that have undergone N
mutations
PAM 250: 20% similar; PAM 120: 40%;
PAM 80: 50%; PAM 60: 60%
HelixInfoSystems
66
BLOSUM
Larger set of sequences considered
Sequences organized into signature
blocks
Consensus sequence formed
60% identical: BLOSUM 60
80% identical: BLOSUM 80
HelixInfoSystems
67
Amino acids: chemical

properties
Aliphatic - G, A, V, L, I, P
Aromatic - F, Y, W
Uncharged polar - S, T, N,
Charged - D, E, H, K, R
Sulfur-containing - C, M
HelixInfoSystems
68
BLOSUM 62
Positive for similar amino acid
Common amino acid have low score

Rare amino acid
have high score
HelixInfoSystems
69
Nucleic Acid Scoring

Matrices
Two mutation models:
Uniform mutation rates (Jukes-Cantor)
Two separate mutation rates (Kimura)
Transitions
Transversions
HelixInfoSystems
70
DNA
Mutations
A
PURINES: A, G
PYRIMIDINES C, T
Transitions:
AG, CT
Transversions:
AC, AT
CG, GT
HelixInfoSystems
71
PAM1 DNA odds

matrices
A. Model of uniform mutation rates among nucleotides.
A
G
T
A
0.99
G 0.00333
0.99
T
0.00333
0.00333
0.99
C
0.00333
0.00333
0.00333
B. Model of 3-fold higher transitions than transversions.
A
G
T
A
0.99
G 0.006
0.99
T
0.002
0.002
0.99
C
0.002
0.002
0.006
HelixInfoSystems
0.99
0.99
72
Use of scoring
matrices
Deciding which scoring matrix you
should use in order of obtain the best
alignment results is a difficult task.
Selection of a "wrong" scoring matrix
will most probable strongly influence
on the outcome of the analysis
HelixInfoSystems
73
Correlations between the

PAM and BLOSUM
matrices
The BLOSUM matrices with low numbers
correspond to PAM matrices with high
numbers
HelixInfoSystems
74
Selection of
Scoring Matrices
For closely related sequences choose
BLOSUM matrices created for highly
similar alignments, like BLOSUM80. You
can also select low PAM matrices such
as PAM1.
For distant related sequences, select
low BLOSUM matrices (for example
BLOSUM45) or high PAM matrices such
as PAM250.
HelixInfoSystems
75
Gap
Penalties
The cost of opening a gap is higher than
the gap extension penalty
This reflects the tendency for insertions
and deletions to occur over several
residues at a time
Scoring scheme should penalize new gaps
more
BLOSUM 62: -11 for gap opening; -1 for
gap extension
BLOSUM 50: -12/-1 penalty
HelixInfoSystems
76
Gap
Penalties
Gap penalties are always a problem:

There is no formal way to calculate
their values.
you can follow recommended settings,
but these are based on trial and error
and not on a formal framework.
HelixInfoSystems
77
DOTPLO
A dotplot gives
T an overview of all possible al
A
T
T
C
Sequence 2
A
C
A
T
A
A C
A T
A C
Sequence 1
HelixInfoSystems
A C
78
In a dotplot each diagonal

corresponds to a
possible (ungapped) alignment
A
T
T
C
Sequence 2
A
C
A
T
A
A C
A T T A C1 G
Sequence
A C
T A C A T T A C G T A C
One possible alignment:
A T A C A C T T 79
A
HelixInfoSystems
Window-based
approaches
Word Size
Window / Stringency
HelixInfoSystems
80
Word size
algorithm
Word
Size = 3
T A C G G T A T G
A C A G T A T C
C
T
A
T
G
A
C
A
T A C G G T A T G
A C A G T A T C
T A C G G T A T G
A C A G T A T C
T A C G G T A T G
T A C G G T A T G
A C A G T A T C
HelixInfoSystems
81
Window / stringency
T A C G G T A T G
Window = 5 / Stringency = 4
T C A G T A T C
T A C G G T A T G
T C A G T A T C
T A C G G T A T G
T C A G T A T C
T A C G G T A T G
T C A G T A T C
C
T
A
T
G
A
C
A
T A C G G T A T G
HelixInfoSystems
82
Considerati
ons method is
The window/stringency
more sensitive than the wordsize
method (ambiguities are permitted).
The smaller the window, the larger
the weight of statistical (unspecific)
matches.
With large windows the sensitivity for
short sequences is reduced.
HelixInfoSystems
83
Insertions / Deletions
inT a Dotplot
Sequence 2
A
C
T
G
T
C
A
T
T
Sequence 1
T A C T G - T C A T
| | | | |
| | | |
T A C T G T T C A T
HelixInfoSystems
84
Hemoglobin
-chain
Dotplot
(Window =
130 /
Stringency
= 9)
HelixInfoSystems
85
Hemoglobin
-chain
Dotplot
(Window
= 18 /
Stringency
= 10)
HelixInfoSystems
86
Conclusions
What about these alike areas?
Whats the best path through the dot
matrix?
How long do I extend it?
How can I zoom-in on it to see exactly
whats happening?
Where, specifically,
is this alignment; how
HelixInfoSystems
87
Dynamic
Programming
Dynamic programming
Dynamic
programming
guaranteed to
algorithms
are
find the optimal scoring alignment or set of

alignments.
Dynamic programming algorithms are core to
computational sequence analysis.
HelixInfoSystems
89
Dynamic Programming
A computational method that works well
for some types of optimization problems.
It works well for problems that can be
solved in a stepwise manner, always
saving just the best solution so far at
each step. (Its efficient because you
disregard loser paths.)
Works well for pair-wise sequence
alignment.of proteins or nucleic acids.
HelixInfoSystems
90
Dynamic Programming
omatic procedure that finds the best align
optimal score depending on the chosen par
Needleman and Wunsch Algorithm

- Global Alignment Smith and Waterman Algorithm
- Local Alignment HelixInfoSystems
91
Needleman Wunsch
Three
steps
programming
in
dynamic
1. Initialization
2. Matrix fill (scoring)
3. Traceback (alignment)
HelixInfoSystems
92
Needleman and
Wunsch
(global alignment)
Sequence 1:
EE
Sequence 2:
HEAGAWGH
PAWHEAE
Scoring parameters:
Gap penalty:
penalty of 8
BLOSUM50
Linear gap
HelixInfoSystems
93
Step1: Initializaton
The
first step in the global alignment

dynamic programming approach is
to create a matrix with M + 1
columns and N + 1 rows
and N correspond to the size of

the sequences to be aligned.
HelixInfoSystems
94
Score Matrix
(BLOSUM 50)
H
-2
-1
-1
-2
-1
-4
-2
-2
-1
-1
-2
-1
-3
-2
-1
-1
-3
-3
-3
-3
-3
15
-3
-3
-3
-3
10
-2
-2
-2
-3
-2
10
-1
-3
-1
-3
-3
-2
-1
-3
-2
-1
-1
-1
-3
-1
-3
-3
HelixInfoSystems
95
Basic principles of dynamic

programming
- Creation of an alignment path mat
- Stepwise calculation of score values
- Backtracking (evaluation of the opt

HelixInfoSystems
96
Creation of an alignment
path
matrix
If F(i-1,j-1),
F(i-1,j) and F(i,j-1) are known
we can calculate F(i,j)
Three possibilities:
xi and yj are aligned, F(i,j) = F(i-1,j1) + s(xi ,yj)
xi is aligned to a gap, F(i,j) = F(i1,j) - d
yj is aligned to a gap, F(i,j) = F(i,j1) - d
HelixInfoSystems
97
Creation of an
alignmentF(i,path
matrix
j) = F(i-1, j-1) + s(x ,y )
i
F(i, j) = max
F(i, j) = F(i-1, j) - d
F(i, j) = F(i, j-1) - d
F(i-1, j-1)
F(i, j-1)
s(xi ,yj)
F(i-1,j)
-d
-d
F(i, j)
HelixInfoSystems
98
Creation of an alignment
H
Ematrix
A
G
A
W
G
H
E
E
path
0
-8
-16 -24 -32 -40 -48 -56 -64 -72 -80
P
-8
A
-16
W
Boundary conditions
-24
F(i, 0) = -i d
-32
F(j, 0) = -j d
H
E
A
-40
E
-48
HelixInfoSystems
99
Creation of an
H
E
A
G
A
W
G
H
E
alignment
path matrix
0
-8
-16 -24 -32 -40 -48 -56 -64 -72
-2
P
-8
F(i, j) = max
-10
A
-16
-24
-9
-3
E
-80
F(i, j) = F(i-1, j-1) + s(xi ,yj)
P-H=-2
F(i, j) = F(i-1, j) - d
E-P=-1
F(i, j) = F(i, j-1) - d
H-A=-2
F(0,0) + s(xi ,yj) = 0 -2 = -2

F(1,1) = max F(0,1) - d
= -8 -8= -16
F(1,0) - d
= -8 -8= -16
= -2
E-A=-1
-32
F(1,0) + s(xi ,yj) = -8 -1 = -9
-40
F(2,1) = max F(1,1) - d

F(2,0) - d
-48
-56
= -9
= -16 -8= -24
-2 -1 = -3
-8 -2 = -10
F(1,2) = max -16 -8 = -24 = -10
-2 -8 = -10
= -2 -8 = -10
F(2,2) = max
-10 -8 = -18 = -3
HelixInfoSystems-9
-8 = -17
100
Backtrac
king
HelixInfoSystems
101
E
-16
A
-24
G
-32
A
-40
W
-48
G
-56
H
-64
E
-72
E
-80
-2
-9
-17
-17
-25
-25
-42
-49
-57
-65
-73
-12
-33
-20
-20
A -16
-10
-3
-4
-52
-60
-7
-15
-36
-13
-13
-44
-6
-28
-5
-5
W -24
-18
-11
-29
-37
-18
-13
-8
-9
-13
-7
-21
-3
-3
H -32
-14
-19
-22
-8
-16
-16
-9
-12
-15
-7
A -48
-30
-16
-3
-11
-11
-12
-12
-15
-11
3
3
-5
-5
E -40
E -56
-38
-24 -11 -6
-12 -14 -15 -12
Optimal global alignment: HEAG AWGHE-E
--P- AW-HEAE
0
P
-8
H
-8
HelixInfoSystems
-5
-9
102
2
1
1
Smith and
Waterman
(local alignment)
Two differences:
0
1. F(i, j) = max
F(i, j) = F(i-1, j-1) + s(xi ,yj)

F(i, j) = F(i-1, j) - d
F(i, j) = F(i, j-1) - d
2. An alignment can now end anywhere in the matrix

Example:
Sequence 1
Sequence 2
H EAGAW G H E E
PAW H EAE
Scoring parameters:
Gap penalty:
BLOSUM
Linear gap penalty of 8
HelixInfoSystems
103
H
0
E
0
A
0
G
0
A
0
W
0
G
0
H
0
E
0
E
0
0
12
12
0
6
0
0
0
5
5
0
20
20
10
12
18
4
22
22
16
10
18
14
28
28
21
13
10
20
27
6
13
18
12
4
0
A WGH E
Optimal local alignment:
A W-H E
16
26
HelixInfoSystems
20
104
Extended Smith &

Waterman
To get multiple local alignments:

Delete regions around best path
Repeat backtracking
HelixInfoSystems
105
Extended Smith & Waterman

0
H
0
E
0
A
0
G
0
0
H
E
A
E
0
0
0
0
10
2
0
0
2
16
8
6
0
8
21
13
0
0
13
18
A
0
W
0
G
0
H
0
E
0
0
5
0
20
0
12
0
4
12
18
22
0
14
0
6
10
18
28
20
10
20
27
16
26
0
0
0
5
12
0
4
HelixInfoSystems
E
0
106
H
0
E
0
A
0
G
0
A
0
W
0
G
0
H
0
E
0
E
0
0
0
0
0
10
10
2
16
16
8
21
21
13
13
18
12
4
Second best local alignment:
HelixInfoSystems
0
HEA
HEA
107
Dynamic programming
Methods
Is computational intensive, for 2 sequence

comparison, requiring O(n2) space and
time.
So, not suitable for searching the best

aligned sequence in databases.
Cannot be used for >3 sequences.

Many heuristic methods are developed
e.g.,BLAST.
HelixInfoSystems
108

Pairwise Sequence Allignment

Uploaded by

Copyright:

Available Formats

Pairwise Sequence Allignment

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pairwise Sequence Allignment

Uploaded by

Copyright:

Available Formats

Bioinforma

Whats Really Next

To develop tools,algorithms and statistics

Need for faster, automated analysis to process

The I.U.B. Code

Given the size of the amino acid

- Exploring the unknown information

What kinds of analyses

Determine the frequency of occurrence

Homologs: similar sequences in 2 different

Orthologs: Similar sequences in 2 different

that have arisen due to a gene duplication event.

Xenologs: similar sequences that have arisen

horizontal transfer events (symbiosis, viruses,

protein sequence means

the function of the new gene

alignment of nucleic acid

of the most common

database query system

Finding regions of interest in

of a gene is scattered over

two major types : eukaryotic

growing and needs explorations.

High sequence similarity usually

Each with specific chemical

what is their distance?

Which one is better than

Which one is better?

Score: (+1x13) + (-1x2) + (-2x4) = 3

Score: (+1x5) + (-1x6) + (-2x11) = -23

Insertion of one break maximizes the identities.

Why need Scoring

Table of values that describe

Amino acids: chemical

Positive for similar amino acid

Common amino acid have low score

Nucleic Acid Scoring

PAM1 DNA odds

Correlations between the

Gap penalties are always a problem:

In a dotplot each diagonal

find the optimal scoring alignment or set of

omatic procedure that finds the best align

optimal score depending on the chosen par

Needleman and Wunsch Algorithm

first step in the global alignment

and N correspond to the size of

Basic principles of dynamic

- Creation of an alignment path mat

- Stepwise calculation of score values

- Backtracking (evaluation of the opt

F(i, j) = F(i-1, j-1) + s(xi ,yj)

F(i, j) = F(i, j-1) - d

F(0,0) + s(xi ,yj) = 0 -2 = -2

F(2,1) = max F(1,1) - d

= -16 -8= -24

F(i, j) = F(i-1, j-1) + s(xi ,yj)

2. An alignment can now end anywhere in the matrix

Extended Smith &