CE6068 Lecture 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 82

CE6068

Bioinformatics and Computational Molecular


Advanced Sequence Alignment
Chia-Ru Chung
Department of Computer Science and Information Engineering
National Central University
2024/3/27
Outline

• Quick Review
• Introduction to Sequence Alignment
• Methods for Sequence Alignment
• Multiple Sequence Alignment

1
Quick Review
20240320 Exercise #20
• Gene expression profiling with DNA microarrays allows researchers to:
(A) Determine the three-dimensional structure of proteins
(B) Identify mutations in genomic DNA (C) Directly sequence entire genomes
(D) Observe the activity levels of genes under different conditions
DNA microarrays assess the activity levels of many genes simultaneously under various conditions, not
directly sequence genomes. They measure mRNA levels to understand gene expression patterns, unlike
sequencing technologies that read the nucleotide sequence of DNA. This distinction is crucial for their
application in research.

3
Microarray
Ref. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1684095

• Microarrays are a technology used to assess the presence and quantity of nucleic acid
sequences in a sample. They rely on predefined probes attached to a solid surface;
these probes hybridize with labeled target sequences in the sample, allowing for the
parallel analysis of thousands of genes at once.
• DNA microarrays, also known as DNA chips or biochips, are a technology used to
measure the expression levels of thousands of genes simultaneously or to genotype
multiple regions of a genome.

4
Major Purpose of Microarray
• Gene Expression Profiling: The primary purpose of microarrays is to measure the
expression levels of thousands of genes simultaneously within a biological sample. By
comparing gene expression profiles under different conditions, researchers can
identify genes that are upregulated or downregulated, providing insights into cellular
responses, disease mechanisms, and treatment effects.
• Genotyping and SNP Detection: Microarrays are also used for genotyping and
單核苷酸多態性
detecting single nucleotide polymorphisms (SNPs) across the genome. This
application is valuable in genetic studies, population genetics, and personalized
medicine. 5
Major Purpose of Sequencing
• Genome Sequencing: Determining the exact sequence of nucleotides in an organism's DNA.
This includes identifying all the genetic material in an organism, from coding regions (genes)
to non-coding regions.
• Transcriptome Analysis (RNA Sequencing): Sequencing RNA molecules to study the
transcriptome, the complete set of RNA transcripts produced by the genome under specific
circumstances. This analysis reveals which genes are being actively transcribed and how gene
expression changes in different conditions.
• Variant Analysis: Identifying mutations, SNPs, insertions, deletions, and structural variants
in the DNA. This application is crucial for understanding genetic diseases, evolutionary
biology, and breeding programs.
6
DNA Sequencing
• Whole-Genome Sequencing: Analyzing the entire genome of an organism to
discover genetic variations, understand evolutionary relationships, and identify
genetic predispositions to diseases.
• Targeted Sequencing: Focusing on specific regions of interest within the genome,
such as genes associated with particular traits or diseases, to study genetic variations
and mutations in detail.
• Metagenomics: Sequencing DNA from environmental samples to study the diversity
and functions of microbial communities without the need for culturing.
7
RNA Sequencing
• Gene Expression Profiling: Measuring the expression levels of genes across
different conditions, tissues, or developmental stages to understand gene function and
regulation.
• Alternative Splicing Analysis: Identifying different splice variants of a gene, which
can produce multiple protein isoforms with diverse functions.
• Non-coding RNA Analysis: Studying non-coding RNA species, such as microRNAs
and long non-coding RNAs, which play critical roles in gene regulation and cellular
processes.
8
Three Generations of Sequencing Technologies (1/4)

• First Generation: Sanger Sequencing


‐ Developed in the 1970s by Frederick Sanger, this method was the first to allow DNA sequencing. It is based on
chain termination or dideoxy sequencing.
‐ The process involves synthesizing a complementary DNA strand from a single-stranded template in the
presence of normal nucleotides and dideoxynucleotides (ddNTPs), which terminate the DNA chain when
incorporated. The resulting fragments of various lengths are then separated by capillary electrophoresis,
allowing the sequence to be read.
‐ Sanger sequencing was used for the first human genome project due to its high accuracy and reliability. It's
suitable for sequencing small DNA fragments (up to 1,000 base pairs) and is still used for applications
requiring precise, long-read sequencing, such as validating mutations identified by other methods.
9
Three Generations of Sequencing Technologies (2/4)

• Second Generation: Next-Generation Sequencing (NGS)


‐ Emerging in the mid-2000s, NGS technologies enabled massively parallel sequencing of millions of DNA
molecules simultaneously, drastically reducing the cost and time required for sequencing.
‐ NGS involves fragmenting DNA, adding adapters, and amplifying the fragments on a solid surface or in
emulsion droplets. Sequencing is performed in a massively parallel manner, often by synthesis (Illumina
sequencing) or by sequencing by ligation (SOLiD sequencing). The reads are then aligned to a reference
genome or assembled de novo.
‐ NGS revolutionized genomics with its ability to rapidly sequence whole genomes or transcriptomes, making it
invaluable for large-scale projects, comprehensive genetic screening, and personalized medicine. Its
applications include whole-genome sequencing, targeted resequencing, RNA sequencing (RNA-seq), and
epigenetic studies. 10
Three Generations of Sequencing Technologies (3/4)

• Third Generation: Single-Molecule Sequencing


‐ The latest advancements in sequencing technology focus on sequencing single DNA molecules without prior
amplification, providing real-time data and significantly longer read lengths.
‐ Technologies like Pacific Biosciences (PacBio) and Oxford Nanopore Technologies sequence DNA by
monitoring the incorporation of nucleotides into a growing DNA strand in real time. PacBio uses an optical
method to detect nucleotide incorporation, while Oxford Nanopore passes DNA strands through nanopores,
measuring changes in electrical current to determine the sequence.
‐ Third-generation sequencing offers several advantages, including the ability to generate long reads (over
10,000 base pairs), which is crucial for assembling complex genomes, resolving structural variations, and
characterizing full-length transcripts in transcriptome studies. It also reduces biases and errors associated with
PCR amplification. 11
Three Generations of Sequencing Technologies (4/4)

Ref. https://journals.asm.org/doi/10.1128/cmr.00056-16 12
FASTA
• FASTA is a text-based format for representing nucleotide sequences (DNA or RNA)
or peptide sequences, where each sequence is denoted by a single-letter code. It
begins with a single-line description, preceded by a greater-than (“>”) symbol,
followed by lines of sequence data.
• FASTA format is widely used for sequence storage, submission to databases, and as
input for various bioinformatics tools, including sequence alignment programs and
databases like BLAST. It's favored for its simplicity and compatibility with a wide
range of software.
13
FASTQ
• FASTQ is a widely used text-based format for storing both biological sequence data
(usually nucleotide sequences) and corresponding quality scores.
• Each sequence in a FASTQ file is accompanied by a quality score for each nucleotide,
indicating the confidence or accuracy of sequencing.
• A single FASTQ record consists of four lines: a sequence identifier with optional
description, the raw sequence letters, a separator (usually a plus sign, "+"), and the
quality scores encoded as ASCII characters.

14
FASTQ
• FASTQ is a widely used text-based format for storing both biological sequence data
(usually nucleotide sequences) and corresponding quality scores.
• Each sequence in a FASTQ file is accompanied by a quality score for each nucleotide,
indicating the confidence or accuracy of sequencing.
• A single FASTQ record consists of four lines: a sequence identifier with optional
description, the raw sequence letters, a separator (usually a plus sign, "+"), and the
Ref. https://www.mdpi.com/2078-2489/7/4/56

quality scores encoded as ASCII characters.

15
FASTQC
• FASTQC is a quality control tool for high-throughput sequencing data. It provides a
simple way to do some quality checks on raw sequence data coming from high
throughput sequencing pipelines.
• FASTQC reads FASTQ files, and for each file, it produces a report with various
quality metrics that help in identifying potential problems with the data. The report
includes metrics such as per base sequence quality, sequence duplication levels,
overrepresented sequences, and k-mer content.

16
FASTQC – Quality Scores (1/2)

Ref. https://liulab-dfci.github.io/bioinfo-combio/ngs.html

17
FASTQC – Quality Scores (2/2)

Ref. https://liulab-dfci.github.io/bioinfo-combio/ngs.html

18
Relationship among FASTA, FASTQ, and FastQC

• FASTA provides the basic format for representing biological sequences without
quality information.
• FASTQ builds upon FASTA by including quality scores for each base in the sequence,
reflecting the reliability of sequencing data.
• FastQC analyzes FASTQ files to evaluate the quality of sequencing data, guiding
researchers in optimizing data preprocessing steps such as trimming low-quality bases
or filtering out low-quality reads.

19
Earliest Research in Sequence Comparison
• Doolittle et al. (Science, July 1983) searched for platelet-derived growth factor (PDGF) in his
own DB. He found that PDGF is similar to v-sis onc gene.
‐ PDGF-2 1 SLGSLTIAEPAMIAECKTREEVFCICRRL?DR?? 34
‐ P28sis 61 LARGKRSLGSLSVAEPAMIAECKTRTEVFEISRRLIDRTN 100
• Riordan et al. (Science, Sept 1989) wanted to understand the cystic fibrosis gene:

Ref. 10.1126/science.2475911
20
Why Comparing Sequences is Necessary
• Biology has the following conjecture
Given two DNAs (or RNAs, or Proteins),
high similarity → similar function or similar 3D structure
• In bioinformatics, we always compare the similarity of two biological sequences.

21
Applications of Sequence Comparison (1/2)
• Inferring the biological function of gene (or RNA or protein)
When two genes look similar, we conjecture that both genes have similar function
• Finding the evolution distance between two species
Evolution modifies the DNA of species. By measuring the similarity of their genome,
we know their evolution distance

22
Applications of Sequence Comparison (2/2)
• Helping genome assembly 序列組裝
Based on the overlapping information of a huge amount of short DNA pieces, Human
genome project reconstructs the whole genome. The overlapping information is done
by sequence comparison.
• Finding common subsequences in two genomes
• Finding repeats within a genome

23
Introduction to Sequence Alignment
String Edit (1/2)
• String edit, also known as edit distance, refers to a way of quantifying how dissimilar
two strings (e.g., words, sequences) are to one another by counting the minimum
number of operations required to transform one string into the other.
• The operations typically considered in string edit include insertion, deletion, and
substitution of characters.
• This concept is foundational in computer science and bioinformatics, especially in
tasks involving text comparison, such as spell checking, plagiarism detection, and
sequence alignment.
25
String Edit (2/2)
• Given two strings A and B, edit A to B with the minimum number of edit operations:
‐ Replace a letter with another letter
‐ Insert a letter
‐ Delete a letter
• Example:
A = interestingly _i nterestingly
B = bioinformatics bioinformatics__
1011011011001111
Edit distance = 11
26
String Edit Problem (1/3)
• The String Edit Problem is a classic issue in computer science, focusing on
determining the minimum number of operations required to transform one string into
another.
• The core of this problem lies in quantifying the dissimilarity between two strings,
often with the aim of comparing texts or biological sequences (such as DNA, RNA, or
proteins).
• The operations considered in solving the String Edit Problem typically include
insertion, deletion, and substitution.
27
String Edit Problem (2/3)
• Instead of minimizing the number of edge operations, we can associate a cost function
to the operations and minimize the total cost. Such cost is called edit distance.
• Example:
A = interestingly _i__nterestingly
B = bioinformatics bioinformatics__
1011011011001111
Edit distance = 11

28
String Edit Problem (3/3)
• Let Σ be the alphabet set and ‘_’ be a special symbol representing a null symbol.
• The cost of each edit operation can be specified by a matrix σ where σ(x, y) equals the
cost of replacing x by y, for x, y ∈ Σ∪{_}.
• Note that σ(_, x) and σ(x, _) denote the cost of insertion and deletion, respectively.
• The cost of a set of edit operations E equals Σ(x,y)∈E σ(x, y).
• Example: if the costs of mismatch, insert, and delete are 1, 2, and 2, respectively, then
the total cost to transform interestingly to bioinformatics is 16.

29
Applications of String Edit in Bioinformatics
• String edit is particularly important in sequence alignment algorithms, where it's used
to compare genetic sequences (DNA, RNA) or protein sequences.
• The edit distance helps identify evolutionary relationships between sequences by
measuring their similarity or dissimilarity, assisting in the reconstruction of
phylogenetic trees, the prediction of functional similarities, and the understanding of
genetic diseases.

30
String Alignment Problem (1/5)
• Instead of using string edit, in computational biology, people like to use string
alignment.
• We use similarity function, instead of cost function, to evaluate the goodness of the
alignment.
• DEFINITION 2.1 in Algorithms in Bioinformatics: A Practical Introduction by Sung
An alignment of two sequences is formed by inserting spaces in arbitrary locations
along the sequences so that they end up with the same length and there are no two
spaces at the same position of the two augmented sequences.
31
String Alignment Problem (2/5)
• Consider two strings ACAATCC and AGCATGC.
• One of their alignment is
insert mismatch
A_CAATCC
match
AGCA_TGC
• In the above alignment, delete

‐ space (‘_’) is introduced to both strings


‐ There are 5 matches, 1 mismatch, 1 insert, and 1 delete.

32
String Alignment Problem (3/5)
• In general, we can associate a similarity score with every pair of aligned characters.
• Let Σ be the alphabet set and be a special symbol representing a null symbol.
• The similarity of a pair of aligned characters can be specified by a matrix δ, where δ(x,
y) equals the similarity of x and y for x, y ∈ Σ∪{_}.
• The aim of the global alignment problem is to find an alignment A which maximizes
Σ(x,y)∈A δ(x, y).

33
String Alignment Problem (4/5)
• The alignment has similarity score 7
A_CAATCC
AGCA_TGC
• The above alignment has the maximum score.
• Such alignment is called optimal alignment.
• String alignment problem tries to find the alignment with the maximum similarity
score!
• String alignment problem is also called global alignment problem.
34
String Alignment Problem (5/5)
• LEMMA 2.1 in Algorithms in Bioinformatics: A Practical Introduction by Sung
Let σ be the cost matrix of the edit distance problem and δ be the score matrix of the
global alignment problem. If δ(x, y) = −σ(x, y) for all x, y ∈ Σ∪{_}, the solution to the
edit distance problem is equivalent to the solution to the string alignment problem.
• String alignment problem and string edit distance are dual problems.
• Although string edit and global alignment are equivalent, people in computational
biology prefer to use global alignment to measure the similarity of DNA, RNA, and
protein sequences.
35
Methods for Sequence Alignment
Needleman-Wunsch Algorithm (1/2)
• Consider two strings S[1..n] and T [1..m].
• Define V(i, j) be the score of the optimal alignment between S[1..i] and T[1...j].
• Basis:
‐ V(0, 0) = 0
‐ V(0, j) = V(0, j-1) + δ(_, T[j])
Insert j times
‐ V(i, 0) = V(i-1, 0) + δ(S[i], _)
Delete i times

37
Needleman-Wunsch Algorithm (2/2)
• Recurrence: For i>0, j>0
match /mismatch
delete
insert
• In the alignment, the last pair must be either match/mismatch, delete, or insert.

match /mismatch delete insert 38


Example (1/4)
- A G C A T G C
• Sequences:
-
‣ Sequence A (horizontal, columns): AGCATGC A
‣ Sequence B (vertical, rows): ACAATCC C
• Scoring system: A

‣ +2 for a match A

‣ -1 for a mismatch T

‣ -1 for a gap C
C
39
Example (2/4)
- A G C A T G C
• Step 1: Initialization
- 0 -1 -2 -3 -4 -5 -6 -7
‣ First, we create a matrix with Sequence A on A -1
the top and Sequence B on the side. C -2
‣ The matrix dimensions will be (length of A -3

Sequence B + 1) x (length of Sequence A + 1). A -4

‣ We then initialize the first row and column T -5

based on the gap penalty. C -6


C -7
40
Example (3/4)
- A G C A T G C
• Step 2: Filling the Matrix
- 0 -1 -2 -3 -4 -5 -6 -7
We fill in each cell using the scoring rules, A -1 2
choosing the highest score from: C -2
‣ Diagonal cell + match/mismatch score A -3

‣ Left cell + gap penalty A -4

‣ Upper cell + gap penalty T -5


C -6
For cell [1,1] (comparing A vs A):
Max(diagonal + match, left + gap, up + gap) = Max(0+2, -1-1, -1-1) C -7
=2
41
Example (3/4)
- A G C A T G C
• Step 2: Filling the Matrix
- 0 -1 -2 -3 -4 -5 -6 -7
We fill in each cell using the scoring rules, A -1 2 1
choosing the highest score from: C -2
‣ Diagonal cell + match/mismatch score A -3

‣ Left cell + gap penalty A -4

‣ Upper cell + gap penalty T -5


C -6
For cell [1,2] (comparing g vs A):
Max(diagonal + mismatch, up + gap, left + gap) = Max(-1-1, -2-1, 2- C -7
1) = 1
42
Example (3/4)
- A G C A T G C
• Step 2: Filling the Matrix
- 0 -1 -2 -3 -4 -5 -6 -7
We fill in each cell using the scoring rules, A -1 2 1 0 -1 -2 -3 -4
choosing the highest score from: C -2 1 1 3 2 1 0 -1
‣ Diagonal cell + match/mismatch score A -3 0 0 2 5 4 3 2

‣ Left cell + gap penalty A -4 -1 -1 1 4 4 3 2

‣ Upper cell + gap penalty T -5 -2 -2 0 3 6 5 4


C -6 -3 -3 0 2 5 5 7
C -7 -4 -4 -1 1 4 4 7
43
Example (4/4)
• Step 3: Traceback
Starting from the bottom-right corner, we
trace back to the origin (top-left), choosing
paths based on the scoring decisions made
during matrix filling. The path indicates the
alignment. Sequence A: AGCA_TGC
Sequence B: A_CAATCC
The optimal alignment is not unique.
We may get another optimal alignment:
A − CAATCC and AGC − ATGC. 44
Local Alignment
• Given two long DNAs, both of them contain the same gene or closely related gene.
• Local alignment problem:
Given two strings S[1..n] and T[1..m],
among all substrings of S and T,
find substrings A of S and B of T whose global alignment has the highest score

45
Brute-force Solution
• Algorithm:
For every substring A = S[i’..i] of S,
For every substring B = T[j’..j] of T,
Compute the global alignment of A and B
Return the pair (A, B) with the highest score
• Time:
There are n2/2 choices of A and m2/2 choices of B.
The global alignment of A and B can be computed in O(nm) time.
In total, time complexity = O(n3m3)
46
Some Background
• X is a suffix of S[1..n] if X=S[k..n] for some k ≥ 1
• X is a prefix of S[1..n] if X=S[1..k] for some k ≤ n
• Example:
‐ Consider S[1..7] = ACCGATT
‐ ACC is a prefix of S, GATT is a suffix of S
‐ Empty string is both prefix and suffix of S

47
Smith-Waterman Algorithm (1/2)
• Consider two strings S[1..n] and T [1..m].
• Define V (i, j) be the maximum score of the global alignment of A and B over all
substrings A of S end at i and all substrings B of T end at j, where 1 ≤ i ≤ n and 1 ≤ j ≤ m.
all suffixes A of S[1..i] all suffixes B of T[1..j]

• Assume an empty string is also a substring of S ends at i and a substring of T ends at j.


• The score of the optimal local alignment is max1≤i≤n,1≤j≤m V (i, j).

48
Smith-Waterman Algorithm (2/2)
• The recursive formula for V (i, j) depending on two cases:
(1) either i = 0 or j = 0 and (2) both i > 0 and j > 0.
1) either i = 0 or j = 0:
The best alignment aligns the empty strings of S and T. Hence, we have
V (i, 0) = 0 for 0 ≤ i ≤ n and V (0, j) = 0 for 0 ≤ j ≤ m
2) both i > 0 and j > 0

49
Example (1/4)
- C T C A T G C
• Sequences:
-
‣ Sequence A (horizontal, columns): CTCATGC A
‣ Sequence B (vertical, rows): ACAATCG C
• Scoring system: A

‣ +2 for a match A

‣ -1 for a mismatch T

‣ -1 for a gap C
G
50
Example (2/4)
- C T C A T G C
• Step 1: Initialization
- 0 0 0 0 0 0 0 0
‣ Create a scoring matrix with rows and A 0
columns representing Sequence A and C 0
Sequence B, respectively, plus an extra row A 0

and column at the beginning for initialization. A 0

‣ Initialize the first row and column with 0s T 0

(since the Smith-Waterman algorithm allows C 0


G 0
alignments to start and end anywhere).
51
Example (3/4)
- C T C A T G C
• Step 2: Filling the Matrix
- 0 0 0 0 0 0 0 0
We fill in each cell using the scoring rules, A 0 0
choosing the highest score from: C 0
‣ Diagonal cell + match/mismatch score A 0

‣ Left cell + gap penalty A 0

‣ Upper cell + gap penalty T 0

‣ 0 C 0
For cell [1,1] (comparing C vs A): G 0
Max(0, diagonal + mismatch, left + gap, up + gap) = Max(0,0-1, 0-1, 0-1)=0
52
Example (3/4)
- C T C A T G C
• Step 2: Filling the Matrix
- 0 0 0 0 0 0 0 0
We fill in each cell using the scoring rules, A 0 0 0
choosing the highest score from: C 0
‣ Diagonal cell + match/mismatch score A 0

‣ Left cell + gap penalty A 0

‣ Upper cell + gap penalty T 0

‣ 0 C 0
For cell [1,2] (comparing T vs A): G 0
Max(0, diagonal + mismatch, left + gap, up + gap) = Max(0,0-1, 0-1, 0-1)=0
53
Example (3/4)
- C T C A T G C
• Step 2: Filling the Matrix
- 0 0 0 0 0 0 0 0
We fill in each cell using the scoring rules, A 0 0 0 0 2 1 0 0
choosing the highest score from: C 0 2 1 2 1 1 0 2
‣ Diagonal cell + match/mismatch score A 0 1 1 1 4 3 2 1

‣ Left cell + gap penalty A 0 0 0 0 3 3 2 1

‣ Upper cell + gap penalty T 0 0 2 1 2 5 4 3

‣ 0 C 0 2 1 4 3 4 4 6
G 0 1 1 3 3 3 6 5
54
Example (4/4)
CAATCG
C_AT_G
• Step 3: Traceback
Identify the highest score in the matrix.
Traceback from this highest score to where the
score drops to 0, determining the path of the
alignment. The traceback rule is to move from a
cell to its origin (diagonal for match/mismatch,
up/left for gaps) until you hit a 0, indicating the
start of the local alignment.

55
Smith-Waterman Algorithm – Complexity
• Time Complexity:
The time complexity of the Smith-Waterman algorithm is O(mn), as it involves filling
a matrix of size m x n where m and n are the lengths of the two sequences.
• Space Complexity:
The space complexity is also O(mn) due to the need to store the entire matrix for the
traceback process. For long sequences, this can be computationally expensive.

56
Semi-global Alignment (1/2)
• Semi-global alignment is similar to global alignment, in the sense that it tries to align
two sequences as a whole. The difference lies in the way it scores alignments.
• Semi-global alignment ignores some end spaces.
• Example: ignoring beginning and ending spaces of the second sequence.
ATCCGAA_CATCCAATCGAAGC Semi-global alignment assigns no cost to spaces that appear
______AGCATGCAAT______ before the first character and/or after the last character.
If we compute the optimal
‐ The score of the alignment is 14 global alignment, we get
ATCCGAACATCCAATCGAAGC
‐ 8 matches (score=16), 1 delete (score=-1), 1 mismatch (score=-1)
A___G__CATGCAAT______
‐ This alignment can be used to locate gene in a prokaryotic genome.
57
Semi-global Alignment (2/2)
• Example: ignoring beginning spaces of the 1st sequence and ending spaces of the 2nd
sequence
_________ACCTCACGATCCGA
TCAACGATCACCGCA________
‐ The score of above alignment is 9
‐ 5 matches (score=10), 1 mismatch (score=-1)
‐ This alignment can be used to find the common region of two overlapping sequences

58
How to Compute Semi-global Alignment
• In general, we can forgive spaces
‐ in the beginning or ending of S[1..n]
‐ in the beginning or ending of T[1..m]
• Semi-global alignment can be computed using the dynamic programming for global
alignment with some small changes.

59
Gaps
• A gap in an alignment is a maximal substring of contiguous spaces in either sequence of the
alignment
A_CAACTCGCCTCC
AGCA_______TGC
• Previous discussion assumes the penalty for insert/delete is proportional to the length of a gap!
• This assumption may not be valid in some applications, for examples:
‐ Mutation may cause insertion/deletion of a large substring. Such kind of mutation may be as likely
as insertion/deletion of a single base.
‐ Recall that mRNA misses the introns. When aligning mRNA with its gene, the penalty should not be
proportional to the length of the gaps.
60
General Gap Penalty (1/2)
• Definition: g(q) is denoted as the penalty of a gap of length q
• Global alignment of S[1..n] and T[1..m]:
‐ Denote V(i, j) be the score for global alignment between S[1..i] and T[1..j].
‐ Base cases:
V(0, 0) = 0
V(0, j) = -g(j)
V(i, 0) = -g(i)

61
General Gap Penalty (2/2)
• When i > 0 and j > 0, V (i, j) can be computed by the following recursive equation.

• To compute the optimal global alignment score under the general gap penalty, we just
fill in the table V row by row.
• Then, the optimal alignment score is stored in V (n,m).
• The optimal alignment can be recovered by back-tracing the dynamic programming
table V . 62
Scoring Function for DNA
• For DNA, since we only have 4 nucleotides, the score function is simple.
• The NCBI-BLASTN set the match score and the mismatch score to +2 and -1,
respectively.
• The WU-BLASTN and FastA set the match score and the mismatch score to +5 and -
4, respectively.
• The score function for NCBI-BLASTN is better for detecting homologous alignment
with 95% identity. For WU-BLASTN and FastA, their scoring function is more
suitable to detect homologous alignment with 65% identity.
63
Scoring Function for DNA
• In the evolution of real sequences, transitions are observed more often than
transversions.
• Transition transversion matrix:
嘌呤
‐ The four nucleotides are separated into two groups: purines (A and G) and
嘧啶
pyrimidines (C and T).
‐ It gives a mild penalty (-1) for replacing purine by purine or pyrimidine by
pyrimidine. For substitution between purine and pyrimidine, it gives a bigger
penalty (-5). Match score is +1.
64
Scoring Function for Protein
• There are two approaches to assign a similarity score.
• The first approach assigns a similarity score based on the chemical or physical
properties of amino acids, including hydrophobicity, charge, electronegative, and size.
The basic assumption is that an amino acid is more likely to be
• Another approach assigns a similarity score purely based on statistics.
‐ two popular scoring functions PAM and BLOSUM
‐ Both methods define the score as the log-odds ratio between the observed
substitution rate and the actual substitution rate
65
Point Accepted Mutation Score Matrix (1/2)
• Point Accepted Mutation (PAM) was developed by Dayhoff (1978).
• A point mutation means substituting one residue by another.
• It is called an accepted point mutation if the mutation does not change the protein’s
function or is not fatal.
• Two sequence S1and S2 are said to be 1 PAM diverged if a series of accepted point
mutation can convert S1 to S2 with an average of 1 accepted point mutation per 100
residues.

66
Point Accepted Mutation Score Matrix (2/2)
• To obtain a PAM-1 matrix, Dayhoff collected a set of ungapped alignments which is
constructed from high similarity amino acid sequences (usually > 85%).
• Then, a phylogenetic tree is constructed to identify the set of mutations.

67
BLOck SUbstitution Matrix
• PAM did not work well for aligning evolutionarily divergent sequences since the
matrix is generated by extrapolation.
• Henikoff and Henikoff (1992) proposed BLOSUM (BLOck SUbstitution Matrix).
• Unlike PAM, BLOSUM matrix is constructed directly from the observed alignment
(instead of extrapolation).

68
Relationship between BLOSUM and PAM
• Relationship between BLOSUM and PAM
‐ BLOSUM 80 ≈PAM 1
‐ BLOSUM 62 ≈PAM 120
‐ BLOSUM 45 ≈PAM 250
• BLOSUM 62 is the default matrix for BLAST 2.0

69
Multiple Sequence Alignment
Multiple Sequence Alignment
• Consider a set of k DNA/protein sequences S = {S1, S2, …, Sk}.
• A multiple alignment of S is a set of k equal-length sequences {S1' , S2′ , …, S𝑘′ }.
S𝑖′ is obtained by inserting gaps in to Si.
• The Multiple Sequence Alignment (MSA) problem is to find an alignment which
maximizes a certain similarity function or minimizes a certain distance function.
One popular similarity function is the Sum-of-Pair (SP) score

71
Sum-of-Pair (SP) Score (1/2)
• Consider the multiple alignment S′ of S.
• SP-score(a1, …, ak) = Σ1≤i<j≤kδ(ai, aj),
where ai can be any character or a space.
• The SP-score of S′ is
ΣxSP-score(S1′ [x], …, S𝑘′ [x]).

72
Sum-of-Pair (SP) Score (2/2)
• Assume the mismatch and match scores are 2 and −2, respectively.
• S1 = ACG--GAGA
• S2 = -CGTTGACA
• S3 = AC-T-GA-A
• S4 = CCGTTCAC-
• For position 1, SP-score(A,-,A,C) = 2δ(A,-) + 2δ(A,C) + δ(A,A) + δ(C,-) = -8
• SP-score= -8+12+0+0–6+0+12–10+0 = 0

73
Sum-of-Pair (SP) Distance
• Sum-of-Pair (SP) distance is a distance function.
• SP-dist(a1, …, ak) = Σ1≤i<j≤kδ(ai, aj),
where ai can be any character or a space.
• The SP-dist of S′ is
ΣxSP-dist(S1′ [x], …, S𝑘′ [x]).
• The problem of finding a multiple alignment which maximizes the SP score (or
minimizes the SP distance) is known to be NP-hard.
• It is unlikely to have a polynomial time solution for this problem.
74
Methods for Solving the MSA Problem
• Multiple sequence alignment methods can be classified as four types:
1) global optimization methods: the basic approach is dynamic programming
2) approximation algorithms
3) heuristic methods
4) probabilistic methods: assume certain model and try to learn the model parameters

75
Dynamic Programming Method (1/3)
• Let V(i1, i2, …, ik) = the SP-score of the optimal alignment of S1[1..i1], S2[1..i2], …,
Sk[1..ik].
• Observation: The last column of the optimal alignment should be either Sj[ij] or ‘-’.
• Hence, the score for the last column should be SP-score(S1[b1i1], S2[b2i2], …, Sk[bkik])
For (b1, b2, …, bk) ∈{0,1}k.
(Assume that Sj[0] = ‘-’.)

76
Dynamic Programming Method (2/3)
• Let V(i1, i2, …, ik) = the SP-score of the optimal alignment of S1[1..i1], S2[1..i2], …,
Sk[1..ik].
• Based on the observation, we have V(i1, i2, …, ik) = max(b1, b2, …, bk) ∈{0,1}k
{ V(i1-b1, …, ik-bk) + SP-score(S1[b1i1], …, Sk[bkik])}
• The SP-score of the optimal multiple alignment of S={S1, S2, …, Sk} is
V(n1, n2, …, nk), where ni is the length of Si.

77
Dynamic Programming Method (3/3)
• By filling-in the dynamic programming table,
We compute V(n1, n2, …, nk).
• By back-tracing,
We recover the multiple alignment.

78
Dynamic Programming Method – Complexity

• Time:
‐ The table V has n1n2…nk entries.
‐ Filling in one entry takes 2kk2time.
‐ Total running time is O(2kk2 n1n2…nk).
• Space:
‐ O(n1n2…nk) space to store the table V.
• Dynamic programming is expensive in both time and space. It is rarely used for
aligning more than 3 or 4 sequences.
79
Q &A
Thank you!

You might also like