Genomic Sequence Alignment
Genomic Sequence Alignment
Genomic Sequence Alignment
Gideon Greenspan
[email protected]
Lecture 3
Genomic Sequence Alignment
Genomic Sequence Alignment
• Homology
• Sequence similarity
• Dot plots
• Needleman-Wunsch alignment
• Alignment Variations
• Complexity
• BLASTN tool
2
Homology
• Similarity between species
• Before Darwin, based on morphology
– Similar bones in bat’s wing and whale’s flipper
• Darwin’s evolutionary theory
– Common ancestry
• Genetic sequences
– Evolution made visible
• Beware of incidental similarity
3
Types of Homology
GTCTTCGACGTCAGTCGGCACGACCTG
ACCGTACGTCTCAGTTAGCGTGATCAG
TCCGT
TCAGT TCGAGT TCAGT
TCGT 8
Scoring Similarity
• Assume independent mutation model
– Each site considered separately
• Score at each site
– Positive if the same
– Negative if different
• Sum to make final score
– Can be positive or negative
– Significance depends on sequence length
9
Substitutions Only
• Pretend there are no indels
– Sequences compared base-by-base
– Count the number of matches and mismatches
TTCGTCGTAGTCGGCTCGACCTG
GTACGTCTAGCGAGCGTGATCCT
TT-CGTCGTAGTCG-GC-TCGACC-TG
GTACGTC-TAG-CGAGCGT-GATCCT-
17 matches +34
2 mismatches - 2 Total score +24
8 indels - 8 A strong match
11
Choosing an Alignment
• Many different alignments are possible
– Should consider all possible
– Take the best score found
TT-CGTCGTAGTCG-GC-TCGACC-TG
+24
GTACGTC-TAG-CGAGCGT-GATCCT-
-TTCGT-CGTAGTC-GGCTCG-ACCTG
0
GTAC-GTCTA-GCGAGCGT-GATCC-T
12
Dot Plots
G T A G T C G G • Early method
T ® ®
• Sequences at
A ®
top and left
G ® ® ® ®
C ® • Dots indicate
G ® ® ® ® matched bases
A ® • Diagonal series
G ® ® ® ® show matched
C ® regions
13
Local Alignment
• Best score for aligning part of sequences
– Often beats global alignment score
• Similar algorithm: Smith-Waterman
– Table cells never score below zero
14
Gap Scores
• Example showed -1 score per indel
– So gap cost is proportional to its length
• Biologically, indels occur in groups
– We want our gap score to reflect this
• Standard solution: affine gap model
– Once-off cost for opening a gap
– Lower cost for extending the gap
– Changes required to algorithm
15
Complexity
• Complexity is determined by size of table
– Aligning a sequence of length m against one of
length n requires calculating (m ¥ n) cells
• Estimate: we calculate 108 cells per second
– Aligning two mRNA sequences of 8,000 bp
requires 64,000,000 cells fi 0.64 seconds
– Aligning an mRNA and a 107 bp chromosome
requires ~1011 cells fi 1,000 secs = 15 minutes
16
Complexity for GenBank
• GenBank contains 3 ¥ 1010 base pairs
– Searching an mRNA against GenBank requires
~2.5 ¥ 1014 cells fi 2.5 ¥ 106 secs = 1 month!
– So each computer could support just one
GenBank search per month
• We need to cut down on alignment
– Use a heuristic method to narrow down the part
of GenBank that could be of interest
17
Genomic Indexing
• Indexing can improve any kind of search
– Index need only be built once
– Accessing an index is instantaneous
• Keys in genomic index are short sequences
– Each key points to many sequences
• Search uses index to isolate candidates
– Alignment is applied only to likely matches
• FASTA and BLAST implementations
18
FASTA and BLAST
• FASTA (Lipman & Pearson 1985)
– First fast sequence searching algorithm
• Gapped BLAST (Altschul et al 1997)
– Originally without gaps (1990)
• BLAST benefits
– Search speed
– Ease of use
– Statistical rigor
19
BLAST Variations
Name Query type Database
blastn Genomic Genomic
blastp Protein Protein
blastx Translated genomic Protein
tblastn Protein Translated genomic
tblastx Translated genomic Translated genomic
Subsequence: only
use part of query
Database: nr = all
available, month = in last
Go! month, chromosome =
complete genomes, more…
21
BLASTN Results (1)
Query sequence
representation
Matched areas
of database
sequences
Multiple
matches on
sequence
22
BLASTN Results (2)
Sequence
Identifier
Sequence
description
Score and
E value
23
BLASTN Results (3)
Database
sequence
Score
breakdown
Sequence
positions
Matching
sites
24
BLASTN Scores
• Bit score
– Similar to alignment score
– Normalized to account for different schemes
– Higher means more significant
• E value
– Based on random database of similar size
– Number of hits of given score expected
– Lower means more significant
25