Genomic Sequence Alignment

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Introduction to Bioinformatics

for Medical Research

Gideon Greenspan
[email protected]

Lecture 3
Genomic Sequence Alignment
Genomic Sequence Alignment
• Homology
• Sequence similarity
• Dot plots
• Needleman-Wunsch alignment
• Alignment Variations
• Complexity
• BLASTN tool
2
Homology
• Similarity between species
• Before Darwin, based on morphology
– Similar bones in bat’s wing and whale’s flipper
• Darwin’s evolutionary theory
– Common ancestry
• Genetic sequences
– Evolution made visible
• Beware of incidental similarity
3
Types of Homology

• Orthologs generated by speciation


• Paralogs by gene duplication 4
Homology Motivation
• Study evolution
– Compare different species
• Discover function
– Compare against known species
• Find crucial features
– Compare many examples
• Identify cause of disease
– Compare against healthy individuals
5
Homology via Sequences
• Study evolution
– Multiple alignment and phylogenetic trees
• Discover function
– BLAST searches against database
• Find crucial features
– Motif finding
• Identify cause of disease
– Detect variable sites in alignment
6
Sequence Similarity
• Molecular genetic evidence of homology
• Requires quantification
– Length of match
– Strength of match

GTCTTCGACGTCAGTCGGCACGACCTG
ACCGTACGTCTCAGTTAGCGTGATCAG

Strong but short match Long but weak match


7
Sequence Modifications
• Three types of mutation
– Substitution (point mutation)
– Insertion
Indel (replication slippage)
– Deletion

TCCGT
TCAGT TCGAGT TCAGT
TCGT 8
Scoring Similarity
• Assume independent mutation model
– Each site considered separately
• Score at each site
– Positive if the same
– Negative if different
• Sum to make final score
– Can be positive or negative
– Significance depends on sequence length
9
Substitutions Only
• Pretend there are no indels
– Sequences compared base-by-base
– Count the number of matches and mismatches

TTCGTCGTAGTCGGCTCGACCTG
GTACGTCTAGCGAGCGTGATCCT

9 matches +18 Total score +4


14 mismatches -14 A weak match
10
Including Indels
• Create an ‘alignment’
– Count matches within alignment
– Required if sequences are different length

TT-CGTCGTAGTCG-GC-TCGACC-TG
GTACGTC-TAG-CGAGCGT-GATCCT-
17 matches +34
2 mismatches - 2 Total score +24
8 indels - 8 A strong match
11
Choosing an Alignment
• Many different alignments are possible
– Should consider all possible
– Take the best score found

TT-CGTCGTAGTCG-GC-TCGACC-TG
+24
GTACGTC-TAG-CGAGCGT-GATCCT-
-TTCGT-CGTAGTC-GGCTCG-ACCTG
0
GTAC-GTCTA-GCGAGCGT-GATCC-T
12
Dot Plots
G T A G T C G G • Early method
T ® ®
• Sequences at
A ®
top and left
G ® ® ® ®
C ® • Dots indicate
G ® ® ® ® matched bases
A ® • Diagonal series
G ® ® ® ® show matched
C ® regions
13
Local Alignment
• Best score for aligning part of sequences
– Often beats global alignment score
• Similar algorithm: Smith-Waterman
– Table cells never score below zero

-ACGCTG Global: ACGCT Local:


CATG-T- +2 ATG-T +4

14
Gap Scores
• Example showed -1 score per indel
– So gap cost is proportional to its length
• Biologically, indels occur in groups
– We want our gap score to reflect this
• Standard solution: affine gap model
– Once-off cost for opening a gap
– Lower cost for extending the gap
– Changes required to algorithm
15
Complexity
• Complexity is determined by size of table
– Aligning a sequence of length m against one of
length n requires calculating (m ¥ n) cells
• Estimate: we calculate 108 cells per second
– Aligning two mRNA sequences of 8,000 bp
requires 64,000,000 cells fi 0.64 seconds
– Aligning an mRNA and a 107 bp chromosome
requires ~1011 cells fi 1,000 secs = 15 minutes

16
Complexity for GenBank
• GenBank contains 3 ¥ 1010 base pairs
– Searching an mRNA against GenBank requires
~2.5 ¥ 1014 cells fi 2.5 ¥ 106 secs = 1 month!
– So each computer could support just one
GenBank search per month
• We need to cut down on alignment
– Use a heuristic method to narrow down the part
of GenBank that could be of interest

17
Genomic Indexing
• Indexing can improve any kind of search
– Index need only be built once
– Accessing an index is instantaneous
• Keys in genomic index are short sequences
– Each key points to many sequences
• Search uses index to isolate candidates
– Alignment is applied only to likely matches
• FASTA and BLAST implementations
18
FASTA and BLAST
• FASTA (Lipman & Pearson 1985)
– First fast sequence searching algorithm
• Gapped BLAST (Altschul et al 1997)
– Originally without gaps (1990)
• BLAST benefits
– Search speed
– Ease of use
– Statistical rigor
19
BLAST Variations
Name Query type Database
blastn Genomic Genomic
blastp Protein Protein
blastx Translated genomic Protein
tblastn Protein Translated genomic
tblastx Translated genomic Translated genomic

• Genomic translations test all 6 possibilities


– 3x for codon frames, 2x for reverse complement
20
BLASTN Parameters
Query: bare sequence, FASTA format,
GenBank accession or identifier

Subsequence: only
use part of query

Database: nr = all
available, month = in last
Go! month, chromosome =
complete genomes, more…
21
BLASTN Results (1)
Query sequence
representation

Matched areas
of database
sequences

Multiple
matches on
sequence
22
BLASTN Results (2)
Sequence
Identifier

Sequence
description

Score and
E value
23
BLASTN Results (3)
Database
sequence

Score
breakdown

Sequence
positions

Matching
sites
24
BLASTN Scores
• Bit score
– Similar to alignment score
– Normalized to account for different schemes
– Higher means more significant
• E value
– Based on random database of similar size
– Number of hits of given score expected
– Lower means more significant
25

You might also like