Genomic Sequence Alignment

Introduction to Bioinformatics
for Medical Research
Gideon Greenspan
[email protected]
Lecture 3
Genomic Sequence Alignment
Genomic Sequence Alignment
• Homology
• Sequence similarity
• Dot plots
• Needleman-Wunsch alignment
• Alignment Variations
• Complexity
• BLASTN tool
2
Homology
• Similarity between species
• Before Darwin, based on morphology
– Similar bones in bat’s wing and whale’s flipper
• Darwin’s evolutionary theory
– Common ancestry
• Genetic sequences
– Evolution made visible
• Beware of incidental similarity
3
Types of Homology
• Orthologs generated by speciation

• Paralogs by gene duplication 4
Homology Motivation
• Study evolution
– Compare different species
• Discover function
– Compare against known species
• Find crucial features
– Compare many examples
• Identify cause of disease
– Compare against healthy individuals
5
Homology via Sequences
• Study evolution
– Multiple alignment and phylogenetic trees
• Discover function
– BLAST searches against database
• Find crucial features
– Motif finding
• Identify cause of disease
– Detect variable sites in alignment
6
Sequence Similarity
• Molecular genetic evidence of homology
• Requires quantification
– Length of match
– Strength of match
GTCTTCGACGTCAGTCGGCACGACCTG
ACCGTACGTCTCAGTTAGCGTGATCAG
Strong but short match Long but weak match

7
Sequence Modifications
• Three types of mutation
– Substitution (point mutation)
– Insertion
Indel (replication slippage)
– Deletion
TCCGT
TCAGT TCGAGT TCAGT
TCGT 8
Scoring Similarity
• Assume independent mutation model
– Each site considered separately
• Score at each site
– Positive if the same
– Negative if different
• Sum to make final score
– Can be positive or negative
– Significance depends on sequence length
9
Substitutions Only
• Pretend there are no indels
– Sequences compared base-by-base
– Count the number of matches and mismatches
TTCGTCGTAGTCGGCTCGACCTG
GTACGTCTAGCGAGCGTGATCCT
9 matches +18 Total score +4

14 mismatches -14 A weak match
10
Including Indels
• Create an ‘alignment’
– Count matches within alignment
– Required if sequences are different length
TT-CGTCGTAGTCG-GC-TCGACC-TG
GTACGTC-TAG-CGAGCGT-GATCCT-
17 matches +34
2 mismatches - 2 Total score +24
8 indels - 8 A strong match
11
Choosing an Alignment
• Many different alignments are possible
– Should consider all possible
– Take the best score found
TT-CGTCGTAGTCG-GC-TCGACC-TG
+24
GTACGTC-TAG-CGAGCGT-GATCCT-
-TTCGT-CGTAGTC-GGCTCG-ACCTG
0
GTAC-GTCTA-GCGAGCGT-GATCC-T
12
Dot Plots
G T A G T C G G • Early method
T ® ®
• Sequences at
A ®
top and left
G ® ® ® ®
C ® • Dots indicate
G ® ® ® ® matched bases
A ® • Diagonal series
G ® ® ® ® show matched
C ® regions
13
Local Alignment
• Best score for aligning part of sequences
– Often beats global alignment score
• Similar algorithm: Smith-Waterman
– Table cells never score below zero
-ACGCTG Global: ACGCT Local:

CATG-T- +2 ATG-T +4
14
Gap Scores
• Example showed -1 score per indel
– So gap cost is proportional to its length
• Biologically, indels occur in groups
– We want our gap score to reflect this
• Standard solution: affine gap model
– Once-off cost for opening a gap
– Lower cost for extending the gap
– Changes required to algorithm
15
Complexity
• Complexity is determined by size of table
– Aligning a sequence of length m against one of
length n requires calculating (m ¥ n) cells
• Estimate: we calculate 108 cells per second
– Aligning two mRNA sequences of 8,000 bp
requires 64,000,000 cells fi 0.64 seconds
– Aligning an mRNA and a 107 bp chromosome
requires ~1011 cells fi 1,000 secs = 15 minutes
16
Complexity for GenBank
• GenBank contains 3 ¥ 1010 base pairs
– Searching an mRNA against GenBank requires
~2.5 ¥ 1014 cells fi 2.5 ¥ 106 secs = 1 month!
– So each computer could support just one
GenBank search per month
• We need to cut down on alignment
– Use a heuristic method to narrow down the part
of GenBank that could be of interest
17
Genomic Indexing
• Indexing can improve any kind of search
– Index need only be built once
– Accessing an index is instantaneous
• Keys in genomic index are short sequences
– Each key points to many sequences
• Search uses index to isolate candidates
– Alignment is applied only to likely matches
• FASTA and BLAST implementations
18
FASTA and BLAST
• FASTA (Lipman & Pearson 1985)
– First fast sequence searching algorithm
• Gapped BLAST (Altschul et al 1997)
– Originally without gaps (1990)
• BLAST benefits
– Search speed
– Ease of use
– Statistical rigor
19
BLAST Variations
Name Query type Database
blastn Genomic Genomic
blastp Protein Protein
blastx Translated genomic Protein
tblastn Protein Translated genomic
tblastx Translated genomic Translated genomic
• Genomic translations test all 6 possibilities

– 3x for codon frames, 2x for reverse complement
20
BLASTN Parameters
Query: bare sequence, FASTA format,
GenBank accession or identifier
Subsequence: only
use part of query
Database: nr = all
available, month = in last
Go! month, chromosome =
complete genomes, more…
21
BLASTN Results (1)
Query sequence
representation
Matched areas
of database
sequences
Multiple
matches on
sequence
22
BLASTN Results (2)
Sequence
Identifier
Sequence
description
Score and
E value
23
BLASTN Results (3)
Database
sequence
Score
breakdown
Sequence
positions
Matching
sites
24
BLASTN Scores
• Bit score
– Similar to alignment score
– Normalized to account for different schemes
– Higher means more significant
• E value
– Based on random database of similar size
– Number of hits of given score expected
– Lower means more significant
25

Genomic Sequence Alignment

Uploaded by

Copyright:

Available Formats

Genomic Sequence Alignment

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Genomic Sequence Alignment

Uploaded by

Copyright:

Available Formats

Introduction to Bioinformatics

for Medical Research

• Orthologs generated by speciation

Strong but short match Long but weak match

9 matches +18 Total score +4

-ACGCTG Global: ACGCT Local:

• Genomic translations test all 6 possibilities

You might also like