Sequence Alignment and Searching
Sequence Alignment and Searching
Sequence Alignment and Searching
Identification of homology and homologous sites in related sequences Inference of evolutionary history that lead to the differences in observed sequences
The Problem
Biological problem Finding a way to compare and represent similarity or dissimilarity between biomolecular sequences (DNA, RNA or amino acid)
The Problem
Computational problem Finding a way to perform inexact or approximate matching of subsequences within strings of characters Sequence comparison and alignment is a central problem in computational biology: High sequence similarity usually => structural or functional similarity
Example: xyz is a subsequence within axayaz, but NOT a substring Characters in a substring must be contiguous
According to sequence Coverage: According to number of sequences: TWO SEQUENCES (Pairwise alignment)
LOCAL
GLOBAL
Defining consensus sequences, protein structural motifs and domains, regulatory elements in DNA etc.
Determination of conserved residues and domains; Introductory step in molecular phylogenetic analysis
gctga-a--cg --ct-ataatc
And another
gctg-aa-cg -ctataatc-
A simple picture that gives an overview of the similarities between two sequences
Dotplot showing identities between sequences (DOROTHYHODGKIN) and (DOROTHYCROWFOOTHODGKIN):
Letters corresponding to isolated matches are shown in non-bold type. The longest matching regions, shown in boldface, are DOROTHY and HODGKIN. Shorter matching regions, such as the OTH of dorOTHy and RO of doROthy and cROwfoot, are noise.
Dotplot showing identities between a repetitive sequence (ABRACADABRACADABRA) and itself. The repeats appear on several subsidiary diagonals parallel to the main diagonal.
Dotplot showing identities between the palindromic sequence MAX I STAY AWAY AT SIX AM and itself. The palindrome reveals itself as a stretch of matches perpendicular to the main diagonal.
The Hamming distance, defined between two strings of equal length, is the number of positions with mismatching characters. The Levenshtein, or edit distance, between two strings of not necessarily equal length, is the minimal number of 'edit operations' required to change one string into the other, where an edit operation is a deletion, insertion or alteration of a single character in either sequence.
Hamming distance = 2
Levenshtein distance = 3
Definition: The edit distance between two strings is defined as the minimum number of edit operations insertions, deletions, & substitutions needed to transform the first string into the second. For emphasis, note that matches are not counted. Example: AATT and AATG Distance = 1 (edit operation of substitution)
String alignment
An edit transcript is a way to represent a particular transformation of one string into another Emphasizes point mutations in the model An alignment displays a relationship between two strings Global alignment means for each string, entire string is involved in the alignment Examples:
AAGCA AA _C_
Sequence diversion
Sequences may have diverged from common ancestor through mutations: Substitution (AAGC AAGT) Insertion (AAG AAGT) Deletion (AAGC AAG)
Scoring schemes
In molecular biology, certain changes are more likely to occur than others
Amino acid substitutions tend to be conservative In nucleotide sequences, transitions are more frequent than transversions
-> We want to give different weights to different edit operations Example: a DNA substitution matrix:
a g c t
a 20 10 5 5
g 10 20 5 5
c 5 5 20 10
t 5 5 10 20
BLAST = Basic local alignment search tool When you have a nucleotide or protein sequence that you want to search against sequence databases
(1) Choose the sequence (query) (2) Select the BLAST program (3) Choose the database to search (4) Choose optional parameters Then click BLAST
Step 2: Choose the BLAST program Program Input blastn blastp blastx tblastn tblastx DNA protein DNA protein DNA Database DNA protein protein DNA DNA
1 1 6 6 36
You can... choose the organism to search turn filtering on/off change the substitution matrix change the expect (e) value change the word size change the output format
CD search
organism
filtering
page 97
Alignments Views
Pairwise Standard BLAST alignment in pairs of query sequence and datab match.
Query: 251
tgaccggtaacgaccgcaccctggacgtcatggcgctggatgtggtgtggacggcgga 3
program
query database
The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length W with a score of at least T. Altschul et al. (1990)
query
Construct a dictionary of all the words in the query Initiate a local alignment for each word match between query and DB DB
2.
Background
3 Stages
Background
Speed gained by minimizing search space Alignments require word hits ( word size = W )
Sequence 1
Sequence 2
word hits
Background
17 14 13 13 12 12 12 12 11 11 11 11 11 11 11 11 11
Threshold score = T Neighborhood words of RGD W and T modulate speed and sensitivity
HGD NGD
RGN AGD MGD RAD RGQ RGS RND RSD SGD TGD
T=12
Background
A hit is made with one or several successive pairs of similar words. All the possible hits between the query sequence and sequences from databases are calculated in this way.
query
Background
Step 3: Extension
Each hit is extended in both directions. Extension is terminated when the maximum score drops below X.
scan DB
query
Background
C C C T T C C T G G A T T G C G A
Evalue - number of unrelated databank sequences expected to yield same or higher score by pure chance
Aligned fragments of query and detected sequence with similarity score exceeding a set cutoff value
E-value
(Expectation)
Score = 61 (27.8 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65 Identities = 10/17 (58%), Positives = 16/17 (94%) Query: 81 SGDLSMLVLLPDEVSDL 97 +GD+SM +LLPDE++D+ Sbjct: 259 AGDVSMFLLLPDEIADV 275
HSP
The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported. (E-value approching zero => significant alignment) An E value of 1 assigned to a hit can be interpreted as in a database of the current size one might expect to see 1 match with a similar score simply by chance.
Statistical Terminology
True positive: A hit returned from a database search which is homologous with the query sequence. GOOD False positive: A hit returned from a database search which is not homologous with the query sequence. BAD True negative: A sequence which is not homologous with the query sequence is not returned from database search. GOOD False negative: A sequence which is homologous with the query sequence is not returned from database search. BAD Sensitivity: A program which is sensitive picks up on most true positive. Selectivity: A program which is selective does not include false positives.
Conclusion
Treat BLAST searches as scientific experiments Dont use the default parameters
Thanks