Sequence Alignment and Searching

Sequence Alignment and Searching By: Sarika goyal
What is the purpose of sequence alignment?
Identification of homology and homologous sites in related sequences Inference of evolutionary history that lead to the differences in observed sequences
The Problem
Biological problem Finding a way to compare and represent similarity or dissimilarity between biomolecular sequences (DNA, RNA or amino acid)
The Problem
Computational problem Finding a way to perform inexact or approximate matching of subsequences within strings of characters Sequence comparison and alignment is a central problem in computational biology: High sequence similarity usually => structural or functional similarity
Substring and subsequence
Example: xyz is a subsequence within axayaz, but NOT a substring Characters in a substring must be contiguous
Types of comparisons and alignment methods
According to sequence Coverage: According to number of sequences: TWO SEQUENCES (Pairwise alignment)
LOCAL
GLOBAL
Database search against query sequences BLAST algorithm
Comparison of two sequences; First step in multiple sequence alignment
THREE OR MORE SEQUENCES (Multiple alignment)
Defining consensus sequences, protein structural motifs and domains, regulatory elements in DNA etc.
Determination of conserved residues and domains; Introductory step in molecular phylogenetic analysis
Introduction to sequence alignment

Given two text strings: First string = a b c d e Second string = a c d e f a reasonable alignment would be a b c d e a - c d e f
For the sequences gctgaacg and ctataatc: An uninformative alignment:

-------gctgaacg ctataatc-------
An alignment without gaps

gctgaacg ctataatc
An alignment with gaps

We must choose criteria so that algorithm can choose the best alignment.
gctga-a--cg --ct-ataatc
And another
gctg-aa-cg -ctataatc-
The dotplot (1)
A simple picture that gives an overview of the similarities between two sequences
Dotplot showing identities between sequences (DOROTHYHODGKIN) and (DOROTHYCROWFOOTHODGKIN):
Letters corresponding to isolated matches are shown in non-bold type. The longest matching regions, shown in boldface, are DOROTHY and HODGKIN. Shorter matching regions, such as the OTH of dorOTHy and RO of doROthy and cROwfoot, are noise.
The dotplot (2)
Dotplot showing identities between a repetitive sequence (ABRACADABRACADABRA) and itself. The repeats appear on several subsidiary diagonals parallel to the main diagonal.
The dotplot (3)
Dotplot showing identities between the palindromic sequence MAX I STAY AWAY AT SIX AM and itself. The palindrome reveals itself as a stretch of matches perpendicular to the main diagonal.
Dotplots and sequence alignments

Any path through the dotplot from upper left to lower right passes through a succession of cells, each of which picks out a pair of positions, one from the row and one from the column, that correspond in the alignment; or that indicates a gap in one of the sequences. The path need not pass through filled-in points only. However, the more filled-in points on the diagonal segments of the path, the more matching residues in the alignment.
Corrseponding alignment: DOROTHY--------HODGKIN DOROTHYCROWFOOTHODGKIN
Measures of sequence similarity
Two measures of distance between two character strings:
The Hamming distance, defined between two strings of equal length, is the number of positions with mismatching characters. The Levenshtein, or edit distance, between two strings of not necessarily equal length, is the minimal number of 'edit operations' required to change one string into the other, where an edit operation is a deletion, insertion or alteration of a single character in either sequence.
agtc cgta ag-tcc cgctca
Hamming distance = 2
Levenshtein distance = 3
The Edit Distance between two strings
Definition: The edit distance between two strings is defined as the minimum number of edit operations insertions, deletions, & substitutions needed to transform the first string into the second. For emphasis, note that matches are not counted. Example: AATT and AATG Distance = 1 (edit operation of substitution)
String alignment
An edit transcript is a way to represent a particular transformation of one string into another Emphasizes point mutations in the model An alignment displays a relationship between two strings Global alignment means for each string, entire string is involved in the alignment Examples:
AAGCA AA _C_
Sequence diversion
Sequences may have diverged from common ancestor through mutations: Substitution (AAGC AAGT) Insertion (AAG AAGT) Deletion (AAGC AAG)
Scoring schemes
In molecular biology, certain changes are more likely to occur than others

Amino acid substitutions tend to be conservative In nucleotide sequences, transitions are more frequent than transversions
-> We want to give different weights to different edit operations Example: a DNA substitution matrix:
a g c t
a 20 10 5 5
g 10 20 5 5
c 5 5 20 10
t 5 5 10 20
BLAST the workhorse of bioinformatics

http://www.ncbi.nlm.nih.gov/BLAST
BLAST = Basic local alignment search tool When you have a nucleotide or protein sequence that you want to search against sequence databases
to determine what the sequence is to find related sequences (homologs)
Different BLAST programs
DNA potentially encodes six proteins
DNA can be translated into six potential proteins

5 CAT CAA 5 ATC AAC 5 TCA ACT
5 CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3 3 GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5
5 GTG GGT 5 TGG GTA 5 GGG TAG
Four components to a BLAST search
(1) Choose the sequence (query) (2) Select the BLAST program (3) Choose the database to search (4) Choose optional parameters Then click BLAST
Step 1: Choose your sequence
Sequence can be input in FASTA format or as accession number
Example of the FASTA format for a BLAST query
Step 2: Choose the BLAST program Program Input blastn blastp blastx tblastn tblastx DNA protein DNA protein DNA Database DNA protein protein DNA DNA
1 1 6 6 36
Step 2: Choose the BLAST program
Step 3: choose the database

nr = non-redundant (most general database) dbest = database of expressed sequence tags dbsts = database of sequence tag sites pdb = sequences derived from 3d structure of proteins Patents = Nucleotide sequences derived from patent division of GenBank.
Step 4a: Select optional search parameters
You can... choose the organism to search turn filtering on/off change the substitution matrix change the expect (e) value change the word size change the output format
CD search
Entrez! Filter Expect Word size Scoring matrix
organism
filtering
Step 4b: optional formatting parameters
Alignment view Descriptions Alignments
page 97
BLAST format options
Alignments Views
Pairwise Standard BLAST alignment in pairs of query sequence and datab match.
Query: 251
tgaccggtaacgaccgcaccctggacgtcatggcgctggatgtggtgtggacggcgga 3
|||||||||| ||||||| |||||||| |||||| |||||||||||||| |||||||

Sbjct: 248575 tgaccggtaaagaccgcagcttggacgtgatggcgatggatgtggtgtggacagcgga 248634
program
query database
High scores low e values
Algorithm for Blast
How a BLAST search works
The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length W with a score of at least T. Altschul et al. (1990)
BLAST algorithm principles

(Basic Local Alignment Search Tool) Main idea:
1.
query
Construct a dictionary of all the words in the query Initiate a local alignment for each word match between query and DB DB
2.
BLAST Original Version

Dictionary: All words of length k Alignment initiated between words of alignment score T Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold
Background
The BLAST Algorithm
3 Stages
Preprocessing of the query Generation of hits Extension of the hits
Background
Step 1: Preprocessing of the query
Speed gained by minimizing search space Alignments require word hits ( word size = W )
Sequence 1
Sequence 2
word hits
Background
Step 1: Preprocessing of the query (Contd.)

RGD KGD QGD RGE EGD
17 14 13 13 12 12 12 12 11 11 11 11 11 11 11 11 11
Threshold score = T Neighborhood words of RGD W and T modulate speed and sensitivity
HGD NGD
RGN AGD MGD RAD RGQ RGS RND RSD SGD TGD
T=12
Background
Step 2: Generation of hits
A hit is made with one or several successive pairs of similar words. All the possible hits between the query sequence and sequences from databases are calculated in this way.
query
Background
Step 3: Extension
Each hit is extended in both directions. Extension is terminated when the maximum score drops below X.
scan DB
query
Background
The BLAST Algorithm: Summary
BLAST Original Version

Example: k = 4, T=4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps Output: GTAAGGTCC GTTAGGTCC A C G A A G T A A G G T C C A G T
C C C T T C C T G G A T T G C G A
BLAST results: List of hits
Evalue - number of unrelated databank sequences expected to yield same or higher score by pure chance
BLAST results: High scoring pairs (HSPs)
Fundamental unit of the BLAST algorithm output
HSP (high scoring pair)
Aligned fragments of query and detected sequence with similarity score exceeding a set cutoff value
E-value
(Expectation)
Score = 61 (27.8 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65 Identities = 10/17 (58%), Positives = 16/17 (94%) Query: 81 SGDLSMLVLLPDEVSDL 97 +GD+SM +LLPDE++D+ Sbjct: 259 AGDVSMFLLLPDEIADV 275
HSP
BLAST Confidence measures

Score and bit-score : depend on scoring method E-value (Expect value) : number of unrelated database sequences expected to yield same or higher score by pure chance
The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported. (E-value approching zero => significant alignment) An E value of 1 assigned to a hit can be interpreted as in a database of the current size one might expect to see 1 match with a similar score simply by chance.
Statistical Terminology
True positive: A hit returned from a database search which is homologous with the query sequence. GOOD False positive: A hit returned from a database search which is not homologous with the query sequence. BAD True negative: A sequence which is not homologous with the query sequence is not returned from database search. GOOD False negative: A sequence which is homologous with the query sequence is not returned from database search. BAD Sensitivity: A program which is sensitive picks up on most true positive. Selectivity: A program which is selective does not include false positives.
Conclusion
Treat BLAST searches as scientific experiments Dont use the default parameters
Default changes from time to time
BLAST is quite complicated. But its very useful.
Thanks

Sequence Alignment and Searching

Uploaded by

Copyright:

Available Formats

Sequence Alignment and Searching

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sequence Alignment and Searching

Uploaded by

Copyright:

Available Formats

Sequence Alignment and Searching By: Sarika goyal

What is the purpose of sequence alignment?

Substring and subsequence

Types of comparisons and alignment methods

Database search against query sequences BLAST algorithm

Comparison of two sequences; First step in multiple sequence alignment

THREE OR MORE SEQUENCES (Multiple alignment)

Introduction to sequence alignment

For the sequences gctgaacg and ctataatc: An uninformative alignment:

An alignment without gaps

An alignment with gaps

The dotplot (1)

The dotplot (2)

The dotplot (3)

Dotplots and sequence alignments

Corrseponding alignment: DOROTHY--------HODGKIN DOROTHYCROWFOOTHODGKIN

Measures of sequence similarity

Two measures of distance between two character strings:

agtc cgta ag-tcc cgctca

The Edit Distance between two strings

BLAST the workhorse of bioinformatics

to determine what the sequence is to find related sequences (homologs)

Different BLAST programs

DNA potentially encodes six proteins

DNA can be translated into six potential proteins

5 GTG GGT 5 TGG GTA 5 GGG TAG

Four components to a BLAST search

Step 1: Choose your sequence

Sequence can be input in FASTA format or as accession number

Example of the FASTA format for a BLAST query

Step 2: Choose the BLAST program

Step 3: choose the database

Step 4a: Select optional search parameters

Step 4a: Select optional search parameters

Step 4a: Select optional search parameters

Entrez! Filter Expect Word size Scoring matrix

Step 4b: optional formatting parameters

Alignment view Descriptions Alignments

BLAST format options

|||||||||| ||||||| |||||||| |||||| |||||||||||||| |||||||

High scores low e values

Algorithm for Blast

How a BLAST search works

BLAST algorithm principles

BLAST Original Version

The BLAST Algorithm

Preprocessing of the query Generation of hits Extension of the hits

Step 1: Preprocessing of the query

Step 1: Preprocessing of the query (Contd.)

Step 2: Generation of hits

The BLAST Algorithm: Summary

BLAST Original Version

BLAST results: List of hits

BLAST results: High scoring pairs (HSPs)

Fundamental unit of the BLAST algorithm output

HSP (high scoring pair)

BLAST Confidence measures

Default changes from time to time

BLAST is quite complicated. But its very useful.

You might also like