Bio 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

BIOINFORMATICS(BIOCOMPUTING)

(3)
ALIGNMENT AND MATCHING
DR. IBRAHIM ZAGHLOUL
DNA Sequencers
DNA Sequencing
• DNA sequencing refers to the general laboratory technique for determining the exact
sequence of nucleotides, or bases, in a DNA molecule.

• A DNA sequencer is a scientific instrument used to automate the DNA


sequencing process. Given a sample of DNA, a DNA sequencer is used to determine the
order of the four bases: G (guanine), C (cytosine), A (adenine) and T (thymine). This is then
reported as a text string, called a read.
DNA Sequencers
DNA Sequencers
Sequencing Reads
Alignment
Assembly
Evolution

1
Sequence conservation implies function

Alignment is the key to


• Finding important regions
• Determining function
• Uncovering the evolutionary forces
Sequence alignment
• Comparing DNA/protein sequences for
– Similarity
– Homology
• Prediction of function
• Construction of phylogeny: the history of the evolution of a species or
group.
• Shotgun assembly
– End-space-free alignment / overlap alignment
• Finding motifs

1
Homology

• Homology: Homology among DNA, or proteins is


inferred from their sequence similarity. Significant
similarity is strong evidence that two sequences are related
by evolutionary changes from a common ancestral
sequence.
• Orthologs (Different Species)
– Divergence follows speciation
– Similarity can be used to construct phylogeny between
species
• Paralogs (Same Species)
- Divergence follows duplication
1
Sequence Alignment

• Procedure of comparing two (pairwise) or more


(multiple) sequences by searching for a series of
individual characters that are in the same order in
the sequences

GCTAGTCAGATCTGACGCTA
| |||| ||||| |||
TGGTCACATCTGCCGC

18
Sequence Alignment

• Procedure of comparing two (pairwise) or more


(multiple) sequences by searching for a series of
individual characters that are in the same order
in the sequences

VLSPADKTNVKAAWGKVGAHAGYEG
||| | | | || | ||
VLSEGDWQLVLHVWAKVEADVAGEG

1
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Definition
Given two strings x = x1x2...xM, y = y1y2…yN,

an alignment is an assignment of gaps to positions 0,…, M


in x, and 0,…, N in y, so as to line up each letter in one
sequence with either a letter, or a gap in the other sequence

2
Sources of variation
• Nucleotide substitution
– Replication error
– Chemical reaction
• Insertions or deletions (indels)
– Unequal crossing over
– Replication slippage
• Duplication
– a single gene (complete gene duplication)
– part of a gene (internal or partial gene duplication)
• Domain duplication
• Exon shuffling
– part of a chromosome (partial polysomy)

– an entire chromosome (aneuploidy or polysomy)


– the whole genome (polyploidy)

21
A simple alignment

• Let us try to align two short nucleotide sequences:


– AATCTATA and AAGATA
• Without considering any gaps (insertions/deletions) there
are 3 possible ways to align these sequences

AATCTATA AATCTATA AATCTATA


AAGATA AAGATA AAGATA

• Which one is better?

22
Scoring the alignments
• We need to have a scoring mechanism to evaluate alignments
– match score
– mismatch score
• We can have the total score as:
n

=1
i
match or mismatch score at position i

• For the simple example, assume a match score of 1 and a


mismatch score of 0:
AATCTATA AATCTATA AATCTATA
AAGATA AAGATA AAGATA
4 1 3
23
Simple alignment with gaps
• Considering gapped alignments vastly
increases the number of possible alignments.

• If gap penalty is -1 what will be the new


scores?

AATCTATA AATCTATA AATCTATA


AAG-AT-A AA-G-ATA AA--GATA
1 3 3

24
BLOSUM 62 matrix
String Definitions

A string S is a finite ordered list of characters.

Characters are drawn from an alphabet Σ.

Nucleic acid alphabet: { A, C, G, T }


Amino acid alphabet: { A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V }

Length of S, |S |, is the number of characters in S

ϵ is the empty string. | ϵ | = 0

26
String Definitions

• For strings S and T over Σ, their concatenation consists of the characters of


S followed by the characters of T, denoted ST

• S is a substring of T if there exist (possibly empty) strings u and v such that


T = uSv

• S is a prefix of T if there exists a string u such that T = Su.


If neither S nor u are ϵ, S is a proper prefix of T.
• Definitions of suffix and proper suffix are similar.

27
String Definitions

• We defined substring. Subsequence is similar except the


characters need not be consecutive.

• “cat” is a substring and a subsequence of “concatenate”

• “cant” is a subsequence of “concatenate”, but not a


substring

28
Exact matching

• Looking for places where a pattern P occurs as a substring


of a text T. Each such place is an occurrence or match.

• An alignment is a way of putting P’s characters opposite


T’s characters. It may or may not correspond to an
occurrence.

29
Exact Matching

30
Exact matching: naïve algorithm

31
Exact matching: naïve algorithm

32
Exact matching: naïve algorithm

33
Can we improve on the naïve algorithm?

P: word
T: There would have been a time for such a word
word

u doesn’t occur in P,so skip next two alignments

P: word
T: There would have been a time for such a word
word
word skip!
word skip!
word

34

You might also like