Burros Wheeler Transform - Bioinformatics
Burros Wheeler Transform - Bioinformatics
Burros Wheeler Transform - Bioinformatics
Computational Biology
April, 2019
Short read mapping
• Short read
– Second generation or mainly Illumina reads
– 35-300bp
• Mapping
– Finding the positions in the genome reads are coming from
• Applications
– Reference guided genome assembly
– Genotyping
– Analysis of *-seq data
Reference guided assembly
Exact pattern matching
• Given a pattern (read) and a text (genome), find all
positions in the text that matches the pattern?
• Input:
– Pattern p = p1…pm of length m
– Text t = t1…tn of length n
PatternMatching(p,t)
1 m length of pattern p
2 n length of text t
3 for i 1 to (n – m + 1)
4 if ti…ti+m-1 = p
5 output i
• AGCCGCATCT
• GCAT
• AGCCGCATCT
• GCAT
• AGCCGCATCT
• GCAT
• AGCCGCATCT
• GCAT
• AGCCGCATCT
• GCAT
• AGCCGCATCT
• GCAT
• AGCCGCATCT
• GCAT
• AGCAGCTAGCTAGT
• AGCTAGT
Knuth-Morris-Pratt Algorithm
• Linear time algorithm
• Avoids comparisons with positions in the text that
we have already checked
• AGCAGCTAGCTAGT
• AGCTAGT
Knuth-Morris-Pratt Algorithm
Pattern Matched so far
AGCTAGT AGCTAG
Proper prefix Proper suffix
ATCATG
TCATG
CATG Keyword Suffix
ATG Tree Tree
TG
G
Adapted from slides available at
www.bioalgorithms.info
Suffix Trees: Runtime
• Say the length of the text is n.
• Construction of the keyword tree takes O(n2) time.
• Constructing the suffix tree takes O(n) time.
• So our method takes O(n2 + n) = O(n2) time.
• However, there exists a method of calculating the suffix tree in O(n) time
without needing the keyword tree for the suffixes.
ATCATG
TCATG
CATG Keyword Suffix
ATG Tree Tree
TG
G
Adapted from slides available at
www.bioalgorithms.info
Pattern Matching with Suffix Trees
• To find any pattern in a text:
1. Build suffix tree for text of length n—O(n) time
2. Thread the pattern of length m through the suffix tree of
the text—O(m) time.