Lecture 56string Matching
Lecture 56string Matching
Lecture 56string Matching
Lecture # 5&6
String Matching Problem
Motivations: text-editing, pattern matching in DNA sequences
32.1
• Naive Algorithm
– Worst-case running time in O((n-m+1) m)
• Rabin-Karp
– Worst-case running time in O((n-m+1) m)
– Better than this on average and in practice
• Knuth-Morris-Pratt
– Worst-case running time in O(n + m)
Notation & Terminology
32.4
Complexity Analysis
There are two cases of consideration :
(i) Pattern found
The worst case occurs when the pattern is at last position and
there are spurious hits all the way. Example, T =
AAAAAAAAAAB, P = AAAB.
To move pattern one position right, m comparisons are made.
Searchable text in T has a length (n – m). Hence, in worst case
algorithm runs in O(m*(n – m)) time.
(ii) Pattern not found
In the best case, the searchable text does not contain any of the
prefixes of the pattern. Only one comparison requires moving
pattern one position right.
Questions based on Naive String Matching
Q1:
Suppose T = 1011101110
P = 111
Find all the Valid Shift /Shifts
Q#2:
a) Show the comparisons the naive string
matcher makes for the pattern P = 0001 in
the text T = 000010001010001.
Q#3
T= PLANINGANDANALYASIS and P = AND
32.5
Rabin-Karp Algorithm (continued)
p = 31415
spurious
hit
Rabin-Karp Algorithm (continued)
Rabin-Karp Algorithm (continued)
d is radix q is modulus
Preprocessing
(m)
Preprocessing
(m)
Matching loop invariant: when line 10 executed
ts=T[s+1..s+m] mod q
((n-m+1)m) rule out spurious hit
(m)
Try all
possible
shifts
Compute-Prefix-Function (p)
1 m length[p] //’p’ pattern to be matched
2 Π[1] 0
3 k0
4 for q 2 to m
5 do while k > 0 and p[k+1] != p[q]
6 do k Π[k]
7 If p[k+1] = p[q]
8 then k k +1
9 Π[q] k
10 return Π
Example: compute Π for the pattern ‘p’ below:
Pa b a b a c a
Initially: m = length[p] = 7
q 1 2 3 4 5 6 7
Π[1] = 0 p a b a b a c a
k=0 Π 0 0
q 1 2 3 4 5 6 7
p a b a b a c a
Step 1: q = 2, k=0
Π 0 0 1
Π[2] = 0
q 1 2 3 4 5 6 7
p a b a b a c A
Π 0 0 1 2
Step 2: q = 3, k = 0,
Π[3] = 1
Step 4: q = 5, k =2 q 1 2 3 4 5 6 7
Π[5] = 3 p a b a b a c a
Π 0 0 1 2 3
q 1 2 3 4 5 6 7
Step 5: q = 6, k = 3
Π[6] = 1 p a b a b a c a
Π 0 0 1 2 3 0
q 1 2 3 4 5 6 7
Step 6: q = 7, k = 1 p a b a b a c a
Π[7] = 1 Π 0 0 1 2 3 0 1
Note: KMP finds every occurrence of a ‘p’ in ‘S’. That is why KMP does not terminate in step 12, rather it searches
remainder of ‘S’ for any more occurrences of ‘p’.
Illustration: given a String ‘S’ and pattern ‘p’ as follows:
b a c b a b a b a b a c a c a
S
p a b a b a c a
Let us execute the KMP algorithm to find
whether ‘p’ occurs in ‘S’.
For ‘p’ the prefix function, Π was computed previously and is as follows:
q 1 2 3 4 5 6 7
p a b A b a c a
Π 0 0 1 2 3 0 1
Initially: n = size of S = 15;
m = size of p = 7
Step 1: i = 1, q = 0
comparing p[1] with S[1]
S b a c b a b a b a b a c a a b
p a b a b a c a
P[1] does not match with S[1]. ‘p’ will be shifted one position to the right.
Step 2: i = 2, q = 0
comparing p[1] with S[2]
S b a c b a b a b a b a c a a b
p a b a b a c a
P[1] matches S[2]. Since there is a match, p is not shifted.
Step 3: i = 3, q = 1
Comparing p[2] with S[3] p[2] does not match with S[3]
S b a c b a b a b a b a c a a b
p a b a b a c a
Backtracking on p, comparing p[1] and S[3]
Step 4: i = 4, q = 0
comparing p[1] with S[4] p[1] does not match with S[4]
S b a c b a b a b a b a c a a b
p a b a b a c a
Step 5: i = 5, q = 0
comparing p[1] with S[5] p[1] matches with S[5]
S b a c b a b a b a b a c a a b
p a b a b a c a
Step 6: i = 6, q = 1
Comparing p[2] with S[6] p[2] matches with S[6]
S b a c b a b a b a b a c a a b
p a b a b a c a
Step 7: i = 7, q = 2
Comparing p[3] with S[7] p[3] matches with S[7]
S b a c b a b a b a b a c a a b
p a b a b a c a
Step 8: i = 8, q = 3
Comparing p[4] with S[8] p[4] matches with S[8]
S b a c b a b a b a b a c a a b
p a b a b a c a
Step 9: i = 9, q = 4
Comparing p[5] with S[9] p[5] matches with S[9]
S b a c b a b a b a b a c a a b
p a b a b a c a
S b a c b a b a b a b a c a a b
p a b a b a c a
Backtracking on p, comparing p[4] with S[10] because after mismatch q = Π[5] = 3
S b a c b a b a b a b a c a a b
p a b a b a c a
Step 12: i = 12, q = 5
Comparing p[6] with S[12] p[6] matches with S[12]
S b a c b a b a b a b a c a a b
p a b a b a c a
S b a c b a b a b a b a c a a b
p a b a b a c a
Pattern ‘p’ has been found to completely occur in string ‘S’. The total number of shifts
that took place for the match to be found are: i – m = 13 – 7 = 6 shifts.
Running - time analysis
In the above pseudocode for computing the prefix The for loop beginning in step 5 runs ‘n’ times, i.e., as
function, the for loop from step 4 to step 10 long as the length of the string ‘S’. Since step 1
runs ‘m’ times. Step 1 to step 3 take to step 4 take constant time, the running time is
constant time. Hence the running time of dominated by this for loop. Thus running time of
compute prefix function is Θ(m). matching function is Θ(n).
Knuth-Morris-Pratt Algorithm
(m) in (n)
# characters matched
using
amortized scan text left-to-right
analysis
(m+n)
next character does not match
(n)
next character matches
Is all of P matched?
using
amortized
analysis Look for next match
Knuth-Morris-Pratt Algorithm