Lecture 56string Matching

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 43

CSE408

String Matching Algorithm

Lecture # 5&6
String Matching Problem
Motivations: text-editing, pattern matching in DNA sequences

32.1

Text: array T[1...n] nm Pattern: array P[1...m]


Array Element: Character from finite alphabet 
Pattern P occurs with shift s in T if P[1...m] = T[s+1...s+m] 0 s  nm
String Matching Algorithms

• Naive Algorithm
– Worst-case running time in O((n-m+1) m)

• Rabin-Karp
– Worst-case running time in O((n-m+1) m)
– Better than this on average and in practice

• Knuth-Morris-Pratt
– Worst-case running time in O(n + m)
Notation & Terminology

• * = set of all finite-length strings formed


using characters from alphabet 
• Empty string: 
• |x| = length of string x
• w is a prefix of x: w x
• w is a suffix of x: w x ab abcca

• prefix, suffix are transitive cca abcca


Working Mechanism
 This is simple and efficient brute force approach. It compares
the first character of pattern with searchable text. If a match
is found, pointers in both strings are advanced. If a match is
not found, the pointer to text is incremented and pointer of
the pattern is reset. This process is repeated till the end of the
text.
 The naïve approach does not require any pre-processing.
Given text T and pattern P, it directly starts comparing both
strings character by character.
After each comparison, it shifts pattern string one position to the
right.
Following example illustrates the working of naïve string
matching algorithm. Here,
T
Here, t and p are indices of text and pattern respectively.
Naive String Matching

worst-case running time is in ((n-m+1)m)

32.4
Complexity Analysis
There are two cases of consideration :
(i) Pattern found
The worst case occurs when the pattern is at last position and
there are spurious hits all the way. Example, T =
AAAAAAAAAAB, P = AAAB.
To move pattern one position right, m comparisons are made.
Searchable text in T has a length (n – m). Hence, in worst case
algorithm runs in O(m*(n – m)) time.
(ii) Pattern not found
In the best case, the searchable text does not contain any of the
prefixes of the pattern. Only one comparison requires moving
pattern one position right.
Questions based on Naive String Matching

Q1:

Suppose T = 1011101110
P = 111
Find all the Valid Shift /Shifts
Q#2:
a) Show the comparisons the naive string
matcher makes for the pattern P = 0001 in
the text T = 000010001010001.
Q#3
T= PLANINGANDANALYASIS and P = AND

a. Find the number of comparisons for valid


shifts.
b.Find the valid shifts .
c. find the pattern indices for valid shifts
Rabin-Karp algorithm
 Like the Naive Algorithm, the Rabin-Karp
algorithm also slides the pattern one by one.
 But unlike the Naive algorithm, the Rabin Karp
algorithm matches the hash value of the
pattern with the
 if the hash values match then only it starts
matching individual characters.
 So Rabin Karp algorithm needs to calculate
hash values for the following strings.
• Since we need to efficiently calculate hash values for
all the substrings of size m of text, we must have a
hash function that has the following property:

 Hash at the next shift must be efficiently computable


from the current hash value and next character in
text or we can say
 hash(txt[s+1 .. s+m]) must be efficiently computable
from hash(txt[s .. s+m-1])
 and txt[s+m] i.e.,
 hash(txt[s+1 .. s+m]) = rehash(txt[s+m], hash(txt[s ..
s+m-1])) and
 Rehash must be O(1) operation.
• The number of possible characters is higher than 10 (256 in
general) and the pattern length can be large.
• So the numeric values cannot be practically stored as an
integer.
• Therefore, the numeric value is calculated using modular
arithmetic to make sure that the hash values can be stored in
an integer variable (can fit in memory words).
• To do rehashing, we need to take off the most significant digit
and add the new least significant digit for in hash value.
Rehashing is done using the following formula:
 hash( txt[s+1 .. s+m] ) = ( d ( hash( txt[s .. s+m-1]) – txt[s]*h ) + txt[s + m] )
mod q
 hash( txt[s .. s+m-1] ) : Hash value at shift s
 hash( txt[s+1 .. s+m] ) : Hash value at next shift (or shift s+1)
d: Number of characters in the alphabet
q: A prime number
h: d(m-1)
Question 1:
• For string matching, working module q = 11,
how many spurious hits does the Rabin-Karp
matcher encounters in Text
T = 31415926535.......
o T = 31415926535.......
o P = 26
o Here T.Length =11 so Q = 11
o And P mod Q = 26 mod 11 = 4
o Now find the exact match of P mod Q...
..contd..
Complexity:

The running time of RABIN-KARP-MATCHER in


the worst case scenario O ((n-m+1) m but it has a
good average case running time. If the expected
number of strong shifts is small O (1) and prime q
is chosen to be quite large, then the Rabin-Karp
algorithm can be expected to run in time O
(n+m) plus the time to require to process spurious
hits.
Rabin-Karp Algorithm

• Assume each character is digit in radix-d notation (e.g. d=10)


• p = decimal value of pattern
• ts = decimal value of substring T[s+1..s+m] for s = 0,1...,n-m
• Strategy:
– compute p in O(m) time (which is in O(n))
– compute all ti values in total of O(n) time
– find all valid shifts s in O(n) time by comparing p with each t s
• Compute p in O(m) time using Horner’s rule:
– p = P[m] + d(P[m-1] + d(P[m-2] + ... + d(P[2] + dP[1])))
• Compute t0 similarly from T[1..m] in O(m) time
• Compute remaining ti‘s in O(n-m) time
– ts+1 = d(ts - d m-1T[s+1]) + T[s+m+1]
Rabin-Karp Algorithm

p, ts may be large, so use mod

32.5
Rabin-Karp Algorithm (continued)

ts+1 = d(ts - d m-1T[s+1]) + T[s+m+1]

p = 31415

spurious
hit
Rabin-Karp Algorithm (continued)
Rabin-Karp Algorithm (continued)

d is radix q is modulus

(m) in (n) high-order digit position for m-digit window

Preprocessing

(m)

Matching loop invariant: when line 10 executed


ts=T[s+1..s+m] mod q
((n-m+1)m) rule out spurious hit
(m)
Try all
possible
shifts

worst-case running time is in ((n-m+1)m)


Rabin-Karp Algorithm (continued)
d is radix q is modulus

(m) in (n) high-order digit position for m-digit window

Preprocessing
(m)
Matching loop invariant: when line 10 executed
ts=T[s+1..s+m] mod q
((n-m+1)m) rule out spurious hit
(m)
Try all
possible
shifts

Assume reducing mod q is like random mapping from * to Zq

Estimate (chance that ts= p mod q) = 1/q # spurious hits is in O(n/q)

Expected matching time = O(n) + O(m(v + n/q)) (v = # valid shifts)

If v is in O(1) and q >= m average-case running time is in (n+m)


The Knuth-Morris-Pratt Algorithm

Knuth, Morris and Pratt proposed a linear time


algorithm for the string matching problem.
A matching time of O(n) is achieved by avoiding
comparisons with elements of ‘S’ that have
previously been involved in comparison with
some element of the pattern ‘p’ to be
matched. i.e., backtracking on the string ‘S’
never occurs
Components of KMP algorithm

• The prefix function, Π


 The prefix function,Π for a pattern encapsulates
knowledge about how the pattern matches against
shifts of itself.
 This information can be used to avoid useless shifts
of the pattern ‘p’. In other words, this enables
avoiding backtracking on the string ‘S’.
• The KMP Matcher
With string ‘S’, pattern ‘p’ and prefix function ‘Π’ as
inputs, finds the occurrence of ‘p’ in ‘S’ and returns
the number of shifts of ‘p’ after which occurrence is
found.
The prefix function, Π
Following pseudo code computes the prefix fucnction, Π:

Compute-Prefix-Function (p)
1 m  length[p] //’p’ pattern to be matched
2 Π[1]  0
3 k0
4 for q  2 to m
5 do while k > 0 and p[k+1] != p[q]
6 do k  Π[k]
7 If p[k+1] = p[q]
8 then k  k +1
9 Π[q]  k
10 return Π
Example: compute Π for the pattern ‘p’ below:

Pa b a b a c a
Initially: m = length[p] = 7
q 1 2 3 4 5 6 7
Π[1] = 0 p a b a b a c a
k=0 Π 0 0

q 1 2 3 4 5 6 7
p a b a b a c a
Step 1: q = 2, k=0
Π 0 0 1
Π[2] = 0
q 1 2 3 4 5 6 7
p a b a b a c A
Π 0 0 1 2

Step 2: q = 3, k = 0,
Π[3] = 1
Step 4: q = 5, k =2 q 1 2 3 4 5 6 7
Π[5] = 3 p a b a b a c a
Π 0 0 1 2 3

q 1 2 3 4 5 6 7
Step 5: q = 6, k = 3
Π[6] = 1 p a b a b a c a
Π 0 0 1 2 3 0

q 1 2 3 4 5 6 7
Step 6: q = 7, k = 1 p a b a b a c a
Π[7] = 1 Π 0 0 1 2 3 0 1

After iterating 6 times, the prefix q 1 2 3 4 5 6 7


function computation is p a b A b a c a
complete: 
Π 0 0 1 2 3 0 1
The KMP Matcher
The KMP Matcher, with pattern ‘p’, string ‘S’ and prefix function ‘Π’ as input, finds a match of p in S.
Following pseudocode computes the matching component of KMP algorithm:
KMP-Matcher(S,p)
1 n  length[S]
2 m  length[p]
3 Π  Compute-Prefix-Function(p)
4q0 //number of characters matched
5 for i  1 to n //scan S from left to right
6 do while q > 0 and p[q+1] != S[i]
7 do q  Π[q] //next character does not match
8 if p[q+1] = S[i]
9 then q  q + 1 //next character matches
10 if q = m //is all of p matched?
11 then print “Pattern occurs with shift” i – m
12 q  Π[ q] // look for the next match

Note: KMP finds every occurrence of a ‘p’ in ‘S’. That is why KMP does not terminate in step 12, rather it searches
remainder of ‘S’ for any more occurrences of ‘p’.
Illustration: given a String ‘S’ and pattern ‘p’ as follows:

b a c b a b a b a b a c a c a
S
p a b a b a c a
Let us execute the KMP algorithm to find
whether ‘p’ occurs in ‘S’.
For ‘p’ the prefix function, Π was computed previously and is as follows:

q 1 2 3 4 5 6 7

p a b A b a c a

Π 0 0 1 2 3 0 1
Initially: n = size of S = 15;
m = size of p = 7
Step 1: i = 1, q = 0
comparing p[1] with S[1]

S b a c b a b a b a b a c a a b

p a b a b a c a
P[1] does not match with S[1]. ‘p’ will be shifted one position to the right.

Step 2: i = 2, q = 0
comparing p[1] with S[2]

S b a c b a b a b a b a c a a b

p a b a b a c a
P[1] matches S[2]. Since there is a match, p is not shifted.
Step 3: i = 3, q = 1
Comparing p[2] with S[3] p[2] does not match with S[3]
S b a c b a b a b a b a c a a b

p a b a b a c a
Backtracking on p, comparing p[1] and S[3]
Step 4: i = 4, q = 0
comparing p[1] with S[4] p[1] does not match with S[4]

S b a c b a b a b a b a c a a b

p a b a b a c a
Step 5: i = 5, q = 0
comparing p[1] with S[5] p[1] matches with S[5]

S b a c b a b a b a b a c a a b

p a b a b a c a
Step 6: i = 6, q = 1
Comparing p[2] with S[6] p[2] matches with S[6]
S b a c b a b a b a b a c a a b

p a b a b a c a
Step 7: i = 7, q = 2
Comparing p[3] with S[7] p[3] matches with S[7]

S b a c b a b a b a b a c a a b

p a b a b a c a
Step 8: i = 8, q = 3
Comparing p[4] with S[8] p[4] matches with S[8]

S b a c b a b a b a b a c a a b

p a b a b a c a
Step 9: i = 9, q = 4
Comparing p[5] with S[9] p[5] matches with S[9]
S b a c b a b a b a b a c a a b

p a b a b a c a

Step 10: i = 10, q = 5


Comparing p[6] with S[10] p[6] doesn’t match with S[10]

S b a c b a b a b a b a c a a b

p a b a b a c a
Backtracking on p, comparing p[4] with S[10] because after mismatch q = Π[5] = 3

Step 11: i = 11, q = 4


Comparing p[5] with S[11] p[5] matches with S[11]

S b a c b a b a b a b a c a a b

p a b a b a c a
Step 12: i = 12, q = 5
Comparing p[6] with S[12] p[6] matches with S[12]
S b a c b a b a b a b a c a a b

p a b a b a c a

Step 13: i = 13, q = 6


Comparing p[7] with S[13] p[7] matches with S[13]

S b a c b a b a b a b a c a a b

p a b a b a c a

Pattern ‘p’ has been found to completely occur in string ‘S’. The total number of shifts
that took place for the match to be found are: i – m = 13 – 7 = 6 shifts.
Running - time analysis

• Compute-Prefix-Function (Π) • KMP Matcher


1 m  length[p] //’p’ pattern to be matched 1 n  length[S]
2 Π[1]  0 2 m  length[p]
3 k0 3 Π  Compute-Prefix-Function(p)
4 for q  2 to m 4q0
5 do while k > 0 and p[k+1] != p[q] 5 for i  1 to n
6 do k  Π[k] 6 do while q > 0 and p[q+1] != S[i]
7 If p[k+1] = p[q] 7 do q  Π[q]
8 then k  k +1 8 if p[q+1] = S[i]
9 Π[q]  k 9 then q  q + 1
10 return Π 10 if q = m
11 then print “Pattern occurs with shift” i – m
12 q  Π[ q]

In the above pseudocode for computing the prefix The for loop beginning in step 5 runs ‘n’ times, i.e., as
function, the for loop from step 4 to step 10 long as the length of the string ‘S’. Since step 1
runs ‘m’ times. Step 1 to step 3 take to step 4 take constant time, the running time is
constant time. Hence the running time of dominated by this for loop. Thus running time of
compute prefix function is Θ(m). matching function is Θ(n).
Knuth-Morris-Pratt Algorithm

(m) in (n)
# characters matched
using
amortized scan text left-to-right
analysis
(m+n)
next character does not match
(n)
next character matches
Is all of P matched?
using
amortized
analysis Look for next match
Knuth-Morris-Pratt Algorithm

Amortized Analysis Potential Method

(k )  k k = current state of algorithm

Potential is never negative


since (k) >= 0 for all k

initial potential value


(m)
in
(n)
potential decreases amortized
(m) loop
potential cost of loop
iterations
increases by body is in
<=1 in each (1)
execution of
for loop body

You might also like