Pattern Matching Algo
Pattern Matching Algo
Pattern Matching Algo
• Strings
• Pattern matching algorithms
– Brute-force algorithm
– Boyer-Moore algorithm
– Knuth-Morris-Pratt algorithm
– Demos
1
Strings
• Let P be a string of size m
• A string is a sequence of
characters – A substring P[i .. j] of P is the
subsequence of P consisting
• Examples of strings: of the characters with ranks
– Java program between i and j
– HTML document – A prefix of P is a substring of
– DNA sequence the type P[0 .. i]
– Digitized image – A suffix of P is a substring of
• An alphabet S is the set of the type P[i ..m - 1]
possible characters for a • Given strings T (text) and P
family of strings (pattern), the pattern matching
problem consists of finding a
• Example of alphabets:
substring of T equal to P
– ASCII
• Applications:
– Unicode
– Text editors
– {0, 1}
– Search engines
– {A, C, G, T}
– Biological research
Algorithms: String Matching: Rajeev Wankar
3
Pattern compress
2
String-matching problem
text T a c a a b c
pattern P a a b
S =2
Brute-Force Algorithm
3
Brute-Force Algorithm
a c a a b c a c a a b c a c a a b c a c a a b c
a a b a a b a a b a a b
S =0 S =1 S =2 S =3
1 2 3 4
pattern P a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13
text T a a a a a a a a a a a a a
a a a b
a a a b
Algorithms: String Matching: Rajeev Wankar
8
4
Worst-case Analysis
Rabin-Karp Algorithm
5
Rabin-Karp Algorithm
Division Theorem
0 ≤ r < n and a = q · n + r
E.g., 23 = 3 · 7 + 2, -19 = -3 · 7 + 2
Rabin-Karp Algorithm
Modular Equivalence
6
Rabin-Karp Algorithm
Modular Arithmetic
Rabin-Karp Algorithm
• Given a text T [1..n], let ts denote the decimal value of the length-
m substring T [(s+1)..(s+m)] for s=0,1,…,(n-m)
7
Rabin-Karp Algorithm
• Other t1, t2,…, tn-m can be computed in O(n-m) time since ts+1 can
be computed from ts in constant time
Rabin-Karp Algorithm
• We can compute p, t0, t1,…, tn-m in O(n+m) time
8
Rabin-Karp Algorithm
• RABIN-KARP-MATCHER(T, P, d, q)
1 n ← length[T]
2 m ← length[P]
3 h ← dm-1 mod q
4 p←0
5 t0 ← 0
6 for i ← 1 to m . Preprocessing.
7 do p ← (dp + P[i]) mod q
8 t0 ← (dt0 + T[i]) mod q
9 for s ← 0 to n - m . Matching.
10 do if p = ts
11 then if P[1 m] = T [s + 1 s + m]
12 then print "Pattern occurs with shift" s
13 if s < n - m
14 then ts+1 ← (d(ts - T[s + 1]h) + T[s + m + 1]) mod q
Rabin-Karp Algorithm
• How values modulo 13 are computed
9
Rabin-Karp Algorithm
Problem of Spurious Hits
Rabin-Karp Algorithm
10
Rabin-Karp Algorithm
• Basic structure like the naïve algorithm, but uses
modular arithmetic as described
a p a t t e r n m a t c h i n g a l g o r i t h m
1 3 5 11 10 9 8 7
r i t h m r i t h m P r i t h m r i t h m
2 4 6
r i t h m r i t h m r i t h m
11
Last-Occurrence Function
• Example:
α a b c d
S = {a, b, c, d}
L(α) 5 6 4 -1
– P = abacab
Algorithm lastOccurenceFunction(P, m, S)
for each character α S
do L[α] = 0
for j 1 to m
do L[P[α]] j
return L
j 1 2 3 4 5 6
P(j) a b a c a b
α 97 98 99 100
L(α) 5 6 4 -1
12
Boyer-Moore’s Algorithm (2)
1+l
Algorithms: String Matching: Rajeev Wankar
25
Example
a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b
13
Example
a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b
Example
a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b
14
Example
a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b
Example
a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b
15
Example
a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b
Example
a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b
16
Analysis
Time Complexity
When the pattern does occur in the text, running time of the
original algorithm is O(nm) in the worst case.
17
Knuth-Morris-Pratt Algorithm (KMP’s) Algorithm (1)
• Knuth-Morris-Pratt’s algorithm
preprocesses the pattern to find
matches of prefixes of the
pattern with the pattern itself
• The prefix function F(i) is
defined as the size of the largest
prefix of P[0..j] that is also a . . a b a a b x . . . . .
proper suffix of P[1..j]
• Knuth-Morris-Pratt’s algorithm
modifies the brute-force
a b a a b a
algorithm so that if a mismatch
occurs at P[j] T[i] we set j
j F(j - 1)
a b a a b a
F(j - 1)
18
i 1 2 3 4 5 6 7 8 9 10
P[i] a b a b a b a b c a
F[i] 0 0 1 2 3 4 5 6 0 1
P8 a b a b a b a b c a
P6 a b a b a b a b c a F[8]=6
P4 a b a b a b a b c a F[6]=4
P2 a b a b a b a b c a F[4]=2
P0 a b a b a b a b c a F[2]=0
i 1 2 3 4 5 6 7 8 9 10
P[i] a b a b a b a b c a
F[i] 0 0 1 2 3 4 5 6 0 1
P8 a b a b a b a b c a
P6 a b a b a b a b c a F[8]=6
P4 a b a b a b a b c a F[6]=4
P2 a b a b a b a b c a F[4]=2
P0 a b a b a b a b c a F[2]=0
19
KMP’s Algorithm (2)
20
Example
a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
a b a c a b
j 0 1 2 3 4 5
P[j] a b a c a b 14 15 16 17 18 19
F(j) 0 0 1 0 1 2 a b a c a b
21