String Matching
String Matching
String Matching
(KCS 503)
UNIT-5
String Matching
From:
Asif Khan
Department of CSE
GNIOT, Greater Noida
Reference:
Thomas H. Corman et al.
The MIT Press
String Matching
Text-editing programs frequently need to find all
occurrences of a pattern in the text. Typically, the
text is a document being edited, and the pattern
searched for is a particular word supplied by the
user.
Efficient algorithms for this problem called “string
matching”. These algorithms can greatly aid the
responsiveness of the text-editing program.
Among their many other applications, string-
matching algorithms search for particular patterns in
DNA sequences. Internet search engines also use
them to find Web pages relevant to queries.
String Matching
We formalize the string-matching problem as
follows.
We assume that the text is an array T[1..n] of length
n and that the pattern is an array P[1..m] of length
m≤ n.
We further assume that the elements of P and T are
characters drawn from a finite alphabet ∑. For
example, we may have ∑={0,1} or ∑={a,b…,z}.
The character arrays P and T are often called strings
of characters.
String Matching
We say that pattern P occurs with shift s in text T if
0 < s < n-m and T[s+1..s+m] = P[1..m]. If P occurs
with shift s in T , then we call s a valid shift;
otherwise, we call s an invalid shift.
The string-matching problem is the problem of
finding all valid shifts with which a given pattern P
occurs in a given text T.
String Matching
for i= 1 to m
if P[i] != T[s+i] break;
if i=m+1 Print “--------”,
The naive string-matching algorithm
Following figures portray the naive string-matching
procedure as sliding a “template” containing the
pattern over the text.
The naive string-matching algorithm
Procedure NAIVE-STRING-MATCHER takes time
O((n-m+1)m), and this bound is tight in the worst
case.
For example, consider the text string an(a string of n
a’s) and the pattern am. For each of the n-m+1
possible values of the shift s, the loop on line 4 to
compare corresponding characters must execute m
times to validate the shift.
The worst-case running time is thus θ((n-m+1)m),
which is θ(n2), if m=n/2.
The Rabin-Karp algorithm
Rabin and Karp proposed a string-matching
algorithm that performs well in practice.
The Rabin-Karp algorithm uses θ(m) preprocessing
time, and its worst-case running time is
θ((n-m+1)m).
Based on certain assumptions, however, its average-
case running time is better.
This algorithm makes use of elementary number-
theoretic notions such as the equivalence of two
numbers modulo a third number.
The Rabin-Karp algorithm
For expository purposes, let us assume that
∑={0,1,2…..,9}, so that each character is a decimal
digit. In the general case, we can assume that each
character is a digit in radix-d notation, where d=|∑|.
We can then view a string of k consecutive
characters as representing a length-k decimal
number. The character string 31415 thus
corresponds to the decimal number 31,415.
The Rabin-Karp algorithm
Let we are given a pattern P =[1..m], let p denote its
corresponding decimal value. In a similar manner,
given a text T =[1..n], let ts denote the decimal value
of the length-m substring T[s+1..s+m], for s = 0,1…
n-m. Certainly, ts = p if and only if T[s+1..s+m] = P
=[1..m]; thus, s is a valid shift if and only if ts = p.
If we could compute p in time θ(m) and all the ts
values in a total of θ(n-m+1) time, then we could
determine all valid shifts s in time θ(m) + θ(n-m+1)
= θ(n) by comparing p with each of the ts values.
The Rabin-Karp algorithm
Until now, we have intentionally overlooked one
problem: p and ts may be too large to work with
conveniently. If P contains m characters, then we
cannot reasonably assume that each arithmetic
operation on p (which is m digits long) takes
“constant time.” Fortunately, we can solve this
problem easily by compute p and the ts values
modulo a suitable modulus q. We can compute p
modulo q in θ(m) time and all the ts values modulo q
in θ(n-m+1) time. If we choose the modulus q as a
prime, then we can perform all the necessary
computations with single-precision arithmetic.
The Rabin-Karp algorithm
The Rabin-Karp Algorithm
T= 1 3 2 6 3 2 3 1 4 5 3 6 1
P= 2 3 1 4
Let q=7 is given, So
p = 4, h=10m-1 mod q =103 mod 7= 1000 mod 7 =6
t= 3 1 0 2 4 4 2 4 0 6