String Matching

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Design & Analysis of Algorithms

(KCS 503)
UNIT-5
String Matching

From:
Asif Khan
Department of CSE
GNIOT, Greater Noida

Reference:
Thomas H. Corman et al.
The MIT Press
String Matching
Text-editing programs frequently need to find all
occurrences of a pattern in the text. Typically, the
text is a document being edited, and the pattern
searched for is a particular word supplied by the
user.
Efficient algorithms for this problem called “string
matching”. These algorithms can greatly aid the
responsiveness of the text-editing program.
Among their many other applications, string-
matching algorithms search for particular patterns in
DNA sequences. Internet search engines also use
them to find Web pages relevant to queries.
String Matching
We formalize the string-matching problem as
follows.
We assume that the text is an array T[1..n] of length
n and that the pattern is an array P[1..m] of length
m≤ n.
We further assume that the elements of P and T are
characters drawn from a finite alphabet ∑. For
example, we may have ∑={0,1} or ∑={a,b…,z}.
The character arrays P and T are often called strings
of characters.
String Matching
We say that pattern P occurs with shift s in text T if
0 < s < n-m and T[s+1..s+m] = P[1..m]. If P occurs
with shift s in T , then we call s a valid shift;
otherwise, we call s an invalid shift.
The string-matching problem is the problem of
finding all valid shifts with which a given pattern P
occurs in a given text T.
String Matching

This is an example of the string-matching problem,


where we want to find all occurrences of the pattern
P=abaa in the text T=abcabaabcabac. The pattern
occurs only once in the text, at shift s = 3, which we
call a valid shift.
The naive string-matching algorithm
The naive algorithm finds all valid shifts using a
loop that checks the condition P[1..m]= T[s+1..s+m]
for each of the n-m+1 possible values of s.

for i= 1 to m
if P[i] != T[s+i] break;
if i=m+1 Print “--------”,
The naive string-matching algorithm
Following figures portray the naive string-matching
procedure as sliding a “template” containing the
pattern over the text.
The naive string-matching algorithm
Procedure NAIVE-STRING-MATCHER takes time
O((n-m+1)m), and this bound is tight in the worst
case.
For example, consider the text string an(a string of n
a’s) and the pattern am. For each of the n-m+1
possible values of the shift s, the loop on line 4 to
compare corresponding characters must execute m
times to validate the shift.
The worst-case running time is thus θ((n-m+1)m),
which is θ(n2), if m=n/2.
The Rabin-Karp algorithm
Rabin and Karp proposed a string-matching
algorithm that performs well in practice.
The Rabin-Karp algorithm uses θ(m) preprocessing
time, and its worst-case running time is
θ((n-m+1)m).
Based on certain assumptions, however, its average-
case running time is better.
This algorithm makes use of elementary number-
theoretic notions such as the equivalence of two
numbers modulo a third number.
The Rabin-Karp algorithm
For expository purposes, let us assume that
∑={0,1,2…..,9}, so that each character is a decimal
digit. In the general case, we can assume that each
character is a digit in radix-d notation, where d=|∑|.
We can then view a string of k consecutive
characters as representing a length-k decimal
number. The character string 31415 thus
corresponds to the decimal number 31,415.
The Rabin-Karp algorithm
Let we are given a pattern P =[1..m], let p denote its
corresponding decimal value. In a similar manner,
given a text T =[1..n], let ts denote the decimal value
of the length-m substring T[s+1..s+m], for s = 0,1…
n-m. Certainly, ts = p if and only if T[s+1..s+m] = P
=[1..m]; thus, s is a valid shift if and only if ts = p.
If we could compute p in time θ(m) and all the ts
values in a total of θ(n-m+1) time, then we could
determine all valid shifts s in time θ(m) + θ(n-m+1)
= θ(n) by comparing p with each of the ts values.
The Rabin-Karp algorithm
Until now, we have intentionally overlooked one
problem: p and ts may be too large to work with
conveniently. If P contains m characters, then we
cannot reasonably assume that each arithmetic
operation on p (which is m digits long) takes
“constant time.” Fortunately, we can solve this
problem easily by compute p and the ts values
modulo a suitable modulus q. We can compute p
modulo q in θ(m) time and all the ts values modulo q
in θ(n-m+1) time. If we choose the modulus q as a
prime, then we can perform all the necessary
computations with single-precision arithmetic.
The Rabin-Karp algorithm
The Rabin-Karp Algorithm
T= 1 3 2 6 3 2 3 1 4 5 3 6 1
P= 2 3 1 4
Let q=7 is given, So
p = 4, h=10m-1 mod q =103 mod 7= 1000 mod 7 =6
t= 3 1 0 2 4 4 2 4 0 6

T0+1 = (10 (t0-T[1] h) + T[5] ) mod 7


= (10 (3- 1 x 6) + 3 ) mod 7
= -27 mod 7 =7- (27 mod 7) =1
The Rabin-Karp algorithm
String matching with finite automata
Many string-matching algorithms build a finite
automaton that scans the text string T for all
occurrences of the pattern P. This section presents a
method for building such an automaton. These
string-matching automata are very efficient: they
examine each text character exactly once, taking
constant time per text character. The matching time
used—after preprocessing the pattern to build the
automaton—is therefore θ(n). The time to build the
automaton, however, can be large if ∑ is large.
String matching with finite automata
In order to specify the string-matching automaton
corresponding to a given pattern P[1..m], we first
define an auxiliary function , called the suffix
function corresponding to P. σ(x) is the length of the
longest prefix of P that is also a suffix of x:
σ(x)=max {k: Pk ‫ ﬤ‬x}
We define the string-matching automaton that
corresponds to a given pattern P[1..m] as follows:
•The state set Q is {0,1,…,m}. The start state q0 is
state 0, and state m is the only accepting state.
•The transition function is defined by the following
equation, for any state q and character a:
String matching with finite automata
δ(q,a) = σ(Pqa)
String matching with finite automata
The Knuth-Morris-Pratt algorithm
We now present a linear-time string-matching
algorithm due to Knuth, Morris, and Pratt. This
algorithm avoids computing the transition function
altogether, and its matching time is θ(n) using just an
auxiliary function , which we precompute from the
pattern in time θ(m) and store in an array π[1..m].
The prefix function for a pattern: The prefix
function π for a pattern encapsulates knowledge about
how the pattern matches against shifts of itself. We
can take advantage of this information to avoid testing
useless shifts in the naive pattern-matching algorithm
and to avoid precomputing the full transition function
for a string-matching automaton.
The Knuth-Morris-Pratt algorithm
In general, it is useful to know the answer to the
following question:
Given that pattern characters P[1..q] match text
characters T[s+1..s+q], what is the least shift s’ > s
such that for some k < q,
P[1..k] = T[ s’+1..s’+k];
where s’+k = s+q.
In other words, knowing that Pq ‫ ﬤ‬Ts+q, we want the
longest proper prefix Pk of Pq that is also a suffix of
Ts+q.
We can precompute the necessary information by
comparing the pattern against itself,
The Knuth-Morris-Pratt algorithm
The Knuth-Morris-Pratt algorithm
We formalize the information that we precompute as
follows. Given a pattern P[1..m], the prefix function
for the pattern P is the function π{1,2,…..m}
{0,1,….m-1} such that
π [q] = max {k : k < q and Pk ‫ ﬤ‬Pq }
That is, π[q] is the length of the longest prefix of P
that is a proper suffix of Pq.
Figure gives the complete prefix function for the
pattern ababaca.
The Knuth-Morris-Pratt algorithm
The Knuth-Morris-Pratt algorithm
The Knuth-Morris-Pratt algorithm
THANK YOU

You might also like