String Matching
String Matching
String Matching
Spring 2014:
This will be presented in half-day "special topic" format. There will be no screencast,
quiz or problem set. We will have a brief lecture followed by an in-class activity,
based on sections 32.1 (the naive algorithm) and 32.3 (string matching with finite
state automata), both of which are reviewd in these notes.
While you are not likely to need to design or implement a string matching algorithm
(they are built into modern software tools), questions about string matching
algorithms reportedly still show up in job interviews (study KMP section 32.4 for
this). It is also worth being familiar with this topic for the concept of finite-state
automata as a form of table-driven algorithm, and how algorithms can be written to
write other algorithms (the FSA).
Outline
1. String Matching Introduction
2. Naive (Brute Force) String Matching
3. Matching with Finite State Automata
4. Knuth-Morris-Pratt Algorithm
5. FYI: Rabin-Karp Algorithm
The string matching problem is the problem of finding all valid shifts.
A simple example:
Applications
Brief History
In 1970 Cook proved that an O(M+N) machine is possible. His theorem was not
intended to be practical, but Knuth and Pratt followed the structure of the theorem to
design an algorithm. Around the same time, Morris derived the algorithm
independently as part of his design of a text editor. Eventually they found out about
each other's work and published it in 1976. Also in 1976, Boyer and Moore found
another efficient algorithm not covered here: see the Sedgewick book. In 1980 Rabin
& Karp came up with a modification to brute force that searches on chunks of text of
size M using a hash function. Researchers continue to come up with new algorithms.
Algorithms
The algorithms we cover are summarized by this table from the text:
Today we will cover how these algorithms work, but not any
proofs of correctness, which you may find in the text.
Comment on String Matching Time: The test of whether "x == y" takes Θ(t + 1)
time, where t is the length of the longest string z such that z ⊏ x and z ⊏ y. (The "1" is
included to cover the case where t = 0, since a positive amount of time must be
expended to determine this fact.) This comparison loop will be implicit in some of our
pseudocode.
The obvious approach is to start at the first character of T, T[1], and then step
through T and P together, checking to see whether the characters match.
Once P has been match at any given shift T[s], then go on to checking at T[s+1] (since
we are looking for all matches), up until s = |T| − |P| (which is n − m).
Example
Analysis
No preprocessing is required.
Inefficiencies
The brute force method does not use information about what has been matched, and
does not use information about recurrences within the pattern itself. Consideration of
these factors leads to improvements.
Algorithms that construct (or simulate) finite state automata can be very efficient
matchers, but they require some preprocessing to construct.
Finite state automata are widely used in computability theory as well as in practical
algorithms. It's worth knowing about them even if you are not doing string matching.
The FSA starts in state q0. As each character (symbol) of the input string is read, it
uses δ to determine what state to transition into. Whenever M is in a state of A, the
input read so far is accepted.
We define the final state function φ : Σ* → Q such that φ(w) is the final state M ends
up in after reading w:
φ(ε) = q0
φ(wa) = δ(φ(w), a) for w ∈ Σ*, a ∈ Σ
Let's see how they work by an example. This is the string matching automaton
for P = ababaca:
The start state is 0 and the only accepting state is 7. Any transitions not shown (e.g.,
if c is read while in states 1, 2, 3, 4, 6, and 7) are assumed to go back to state 0.
This automaton can be represented with the table to the right. The shading shows the
sequence of states through which a successful match transitions. These transitions
correspond to the darkened arrows in the diagram.
We can run this automaton continuously on a text T, and if and when state 7 is entered
output the relevant positions: the match will start with shift i − m or at position i − m +
1.
The following code simulates any FSA on input text T, given the FSA's table δ and
pattern length m:
Unlike the brute force approach, past work is not thrown away:
the transitions following either failure to match a character or
success to match the entire pattern pick up with the input read so
far.
For example,
After Failure: At i = 5, ababa has been
matched in T[1 .. 5] and c was expected
but not found at T[6]. Rather than starting
over, the FSA transitions to state δ(5, b) =
4 to indicate that the pattern prefix P4 =abab has matched the present text
suffix T[3 .. 6].
This makes FSAs much more efficient than brute force in the matching phase. In fact,
matching is Θ(n). But how do we build them?
In general, the FSA is constructed so that the state number tells us how much of a
prefix of P has been matched.
If the pattern P is of length m and the FSA is in state m, then the pattern has
been matched.
If the state number is smaller than m, then the state number is the length of the
prefix of P matched.
σ(ε) = 0
σ(ccaca) = 1
σ(ccab) = 2
(For simplicity, we leave out the subscript P when it is clear from the context.) Then
we can define the automaton for pattern P[1 .. m] as:
Q = {0, 1 .. m}
q0 = 0
A = {m}
δ(q, a) = σ(Pq a) for any state q and character a. (Pq a is the concatenation of
the first q characters of P with the character a.)
By defining δ(q, a) = σ(Pq a), the state of the FSA keeps track of the longest prefix of
the pattern P that has matched the input text T so far.
In order for a substring of T ending at T[i] to match some prefix Pj, then this
prefix Pj must be a suffix of T[i].
We design δ such that the state q = φ(T[i]) gives the length of the longest prefix
of P that matches a suffix of T. We have:
Given a match so far to Pq (which may be ε) and reading character a, there are two
kinds of transitions:
Preprocessing Procedure
The following procedure computes the transition function δ from a pattern P[1 .. m].
The nested loops process all combinations of states q and characters a needed for the
cells of the table representing δ.
Analysis
Knuth-Morris-Pratt Algorithm
The Knuth-Morris-Pratt algorithm improves on the FSA algorithm by avoiding
computation of δ.
π[q] is the length of the longest prefix of P that is a proper suffix of Pq. This
information enables fast amortized computation of δ(q, a) on the fly.
The KMP matcher operates like the FSA matcher, but using π instead of δ. There is
some additional overhead at matching time, but asymptotic performance of matching
remains Θ(n).
How π Works
Given P[1 .. m], the prefix function π for P is π : {1, 2 ..., m} → {0, 1, ..., m-1} such
that
For example, for P = ababaca, π is as follows (let's use the formula above to explain a
few of the entries below, particularly for i=5.):
The table is structured such that a failure at a given position will drive the system to
hop to the next smaller prefix that could be matched. (Here I am skipping over the
textbook's discussion of π* and Lemma 32.5, and using examples instead).
For example, for P = ababaca, if q = 5 and we are reading the 6th character (to the
right of the line), consider what happens when that character is c, b or a:
Thus, π helps us retain information from prior comparisions, shifting back to the state
for the maximum prefix matched so far, rather than starting over.
Both the KMP matcher and the algorithm for computing π are similar to the
FSA's Compute-Transition-Function (take the time to compare them when you read
the text).
The KMP matching code is below. We can see the "skipping" discussed above
happening in lines 6-7, with successful matching of the next character handled in lines
8-9. After a successful match, we jump to the appropriate state to continue looking for
the next item in line 12.
Computing π
The code for computing π is very similar to that of KMP-Matcher, because the
constructor is matching P against itself:
Rabin-Karp Algorithm
There is an entirely different approach that functions in some respects like the brute
force approach, but instead of testing character by character it tests on whole
substrings of length m by using a hash function. If the hash function matches, it then
uses brute force checking to verify the match.
Two modifications need to be made to make this method suitable for a string
matching algorithm: shifting and hashing.
Shifting
Rather than recomputing the entire sum, this can be done efficiently by a
mathematical "shift":
An example is shown below, shifting a 5-digit number. (The mod will be explained
below.)
Hashing
These numbers can get very large. Even on a machine that allows arbitrarily large
numbers, this is still a problem: we can no longer assume that comparing two numbers
takes constant time.
The solution is to compute the numbers modulo the largest prime number q such
that dq still fits in one computer word (which can be manipulated with one machine
level instruction).
Now you see why it is called "hashing:" it is like the division method of hashing.
To solve this problem, when the hashes of P[1 .. m] and T[s+1 .. s+m] are equal, the
Rabin-Karp algorithm verifies the match with a brute force comparison.
This is still saving on comparisons over the brute force method, as we do not compare
characters for failed hash matches: the strings cannot be the same.
Line 3 initializes h to the value of the highest order digit that will be subtracted
when shifting.
Lines 4-8 compute the hash code p for P and the initial hash code t for the
substring of T to be compared in each iteration (t will be updated by shifting
digits as discussed).
Like the brute-force algorithm, lines 9-14 iterate over all possible shifts s, but
instead of comparing character by character, at first the hash codes are
compared. If they match, brute force comparison is done.
Anaysis
Preprocessing is Θ(m): the loop 6-8 executes m times and with the modular arithmetic
keeping the numbers within the size of one computer word the numerical work in the
loop is O(1).
The worst case matching time is still Θ((n − m + 1)m), like brute force, because it is
possible that every hash code matches and therefore every brute force comparison has
to be done.
But this is highly unlikely in any realistic application (where we are searching for
somewhat rare patterns in large texts). For constant number of valid shifts the text
offers an argument that the expected time is O(n + m), or O(n) since m ≤ n.
Extensions
Dan Suthers
Last modified: Wed Apr 23 00:32:05 HST 2014
Images are from the instructor's material for Cormen et al. Introduction to Algorithms,
Third Edition.