11 Data Structures and Algorithms - Narasimha Karumanchi

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

15.

1 Introduction

To understand the importance of string algorithms let us consider the case of entering the URL
(Uniform To understand the importance of string algorithms let us consider the case of entering the
URL (Uniform Resource Locator) in any browser (say, Internet Explorer, Firefox, or Google
Chrome). You will observe that after typing the prefix of the URL, a list of all possible URLs is
displayed. That means, the browsers are doing some internal processing and giving us the list of
matching URLs. This technique is sometimes called auto – completion.

Similarly, consider the case of entering the directory name in the command line interface (in both
Windows and UNIX). After typing the prefix of the directory name, if we press the tab button, we
get a list of all matched directory names available. This is another example of auto completion.

In order to support these kinds of operations, we need a data structure which stores the string data
efficiently. In this chapter, we will look at the data structures that are useful for implementing
string algorithms.

We start our discussion with the basic problem of strings: given a string, how do we search a
substring (pattern)? This is called a string matching problem. After discussing various string
matching algorithms, we will look at different data structures for storing strings.

15.2 String Matching Algorithms

In this section, we concentrate on checking whether a pattern P is a substring of another string T


(T stands for text) or not. Since we are trying to check a fixed string P, sometimes these
algorithms are called exact string matching algorithms. To simplify our discussion, let us assume
that the length of given text T is n and the length of the pattern P which we are trying to match has
the length m. That means, T has the characters from 0 to n – 1 (T[0 ...n – 1]) and P has the
characters from 0 to m – 1 (T[0 ...m – 1]). This algorithm is implemented in C + + as strstr().

In the subsequent sections, we start with the brute force method and gradually move towards
better algorithms.
• Brute Force Method
• Rabin-Karp String Matching Algorithm
• String Matching with Finite Automata
• KMP Algorithm
• Boyer-Moore Algorithm
• Suffix Trees

15.3 Brute Force Method

In this method, for each possible position in the text T we check whether the pattern P matches or
not. Since the length of T is n, we have n – m + 1 possible choices for comparisons. This is
because we do not need to check the last m – 1 locations of T as the pattern length is m. The
following algorithm searches for the first occurrence of a pattern string P in a text string T.

Algorithm
Time Complexity: O((n – m + 1) × m) ≈ O(n × m). Space Complexity: O(1).

15.4 Rabin-Karp String Matching Algorithm

In this method, we will use the hashing technique and instead of checking for each possible
position in T, we check only if the hashing of P and the hashing of m characters of T give the same
result.

Initially, apply the hash function to the first m characters of T and check whether this result and
P’s hashing result is the same or not. If they are not the same, then go to the next character of T and
again apply the hash function to m characters (by starting at the second character). If they are the
same then we compare those m characters of T with P.

Selecting Hash Function

At each step, since we are finding the hash of m characters of T, we need an efficient hash
function. If the hash function takes O(m) complexity in every step, then the total complexity is O(n
× m). This is worse than the brute force method because first we are applying the hash function
and also comparing.

Our objective is to select a hash function which takes O(1) complexity for finding the hash of m
characters of T every time. Only then can we reduce the total complexity of the algorithm. If the
hash function is not good (worst case), the complexity of the Rabin-Karp algorithm is O(n – m +
1) × m) ≈ O(n × m). If we select a good hash function, the complexity of the Rabin-Karp
algorithm complexity is O(m + n). Now let us see how to select a hash function which can
compute the hash of m characters of T at each step in O(1).

For simplicity, let’s assume that the characters used in string T are only integers. That means, all
characters in T ∈ {0,1,2,...,9 }. Since all of them are integers, we can view a string of m
consecutive characters as decimal numbers. For example, string ′61815′ corresponds to the
number 61815. With the above assumption, the pattern P is also a decimal value, and let us
assume that the decimal value of P is p. For the given text T[0..n – 1], let t(i) denote the decimal
value of length–m substring T[i.. i + m – 1] for i = 0,1, ...,n – m– 1. So, t(i) == p if and only if
T[i..i + m – 1] == P[0..m – 1].

We can compute p in O(m) time using Horner’s Rule as:

The code for the above assumption is:


We can compute all t(i), for i = 0,1,..., n – m – 1 values in a total of O(n) time. The value of t(0)
can be similarly computed from T[0.. m – 1] in O(m) time. To compute the remaining values t(0),
t(1),..., t(n – m – 1), understand that t(i + 1) can be computed from t(i) in constant time.

For example, if T = ″123456″ and m = 3

Step by Step explanation

First : remove the first digit : 123 – 100 * 1 = 23

Second: Multiply by 10 to shift it : 23 * 10 = 230


Third: Add last digit : 230 + 4 = 234

The algorithm runs by comparing, t(i) with p. When t(i) == p, then we have found the substring P
in T, starting from position i.

15.5 String Matching with Finite Automata

In this method we use the finite automata which is the concept of the Theory of Computation
(ToC). Before looking at the algorithm, first let us look at the definition of finite automata.

Finite Automata

A finite automaton F is a 5-tuple (Q,q0,A,∑,δ), where


• Q is a finite set of states
• q0 ∈ Q is the start state
• A ⊆ Q is a set of accepting states
• ∑ is a finite input alphabet
• δ is the transition function that gives the next state for a given current state and input
How does Finite Automata Work?

• The finite automaton F begins in state q0


• Reads characters from ∑ one at a time
• If F is in state q and reads input character a, F moves to state δ(q,d)
• At the end, if its state is in A, then we say, F accepted the input string read so far
• If the input string is not accepted it is called the rejected string

Example: Let us assume that Q = {0,1{,q0 = 0,A = {1},∑ = {a, b}. δ(q,d) as shown in the
transition table/diagram. This accepts strings that end in an odd number of a’s; e.g., abbaaa is
accepted, aa is rejected.

Important Notes for Constructing the Finite Automata

For building the automata, first we start with the initial state. The FA will be in state k if k
characters of the pattern have been matched. If the next text character is equal to the pattern
character c, we have matched k + 1 characters and the FA enters state k + 1. If the next text
character is not equal to the pattern character, then the FA go to a state 0,1,2,....or k, depending on
how many initial pattern characters match the text characters ending with c.

Matching Algorithm

Now, let us concentrate on the matching algorithm.


• For a given pattern P[0.. m – 1], first we need to build a finite automaton F
○ The state set is Q = {0,1,2, ...,m}
○ The start state is 0
○ The only accepting state is m
○ Time to build F can be large if ∑ is large
• Scan the text string T[0.. n – 1] to find all occurrences of the pattern P[0.. m – 1]
• String matching is efficient: Θ(n)
○ Each character is examined exactly once
○ Constant time for each character
○ But the time to compute δ (transition function) is O(m|∑|). This is
because δ has O(m|∑|) entries. If we assume |∑| is constant then the
complexity becomes O(m).

Algorithm:

Time Complexity: O(m).

15.6 KMP Algorithm

As before, let us assume that T is the string to be searched and P is the pattern to be matched. This
algorithm was presented by Knuth, Morris and Pratt. It takes O(n) time complexity for searching a
pattern. To get O(n) time complexity, it avoids the comparisons with elements of T that were
previously involved in comparison with some element of the pattern P.

The algorithm uses a table and in general we call it prefix function or prefix table or fail
function F. First we will see how to fill this table and later how to search for a pattern using this
table. The prefix function F for a pattern stores the knowledge about how the pattern matches
against shifts of itself. This information can be used to avoid useless shifts of the pattern P. It
means that this table can be used for avoiding backtracking on the string T.

Prefix Table
As an example, assume that P = a b a b a c a. For this pattern, let us follow the step-by-step
instructions for filling the prefix table F. Initially: m = length[P] = 7,F[0] = 0 and F[1] = 0.
Step 1: i = 1,j = 0,F[1] =0

Step 2: i = 2,j = 0,F[2] = 1

Step 3: i = 3,j = 1,F[3] =2

Step 4: i = 4,j = 2,F[4] =3


Step 5: i = 5,j = 3,F[5] = 1

Step 6: i = 6,j = 1,F[6] =1

At this step the filling of the prefix table is complete.

Matching Algorithm

The KMP algorithm takes pattern P, string T and prefix function F as input, and finds a match of P
in T.
Time Complexity: O(m + n), where m is the length of the pattern and n is the length of the text to
be searched. Space Complexity: O(m).

Now, to understand the process let us go through an example. Assume that T = b a c b a b a b a b


a c a c a & P = a b a b a c a. Since we have already filled the prefix table, let us use it and go to
the matching algorithm. Initially: n = size of T = 15; m = size of P = 7.

Step 1: i = 0, j = 0, comparing P[0] with T[0]. P[0] does not match with T[0]. P will be shifted
one position to the right.

Step 2 :i = 1, j = 0, comparing P[0] with T[1]. P[0] matches with T[1]. Since there is a match, P
is not shifted.

Step 3: i = 2, j = 1, comparing P[1] with T[2]. P[1] does not match with T[2]. Backtracking on P,
comparing P[0] and T[2].

Step 4: i = 3, j = 0, comparing P[0] with T[3]. P[0] does not match with T[3].

Step 5: i = 4, j = 0, comparing P[0] with T[4]. P[0] matches with T[4].

Step 6: i = 5, j = 1, comparing P[1] with T[5]. P[1] matches with T[5].

Step 7: i = 6, j = 2, comparing P[2] with T[6]. P[2] matches with T[6].

Step 8: i = 7, j = 3, comparing P[3] with T[7]. P[3] matches with T[7].

Step 9: i = 8, j = 4, comparing P[4] with T[8]. P[4] matches with T[8].


Step 10: i = 9, j = 5, comparing P[5] with T[9]. P[5] does not match with T[9]. Backtracking on
P, comparing P[4] with T[9] because after mismatch ; = F[4] = 3.

Comparing P[3] with T[9].

Step 11: i = 10, j = 4, comparing P[4] with T[10]. P[4] matches with T[10].

Step 12: i = 11, j = 5, comparing P[5] with T[11]. P[5] matches with T[11].

Step 13: i = 12, j = 6, comparing P[6] with T[12]. P[6] matches with T[12].

Pattern P has been found to completely occur in string T. The total number of shifts that took place
for the match to be found are: i – m= 13 – 7 = 6 shifts.

Notes:
• KMP performs the comparisons from left to right
• KMP algorithm needs a preprocessing (prefix function) which takes O(m) space and
time complexity
• Searching takes O(n + m) time complexity (does not depend on alphabet size)
15.7 Boyer-Moore Algorithm

Like the KMP algorithm, this also does some pre-processing and we call it last function. The
algorithm scans the characters of the pattern from right to left beginning with the rightmost
character. During the testing of a possible placement of pattern P in T, a mismatch is handled as
follows: Let us assume that the current character being matched is T[i] = c and the corresponding
pattern character is P[j]. If c is not contained anywhere in P, then shift the pattern P completely
past T[i]. Otherwise, shift P until an occurrence of character c in P gets aligned with T[i]. This
technique avoids needless comparisons by shifting the pattern relative to the text.

The last function takes O(m + |∑|) time and the actual search takes O(nm) time. Therefore the
worst case running time of the Boyer-Moore algorithm is O(nm + |∑|). This indicates that the
worst-case running time is quadratic, in the case of n == m, the same as the brute force algorithm.
• The Boyer-Moore algorithm is very fast on the large alphabet (relative to the length
of the pattern).
• For the small alphabet, Boyer-Moore is not preferable.
• For binary strings, the KMP algorithm is recommended.
• For the very shortest patterns, the brute force algorithm is better.

15.8 Data Structures for Storing Strings

If we have a set of strings (for example, all the words in the dictionary) and a word which we
want to search in that set, in order to perform the search operation faster, we need an efficient
way of storing the strings. To store sets of strings we can use any of the following data structures.
• Hashing Tables
• Binary Search Trees
• Tries
• Ternary Search Trees

15.9 Hash Tables for Strings

As seen in the Hashing chapter, we can use hash tables for storing the integers or strings. In this
case, the keys are nothing but the strings. The problem with hash table implementation is that we
lose the ordering information – after applying the hash function, we do not know where it will
map to. As a result, some queries take more time. For example, to find all the words starting with
the letter “K”, with hash table representation we need to scan the complete hash table. This is
because the hash function takes the complete key, performs hash on it, and we do not know the
location of each word.

You might also like