11 Data Structures and Algorithms - Narasimha Karumanchi
11 Data Structures and Algorithms - Narasimha Karumanchi
11 Data Structures and Algorithms - Narasimha Karumanchi
1 Introduction
To understand the importance of string algorithms let us consider the case of entering the URL
(Uniform To understand the importance of string algorithms let us consider the case of entering the
URL (Uniform Resource Locator) in any browser (say, Internet Explorer, Firefox, or Google
Chrome). You will observe that after typing the prefix of the URL, a list of all possible URLs is
displayed. That means, the browsers are doing some internal processing and giving us the list of
matching URLs. This technique is sometimes called auto – completion.
Similarly, consider the case of entering the directory name in the command line interface (in both
Windows and UNIX). After typing the prefix of the directory name, if we press the tab button, we
get a list of all matched directory names available. This is another example of auto completion.
In order to support these kinds of operations, we need a data structure which stores the string data
efficiently. In this chapter, we will look at the data structures that are useful for implementing
string algorithms.
We start our discussion with the basic problem of strings: given a string, how do we search a
substring (pattern)? This is called a string matching problem. After discussing various string
matching algorithms, we will look at different data structures for storing strings.
In the subsequent sections, we start with the brute force method and gradually move towards
better algorithms.
• Brute Force Method
• Rabin-Karp String Matching Algorithm
• String Matching with Finite Automata
• KMP Algorithm
• Boyer-Moore Algorithm
• Suffix Trees
In this method, for each possible position in the text T we check whether the pattern P matches or
not. Since the length of T is n, we have n – m + 1 possible choices for comparisons. This is
because we do not need to check the last m – 1 locations of T as the pattern length is m. The
following algorithm searches for the first occurrence of a pattern string P in a text string T.
Algorithm
Time Complexity: O((n – m + 1) × m) ≈ O(n × m). Space Complexity: O(1).
In this method, we will use the hashing technique and instead of checking for each possible
position in T, we check only if the hashing of P and the hashing of m characters of T give the same
result.
Initially, apply the hash function to the first m characters of T and check whether this result and
P’s hashing result is the same or not. If they are not the same, then go to the next character of T and
again apply the hash function to m characters (by starting at the second character). If they are the
same then we compare those m characters of T with P.
At each step, since we are finding the hash of m characters of T, we need an efficient hash
function. If the hash function takes O(m) complexity in every step, then the total complexity is O(n
× m). This is worse than the brute force method because first we are applying the hash function
and also comparing.
Our objective is to select a hash function which takes O(1) complexity for finding the hash of m
characters of T every time. Only then can we reduce the total complexity of the algorithm. If the
hash function is not good (worst case), the complexity of the Rabin-Karp algorithm is O(n – m +
1) × m) ≈ O(n × m). If we select a good hash function, the complexity of the Rabin-Karp
algorithm complexity is O(m + n). Now let us see how to select a hash function which can
compute the hash of m characters of T at each step in O(1).
For simplicity, let’s assume that the characters used in string T are only integers. That means, all
characters in T ∈ {0,1,2,...,9 }. Since all of them are integers, we can view a string of m
consecutive characters as decimal numbers. For example, string ′61815′ corresponds to the
number 61815. With the above assumption, the pattern P is also a decimal value, and let us
assume that the decimal value of P is p. For the given text T[0..n – 1], let t(i) denote the decimal
value of length–m substring T[i.. i + m – 1] for i = 0,1, ...,n – m– 1. So, t(i) == p if and only if
T[i..i + m – 1] == P[0..m – 1].
The algorithm runs by comparing, t(i) with p. When t(i) == p, then we have found the substring P
in T, starting from position i.
In this method we use the finite automata which is the concept of the Theory of Computation
(ToC). Before looking at the algorithm, first let us look at the definition of finite automata.
Finite Automata
Example: Let us assume that Q = {0,1{,q0 = 0,A = {1},∑ = {a, b}. δ(q,d) as shown in the
transition table/diagram. This accepts strings that end in an odd number of a’s; e.g., abbaaa is
accepted, aa is rejected.
For building the automata, first we start with the initial state. The FA will be in state k if k
characters of the pattern have been matched. If the next text character is equal to the pattern
character c, we have matched k + 1 characters and the FA enters state k + 1. If the next text
character is not equal to the pattern character, then the FA go to a state 0,1,2,....or k, depending on
how many initial pattern characters match the text characters ending with c.
Matching Algorithm
Algorithm:
As before, let us assume that T is the string to be searched and P is the pattern to be matched. This
algorithm was presented by Knuth, Morris and Pratt. It takes O(n) time complexity for searching a
pattern. To get O(n) time complexity, it avoids the comparisons with elements of T that were
previously involved in comparison with some element of the pattern P.
The algorithm uses a table and in general we call it prefix function or prefix table or fail
function F. First we will see how to fill this table and later how to search for a pattern using this
table. The prefix function F for a pattern stores the knowledge about how the pattern matches
against shifts of itself. This information can be used to avoid useless shifts of the pattern P. It
means that this table can be used for avoiding backtracking on the string T.
Prefix Table
As an example, assume that P = a b a b a c a. For this pattern, let us follow the step-by-step
instructions for filling the prefix table F. Initially: m = length[P] = 7,F[0] = 0 and F[1] = 0.
Step 1: i = 1,j = 0,F[1] =0
Matching Algorithm
The KMP algorithm takes pattern P, string T and prefix function F as input, and finds a match of P
in T.
Time Complexity: O(m + n), where m is the length of the pattern and n is the length of the text to
be searched. Space Complexity: O(m).
Step 1: i = 0, j = 0, comparing P[0] with T[0]. P[0] does not match with T[0]. P will be shifted
one position to the right.
Step 2 :i = 1, j = 0, comparing P[0] with T[1]. P[0] matches with T[1]. Since there is a match, P
is not shifted.
Step 3: i = 2, j = 1, comparing P[1] with T[2]. P[1] does not match with T[2]. Backtracking on P,
comparing P[0] and T[2].
Step 4: i = 3, j = 0, comparing P[0] with T[3]. P[0] does not match with T[3].
Step 11: i = 10, j = 4, comparing P[4] with T[10]. P[4] matches with T[10].
Step 12: i = 11, j = 5, comparing P[5] with T[11]. P[5] matches with T[11].
Step 13: i = 12, j = 6, comparing P[6] with T[12]. P[6] matches with T[12].
Pattern P has been found to completely occur in string T. The total number of shifts that took place
for the match to be found are: i – m= 13 – 7 = 6 shifts.
Notes:
• KMP performs the comparisons from left to right
• KMP algorithm needs a preprocessing (prefix function) which takes O(m) space and
time complexity
• Searching takes O(n + m) time complexity (does not depend on alphabet size)
15.7 Boyer-Moore Algorithm
Like the KMP algorithm, this also does some pre-processing and we call it last function. The
algorithm scans the characters of the pattern from right to left beginning with the rightmost
character. During the testing of a possible placement of pattern P in T, a mismatch is handled as
follows: Let us assume that the current character being matched is T[i] = c and the corresponding
pattern character is P[j]. If c is not contained anywhere in P, then shift the pattern P completely
past T[i]. Otherwise, shift P until an occurrence of character c in P gets aligned with T[i]. This
technique avoids needless comparisons by shifting the pattern relative to the text.
The last function takes O(m + |∑|) time and the actual search takes O(nm) time. Therefore the
worst case running time of the Boyer-Moore algorithm is O(nm + |∑|). This indicates that the
worst-case running time is quadratic, in the case of n == m, the same as the brute force algorithm.
• The Boyer-Moore algorithm is very fast on the large alphabet (relative to the length
of the pattern).
• For the small alphabet, Boyer-Moore is not preferable.
• For binary strings, the KMP algorithm is recommended.
• For the very shortest patterns, the brute force algorithm is better.
If we have a set of strings (for example, all the words in the dictionary) and a word which we
want to search in that set, in order to perform the search operation faster, we need an efficient
way of storing the strings. To store sets of strings we can use any of the following data structures.
• Hashing Tables
• Binary Search Trees
• Tries
• Ternary Search Trees
As seen in the Hashing chapter, we can use hash tables for storing the integers or strings. In this
case, the keys are nothing but the strings. The problem with hash table implementation is that we
lose the ordering information – after applying the hash function, we do not know where it will
map to. As a result, some queries take more time. For example, to find all the words starting with
the letter “K”, with hash table representation we need to scan the complete hash table. This is
because the hash function takes the complete key, performs hash on it, and we do not know the
location of each word.