A Fast String Matching Algorithm: H N Verma, Ravendra Singh M.Tech (CSE-0104cs09mt16) RKDF IST Bhopal, India

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

ISSN:2229-6093

Ravendra Singh et al, Int. J. Comp. Tech. Appl., Vol 2 (6),1877-1883

A FAST STRING MATCHING ALGORITHM


1
H N Verma, 2Ravendra Singh
1
Department of CSE, Sachdeva Institute of Technology, Mathura, India, [email protected]
2
M.Tech(CSE-0104cs09mt16) RKDF IST Bhopal, India, [email protected]

ABSTRACT

The pattern matching is a well known and important task of the pattern discovery process in todays world
for finding the nucleotide or amino acid sequence patterns in protein sequence databases. Although pattern
matching is commonly used in computer science, its applications cover a wide range, including in editors,
information retrieval. In this paper we propose a new pattern matching algorithm that has an improved
performance compare to the well known algorithms in the literature so far. Our proposed algorithm has
been evolved after the comparatively study of the well known algorithms like Boyer Moore , Horspool and
Raita. When we are talking about the overall performance of the proposed algorithm it has been improved
using the shift provided by the Horspool search bad-character and by defining a fixed order of comparison.
The proposed algorithm has been compared with other well known algorithm.

1. INTRODUCTION
Pattern matching problem has attract a lot of After a whole match or a mismatch of the
interest throughout the history of computer pattern, the text window is shifted in the forward
science, pattern matching has been used in direction until the window is positioned at the (n-
various computer application for several decades m+1) position of the text. This approach is
these algorithm are applied in most of the similar to the brute-force algorithm. In brute-
Operating Systems, Editors, Search Engines on force algorithm, window is shifted to the right by
the internet, retrieval of information and one character after an attempt.
searching Protein or DNA sequence pattern in
genome and protein sequence databases. For a large character of text the brute-force
algorithm is not efficient to perform this task.
Theoretical studies of different algorithm suggest We have several well known algorithms in the
various possible means by which those literature so far and these have their own
algorithms are likely to perform but in some advantages and limitation based on the method
cases they fail to predict the actual performance, they use to calculate the shift value(the number
here we demonstrate that better methods can be of characters the window should move forward).
devised from theoretical analysis by extensive The algorithms vary in the order in which
experimentation and modification of the existing character comparisons are made and the distance
algorithm. by which the window is shifted on the text after
each attempt.
Pattern matching can be defined as the text is an
array T[1n] of length n and that the pattern is The efficiency of the algorithm lies in two
an array p[1m] of length m<=n. We further phases:
assume that the element of P and T are character 1. The preprocessing phases
drawn from a finite alphabet . For example, we 2. The searching phase
may have = {0, 1}, = {a, b ,z} etc. The
character arrays P and T are often called strings The preprocessing phase collects the full
of characters. All pattern matching algorithm information about the pattern and uses this
scan the text with the help of a window which is information in searching phase. In searching
equal to the length of the pattern. phase, pattern is compared with the sliding
window from right to left or left to right until a
The first process is to align the left ends of the match or mismatched occur. The efficiency of
pattern with the text window and then compared pattern-matching algorithm is calculated on the
the corresponding characters of the window and basis of the search phase. The efficiency of the
the pattern. This process is known as an attempt. search is achieved by changing the order in
which the characters are compared at each

IJCTA | NOV-DEC 2011 1877


Available [email protected]
ISSN:2229-6093
Ravendra Singh et al, Int. J. Comp. Tech. Appl., Vol 2 (6),1877-1883

attempt and moving the window on text, hence 3. PROPOSED ALGORITHM


calculating the shift value.
This proposed algorithm is evolved after the
2. DESCRIPTION OF WELL KNOWN comparatively study of the well know algorithms
ALGORITHM like Horspool and Raita. This new algorithm
compares the right most character of the pattern
Nave string matching algorithm: The nave to the text window, if it is matched then the left
algorithm finds the entire valid shifts by most character of the pattern and the sliding
comparing each character of the pattern to text window is compared, if it also matched then the
character, if the whole pattern is match to the middle character of the pattern is compared to
text than it has a valid shift otherwise in case of the corresponding position of the text window, if
mismatch it shift the pattern by one character and it is matched than it again compare the characters
again compare the each character of the pattern staring from the second last character (m-1) of
to the text in this manner it finds entire valid the pattern to second character of the pattern to
shift of the pattern in the text. the text window[7]. In case of match or
mismatch the skip of the window is achieved by
Horspool Algorithm (Horspool): The the Horspool Bad character (Hbc) shift value for
Horspool[2] algorithm that has the shift value of the character that is placed next to the window.
the right most character of the sliding window. This new algorithm first compare the last
In pre processing phase the shift value are character of the pattern and second the first
computed for all characters. In an attempt, we character of the pattern with corresponding
compare the pattern from the right to the left characters of the text window, because by doing
until the pattern match or mismatch occurs. The this the probability of the finding the pattern in
right most character in the window is used as the the text window is increased over the
index to obtain the shift value. In case of comparison of the one by one character of the
mismatch i.e. character does not occur in the pattern in the text window[12]. This new
pattern, the window is shifted by the length of algorithm postpones the comparison of the
the pattern. Otherwise the window is shifted neighboring characters. Hence the probability of
according to the right most character in the an exact match increases between the pattern and
pattern. the text window[9].

Raita Algorithm: In Raita[3] algorithm, first it 3.1. PREPROCESSING PHASE


compare the last character of the pattern with the
corresponding(i.e. nth character of the text) text In preprocessing phase, the Horspool Bad
character of the window, if they match, than it character (Hbc) function is generated for all the
compare with the first character of the pattern characters in the alphabet set. A Horspool bad
with the left most text character of the window, character table is create as the value of Horspool
if it again match then it compare the middle Bad character (Hbc) for a particular alphabet is
character of the pattern with the corresponding defined as the right most position of that
text character of the sliding window. And finally, character in the pattern-1. If the character does
it they match than again compare the characters not occur in the pattern , then the value is equal
starting from the second character to second last to m (length of the pattern). The skip values for
character of the pattern. In this case middle each character are stored in the Horspool Bad
character of the pattern compare again while Character (Hbc) table and these skip values are
comparing from second character to second last used in the searching phase. In case of mismatch,
character of the pattern. the skip value of the right most character that is
stored in the Horspool Bad Character (Hbc) table
Boyer-Moore algorithm (BM): The Boyer is used to shift the pattern in an attempt. This
Moore[11] algorithm uses two different tables, process is repeated until the match or mismatch
one is Boyer-Moore Bad Character (bmBc) and of the pattern occur in the text. When the
another is Boyer-Moore Good Suffix (bmGs), a alphabet set is large then the character occurring
bad character table stores the shift value that are probability in the pattern is less and this provide
obtained on the occurrence of the character in the a maximum skip of the window. This proposed
pattern. Another good suffix table contains the algorithm has Horspool Bad Character (Hbc)
matching shift values for each character of the over the Boyer-Moore Bad Character (BmBc)
pattern. for the following reasons:

IJCTA | NOV-DEC 2011 1878


Available [email protected]
ISSN:2229-6093
Ravendra Singh et al, Int. J. Comp. Tech. Appl., Vol 2 (6),1877-1883

1. When the alphabet size is large and the pattern //The function for matching.
length is too small, then Boyer-Moores bad- //Input is Pattern and Text string.
character technique is not affected. //flag is a variable to check whether pattern
occurs in the text or not at all.
2. Boyer-Moore algorithm has two different //The valid shift values are from 0 to n-m.
tables to skip the window, one is Boyer-Moores /*First the last character of Pattern and Text
Bad Character and another is Boyer-Moores Window is compared, in case of a match, the
good suffix to calculate the skip of the window. first character of Pattern and the Text Window is
While Horspools Bad Character is always have compared, in case of these two matches, the
the values to be 1 and so Horspools algorithm mid(if m is even, then lower median) character
works independently. of the Pattern and the Text character will be
compared. If it also matches then we compare
3.2 SEARCHING PHASE second-last to second character of Pattern with
corresponding text window. If all characters
The proposed algorithm search the pattern in the matches then we get a message that Pattern has
window, first it match the right most character of been occurred. In case of mismatch at anytime
the pattern to the text window, in case of a match we compute the shift value as shift +
the left most character of the pattern to the text L[Text[shift+m]].*/
window is compared, if these match, then the
middle character of the pattern to the text matching( )
window is compared, otherwise in case of {
mismatch the pattern is shifted by the shift value int flag=0;
calculated by the Horspool Bad Character (Hbc) int shift = 0;
value of the right most character of the pattern.
while(shift <= n-m)
In case of matched middle character of the {
pattern, it compare the second last character of if(Pattern[m]== Text[shift+m])
the pattern (m-1) to second character of the {
pattern to the text window. Middle character of if(Pattern[1]==Text[shift+1])
the pattern is compared twice to the {
corresponding character of text window. This if(Pattern[m/2] == Text[shift + m/2])
procedure is repeated until the shift does not {
reach to n-m+1. i = m-1;
k = 0;
3.3. C LANGUAGE IMPLEMENTATION while(i>=2&&Pattern[i]==Text[shift+i])
{
//This function computes the Horspool Bad k = k + 1;
Character, Hbc, table and store it in an array L. i = i - 1;
//Input is Pattern and the number of elements in }
as ALPHABET. if(k==m-2)
//m is the length of Pattern. {
printf("\n\n\t Pattern occurs with shift %d", shift);
preprocessing( ) flag=1;
{ }
int i; }
}
for(i=0; i<ALPHABETS; i++) }
L[i] = m; shift = shift + L[Text[shift+m]];
}
for(i=1; i<=m-1; i++)
L[Pattern[i]] = m-i; if(flag == 0)
} printf("\n\n\t Pattern does not found");
else
printf("\n\n\t No more pattern exists");
}

IJCTA | NOV-DEC 2011 1879


Available [email protected]
ISSN:2229-6093
Ravendra Singh et al, Int. J. Comp. Tech. Appl., Vol 2 (6),1877-1883

4. WORKING EXAMPLE In second attempt:


The shift is 1.
To test the proposed algorithm, the following
sequence has been considered for the test run BBARBERBABCDEEAACDEEBBRBSRBAR
EERBERBERBARBER
Text 2 7 35 4 1
T=BBARBERBABCDEEAACDEEBBRBSRB BARBER
AREERBERBERBARBER
First, last R of text window is compared with last
Pattern P = BARBER R of pattern, then the first character B of text
window is compared with B of pattern, then
Length of pattern m=6 and, middle R of pattern is compared and matched
with 4th character of text. Then finally patterns
Length of text n=44 second last to second characters are compared
and matched with 6th character to 3rd character
4.1. PREPROCESSING PHASE of text window respectively.

The Horspool Bad Character table is obtained Pattern occurs with shift 1.
with a size [10] that store the character and its New shift value is calculated as shift + Hbc[R]
corresponding skip value. The set of alphabets i.e. 1 + 3 = 4
ALPHABET () we have taken, includes all
capital and small English letters, all digits and Total number of attempt is 2;
some special characters and punctuation marks. Total number of character comparison: 7.

.. ... In third attempt:


A B C D E F R S The shift is 4.
.

.. ... BBARBERBABCDEEAACDEEBBRBSRBAR
Hbc 4 2 6 6 1 6 3 6 EERBERBERBARBER
.
1
BARBER

4.2. SEARCHING PHASE The patterns last character is compared with


tenth character of Text. The mismatched
In first attempt: character is B, The new shift value is calculated
The shift is 0. as shift + Hbc[B] i.e.4 + 2 = 6.

BBARBERBABCDEEAACDEEBBRBSRBAR In fourth attempt:


EERBERBERBARBER The shift is 6.
1
BARBER BBARBERBABCDEEAACDEEBBRBSRBAR
EERBERBERBARBER
Here the comparison between the last character 1
of the pattern and the window is done, and a BARBER
mismatch occur so shift value is needed to shift
the pattern, the shift value is calculated from the Now patterns last character is compared with
Horspool Bad Character table Hbc[E]. Hence the Texts 12th character. A mismatch occurs. The
window is moved by the Horspool Bad mismatched character is D. The new shift value
Character shift value of Hbc[E] =1.The new shift is shift + Hbc[D] i.e. 6 + 6 = 12.
value is shift + 1 i.e. 0 + 1 = 1.
In fifth attempt:
The shift value is 12.

BBARBERBABCDEEAACDEEBBRBSRBAR
EERBERBERBARBER
1
BARBER

IJCTA | NOV-DEC 2011 1880


Available [email protected]
ISSN:2229-6093
Ravendra Singh et al, Int. J. Comp. Tech. Appl., Vol 2 (6),1877-1883

In this attempt, the comparison of the last In ninth attempt:


character of the pattern and the text window is The shift value is 26.
carried out and a mismatch occur, the window is
shifted based on the value shift + Hbc[D].The BBARBERBABCDEEAACDEEBBRBSRBAR
new shift value is 12 + 6 = 18. EERBERBERBARBER
2 3
In sixth attempt: 54 1 BAR
The shift value is 18. BER

BBARBERBABCDEEAACDEEBBRBSRBAR Again, a match occurs with the right most


EERBERBERBARBER character of the pattern and the text window, the
1 first character of the pattern is also matched with
BARBER the text window, and the middle character of the
pattern is matched with the text window. Then
Here Patterns last character is compared with we compare second-last character of pattern with
Texts twenty-fourth character and a mismatch 31st character of text. It is a match. Then we
occurs. So the window is shifted based on the compare third-last character of pattern with 30th
shift value shift + Hbc[B] i.e. 18 + 2 = 20. character of text. It is a mismatch and
mismatched character is E. The new shift value
In seventh attempt: is shift + Hbc[R] i.e. 26 + 3 = 29.
The shift value is 20.
In tenth attempt:
BBARBERBABCDEEAACDEEBBRBSRBAR The shift value is 29.
EERBERBERBARBER
2 3 41 BBARBERBABCDEEAACDEEBBRBSRBAR
BARBER EERBERBERBARBER
2 1
In this attempt, last character of the pattern is BARBER
matched with the text window, then the first
character of the pattern is also matched with the In this attempt, last character of the pattern is
text window, then the middle character* of the matched with the last character of text window,
pattern is matched with the text window. Then, but the first character of the pattern is
the second last character of the pattern is mismatched with the first character of text
compared and a mismatch occurs with the text window, so window is shifted by shift + Hbc[R]
window character S, so the window is shifted by i.e. 29 + 3 = 32.
shift + Hbc[R] i.e. 20 + 3 = 23.
In eleventh attempt:
In eighth attempt: The shift value is 32.
The shift value is 23.
BBARBERBABCDEEAACDEEBBRBSRBAR
BBARBERBABCDEEAACDEEBBRBSRBAR EERBERBERBARBER
EERBERBERBARBER 2 7 3 54 1
2 3 41 BARBER
BARBER
Here again, a match occurs with the right most
Here again, a match occurs with the right most character of the pattern and the text window, the
character of the pattern and the text window, first character of the pattern is also matched with
then the first character of the pattern is also the text window, and the middle character of the
matched with the text window, and the middle pattern is matched with the text window. Then
character of the pattern is matched with the text we compare patterns 5th, 4th and 3rd characters
window. Next comparison is in between with texts 37th, 36th and 35th characters
windows A and patterns E. We get a mismatch respectively and these all are matched. Next
and the mismatched character is A. So the comparison is in between patterns 2nd character
window is shifted based on the shift value shift + and texts 34th character, which is a mismatch
Hbc[R] i.e. 23 + 3 = 26. and the mismatched character is E. So the
_ window is shifted based on the shift value shift +
*If length of the Pattern is even then out of two, Hbc[R] i.e. 32 + 3 = 35.
the lower median is selected[4].

IJCTA | NOV-DEC 2011 1881


Available [email protected]
ISSN:2229-6093
Ravendra Singh et al, Int. J. Comp. Tech. Appl., Vol 2 (6),1877-1883

In twelfth attempt: 5. ANALYSIS


The shift value is 35.
First for loop of preprocessing function iterates
BBARBERBABCDEEAACDEEBBRBSRBAR (||) times and the second for loop iterates (m)
EERBERBERBARBER times. Since, generally || > m, the complexity of
2 3 41 preprocessing is (||).
BARBER
In matching function the outer while loop
Again, a match occurs with the right most iterates O(n-m+1) times. Experiments show that
character of the pattern and the right most the exact number of iterations is far less than n-
character of the text window, the first character m+1. The inner while loop iterates O(m) times.
of the pattern is also matched with the first Thus the total complexity of the matching phase
character of the text window, and the middle is O(m(n-m+1)).
character of the pattern is matched with the
middle character of the text window. Then we 6. CONCLUSION
compare second-last character of the pattern with
the 40th character of the text and get a mismatch. We have proposed a new algorithm for string
The mismatched character is A. So the window matching, by defining a new order of comparison
is shifted based on the shift value shift + Hbc[R] between given Pattern and the text Window[1].
i.e. 35 + 3 = 38. We have explained our algorithm by an example.
The example strings are taken such that in which
In thirteenth attempt: cases are covered. Our algorithm is checked and
The shift value is 38. tested on many random strings. The complexity
analysis is also given. Although the given
BBARBERBABCDEEAACDEEBBRBSRBAR algorithm is not much better than other existing
EERBERBERBARBER algorithms in complexity, but practically its
2 7 3 54 1 running time is faster than other existing
BARBER algorithms. Hence the proposed algorithm is
efficient and faster as it can be observed from the
Here again, a match occurs with the right most given chart and the graph.
character of the pattern and the right most
character of the text window, the first character
Pattern VS (our
of the pattern is also matched with the first BM Raita Horspool
Length algorithm)
character of the text window, and the middle
2 238 207 214 119.435620
character of the pattern is matched with the
middle character of the text window. Then 4 189 174 175 95.5909075
second-last to second character of pattern is 6 175 165 166 82.4739625
compared and matched with the second-last to 8 166 159 157 75.5494650
second character of text window, thus the whole 10 163 156 154 73.9011125
pattern is matched with the text window. 12 157 153 151 71.0164925
14 154 151 150 70.6044025
Pattern occurs with shift =38. 16 153 151 149 65.2472575
18 152 151 147 64.8351675
Now the window is shifted based on the shift
20 151 148 147 62.7747250
value shift + Hbc[R] i.e. 38 + 3 = 41, which is
Table1: Comparison of our algorithm(VS) with
more than n-m. We get a message-
other well known algorithms[6]
No more pattern exists.
The graph on next page clearly shows the
difference of our algorithm with other known
algorithms. Our algorithm is applicable in many
areas like Biology[5], Computer science,
Medical etc. It also works fine with multiple
pattern matching[8].

IJCTA | NOV-DEC 2011 1882


Available [email protected]
ISSN:2229-6093
Ravendra Singh et al, Int. J. Comp. Tech. Appl., Vol 2 (6),1877-1883

250

200
Average Time in 10-2 mSeconds

150 BM
Raita
Horspool
100 VS

50

0
2 4 6 8 10 12 14 16 18 20
Pattern Length

Figure 1: Comparison Chart with other well


known algorithms

7. REFERENCES 7. Olivier Danvy and Henning Korsholm Rohde,


Obtaining the Boyer-Moore string-matching
1. R. Manivannan and S. K. Srivasta, Semi algorithm by partial evaluation. Department
automatic method for string matching, of Computer Science University of Aarhus,
Information Technology Journal, 2011, 10(1), 2005, pp. 1-9.
195-200. 8. Castelo AT, Martins W, Gao GR, TROLL--
2. Horspool R. N., Practical fast searching in tandem repeat occurrence locator.
strings. Software-Practice Experience 1980, Bioinformatics 2002, 18(4):634-636.
10(6), 501-506. 9. Yoginder S Dandass, Shane C Burgess, Mark
3. Raita T., Tuning the Boyer-Moore-Horspool Lawrence and Susan M Bridges,
string-searching algorithm. Software-Practice Accelerating string set matching in FPGA
Experience 1992, 22(10), 879-884. Hardware for Bioinformatics Research, BMC
4. Tim Bell, Matt Powell, Amar Mukherjee and Bioinformatics 2008, 9:197.
Don Adjeroh, Searching BWT compressed 10. Mansi R. H., J. Q. Alnihoud, 2010, An
text with the Boyer-Moore algorithm and efficient ASCII-based algorithm for single
binary search. University of Central Florida, pattern matching, Information Technology
USA, 2001, pp. 1-10. Journal, 2010, 9: 453-459.
5. Rahul Thathoo, Ashish Virmani, S. Sai 11. Boyer R. S., Moore J. S., A fast string
Lakshmi, N. Balakrishnan and K. Sekar1, searching algorithm. Commun. ACM 1977,
TVSBS: A fast exact pattern matching 20, 762-772.
algorithm for biological sequences. India, 12. Domenico Cantone and Simone Faro,
2006, pp. 47-53. Forward-fast-search: Another fast variant of
6. S. S. Sheik, Sumit K. Aggarwal, Anindya the Boyer-Moore string matching algorithm.
Poddar, N. Balakrishnan and K. Sekar, A fast Dipartimento di Matematica e informatica,
pattern matching algorithm, J. Chem. Inf. Universita di Catania, Italy, 2003, pp. 10-24.
Comput. Sci. 2004, 44, 1251-1256.

IJCTA | NOV-DEC 2011 1883


Available [email protected]

You might also like