A Fast String Matching Algorithm: H N Verma, Ravendra Singh M.Tech (CSE-0104cs09mt16) RKDF IST Bhopal, India
A Fast String Matching Algorithm: H N Verma, Ravendra Singh M.Tech (CSE-0104cs09mt16) RKDF IST Bhopal, India
A Fast String Matching Algorithm: H N Verma, Ravendra Singh M.Tech (CSE-0104cs09mt16) RKDF IST Bhopal, India
ABSTRACT
The pattern matching is a well known and important task of the pattern discovery process in todays world
for finding the nucleotide or amino acid sequence patterns in protein sequence databases. Although pattern
matching is commonly used in computer science, its applications cover a wide range, including in editors,
information retrieval. In this paper we propose a new pattern matching algorithm that has an improved
performance compare to the well known algorithms in the literature so far. Our proposed algorithm has
been evolved after the comparatively study of the well known algorithms like Boyer Moore , Horspool and
Raita. When we are talking about the overall performance of the proposed algorithm it has been improved
using the shift provided by the Horspool search bad-character and by defining a fixed order of comparison.
The proposed algorithm has been compared with other well known algorithm.
1. INTRODUCTION
Pattern matching problem has attract a lot of After a whole match or a mismatch of the
interest throughout the history of computer pattern, the text window is shifted in the forward
science, pattern matching has been used in direction until the window is positioned at the (n-
various computer application for several decades m+1) position of the text. This approach is
these algorithm are applied in most of the similar to the brute-force algorithm. In brute-
Operating Systems, Editors, Search Engines on force algorithm, window is shifted to the right by
the internet, retrieval of information and one character after an attempt.
searching Protein or DNA sequence pattern in
genome and protein sequence databases. For a large character of text the brute-force
algorithm is not efficient to perform this task.
Theoretical studies of different algorithm suggest We have several well known algorithms in the
various possible means by which those literature so far and these have their own
algorithms are likely to perform but in some advantages and limitation based on the method
cases they fail to predict the actual performance, they use to calculate the shift value(the number
here we demonstrate that better methods can be of characters the window should move forward).
devised from theoretical analysis by extensive The algorithms vary in the order in which
experimentation and modification of the existing character comparisons are made and the distance
algorithm. by which the window is shifted on the text after
each attempt.
Pattern matching can be defined as the text is an
array T[1n] of length n and that the pattern is The efficiency of the algorithm lies in two
an array p[1m] of length m<=n. We further phases:
assume that the element of P and T are character 1. The preprocessing phases
drawn from a finite alphabet . For example, we 2. The searching phase
may have = {0, 1}, = {a, b ,z} etc. The
character arrays P and T are often called strings The preprocessing phase collects the full
of characters. All pattern matching algorithm information about the pattern and uses this
scan the text with the help of a window which is information in searching phase. In searching
equal to the length of the pattern. phase, pattern is compared with the sliding
window from right to left or left to right until a
The first process is to align the left ends of the match or mismatched occur. The efficiency of
pattern with the text window and then compared pattern-matching algorithm is calculated on the
the corresponding characters of the window and basis of the search phase. The efficiency of the
the pattern. This process is known as an attempt. search is achieved by changing the order in
which the characters are compared at each
1. When the alphabet size is large and the pattern //The function for matching.
length is too small, then Boyer-Moores bad- //Input is Pattern and Text string.
character technique is not affected. //flag is a variable to check whether pattern
occurs in the text or not at all.
2. Boyer-Moore algorithm has two different //The valid shift values are from 0 to n-m.
tables to skip the window, one is Boyer-Moores /*First the last character of Pattern and Text
Bad Character and another is Boyer-Moores Window is compared, in case of a match, the
good suffix to calculate the skip of the window. first character of Pattern and the Text Window is
While Horspools Bad Character is always have compared, in case of these two matches, the
the values to be 1 and so Horspools algorithm mid(if m is even, then lower median) character
works independently. of the Pattern and the Text character will be
compared. If it also matches then we compare
3.2 SEARCHING PHASE second-last to second character of Pattern with
corresponding text window. If all characters
The proposed algorithm search the pattern in the matches then we get a message that Pattern has
window, first it match the right most character of been occurred. In case of mismatch at anytime
the pattern to the text window, in case of a match we compute the shift value as shift +
the left most character of the pattern to the text L[Text[shift+m]].*/
window is compared, if these match, then the
middle character of the pattern to the text matching( )
window is compared, otherwise in case of {
mismatch the pattern is shifted by the shift value int flag=0;
calculated by the Horspool Bad Character (Hbc) int shift = 0;
value of the right most character of the pattern.
while(shift <= n-m)
In case of matched middle character of the {
pattern, it compare the second last character of if(Pattern[m]== Text[shift+m])
the pattern (m-1) to second character of the {
pattern to the text window. Middle character of if(Pattern[1]==Text[shift+1])
the pattern is compared twice to the {
corresponding character of text window. This if(Pattern[m/2] == Text[shift + m/2])
procedure is repeated until the shift does not {
reach to n-m+1. i = m-1;
k = 0;
3.3. C LANGUAGE IMPLEMENTATION while(i>=2&&Pattern[i]==Text[shift+i])
{
//This function computes the Horspool Bad k = k + 1;
Character, Hbc, table and store it in an array L. i = i - 1;
//Input is Pattern and the number of elements in }
as ALPHABET. if(k==m-2)
//m is the length of Pattern. {
printf("\n\n\t Pattern occurs with shift %d", shift);
preprocessing( ) flag=1;
{ }
int i; }
}
for(i=0; i<ALPHABETS; i++) }
L[i] = m; shift = shift + L[Text[shift+m]];
}
for(i=1; i<=m-1; i++)
L[Pattern[i]] = m-i; if(flag == 0)
} printf("\n\n\t Pattern does not found");
else
printf("\n\n\t No more pattern exists");
}
The Horspool Bad Character table is obtained Pattern occurs with shift 1.
with a size [10] that store the character and its New shift value is calculated as shift + Hbc[R]
corresponding skip value. The set of alphabets i.e. 1 + 3 = 4
ALPHABET () we have taken, includes all
capital and small English letters, all digits and Total number of attempt is 2;
some special characters and punctuation marks. Total number of character comparison: 7.
.. ... BBARBERBABCDEEAACDEEBBRBSRBAR
Hbc 4 2 6 6 1 6 3 6 EERBERBERBARBER
.
1
BARBER
BBARBERBABCDEEAACDEEBBRBSRBAR
EERBERBERBARBER
1
BARBER
250
200
Average Time in 10-2 mSeconds
150 BM
Raita
Horspool
100 VS
50
0
2 4 6 8 10 12 14 16 18 20
Pattern Length