A Two Way Pattern Matching Algorithm Using Sliding Patterns

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2010 3rd International Conforence on Advanced Computer Theory and Engineering(ICACTE)

A Two Way Pattern Matching Algorithm Using Sliding Patterns

Approach to reduce comparisons

Radhakrishna.V B.Phaneendra,V.Sangeeth Kumar


Department of CSE Department of CSE
BITS, Narsampet BITS, Narsampet
Warangal, A.P, India Warangal, A.P, India
e-mail: [email protected] e-mail:[email protected]

Ahstract- The string matching problem has attracted a lot of Although data is memorized in various ways the text
interest throughout the history of computer science and is remains the main form to exchange information. This is
becoming still crucial to computing industry. This problem particularly evident in the literature or linguistics where data
arises in many computer packages in the form of spell checkers, is composed of huge corpus and dictionaries. This apply as
search engines on the internet, matching of DNA strands well to computer science where large amount of data is
firewalls etc. Also as the quality of data available in these fields
stored in linear files and this is also the case in molecular
tends to double every 18 months, this seeds the need of more
biology because biological molecules can often be
efficient algorithms even though the speed and storage capacity
approximated as sequence of nucleotides or amino acids.
of computers increase regularly. In this paper we present an
Further as the quality of data available in these fields tends to
efficient pattern matching algorithm based on preprocessing of
double every 18 months, this seeds the need of more efficient
the pattern string by considering three consecutive characters
algorithms even though the speed and storage capacity of
of the text that immediately follow the aligned pattern window
in an event of mismatch between pattern and text character.
computers increase regularly.
The algorithm makes use of two sliding patterns. Both patterns Applications require two kinds of solution depending on
slide in parallel over the text until the first occurrence of the whether pattern or text is given first. Algorithms based on
pattern is found or a failure occurs. the use of automata or combinatorial properties of strings are
commonly implemented to preprocess the pattern and solve
Keywords- pattern matching, preprocessing, string matching the first kind of problem. The notion of indexes realized by
automata or trees is used by second class of problems.
In this paper the idea is to preprocess the pattern string
I. INTRODUCTION and hence this falls under the problem of first class. Section
The string matching problem has attracted a lot of II in this paper describes the related works carried out in the
interest throughout the history of computer science and is literature, Section III and IV describes the proposed
becoming still crucial to computing industry. String algorithm with working example.
matching is finding occurrence of a pattern string in the huge
II. RELATED WORKS
text string. This problem occurs in many computer packages
in the form of spell checkers, search engines on the internet, Many promising data structures and algorithms
matching of DNA strands firewalls etc. discovered by theoretical community are never implemented
String matching algorithms works as follows. Initially the or tested at all. This is because the actual performance of the
pattern string p[l...m] of length m to be searched for is algorithm is not analyzed as only the working of the
aligned to the left most end of the text string T[l...n] of algorithm in practice is taken care. There have been several
length n. The search process begins by comparing the pattern pattern matching algorithms designed in the literature till
characters with the characters of the text string and if a date as discussed below.
mismatch occurs the pattern is shifted by an amount defined In the naive or Brute Force technique [6] the string to be
by the algorithm being used. This is the key criteria where searched is aligned to the left end of the text and each
behaviors of most of the pattern algorithms differ. The speed pattern character is compared with the corresponding text
at which a mismatch between a pattern and the text is character. In this process if a mismatch occurs or the pattern
detected and the amount of shift defined on an event of character exhausts the pattern is shifted by one unit towards
mismatch is the key issue for judging the criteria of any right. The search is again resumed from the start of the
pattern matching algorithm and decides the efficiency of a pattern until the text is exhausted or match is found.
pattern matching algorithm. Naturally the number of comparisons being performed is
Monitoring number of comparisons performed by each very more as each time the pattern string is shifted right by
algorithm was chosen as a way to compare the pattern search only one unit towards right. The worst case comparison of
algorithms [3]. the algorithm is O(mn).

978-1-4244-6542-2/$26.00 © 2010 IEEE V2-666


2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE)

The number of comparisons can be reduced if we can are computed based on the mathematical formulation shown
move the pattern string by more than one unit. This was the below.
idea of KMP algorithm [4]. The KMP algorithm compares
the pattern from left to right with the text just as BF Bad Character of shift right
algorithm. When a mismatch occurs the KMP algorithm
moves the pattern to the right by maintaining the longest
m+2 if p[m-l]=a
overlap of a prefix of the pattern with a suffix of a part of
m+l if p[m-l]p[m-2]=ba
the text that matched the pattern so far. The KMP algorithm
does almost 2n text comparisons and the worst case Bad char of i+3 if p[i]p[i+l]p[i+2]=abc
complexity of the algorithm is 0 (m+n). Shiftr [a, b, c] =min 1 if p[O]p[l] =bc
Boyer-Moore algorithm[2] differs from other algorithms 2 if prO] =c
in one feature. Instead of comparing the pattern characters m+3 Otherwise
from left to right the comparison is done from the right
towards left by starting comparison from the rightmost
character of the pattern. In case of a mismatch it uses two
Begin
functions last occurrence function and good suffix function.
Shift=shiftr=m+3
If the text character does not exist in the pattern then the last
for (each character pi where Pj=O to m-3)
occurrence function returns m where m is the length of the
{Next[i) =m-i, next[i) =i+2}
pattern string. So the maximum shift possible was m.The
if P [m-l)=a {shift=1,shiftr=m+2}
worst case running time of the algorithm is 0 (mn).
else if p[m-l)p[m-2)=ba {shift=2,shiftr=m+l}
In [7] the authors propose the pattern matching based on
else if p[i)p[i+l]p[i+2]=abc {shift=next[i],shiftr=next[i]}
two sliding windows known as TSW algorithm.
else if p[O)p[l)=bc {shift=m+l,shiftr=l}
In this paper the idea is to reduce the number of
else if prO] =c {shiftl=m+2, shiftr=2}
comparisons being performed by obtaining as much shift as
End
possible. A shift value of more than length of the pattern is
obtained when the pattern gets mismatch with the text for
Figure 1: Algorithm for shift left and shifts right function
one instance.

B. Searching Phase
III. PROPOSED ALGORITHM
In this phase the pattern string to be searched for in the
The proposed algorithm consist of two phases given text string is aligned at both ends of the text and
a) Preprocessing phase scanned simultaneously from left and right ends for the
b) Searching phase occurrence of the pattern string.
A. Preprocessing phase
Initially the search process begins by searching for the
pattern string from left end of the text. In case of a mismatch
In this phase the pattern to be searched for in the given text while searching from left, the pattern window is shifted to
string is preprocessed i.e. it mainly deals with the right by a displacement defined by shift-left function. Then
computation of the shift-left and shift-right values which are the search turns to right end of the text and when a
later used in the searching process. To calculate shift-left mismatch occurs while scanning for the occurrence of the
and shift-right, the algorithm uses three consecutive pattern from right end of the text, the shift-right function is
characters a, b, c which are aligned immediately after the used to shift the pattern towards left. This process is
pattern window. Initially the indices of three consecutive continued until the pattern is found or the left and right
characters in the text string from the left are indices over cross each other.
(m+1),(m+2),(m+3) for a,b,c respectively represented
mathematically as shown below. Step1: Initially the search process begins by aligning the
pattern to both ends of the text i.e. by aligning p [0] to T [0]
Bad Character of shift left at the left end and p[m] to T[n] at the right end respectively.
Instead of searching all the characters of the pattern
1 if p[m-I]=a sequentially through p[O] ....p[m-I], only the dead end
characters of the pattern are compared and if satisfied
2 if p[m-I]p[m-2]=ba
remaining characters are compared. i.e. P[O] is compared
Bad char of m-i if p[i]p[i+1]p[i+2] =abc
with corresponding text character if it satisfies then p[m-I]
Shiftl [a, b, c] = min m+I if p[O]p[1]=bc
is compared with the corresponding text character aligned
m+2 if p[O]=c
and in an event of success the remaining pattern characters
m+3 Otherwise say p[m-2] ....p[I] are compared to reduce the number of
comparisons.
In the same way, the indices of three successive characters When a mismatch occurs during comparison from both
a,b,c from right are (n-m-3),(n-m-2),(n-m-I) respectively sides the algorithm goes to step2 otherwise the search

V2-667
2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE)

process continues until a pattern is found. On success it


returns the position of the pattern. When L and R cross each The entire search process for the pattern P in the text T is
other the search is said to be failure, which makes the illustrated with the help of the example below as shown in
algorithm to stop the process. the Figure 3 below.

Step2: In this step when a mismatch occurs during the o I 2 3 4 56789 W U U 0 W U � n � w w � n D U


search process the shift-left function is used to shift the
pattern by using the immediate three characters which are
placed to the right side of the left window and shift-right
A Ae c eTA A TTee
function is used to shift the pattern by using the immediate
three characters which are placed to the left side of the right
pattern window until the first occurrence of the pattern is
found in the text from either sides or until both the windows
(a)
are positioned beyond.
-'24 252627 2829 311 31 32 33 34 3536373839 40 41 42 43 44 4546

Tnt
The detailed algorithm for the same is given below.
-

AGAG XGA GA AT eGAA T (' A


L=O;lltext index used from left
R=n-m;lltext index used from right PalUmAfutshift
found=false;llinitial value
While(L<=R) do
1=r=m-l;llpattern index for left and right
if(P[O]=T[LD (b)
while(I>O) do
if(P[I]=T[L+lD 1--; o I 2 3 4 5 678 9 10 U 12 0 W 151617 1819 W 21 22 23 24
Else
break;
;.� T eTA Xc ATeA TA lieTA A T Tee
end while
if(P[O]=T[RD � WU � 1 2 3 4 567 89
while(r>O) do
if (P[r] ==T[r+RD r--; PalttmAl\ershift
else (c)
break;
U 252627W 29 311 D 32 D 34 ! • 37 � � w « a a « I •

end while
if(I=O){Pattern found left side atL} exit from
outer loop
Ten -
AGAG AGAGA ATe AATeG A ATe A
if(r==O){pattern found right side at:R} exit from
outer loop
L=L+shiftL (text substring (L+m,L+m+3)); (d)
R=R-shiftR (text substring(R-3, R));
if (L>R){pattern not found} exit from outer loop Figure 3: Working Example of proposed algorithm
end while
First attempt: In the first attempt the pattern aligned to
Figure 2: Searching algorithm the left end of the text is considered and the pattern
character p[O] is compared with the text character T[O]. In
this case there is a mismatch between the pattern character
IV. WORKING EXAMPLE G and text character A. Therefore we take the immediate
Consider the text and pattern string as shown below where three characters following the text at indices 8,9,10. Here
text length n=47 and pattern length m=8 T[8]T[9]T[10] = TCA. Also the pattern characters at indices
3, 4, 5 are TCA respectively. i.e. p[3]p[4]p[5]=TCA makes
Text i=3, i+1=4, i+2=5. The shift value is computed by using the
shift left function given by m-i = 8-3 =5 in this case. Hence
ATCTAACATCATAACCCTAATTGGCAGAGAGAGAA
the pattern is shifted to the right 5 units. This makes the
TCAATCGAATCA
pattern characters p [O] ...p [7] aligned to T [5] ...T [12].
Pattern
Now as there is a failure while searching from left, we shift
GAATCAAT

V2-668
2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE)

the search process to continue from right end of the text in


the next step.
COlllll1l1'isons
Second attempt: In the second attempt the pattern window
aligned to the right end of the text is considered for search
mo �------� r-------�
process and the pattern character p[O] is compared with the -+-P rqJosed
text character T[39]. Since there is a mismatch between the ___ �tithm
pattern character G and text character T, we consider the
immediate three characters to the left of the text at indices
BR
36, 37, 38 respectively and evaluate the shift value obtained
-+-BM
by using shift right function. This is because we are
searching for the text string from the right end. Now the text
characters at these indices are T[36]T[37]T[38]=CAA and --KMP
the pattern characters at the indices 4,5,6 are C,A,A
respectively. In other words p[4]p[5]p[6]=CAA which
makes i=4. The shift value is given by i+3 = 4+3=7 in this
case. Hence the pattern is shifted by 7 units towards left as 4 5 6 7 8 9 10 11
shown in the fig below. This makes the pattern characters
Pattern LengUl
T[32].....T[39] aligned to p[O] ...p[7]. Since there is a failure
in the search process from right end we stop the search
Figure 4 : Comparison of proposed algorithm with other pattern
process and continue the search from the left side.
matching algorithms

Third attempt: Now the pattern character p[O] is compared


with the text character T[5]. In this case there is a mismatch
The Figure 5 below shows the comparison of TSW
between the pattern character G and text character A.
Because of a mismatch we consider the 3 immediate algorithm and proposed algorithm
characters after the text at indices 13, 14, 15 which are A, C,
C respectively. In other words T[13]T[14]T[15]=ACC
where a=A,b=C,c=C. The shift value obtained by using COMPARISONS
shift-left function is given by m+3 i.e 8+3 = 11 in this case. ----

Hence the pattern is shifted to right by 11 units which �O Tr===============i 1


makes the pattern characters p[O] ...p[7] aligned to
� 400 I -+-PROPOSED ALGORITHM
--- TSVV I
T[16] ...T[23] respectively. o 350
.!tl
� 3]0
Fourth attempt: Because of a mismatch while searching
from the left end of the text previously so in the fourth � 30
attempt the search is continued by considering the pattern 8 2]0
aligned to the right end of the text at indices 32...39. Now ..
o
150
the pattern character p[O] at position 1 is compared to T[32] o 100
and there is a match. This is followed by comparing the
Z 50
pattern character p[7] at position 8 to the text character T[39]
and there is a match. Because both the dead end characters O +--.---r--��---.--.--'---.�
of the pattern p[O] and p[7] are matched we now start 4 5 6 7 8 9 10 11 12
comparing the remaining characters of the pattern
p[6] ....p[l] with the text characters T[38] ...T[33] as long as Pl1tern length
there is a match.
Figure 5: Comparison of TSW algorithm and proposed algorithm
In this case all the characters of the pattern match the
corresponding text which makes the search success and
hence the position of the pattern is returned.
The purpose of making a constraint on dead end V. CONCLUSION
characters of the pattern is to reduce the number of
comparisons if any. In this paper we present an efficient pattern matching
algorithm based on preprocessing of the pattern string by
considering three consecutive characters of the text in an
The Figure 4 below shows the comparison of the various event of mismatch between pattern and text character. The
pattern matching algorithms as mentioned with the proposed idea of considering three consecutive characters is from the
algorithm fact that occurrence of three successive characters is less
frequent than the other possibilities because of which even

V2-669
2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE)

the shift value obtained is also more compared to KMP or


Boyer-Moore algorithms. The concept of searching from
both sides makes the algorithm efficient when the pattern is
even at the end of the text. The reduction in the number of
attempts and comparisons is seen on using of the proposed
algorithm.

REFERENCES

[I] Wang, Y. and H. Kobayashi, 2006. High performance pattern


matching algorithm for network security.
URL:http: //paper. ijcsns.org/07 bookl200610/
_

200610A3.pdf
[2] BOYER, R. S. AND MOORE, J. S. 1977. A fast string
searching algorithm. Commun. ACM 20, 10,762-772.
[ 3] T.BERRY AND S.RAVINDRAN A Fast String Matching Algorithm
And Experimental Results, Proceedings of Prague Stringology
Club, workshop'99
[4] Knuth, D'£', J.H. Morris and V.R. Pratt, 1977. Fast pattern matching
in strings. SIAM J. Comput., 6: 323-350.7.
[5 ] Horspool, R.N., 1980. Practical fast searching in strings. Software
Practice Experience, 10: 501-506.
[6] Handbook of Exact String Matching Algorithms
[7] A Fast Pattern Matching Algorithm Using Two Sliding Windows,
Journal of computer science, Vol 6 2008

V2-670

You might also like