Lecture Notes On Pattern Matching Algorithms
Lecture Notes On Pattern Matching Algorithms
Lecture Notes On Pattern Matching Algorithms
1. Introduction
1 n
text ababaa
pattern ababbabab
1 m
mismatch
At first the pattern is set to the left end of the text, and matching process starts. After a
mismatch is found, pattern is shifted one place right and a new matching process starts,
and so on. The pattern and text are in arrays pat[1..m] and text[1..n] respectively.
1. j:=1;
2. while j <= n-m+1 do begin
3. i:=1;
4. while (i<=m) and (pat[i]=text[j]) do begin
5. i:=i+1;
6. j:=j+1
7. end;
8. if i<=m then j:=j-i+2 /* shift the pattern one place right */
9. else write(“found at “, j-i+1)
10. end.
The worst case happens when pat and text are all a’s but b at the end, such as pat =
aaaaab and text = aaaaaaaaaaaaaaaaaaaaaaaaaaaaab. The time is obviously O(mn). On
average the situation is not as bad, but in the next section we introduce a much better
algorithm. We call the operation pat[i]=text[j] a comparison between characters, and
48
measure the complexity of a given algorithm by the number of character comparisons.
The rest of computing time is proportional to this measure.
When we shift the pattern to the right after a mismatch is found at i on the pattern and j
on the text, we did not utilise the matching history for the portion pat[1..i] and
text[j-i+1..j]. If we can get information on how much to shift in advance, we can shift the
pattern more, as shown in the following example.
Example 1.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
text a b a b a a b b a b a b b a
pattern a b a b b
a b a b b
a b a b b
a b a b b
After mismatch at 5, there is no point in shifting the pattern one place. On the other hand
we know “ab” repeats itself in the pattern. We also need the condition pat[3] <> pat[5]
to ensure that we do not waste a match at position 5. Thus after shifting two places, we
resume matching at position 5. Now we have a mismatch at position 6. This time shifting
two places does not work, since “ab” repeats itself and we know that shifting two places
will invite a mismatch. The condition pat[1]<>pat[4] is satisfied. Thus we shift pat three
places and resume matching at position 6, and find a mismatch at position 8. For a
similar reason to the previous case, we shift pat three places, and finally we find the
pattern at position 9 of the text. We spent 15 comparisons between characters. If we
followed Algorithm 1, we would spend 23 comparisons. Confirm this.
The information of how much to shift is given by array h[1..m], which is defined by
h[1] = 0
text xxx a
pattern x x x c xxx b pat[i]=b
pattern shifted xxx c
h[i]
49
A B
Analysing the pattern xxx c xxx b
s i
The meaning of h[i] is to maximise the portion A and B in the above figure, and require
b<>c. The value of h[i] is such maximum s. Then in the main matching process, we can
resume matching after we shift the pattern after a mismatch at i on the pattern to position
h[i] on the pattern, and we can keep going with the pointer j on the text. In other words,
we need not to backtrack on the text. The maximisation of such s, (or minimisation of
shift), is necessary in order not to overlook an occurrence of the pattern in the text. The
main matching algorithm follows.
1. i:=1; j:=1;
2. while (i<=m) and (j<=n) do begin
3. while (i>=0) and (pat[i]<>text[j] do i:=h[i];
4. i:=i+1; j:=j+1
5. end
6. if i <= m then write(“not found”)
7. else write(“found at”, j-i+1)
f(1) = 0
The definitions of h[i] and f(i) are similar. In the latter we do not care about
pat[s]<>pat[i]. The computation of h is like pattern matching of pat on itself.
t
x x x x a
x x x x a x x x x b
t i-1 i
h[f(i)]
x x
f(i)
x x x x a
x x x x a x x x x b
50
Algorithm 3. Computation of h
1. t:=0; h[1]:=0;
2. for i:=2 to m do begin
3. /* t = f(i-1) */
4. while (t>0) and (pat[i-1]<>pat[t] do t:=h[t];
5. t:=t+1;
6. /* t=f(i) */
7. if pat[i]<>pat[t] then h[i]:=t else h[i]:=h[t]
8. end
Example. pat = a b a b b
Finally we have
i | 1 2 3 4 5
------------------------------
pat | a b a b b
f | 0 1 1 2 3
h | 0 1 0 1 3
The time of Algorithm 2 is clearly O(n), since each character in the text is examined at
most twice, which gives the upper bound on the number of comparisons. Also, whenever
51
a mismatch is found, the pattern is shifted to the right. The shifting can not take place
more than n-m_1 times. The analysis of Algorithm 3 is a little more tricky. Trace the
changes on the value of t in the algorithm. We have a doubly nested loop, one by the
outer for and the other by while. The value of t can be increased by one at line 5, and m-
1 times in total, which we regard as income. If we get into the while loop, the value of t
is decreased, which we regard as expenditure. Since the total income is m-1, and t can
not go to negative, the total number of executions of t:=h[t] can not exceed m-1. Thus the
total time is O(m).
Summarising these analyses, the total time for the KMP algorithm, which includes
the pre-processing of the pattern, is O(m+n), which is linear.
In the KMP algorithm, we started matching process from the left end of the pattern. In
the Boyer-Moore algorithm, we start matching from the right end of algorithm. By doing
pre-processing on the pattern, we know how much to shift the pattern over the text when
a mismatch is detected. See the following figure.
Text
xxxxx
xxxxx
Pattern
xx
If the matched portion or a part of it exists within the pattern, we make an alignment as
shown in the figure, and start the matching process from the right end of the pattern. For
a random text and pattern, the probability that the matched portion appears in the pattern
is very small, which means that the pattern can move the distance of m, which in turn
gives us the time of O(n/m). As the parameter m is in the denominator, this complexity is
called sub-linear, much faster than the KMP algorithm for a random text and pattern.
52
j
row 1
row k-1
i row R
We call the m rows of the pattern row 1, ..., row m. Some of the rows may be equal.
Thus we can compute the h function of row 1, ..., row m for the vertical pattern
matching. We prepare a one-dimensional array a[1..n] to keep track of the partial
matching. If a[j]=k, it means we have found row 1, ..., row k-1 of the pattern in the
region of the text from the point (i,j) towards up and left. If the newly found row R is not
found to be any row of the pattern, a[j] is set to 1. If row R belongs to the pattern, we use
the h function of the vertical pattern. While R <> row k, we perform k:=h[k].
After this, if R matches row k, a[j] is set to k+1. If a[j] becomes m+1, the pattern is
found. By a similar reasoning to the one-dimensional case, we can show that the time is
O(m2 + n2), which is linear. This is much better than the exhaustive method with
O(m2n2), as the data size is enormous in two-dimensions. There is an algorithm with sub-
linear time, which is better than Bird’s.
53
h function
1 2 3 4 5
-------------------------
pat 1 2 3 1 3
h 0 1 1 0 2
j
a a a b a b a
b a a b b a b a[j]=2
i a a b a b a b changed to a[j]=1
j
a a a b a b a
b a a b b a b a[j]=2
i a a a a b b b changed to a[j]=3
54
Lecture Notes on Computational Geometry
struct point {
int x, y; (x, y)
}
In this section, we develop an algorithm for determining whether two line segments
intersect. Let two line segments be q1 and q2. That is, the two end points of q1 are q1.p1
and q1.p2, and those of q2 are q2.p1 and q2.p2. We use the following condition for
intersection. Note that a line segment defines an infinite (straight) line
q2.p2
q1.p1 q2
q1
q1.p2
q2.p1
Condition: The two end points q1.p1 and q1.p2 must be on the opposite sides of the line
defined by q2, and the two end points q2.p1 and q2.p2 must be on the opposite sides of
the line defined by q1.
As an auxiliary function, we make the function “turn” that tests whether the angle made
by three points p0, p1, and p2. The function test whether the vector p0 -> p2 is ahead of
the vector p0 -> p1 anti-clock wise. If it is true, it returns 1, and -1 otherwise. See the
following figure. The function value is 1 in this case.
p2
p1
p0
55
The function “intersect” together with “turn” is given in the following.
struct point { int x, y; } dy1 p1 p2
dy2
struct segment { struct point p1, p2;}
trun = -1
int turn {struct point p0, struct point p1, struct point p2)
{ /* The last three cases are when three points are on a straight line */ dx1 dx2
int dx1, dx2, dy1, dy2;
dx1 = p1.x - p0.x; dy1 = p1.y - p0.y;
dx2 = p2.x - p0.x; dy2 = p2.y - p0.y;
if (dx1*dy2 > dy1*dx2) return 1; p1 p2
if ( dx1*dy2 < dy1*dx2) return -1;
if ((dx1*dx2 < 0)||(dy1*dy2 < 0) return -1;
if ((dx1*dx1 + dy1*dy1) < (dx2*dx2 + dy2*dy2)) return 1; p0
return 0;
}
int intersect(struct segment s1, struct segment s2)
{ if(((s1.p1.x==s1.p2.x)&&(s1.p1.y==s1.p2.y))||
((s2.p1.x==s2.p2.x)&&(s2.p2.x==s2.p2.x)))
/* case 1: One of two segments shrinks to one point */
return ((turn(s1.p1, s1.p2, s2.p1)*turn(s1.p1,s1.p2,s2.p2))<=0)
|| ((turn(s2.p1, s2.p2, s1.p1)*turn(s2.p1,s2.p2,s1.p2))<=0);
else /* case 2: Neither of segments is a single point */
return ((turn(s1.p1, s1.p2, s2.p1)*(turn(s1.p1, s1.p2, s2.p2)) <= 0)
&&((turn(s2.p1, s2.p2, s1.p1)*trun(s2.p1, s2.p2, s1.p2))) <= 0);
}
main( ) /* for test */
{ p2 (10, 10)
struct segment test1, test2;
test1.p1.x = 0; test1.p1.y = 0; p2 (3, 7)
test1.p2.x = 10; test1.p2.y = 10;
test2.p1.x = 8; test2.p1.y = 1;
test2.p2.x = 3; test2.p2.y = 7; test1 test2
printf(“ %d\n”, intersect(test1, test2));
} p1(0, 0) p1 (8, 1)
This program correctly computes whether two segments intersect even in special cases
like one point is on the other segment or one end point of one segment and one point of
the other segment fall on the same point .
3. Interior point
We test if a point is inside a polygon. The polygon in not necessarily convex. We draw a
semi-infinite line from the given point. If it intersects with the edges of the polygon an
odd number of times, it is inside, and outside if even. When we traverse vertices of the
56
polygon, we check if they are on the semi-infinite line. If so we skip them and a both-
infinite line is used for intersection test. See the following figure.
p[5]
p[3]
v
p[4]
p[2]
p[1]
This program is easy to design if we use the function “intersect”. We give the polygon by
array p[1], ..., p[n] of points. The program follows.
57
p[2].x = 5; p[2].y = 1;
p[3].x = 4; p[3].y = 5;
p[4].x = 9; p[4].y = 2;
p[5].x = 2; p[5].y = 8;
v.x = 3; v.y = 3;
printf(“ %d\n”, inside(p, n, v));
A convex polygon is a polygon such that the inner angle at each vertex is less than 180
degrees. The convex hull of n points is the smallest convex polygon that includes those
points. We first describe the package wrapping method with O(n2) time and then
Graham’s algorithm with O(n log n) time.
Package wrapping: We start from a specified point and select the point we first
encounter by going anti-clockwise, and repeat this process by going to the selected point.
In the following figure we start from point 1 and select point 2, ..., 6. Finally we select 1
and finish.
5 4
1 2
We need to compare the angles between segments. The C library function arctan will
give us the angle in radian. If we just determine which of two angles is greater, we can
use the following function “angle”, which is more efficient.
58
{
/* p[1] through p[n] are vertices of a polygon */
int i, m, min; double theta, prev;
struct point t;
min = 1;
for (i=2; i<=n; i++) if (p[i].y<p[min].y) min = i;
t = p[1]; p[1] = p[min]; p[min] = t;
p[n+1] = p[1];
min = n; theta = 360.0;
for (i=2; i<=n; i++)
if (angle(p[1], p[i]) < theta) {min = i; theta = angle(p[1], p[min]);
}
for (m=2; m<=n; m++) {
t = p[m]; p[m] = p[min]; p[min] = t; /* swap p[m] and p[min] */
prev = theta;
min = n + 1; theta = 360.0;
/*** This part finds the point p[min] such that angle(p[m], p[min]) is minimum
and greater than the previous angle. ***/
for (i=m+1; i<=n+1; i++)
if ((angle(p[m], p[i]) < theta )&& (angle(p[m], p[i]) >= prev))
{min = i; theta = angle(p[m], p[min]);}
if (min == n+1) return m; /* m is the number of points on the hull */
}
}
As the program has a double nested loop structure, the computing time is O(n2), which is
not very efficient. In the following, we introduce Graham’s algorithm invented by R. L.
Graham in 1972, which runs in O(n log n) time.
Graham’s algorithm: We start from the same point as before, that is, the lowest point
p[1]. We sort the angles angle(p[1], p[i]) for i=1, ..., n in increasing order by any fast
sorting algorithm, such as quicksort. Let p[1], ..., p[n] be sorted in this way. Then we
pick up p[2], p[3], ...in this order. We keep selecting points as long as we go anti-
clockwise from the previous angle. If we hit a clockwise turn, we throw away already
selected points until we recover clockwise angle. See the following figure.
p[5]
p[4]
p[3]
p[2]
p[1]
59
After we select up to p[4], we throw away p[3] and p[4], and connect p[2] and p[5]. To
determine clockwise turn or anti-clockwise turn, we use the function ‘turn’, introduced
before. The main part of the algorithm follows.
The main complexity is that of sorting. The time for the rest is O(n) as we examine each
point once. Hence the total time is O(n log n).
60
5. Voronoi Diagram
Suppose we have n points on a plane such as a square. The Voronoi diagram separates
those points by line segments or semi-infinte lines in such a way that in any closed region
there is only one point and any point in the region is closer to the point than to any other
point. If we imagine a big circle surrounding those points, we can define a region with
the semi-infinite lines and the circle. The regions define territories for those points, so to
speak.
Example.
Note that the three perpendiculars of a triangle meet at one point. In this example five
points are surrounded by line segments and the outer circle, and the five regions are
defined. The line segment between two points, say, a and b, is the perpendicular of the
points a and b. Any pair of two points which are in adjacent regions are connected by a
dotted line. Those dotted lines define triangles, and the resulting diagram is called the
Delauney diagram.
61
There are a number of algorithms for Voronoi diagrams. The following is based on the
divide-and-conquer paradigm.
1. Sort the n points according to the x-co-ordinates and call the p[1], ..., p[n].
2. Call Voronoi(1, n, P, V).
3 Define Voronoi(l, r, P, V) as follows:
4. If there are only three or less in P, draw the diagram directly.
5. Let m=(l+r).
6. Let P1 = {p[l], ..., p[m]}.
7. Call Voronoi(l, m, P1, V1).
8. Let P2 = {p[m+1], ..., p[r]}..
9. Call Voronoi(m, r, P2, V2).
10. Obtain the zipping line Z for V1 and V2 by algorithm M
11. Delete portions of line segments in V1 exceeding Z to the right.
12. Delete portions of line segments in V2 exceeding Z to the left.
13. Return the resulting diagram as V
Algorithm M
The time for Algorithm M is O(n). As usual, we can establish the following recurrence
for the computing time.
T(3) = c1
T(n) = 2T(n) + c2n
The solution is T(n) = O(n log n). Together with the sorting, the total time is O(n logn).
62
Example.
Legends
63