Efficient Algorithms For Normalized Edit Distance: Abdullah N. Arslan, Department of Computer Science
Efficient Algorithms For Normalized Edit Distance: Abdullah N. Arslan, Department of Computer Science
Efficient Algorithms For Normalized Edit Distance: Abdullah N. Arslan, Department of Computer Science
ABDULLAH N. ARSLAN1 , Department of Computer Science, University of California, Santa Barbara, CA 93106, USA. E-mail: [email protected] OMER EGECIOGLU2 , Department of Computer Science, University of California, Santa Barbara, CA 93106, USA. E-mail: [email protected]
X and Y of lengths m and n respectively, with m n, is to transform X into Y through a sequence of edit operations, called an edit sequence. The edit operations are of three types: insertion, deletion, and substitution. A given cost function assigns a weight to each edit operation. The amortized weight for a given edit sequence is the ratio of its weight to its length, and the minimum of this ratio over all edit sequences is the normalized edit distance. Existing algorithms for normalized edit distance computation with proven complexity bounds require O(mn2 ) time in the worst-case. We give provably better algorithms: an O(mn log n)-time algorithm when the cost function is uniform, i.e, the weights of edit operations depend only on the type but not on the individual symbols involved, and an O(mn log m)-time algorithm when the weights are rational.
ABSTRACT: A common model for computing the similarity of two strings Keywords: Edit distance, normalized edit distance, algorithm, dynamic programming, fractional programming, ratio minimization.
1 Introduction
Measuring similarity of strings is a basic problem in computer science with applications in many elds such as computational biology, text processing, optical character recognition, image and signal processing, error correction, information retrieval, pattern recognition, and pattern matching in large databases. Given two strings X and Y over a nite alphabet whose lengths are m and n respectively with m n, we consider sequences of weighted edit operations (insertions, deletions, and substitutions of characters), by means of which X is transformed into Y . If we call each such sequence an edit sequence, then the ordinary (conventional) edit distance problem (ED) seeks for an edit sequence with minimum total weight over all edit sequences. The edit distance between X and Y is dened as the weight of such a sequence. Although the edit distance is a useful measure for similarity of two strings, for some applications the lengths of the strings compared need to be taken
1 2
Supported in part by a UCSB-COR research grant 19992000. Supported in part by NSF Grant No. CCR9821038.
into account. Two binary strings of length 1000 differing in 1 bit may have the same edit distance as two binary strings of length 2 differing in 1 bit, although one would most likely state that only the 1000-bit strings are almost equal. Marzal and Vidal considered an alternate measure called normalized edit distance (NED) between X and Y [7] that takes the lengths into account. If we dene the amortized weight of an edit sequence as the ratio of the total weight of the sequence to the length of the sequence, NED is the minimum amortized weight over all edit sequences. Marzal and Vidal showed that NED yields better results in empirical experiments. However, it seems that the computation of NED requires signicantly more work than those of ordinary edit distance algorithms, which are mostly dynamic programming based. It is also interesting to note that NED cannot be obtained by post-normalization, i.e., rst computing the ordinary edit distance and then dividing it by the length of the corresponding edit sequence [7]. Ordinary edit distance can be computed in O(mn) time [13, 17, 5], or O(mn= log n) time if the weights are rational [8]. In order for NED computations to be advantageous, the computational complexity of an algorithm for the latter should not signicantly exceed these. There are several algorithms to compute NED, both sequential [7, 11, 16] and parallel [4]. Observing that the length of an edit sequence lies in the range m and m + n inclusive, an O(mn2 )-time dynamic programming algorithm can be developed for this problem [7]. Furthermore, it has been noted that NED can be formulated as a special case of constrained edit distance problems [11]. By adapting the techniques used for the constrained edit distance problems, NED can be computed in O(smn) time where s is the number of substitutions in an optimal edit sequence [11]. But since s can be as large as n, the worst case time complexity remains O(mn2 ). Another approach for NED computation uses fractional programming [16]; an iterative method in which an ordinary edit distance algorithm is invoked at each iteration. Experimental results on both randomly generated synthetic data, and real applications performed by Vidal, Marzal, and Aibar [16] suggest that the number of iterations necessary for the NED computation with this method is bounded by a small constant. This implies an achievement of experimental O(mn) time complexity for NED computation, but as argued by Vidal, Marzal, and Aibar, a mathematical proof of a theoretical bound for this algorithm seems difcult, although it may be possible to analyze at least the average time complexity under a reasonable probabilistic model. In this paper, we direct our attention to NED computations for common types of cost functions. First, we analyze the case when the cost function is uniform; i.e. weight of an edit operation does not depend on the symbols involved in the operation, but only on its type. In this case, we classify the edit operations into four types : insertion, deletion, matching and non-matching substitutions. Second, we analyze the case when the weights are rational, i.e. they are integral multiples of a xed real number, but not necessarily uniform. These and more restricted cases of cost functions have been studied in the literature in the context of ordinary edit distances [12, 15, 10, 8]. The previously suggested algorithms for NED do not specialize to faster than O(mn2 )-time algorithms for the cases we analyze in this paper. We propose provably
better algorithms for NED computation in these cases. Our algorithms are similar to the fractional programming normalized edit distance algorithm in that they iteratively solve certain subproblems that involve shortest path computations. Each of these subproblems is solved by a standard ordinary edit distance algorithm, such as the one proposed by Wagner and Fisher [17]. But unlike the fractional programming algorithm, our algorithms perform a binary search for the optimum value of NED by making use of a technique developed by Megiddo [9] for the optimization of the ratio of two linear functions. The worst-case resulting time complexity of our algorithms are O(mn log n) when the weights are uniform (Corollary 4.4), and O(mn log m) when the weights are rational (Corollary 4.9). The outline of this paper is as follows. Section 2 consists of preliminaries; denitions, notation used, and the problem statement. In section 3 we describe the optimization of the ratio of two linear functions and Megiddo's method. Section 4 outlines our algorithms and gives proofs of their worst-case time complexities. This is followed by remarks in section 5 and conclusions in section 6.
2 Denitions
Let
X = x1 x2 xm and Y = y1 y2 yn be two strings over an alphabet with n 0, not both null. We assume that the edit operations applicable on the symbols of X to transform it into Y are of three types: inserting a symbol into X , deleting a symbol from X , or substituting a symbol from for a symbol of X . The m
substitution operation can further be broken down into matching and non-matching substitutions. More formally for 1 i m, the allowable edit operations are (2) Deletion: the symbol xi can be deleted. (1) Insertion: any symbol s 2 can be inserted before or after xi ,
(3) Substitution: the symbol xi can be replaced by a symbol s 2 . A substitution operation is (3-a) a matching substitution if s = xi , (3-b) a non-matching substitution if s 6= xi .
A common framework for edit distances is the edit graph GX;Y; of the strings X and Y , and a given cost function . The edit graph is a weighted directed acyclic i m, and 0 j n graph having (m + 1)(n + 1) lattice points (i; j ) for 0 as vertices. The cost function determines the edge weights as we will explain later. The top-left extreme point of this rectangular grid is labeled (0; 0) and the bottomright extreme point is labeled (m; n), as shown in gure 1. The arcs of GX;Y; are divided into three types corresponding to edit operations: (1) Horizontal arcs: f((i ? 1; j ); (i; j )) j 0 < i
m, 0 j ng (deletions). (2) Vertical arcs: f((i; j ? 1); (i; j )) j 0 i m, 0 < j ng (insertions). (3) Diagonal arcs: f((i ? 1; j ? 1); (i; j )) j 0 < i m, 0 < j ng (substitutions).
If xi = yj , then the diagonal arc ((i ? 1; j ? 1); (i; j )) is a matching diagonal arc, otherwise a non-matching diagonal arc. Figure 1 illustrates an example edit graph for X = aba and Y = bab.
X Y
b
0,0
a
a b a b a b
1,0
b
b b b b a b
2,0
a
a b a b a b
3,0
0,1
1,1
2,1
3,1
a a a
b a b
a a a
0,2
1,2
2,2
3,2
a b a
b b b
a b a
0,3
1,3
2,3
3,3
F IG . 1: The edit graph GX;Y; for the strings function is not specied in this case.
The cost
An edit path in GX;Y; is a directed path from (0; 0) to (m; n). Steps of an edit path correspond to an edit sequence as follows: A horizontal arc ((i ? 1; j ); (i; j )) corresponds to the deletion of xi (i.e. xi ! ), a vertical arc ((i; j ? 1); (i; j )) corresponds to the insertion of yj immediately before xi (i.e. ! yj ), and a diagonal arc ((i ? 1; j ? 1); (i; j )) corresponds to the substitution of symbol yj for xi (i.e. xi ! yj ), where represents the null string. In gure 1, the arcs are labeled accordingly. A cost function assigns a weight to each edit operation, turning GX;Y; into a weighted graph. is given as a cost matrix where rows and columns are indexed by . The entry at row s1 , and column s2 is the weight s1 !s2 of the symbols in f g the edit operation s1 ! s2 . Note that ! is undened. Figure 2 shows an example. We rst consider certain graph problems in GX;Y; and give the denitions of ordinary and normalized edit distances in terms of these. Let PX;Y denote the set of all edit paths between X and Y in GX;Y; . W (p) denotes the sum of the weights of the arcs in p 2 PX;Y (W is called the path-weight function with respect to ), and L(p) denotes the path-length function which gives the number of arcs in p 2 PX;Y . The shortest path problem SX;Y; is the weight-minimization problem such that
(2.1)
a b
7 7
a
9 0 5
b
9 5 0
.
F IG . 2. General representation of
where for any optimization problem M , we use M to denote its optimum value. The ordinary edit distance EDX;Y; is the weight of an edit sequence with minimum weight. EDX;Y; = SX;Y; when the cost function satises a pair of conditions: First, should not include negative weights, for otherwise EDX;Y; is undened even though the denition (2.1) yields some value for SX;Y; . In the presence of negative weights, some sequence of edit operations may be repeated, undoing the effect of earlier operations, and decreasing the total weight. This situation is similar to the shortest path problem in arbitrary weighted graphs with negative cycles. Second, the triangle inequality must hold for : i.e. for all triples of edit operations s1 ! s3 , s1 ! s2 , s2 ! s3 where s1 ; s2 ; s3 2 f g ,
s1 !s3
s1 !s2 + s2 !s3
To see the relevance of this, assume all edit operations have some large weight except for c!f = f !h = 0, and consider the strings cat and hat. The optimum ordinary edit distance can be achieved by performing two operations on the rst index position of cat by cat ! fat ! hat. But then the resulting ordinary edit distance is smaller than SX;Y; . The problem is that an optimal edit sequence does not correspond to any edit path in PX;Y . An optimal edit sequence can be captured in the set of edit paths only if no multiple editing is allowed on any index position. Consider the problem of nding the minimum amortized weight of the paths in PX;Y . We denote this problem by RX;Y; . Thus
conditions discussed above for the cost function hold. We assume that the cost function is non-negative and it satises the triangle inequality, which has been the common practice throughout the literature. Note that if m = n = 0 then both NEDX;Y; and RX;Y; are undened. Figure 3 shows the resulting edit graph GX;Y; for the strings of gure 1, and the cost function in gure 2. The paths p1 and p2 in gure 3 are optimal edit paths for ordinary and normalized edit distances, respectively. Note that the post-normalized
( RX;Y; = p min W (pp) : (2.2) X;Y L ) The normalized edit distance problem NEDX;Y; , seeks for the minimum amortized weight over all possible edit sequences. NEDX;Y; = RX;Y; when the two
2P
value for p1 is W (p1 )=L(p1 ) = 5, whereas the optimum value of NED as dened in (2.2) is achieved by p2 with W (p2 )=L(p2 ) = 4.
a
7
b
7 9 0 7 9 5 7 9 0 7 9 9 9
a
7 5 7 0 7 5 7 9 9 9
a
7
b
7 9 0 7 9 5 7 9 0 7 9 9 9
a
7 5 7 0 7 5 7 9 9 9
b a b
5 7
b a b
5 7
0 7
0 7
5 7
5 7
(a)
(b)
F IG . 3: (a) An optimal edit path p1 (for ED) with weight 15, EDX;Y; = 15. (b) An optimal edit path p2 (for NED) with amortized weight 16=4 = 4, NEDX;Y; = 4. Given an alphabet , let < e1 ; e2 ; : : : ; ed > be an ordering of edit operations where d = (j j + 1)2 ? 1, the size of . For strings X and Y , dene F = f(fe1 (p); fe2 (p); : : : ; fed (p) j p 2 PX;Y g where fe (p) denotes the number of arcs in p corresponding to the edit operation e . W and L can be thought of as linear functions from F to the real numbers. For a given p 2 PX;Y
W (p) =
1
L(p) =
1
i d i d
ei fei (p)
(2.3)
fei (p):
Consequently NEDX;Y; can be viewed as the minimization problem whose objective function in (2.2) is the ratio of the two linear functions in (2.3).
3 Optimizing the Ratio of Two Linear Functions For z = (z1 ; z2 ; : : : ; zn ) 2 D for some domain D IRn , let A the problem of minimizing c0 + c1 z1 + + cn zn and let B be the problem of minimizing (a0 + a1z1 + + anzn )=(b0 + b1z1 + + bnzn ) where the denominator is assumed
to be always positive in D. For real, let A( ) denote the parametric problem of minimizing (a0 + a1 z1 + + an zn ) ? (b0 + b1 z1 + + bn zn) in D. This is problem A with ci = ai ? bi , i = 0; 1; : : : ; n. s.t.
Problem A:
minimize c0 + c1 z1 + z2D
+ c n zn
Problem A( ):
+ an zn) ? (b0 + b1 z1 +
+ bn zn )]
Dinkelbach's algorithm [3] can be used to solve B when a solution for A is available. The algorithm uses the parametric method of an optimization technique known as fractional programming. In fact, this method is also applicable to optimization problems involving ratios of functions which are more general than linear. The thesis of the parametric method is that an optimal solution to B can be achieved via the solution of A( ). In fact
=B
iff
A ( )=0:
That is, is the optimum value of problem B iff the optimum value of the parametric problem A( ) is zero. Dinkelbach's algorithm starts with an initial value for and repeatedly solves A( ). At each instance of the parametric problem, an optimal solution z of A( ) yields a ratio for B . This new ratio is either equal to , in which case it is optimum, or better (smaller) than . If it is equal to then the algorithm terminates. Otherwise, the ratio is taken to be the new value of and A( ) is solved again. It can be shown that when continued in this fashion, this algorithm takes nitely many steps to nd an optimal solution to B if D is a nite set. Furthermore even if D is not nite, convergence to B is guaranteed to be superlinear. Various properties of Dinkelbach's algorithm and fractional programming can be found in [2, 3, 6, 14]. Megiddo [9] introduced a general technique to develop an algorithm for B given an algorithm for A. The resulting algorithm for B is the algorithm for A with ci = ai ? bi , for i = 0; 1; : : :; n, where is treated as a variable, not a constant. That is, the algorithm is the same algorithm as that for A except that the coefcients are not simple constants but linear functions of the parameter . Instead of repeatedly solving A( ) with improved values of , this alternative solution simulates the algorithm for A over these coefcients. The assumption is that the operations among coefcients in the algorithm for A are limited to comparisons and additions. Additions of linear functions are linear and can be computed immediately, but comparisons among linear functions need to be done with some care. The algorithm needs to keep track of the interval in which the optimum value B lies. This is essential because comparisons in the algorithm for A now correspond to those among linear functions, and outcomes may vary depending on interval under consideration for . The algorithm starts with the initial interval ?1; +1] for B . If the functions to be compared intersect, then their intersection point 0 determines two subintervals of the initial interval. In calculating which of the two subintervals contains B , algorithm for A is called for help, and problem A( 0 ) is solved. The new interval and the result of the comparison are determined from the sign of the optimum value A ( 0 ) as will be explained later. With Megiddo's technique, if A is solvable using O(p(n)) com-
parisons and O(q (n)) additions then B can be solved in time O(p(n)(p(n) + q (n))) . We refer the reader to Megiddo's paper [9] for the details of this approach. Megiddo also showed that for some problems the critical values of , values of which affect the outcome of comparisons, can be precomputed. In such cases these values give us the possible candidates for the endpoints of the smallest interval which eventually contains the optimum value B . Whenever this can be done, binary search can be used to nd B as follows: If A ( ) = 0, then = B , and an optimal solution z of A( ) is also an optimal solution of B . On the other hand, if A ( ) > 0, then a larger , and if A ( ) < 0, then a smaller should be tested (i.e. problem A( ) should be solved with a new value of ). This procedure continues until the correct value B is found. Let 0 be the smallest value in the set for which A ( 0 ) is greater than or equal to zero. Then an optimal solution z of A( 0 ) yields the optimum value B of B . Fewer number of invocations of algorithm A may reduce the time complexity of solving B signicantly, which is the case in problems such as minimum ratio cycles, and minimum ratio spanning trees [9].
4 Algorithms for
We analyze the NED problem assuming two different types of commonly used cost functions: uniform weights and rational weights. In each case, problem A is the shortest path problem SX;Y; with the objective function in (2.1), and problem B is the normalized edit distance problem NEDX;Y; whose objective function is given in (2.2). We call the parametric problem A( ) the parametric shortest path problem whose objective is dened as the following : For any real SX;Y; ( ) = min W (p) ? L(p)] : (4.1)
NED
p2PX;Y
In the cases of cost functions we analyze, we consider two different formulations for the parametric shortest path problem in terms of the shortest path problem. We apply Megiddo's search technique either on a precomputed set of values, or on a range of values in which the smallest distance between any two distinct values is larger than some threshold we precompute.
p. W
W u (p) = D h(p) + I v(p) + M dM (p) + N dN (p); L(p) = h(p) + v(p) + dM (p) + dN (p):
Furthermore from the structure of the graph
(4.2)
GX;Y; we have the identities m = h(p) + dM (p) + dN (p); n = v(p) + dM (p) + dN (p): Therefore we can rewrite W (p) and L(p) as linear functions of two variables dM (p) and dN (p) only, by using the expressions h(p) = m ? dM (p) ? dN (p), and v (p) = n ? dM (p) ? dN (p) in (4.2). In this case, W u and L are linear functions from the set f(dM (p); dN (p)) j p 2 PX;Y g to the real numbers, and the numerator and
denominator in (2.2) are given by the linear functions
W u (p) = m D + n I
(4.3)
As we show next the set of possible values of can be precomputed efciently. P ROPOSITION 4.1 For not both m and n equal to zero, let
NEDX;Y; u
? D )r + q(r; s) = m D + n I + ( M m +In? r ? s ( N ? I ? D )s : ?
Then 1. jQj = O(n2 ), 2. For any two strings X and Y over PX;Y g Q, 0. 3. For all 2 Q, of lengths m and n,
(4.4)
fW u (p)=L(p) j p 2
That is, the possible amortized weights for the paths in PX;Y are all included in the set Q whose cardinality is O(n2 ) (assuming m n), and whose elements are nonnegative.
10
P ROOF. Since r + s minfm; ng = n, by denition q (r; s) takes on O(n2 ) distinct values, proving the result about the size of Q. To see that Q includes all amortized weights note that the expression used to generate the elements of Q in (4.4) uses the same expressions for W u and L in (4.3) except that the variables r, and s replace dM (p) and dN (p) respectively, and for any p 2 PX;Y , it is necessary that dM (p) + dN (p) minfm; ng . Conversely, for non-negative integers r; s with r + s minfm; ng, it is easy to see that there are strings X = x1 x2 xm , Y = y1 y2 yn and an edit path p in PX;Y such that dM (p) = r, dN (p) = s, and q (r; s) is the amortized weight of the path p provided that not both m and n are zero. Therefore the set Q in the proposition is actually the union of the ratios fW u (p)=L(p) j p 2 PX;Y g where X and Y vary over all strings of length m and n over . Since amortized weights are non-negative for the paths in PX;Y , Q does not include a negative number. P ROPOSITION 4.2 For 2 Q, the optimum value SX;Y; u ( ) of the parametric shortest path problem can be formulated in terms of the optimum value SX;Y; u 0 of a shortest path problem. P ROOF. For any (4.3), we have
+ ( M ? I ? D )dM (p) + ( N ? I ? D )dN (p) ? ( m + n ? dM (p) ? dN (p) ) ] = (p min m D + n I X;Y + ( M + ? I ? D )dM (p) + ( N + ? I ? D )dN (p) ] ) ? (m + n) = SX;Y; u ? (m + n)
2P
0
where
Thus computing SX;Y; u ( ) involves solving the shortest path problem SX;Y; u 0 , and performing some simple arithmetic afterwards. Note that, all 2 Q are non-negative. Therefore u 0 and GX;Y; u 0 have no negative weights.
u0 = ( I ; D ; M +
; N+ ):
(4.5)
NED problem with uniform weights: The algorithm rst generates the set of numbers Q which includes all possible amortized weights, i.e. potential optimal values of NED as described in proposition 4.1. Next, the optimum value NEDX;Y; u is sought
in this set by simulating a binary search. At each iteration, the median of the current set is found, and with this value a parametric shortest path problem instance is created. This parametric problem is solved by using the formulation in proposition 4.2. If the
Based on this result, we propose the following algorithm UniformNED to solve the
11
optimum value of the parametric problem is zero then the median is the optimum value of NED and the algorithm terminates with returning the median. If it is negative then the search needs to be directed to smaller values; otherwise if it is positive, the search space is reduced to larger values. In both cases, the non-optimal half is removed from the set. This process may remove all the elements but one from Q . The remaining element is returned as the optimum value of NED . The main steps of the algorithm are shown in gure 4. The algorithm is clearly
ALGORITHM UniformNED Step 1 : If m = n = 0 then return( 1) signalling an undefined result. Step 2 : Return trivial answers : If m = 0 then return( I ), If n = 0 then return( D ). Step 3 : Generate the set Q (of proposition 4.1). Step 4 : While( Q > 1) do Find the median med of Q . Step 5 : Step 6 : v SX;Y; u ( med ) (use proposition 4.2). Step 7 : If v = 0 then return( med ) else if v < 0 then remove from Q Step 8 : med and the elements larger than med Step 9 : else (if v > 0 then) remove from Q med and the elements smaller than med . Step 10: End (while) Step 11: Return the element in Q. Step 12: End.
j j
F IG . 4.
correct when m or n is zero. Otherwise fW u (p)=L(p) j p 2 PX;Y g 6= , and 2 Q such the correctness in this case follows from the facts that there exists that SX;Y; u ( ) = 0 (take = NEDX;Y; u ), and if SX;Y; u ( ) = 0 then 2 fW u (p)=L(p) j p 2 PX;Y g . Note that any shortest path algorithm A for GX;Y; u 0 with non-negative weights can be used to compute SX;Y; u 0 as part of the computation of SX;Y; u ( ) in step 6 of algorithm UniformNED, provided that u 0 as dened in (4.5), does not cause a conict with algorithm A's assumption on the edge weights of GX;Y; u 0 . T HEOREM 4.3 If algorithm A computes SX;Y; u 0 (without the assumption of the triangle inequality for u 0 ) with time complexity C (m; n), then NEDX;Y; u can be computed in time O(n2 ) + C (m; n)O(log n) . P ROOF. We show that this complexity result can be achieved by using algorithm UniformNED. Step 3 of the algorithm takes O(n2 ) time. The while loop iterates O(log n) times. If linear-time median nding algorithm [1] is used in step 5, then the total time (from start to completion of the loop) spent in steps 5, 8, and 9 to nd the
12
medians, and remove the elements from Q is O(n2 ) since the size of Q is halved after each iteration. Solving a parametric problem in step 6, which involves a shortest path computation problem and simple arithmetic, by proposition 4.2, takes C (m; n) time using algorithm A . The remaining steps take constant time. Thus the resulting time complexity is as expressed in the theorem. Wagner and Fisher's ordinary edit distance algorithm [17], whose time complexity is O(mn) and space complexity is O(n), actually computes SX;Y; u 0 . It can be used as algorithm A to solve the parametric shortest path problems in algorithm UniformNED. Since C (m; n) = O(mn) in this case, and the set Q requires O(n2 ) space, we have C OROLLARY 4.4 Algorithm UniformNED computes the normalized edit distance NEDX;Y; O(mn log n) operations and O(n2 ) space. using
Q = fW (p)=L(p) j p 2 PX;Y g;
and
then
follows from the facts that jW (p1 )L(p2 ) ? W (p2 )L(p1 )j 1 (distance between any two distinct integers), and L(p1 )L(p2 ) (m + n)2 since L(p1 ); L(p2 ) m + n . P ROPOSITION 4.6 For real and a cost function with integral weights, the optimum value SX;Y; ( ) of the parametric shortest path problem can be formulated in terms of the optimum value SX;Y; 0 of a shortest path problem.
Efcient Algorithms For Normalized Edit Distance P ROOF. For any we have
13
4 2
i d
X
ei fei (p) ?
1
i d
fei (p)5
= p min = SX;Y; = ei ?
for all i, 1
4
1
0
2PX;Y
i d
( ei ? )fei (p)5
From proposition 4.6, SX;Y; ( ) is the optimum value of the shortest path problem in GX;Y; 0 . This denition of the parametric problem was used in fractional programming normalized edit distance algorithm [16]. We want to emphasize that GX;Y; 0 may not lead to a valid instance of ordinary edit distance problem because 0 may include a negative weight. We propose the following algorithm IntegerNED for the NED problem with integer weights : The algorithm rst computes the smallest possible gap between any two distinct values for NEDX;Y; using the expression in proposition 4.5. The algorithm maintains an interval, e; f ], such that the optimum value of NEDX;Y; lies in e ; f ] where e, and f are appropriate integral values. Initially e is set to zero, and f is set to max = where max is the maximum weight in . The algorithm iteratively solves a parametric shortest path problem with parameter k where k is the median of integers in e; f ]. At each iteration the interval is updated according to the sign of the optimum value of the parametric problem as explained in Megiddo's search technique. The effective search space is the integers in e; f ] . Each iteration reduces this space by half. The iterations end whenever the optimum value for the parametric shortest path problem is zero upon which the algorithm terminates by returning the parameter k as the optimum value of NEDX;Y; , or whenever there remains no integers between e, and f . In the latter case, the algorithm solves a parametric shortest path problem with parameter f . An optimal path for this parametric problem yields the optimum amortized weight (the optimum value of NEDX;Y; ) with which the algorithm terminates. The main steps of the algorithm are shown in gure 5. f , and NEDX;Y; is in The invariant for the while loop in step 4 is that e e ; f ] . We can prove that it holds by induction on the number of iterations. At the beginning (iteration zero) the invariant is true since e and f are initialized to zero and max , respectively, and NEDX;Y; is in 0; max ] . The proof of the inductive step follows from the discussions of Megiddo's search technique. The algorithm returns the parameter value if during one of the iterations the optimum value of the parametric shortest path problem is zero in which case the algorithm is correct. Otherwise, the while-loop terminates with the following conditions being true: e f and e + 1 f (e = f or e + 1 = f ), and NEDX;Y; is in e ; f ] . Since the minimum distance
0 where ei
i d.
14
ALGORITHM IntegerNED Step 1 : If m = n = 0 then return( 1) signalling an undefined result. 1 Step 2 : (m+n)2 Step 3 : e;f ] 0; max (m + n)2 ] Step 4 : While (e + 1 < f ) do Step 5 : k (e + f )=2 Step 6 : v SX;Y; (k ) (use proposition 4.6). Step 7 : if v = 0 then return(k ) Step 8 : else if v < 0 then f = k Step 9 : else e = k Step 10 : End while Step 11 : Solve SX;Y; (f ) to find the optimum value SX;Y; (f ) and the length L(p) of the corresponding optimal path p . Step 12 : W (p) SX;Y; (f ) + f L(p)
Step 13 : Step 14 :
Return End.
W (p) L(p)
F IG . 5.
between any two possible distinct values for cases two consider :
(i) NEDX;Y; = e or NEDX;Y; = f , (ii) NEDX;Y; is in (e ; f ) (in which case there is only one possible value for NEDX;Y; in (e ; f )).
In both cases, an optimal path p for the parametric shortest path problem with parameter f yields the optimum value of NEDX;Y; , because of the fact that each new solution to a parametric problem yields a ratio no larger than the parameter value as pointed out in the description of Dinkelbach's algorithm. Note that in step 12 of the algorithm, after determining the optimum value of the parametric shortest path problem and the length of the corresponding optimal path, we have used the expression in denition (4.1) to determine the weight of the optimal path. T HEOREM 4.7 If algorithm A computes SX;Y; 0 and the length of the corresponding optimal path (without the assumption of non-negativity for 0 ) with time complexity C (m; n) and space complexity U (m; n) then provided that all the costs in are integral, NEDX;Y; can be computed in time O(C (m; n) log m) using U (m; n) space. P ROOF. We show that this complexity result can be achieved by using algorithm IntegerNED. The while loop iterates O(log ( max (m + n)2 )) times because the search space on which binary search is performed is included in the set of integers in the range 0; max (m + n)2 ] . Solving the parametric problem in step 6 takes C (m; n) time using algorithm A since it involves a shortest path computation and some simple arithmetic by proposition 4.6. Given that algorithm A computes the
15
length of an optimal path along with the optimum value of the parametric problem, step 11 can be performed within time and space complexities of algorithm A . The remaining steps take constant time. Therefore the resulting time complexity is O(C (m; n) log ( max (m + n)2 )) . The space complexity of the algorithm is the same as that of algorithm A since the space required by the steps other than those which involve the parametric shorest path computations is only constant. The ordinary distance algorithm of Wagner and Fisher [17] uses a very simple dynamic programming formulation which can easily be modied such that it keeps track of the length of the optimal path, and it also returns this length in addition to the optimum value of the shortest path problem. This modication does not increase the time and space complexities of the algorithm. This modied algorithm is suitable as algorithm A in IntegerNED. Since C (m; n) = O(mn) and U (m; n) = O(n) in this case, we have C OROLLARY 4.8 Algorithm IntegerNED computes the normalized edit distance NEDX;Y; using O(mn log m) operations and O(n) space provided that all the costs in are integral.
are integral multiples of a xed real number with the same time complexity as follows: Let int be an integral cost function such that If the weights of edit operations in (4.6)
i d. Then the following relation holds by the denition in (2.2) NEDX;Y; = rNEDX;Y;
int
Therefore C OROLLARY 4.9 The normalized edit distance NEDX;Y; can be computed in O(mn log m) time and O(n) space provided that all the costs in are rational.
5 Remarks
In some cases, an optimal edit path may also be desired besides the normalized edit distance. This can be accomplished by solving an additional parametric shortest path problem. An optimal edit path for the parametric problem SX;Y; (NEDX;Y; ) has amortized weight NEDX;Y; for any cost function . Therefore, following the computation of NEDX;Y; it sufces to solve SX;Y; (NEDX;Y; ) for an optimal edit path. This additional computation can be done in O(mn) space. In theorems 4.3, and 4.7, we have expressed the time complexities of the algorithms UniformNED and IntegerNED in terms of the time complexity of a shortest path algorithm A which is actually used to solve the parametric shortest path problems. To improve these time complexities one may want to use faster shortest path algorithms.
16
For edit graphs, ordinary edit distance algorithms perform better than general shortest path algorithms because of the special structure of the graph. But not every fast ordinary edit distance algorithm is a suitable candidate as algorithm A for the computation of shortest paths in our algorithms. In general, ordinary edit distance algorithms assume that the weights are non-negative and/or the triangle inequality holds for the weights. Proposition 4.6 gives a general formulation of the parametric shortest path problem in terms of the shortest path problem. The formulation preserves the triangle inequality, but it may introduce negative weights into the cost function. To the best of our knowledge the algorithm in [17] is the fastest algorithm A in the presence of negative weights. In the case of uniform cost function, we gave a different formulation for the parametric problem in proposition 4.2. By this formulation the triangle inequality may be destroyed, but non-negativity of weights is preserved. Even though there are more candidate algorithms that can be used as algorithm A in this case, the algorithm in [17] turns out to be the only feasible one. Next we discuss the reasons for this. Ordinary edit distance algorithms which assume a xed cost function (e.g. u = (1; 1; 0; 1)) and achieve fast running times by using this constant nature of the weights [15, 10] are evidently not suitable as algorithm A in UniformNED. This is because each execution of Step 6 of the algorithm uses a different cost function u 0 . Masek and Paterson gave a C (m; n) = O(mn= log n)-time algorithm [8] which can be used as algorithm A here, but it is no longer true that the running time of UniformNED is as indicated by theorem 4.3. The assumption in the algorithm of Masek and Paterson is that the weights are integral multiples of a positive real constant r. We can easily show that if the assumption is true for u , then it is also true for u 0 that arises in UniformNED: Suppose that I , D , M , and N are all integral multiples of a real positive number r and let be in Q. If = 0, then the assertion is true for the weights in u 0 , since u 0 = u in this case. If > 0 then there exist positive integers a, and b such that = ar=b because of the way we generate the set Q in proposition 4.1. Therefore in this case, I , D , M + , and N + are integral multiples of the positive real number r=b. There appear to be no other obvious choice of the new constant such that the assumption holds for u 0 . But the algorithm of Masek and Paterson takes time proportional to some function of 1=r which normally is absorbed in the big O, since it is independent of m and n. The discrepancy factor among the magnitude of constants for different parametric problems of UniformNED can be in order of m since b has to be, in some cases, as large as the maximum path length m + n. Therefore even though it may be possible to achieve O(mn= log n) with u , solving some parametric problems using this A would take signicantly (by about a factor of m) more time. Ukkonen's O(dn)-time output-size sensitive algorithm [15], where d is the actual edit distance, does not seem applicable because it assumes that the weights are nonnegative and they fulll the triangle inequality. Even though there are alternate formulations of a parametric shortest path problem SX;Y; u ( ) in terms of a shortest path problem in GX;Y; u 0 with different cost functions u 0 , the triangle inequality cannot always be met while simultaneously keeping the weights non-negative. For example,
17
setting u 0 = ( I ? ; D ? ; M ? ; N ? ) maintains the triangle inequality by allowing negative weights, or setting u 0 = ( I ; D ; M + ; N + ) guarantees non-negativity of weights but possibly destroys the triangle inequality.
6 Conclusion
In the absence of theoretical time complexity results for the application of fractional programming to normalized edit distance calculations, we have improved the time complexity of normalized edit distance computation from O(mn2 ) to O(mn log n) when the weights of edit operations are uniform, and to O(mn log m) when the weights are rational. The worst-case and the expected time complexity of fractional programming formulation of NED need to be investigated for a better assessment of the quality of the algorithms presented in this paper, keeping in mind that our time complexity bounds are in the worst-case. Real applications may be in favor of fractional programming based NED algorithm because of its easy implementation, observed experimental performance, and the generality of the cost function that can be used. It seems that while our algorithms have improved worst-case time complexities, fractional programming normalized edit distance algorithm may actually be provably faster on the average.
References
[1] M. Blum, R. W. Floyd, R. Pratt, R. L. Rivest, and R. E. Tarjan. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448461, August 1973. [2] B. D. Craven. Fractional Programming. Helderman Verlag, Berlin, 1988. [3] W. Dinkelbach. On nonlinear fractional programming. Management Science, 13(7):492498, March 1967. g [4] O. E ecio lu and M. Ibel. Parallel algorithms for fast computation of normalized edit distances. g Proceedings. Eighth IEEE Symposium on Parallel and Distributed Processing (SPDP' 96), pages 496 503, October 1996. [5] Z. Galil and R. Giancarlo. Data structures and algorithms for approximate string matching. Journal of Complexity, 4(1):3372, March 1988. [6] T. Ibaraki. Parametric approaches to fractional programs. Mathematical Programming, 26(3):345 362, August 1983. [7] A. Marzal and E. Vidal. Computation of normalized edit distances and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9):926932, September 1993. [8] W. J. Masek and M. S. Paterson. A faster algorithm computing string edit distances. Journal of Computer and System Sciences, 20(1):1831, February 1980. [9] N. Megiddo. Combinatorial optimization with rational objective functions. Mathematics of Operations Research, 4(4):414424, November 1979. [10] E. W. Myers. An O(ND) difference algorithm and its variations. Algorithmica, 1(2):251266, 1986. [11] B. J. Oommen and K. Zhang. The normalized string editing problem revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6):669672, June 1996. [12] S. V. Rice, H. Bunke, and T. A. Nartker. Classes of cost functions for string edit distance. Algorithmica, 18(2):271280, 1997. [13] D. Sankoff and J. B. Kruskal. Time Warps, String Edits, and Macromolecules : The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, 1983. [14] M. Sniedovich. Dynamic Programming. Marcel Dekker, New York, 1992.
18
[15] E. Ukkonen. Algorithms for approximate string matching. Information and Control, 64:100118, 1985. [16] E. Vidal, A. Marzal, and P. Aibar. Fast computation of normalized edit distances. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(9):899902, September 1995. [17] R. A. Wagner and M. J. Fisher. The string-to-string correction problem. Journal of the Association for Computing Machinery, 21(1):168173, January 1974.