Academia.eduAcademia.edu

Optimal multiway search trees for variable size keys

1984, Acta Informatica

This paper considers the construction of optimal search trees for a sequence of n keys of varying sizes, under various cost measures. Constructing optimal search cost multiway trees is NP-hard, although it can be done in pseudo-polynomial time O(n3L) and space O(n2L), where L is the page size limit. An optimal space multiway search tree is obtained in O(n 3) time and O(n 2) space, while an optimal height tree in O(n 2 log 2 n) time and O(n) space both having additionally minimal root sizes. The monotonicity principle does not hold for the above cases. Finding optimal search cost weak B-trees is NP-hard, but a weak B-tree of height 2 and minimal root size can be constructed in O(nlogn) time. In addition, if its root is restricted to contain M keys then a different algorithm is applied, having time complexity O(nM log n). The latter solves a problem posed by McCreight.

Acta Informatica21, 47-60 (1984) 9 Springer-Verlag 1984 Optimal Multiway Search Trees for Variable Size Keys Jayme Luiz Szwarcfiter Universidade Federal do Rio de Janeiro, N6cleo de Computaq~oEletr6nica, Caixa Postal 2324, 20.001 - Rio de Janeiro - RJ Brasii Summary. This paper considers the construction of optimal search trees for a sequence of n keys of varying sizes, under various cost measures. Constructing optimal search cost multiway trees is NP-hard, although it can be done in pseudo-polynomial time O(n3L) and space O(n2L), where L is the page size limit. An optimal space multiway search tree is obtained in O(n 3) time and O(n 2) space, while an optimal height tree in O(n2 log 2 n) time and O(n) space both having additionally minimal root sizes. The monotonicity principle does not hold for the above cases. Finding optimal search cost weak B-trees is NP-hard, but a weak B-tree of height 2 and minimal root size can be constructed in O(nlogn) time. In addition, if its root is restricted to contain M keys then a different algorithm is applied, having time complexity O(nM log n). The latter solves a problem posed by McCreight. 1. Introduction Multiway trees for variable size keys were discussed by Knuth [10] and McCreight [11]. The present paper considers the construction of some optimal trees of this kind. We show that the problem of constructing an optimal cost (i.e. search cost) multiway search tree is NP-hard. However it can be solved in pseudo-polynomial time O(naL) and space O(n2L), where L is the given page size limit. Optimal space multiway search trees are constructed in O(n 3) time and O(n 2) space, while the optimal height problem is solved in time O(n2 log 2 n) and space O(n). These two algorithms are described by a common formulation and find optimal trees with minimal root sizes. Next it is shown that the problem of constructing an optimal cost weak B-tree is NP-hard, although it also admits a pseudo-polynomial time solution. Weak B-trees are similar to B-trees, except that the lower and upper page limits can be independent. Following we describe an O(n log n) time algorithm for finding a weak B-tree of height 2 and minimal root size. If additionaly the root is restricted to be formed by a given number M of keys then a different process is applied, 48 J.L. Szwarcfiter requiring O(nMlogn) time and O(n) space. This algorithm solves a problem posed by McCreight in [11], where polynomial time algorithms were described only for the cases M = 2 and 3. As for fixed size keys, an algorithm to construct an optimal cost binary search tree in O(n 3) time was described by Gilbert and Moore [3]. Knuth [9] presented a similar algorithm but additionally introduced gap weights and proved a monotonicity principle that decreased the time complexity to O(n2). Garey [2] included a height restriction in the trees. The space bound for these algorithms is O(n2). Vaishnavi, Kriegel and Wood [12] and Gotlieb [4] described optimal cost multiway search tree algorithms, both having O(n3t) time and O(n2t) space complexities, where t is the maximum number of keys per node. Gotlieb [4] and Gotlieb and Wood [5] showed that the monotonicity principle does not extend to multiway search trees. However it does apply when the gap weights are absent [41 which reduces the running time in this case to O(n 2 t). Finally, Itai [7] described a general technique for reducing the factor t to log t, in the above time complexities. Basically, the monotonicity principle is the fact that the rightmost key in the root of an optimal tree does not need to move left (right) when a new key is added at the right (left) of the key sequence. This restricts the number of candidates to consider for the root of the new optimal tree. Examples show that this property does not hold for the present multiway search tree problems. Unlike the fixed size key case [4], it fails also for the search cost criterion with no gap weights. 2. Preliminaries Let E = ( e l , ...,e.) be a sequence of elements called keys, each ei having a positive integer size si and an arbitrary finite value Yi, satisfying Yi<Y~+I, 1 <i<n. A multiway search tree for E is an ordered rooted tree T such that each key of E is assigned to exactly one node of T, while each node x keeps a subset E(x) of keys satisfying: (i) E(x)=dpc>x is a leaf, (ii) each non-leaf x has exactly IE(x)l + 1 sons, and (iii) if y is the k-th son of x in the ordering of T and ej is an arbitrary key of E(y) then exactly k - 1 among the keys e~E(x) satisfy y,<yj. The n + l leaves of T are called gaps and denoted go .... ,g,, respectively. The size of a node x is the sum of the sizes of the keys of E(x). Letting L be an integer, then T has page limit (or limit) L whenever size (x)< L, for any node x of T. The level of x is the number of nodes in the path between x and the root of T. The height of T is the maximum level among the non-leaf nodes and its space is the total number of non-leaf nodes. Suppose there are associated to E non-negative real key weights pl ..... p, and gap weightsqo .... , q , . L e t W= ~ p,+ ~ q j a n d d e f i n e y 0 = - o o and l~i~n O~j~n Y.+ 1= + ~- Then pJWand qJWare the probabilities that the search argument Optimal Search Trees 49 has value Yi and a value strictly between y~ and Yj+I, respectively. The search cost (or cost) of T is the sum Pi level (x(ei))+ 1 <--i<n ~ qj (level (g~)-l), O<j<n where x(ei) is the node of T containing key e i. Let L1, L 2 be integers, 0 < L 1 < L 2. A weak B-tree 1"6] of limits (L 1, LE) is a multiway search tree T of limit L z such that: (i) size (x) > LI, for any non-leaf x 4: root (T), and (ii) all leaves of T have the same level. A B-tree [1] is a weak B-tree with L 1 = I-L2/2]. If the sizes of the keys are fixed then consider s i = 1, i < i < n. For this case, all known optimal cost algorithms employ dynamic programming using the following decomposition. Let <e~.... , ej> be the key sequence and T the corresponding optimal cost tree of limit M. The problem of finding T is decomposed in the two subproblems of finding optimal cost trees TL and TR of limits M for <ei . . . . . e k _ l > and <e~+~, ...,ej>, respectively. Suppose the root r of T consists of m keys, 1 < m < M (Fig. 1). Case I. m = 1. Then e k is the key in r. Case 2. m > 1. Then ek is the rightmost key in r and the root of TL is restricted to m - 1 keys. TCT CaseI /,,' 9( ) // / Z TL I ~\ _3 Fig. 1. The decompositionrule Case2 50 J.L. Szwarcfiter The cost of T can be computed from e k and the costs of TL and TR. This decomposition was first used in [3] for binary trees (Case 1 only) and generalized in [7]. 3. Optimal Cost Muitiway Search Trees In this section we use optimal tree to mean an optimal cost multiway search tree. The problem of constructing an optimal tree for a given key sequence is shown to be NP-hard, but a pseudo-polynomial time algorithm is described. The NP-completeness proof is a simple transformation from the partition problem [8-1. An instance for partition is a set A of elements, each one with a non-negative integer value. The question is to decide whether A can be partitioned into two subsets both having the same sum of values of their elements. Let E = (e 1..... e,) be a sequence of keys with sizes s~, key weights p~ and gap weights q~, 1 _<i_<n and 0 < j < n. Let L and C be positive integers, L > s~, l<i<n. Theorem 1. Deciding whether there exists a multiway search tree for E having limit L and cost < C is NP-complete. It remains so even if all gap weights are zero and each key size equals the corresponding key weight. Proof. Consider an arbitrary instance of the partition problem, namely a set A = {al, ..., a,}, where each a i has a non-negative integer value vi. Denote 89~ v~ by b. Define a key sequence E = (e 1..... e,) such that oj~A si=pi=vi, l <i<n, and qi=O, O<=j<=n, It follows that there exists a subset A'c_ A satisfying ~, vj = b iff there exists a aj~A' multiway search tree for E, having limit b and cost =<3b. Such a tree would have height 2 and the subset of keys in the root would be in one-to-one correspondence with A'A. However an optimal tree can be constructed in pseudo-polynomial time. Let E=(ex, ..., e,) be a key sequence as above and L the limit of the desired optimal tree T for E. For O < i < j < n and O < m < L define w(i,j)= ~, Pk+ ~ i<k<-j qk, and i<-k~_j [ ~ , when s k> m for all k, i < k -<j. Otherwise 0t(i,j, m) = / the cost of the optimal tree of limit L for (e~+ 1, ---, ej) t such that the root has size _<_m. For O<i<n and O < m < L define w(i,i)=O, and ot(i,i,m)=qi. Optimal Search Trees 51 Clearly, ct(0, n, L) is the cost of the desired optimal tree. Applying the decomposition rule of w2, let e k be the rightmost key in the root of T. Then TR is an optimal tree of limit L. So is Tt., except that in Case 2 its root is restricted to size at most m - s k. Therefore ot(O,n,L) can be obtained by the following computation. For O < i < j < n and O<m<L, let o0, when s k > m for all k, i < k <j. Otherwise ot(i,j, m)= ~ min {min [or(i, k - 1, m-- sk), or(i, k - 1, L) + w(i, k - 1)-I i<k<j [ ~,,~,h_<,~,,t + Pk + ct(k, j, L) + w (k, j)}. The above algorithm requires O(n3L) time and O(n2L) space. The time bound cannot be reduced by an application of the monotonicity principle. For example, in Fig. 2, e 2 is the rightmost key in the root of the optimal tree for the key sequence <e2, %>. When adding e 1 at the left the rightmost key moves right. Optimal cost tree of page limit 2 for <e2,%> Z Optimal cost tree of page limit 2 for <et, e2, e3> 0 ~ 2 .3 si - I 2 i Pi - 2 2 I q~ 0 0 0 0 key sequence Fig. 2. Failure of the monotonicity principle for the cost criterion 52 J.L. Szwarcfiter 4. Optimal Height or Space Multiway Search Tree Throughout this section, an optimal tree means either an optimal height or space multiway search tree, according to the desired minimization criterion. An algorithm is described for finding an optimal tree for a given key sequence and limit. Among the possible optimal trees, the algorithm chooses one with minimal root size. It uses the decomposition of Sect. 2, as follows: Let T be an optimal tree of limit L, having space S and height H. Suppose T is space optimal. Then TR is space optimal and of limit L. Let S' be the space of TR. In case 1, TL is space optimal, has limit L and space S - S ' - 1 . Similarly in Case 2, except that the root is restricted to size at most L - s k and its space is S - S ' . Suppose now T is height optimal. Then TL has height at most H - 1 in Case 1 and at most H otherwise. The height of TR is always no more than H - 1 . Clearly, at least one of TL and TR is height optimal, but we can restrict the search to the case in which both are. A quasi multiway search tree of limit L is a multiway search tree of limit L, except for the root whose size is unbounded. A multiway (or quasi multiway) search tree has parameter z when its height or space is z, respectively according to the case in consideration. Denote by Z the parameter of the optimal tree. Let E = (e 1.... , en), each key e i having size s i. For given L and z > 0 define: 0, when i >j. Otherwise ~t(i,j, z) = ~ the minimal size of the root of a quasi multiway search tree [ o f limit L and parameter < z for the key sequence (e~, ..., e~). When z < 0 define ~(i,j, z)= ~ for all i,j. Now let f~, when ot(i,j,z')>Lfor all z', l<z'<z. Otherwise a(i,j,z)=l min {z'lot(i,j,z')<L}. (i) I,l < z ' < z ~(i,j, z)=(oo, when tr(i,j, z)=oo. Otherwise (ii) In other words, tr(i,j, z) equals the parameter of the optimal tree of limit L for (ei, ...,ej), provided it is <z. In this case, ~(i,j, z)=0. If the parameter of this tree is greater than z then a(i,j, z)=[3(i,j, z)= oo. The computation of 0t is as follows. For 1 <i<j<n, ot(i,j, 1)= ~ s k. (iii) i<--k<--j Then for 1 < i < j < n and z = 2, 3.... ~t(i,j, z)= min {min [~t(i, k - 1, f(k,j, z)), fl(i, k - 1,f(k,j, z ) - 1)] +sk}, where , (z-fl(k+l,j,z-1) f(k'J'z)=)z-a(k+l,j,z-1)- for heightminimization for space minimization. (iv) (v) Optimal Search Trees 53 The process starts by computing (iii), i.e. the value of each ~t for z = 1. Then it proceeds to (iv) for z > 1. It stops at the least z such that a(1, n, z)< ~ . The terminating z satisfies z=Z. Clearly, tr(1, n , Z ) = Z and 0t(1, n, Z) is the (minimal) root size of the final optimal tree. After each at is calculated, the corresponding a is evaluated by (i) and then fl by (ii). All computations are common for both height or space minimization, except f Each computation of ct(i,j, z) by (iv) involves the evaluation by (v) of the f(k,j, z) function, for i<k<i. Each evaluation of a, fl or f can be done in constant time, provided some previously computed values were kept. Therefore the algorithm can be implemented in O(naZ) time and O(n2Z) space. Lemma 1. tr(i,j, z) = oo =~a(i,j + 1, z) = oo, tr(i,j, z)< ~=~tr(i,j, z+ 1)=tr(i,j, z) and a(i,j,z)<oo and tr(i,j+l,z)=oo~a(i,j,z)=z and a(i,j+l,z+l)=z+l. The proof is straightforward. By using Lemma 1 it is possible to improve the algorithm. Observe that whenever a(i,j,z)=~ there is no need to compute ~(i,j',z) for j'>j, since tr(i,j', z)= ~ . Also, when a(i,j, z)< oo we can avoid the computation of ct(i,j, z') for z'>z, since tr(i,j,z')=a(i,j,z). Finally, if a(i,j,z)<oo and a ( i , j + l , z ) = ~ , necessarily ~t(i,j+ 1, z)>L and ot(i,j+ 1, z+ 1)<L. Consequently, for each pair i,j such that l < i < j < n , ot does not need to be computed more than twice (obtaining a value ct > L at most once and ~ < L exactly once). The above observations lead to the following formulation. Height or Space Minimization Algorithm In the initial step, let z = 0 and for 1 < i < n define each key e~ as unfinished and j(i)=i. In the general step, if there are no unfinished keys the process terminates. Otherwise label as unlocked each still unfinished key, increase z by one, perform the below locking procedure and repeat the general step. Locking Procedure Verify if there are still unlocked keys. If negative, the procedure terminates. Otherwise, choose arbitrarily an unlocked key e~ and compute ~(i,j(i), z) and the corresponding tr and ft. Check whether a(i,j(i),z)<~. In the affirmative case, increase j(i) by one and ifj(i) becomes n + 1 redefine e~ as both locked and finished. When a(i,j(i), z)= ~ just relabel ei as locked (but still unfinished). In any case, repeat the locking procedure. The new algorithm runs in O(n 3) time and O(n 2) space. 54 J.L. Szwarcfiter Now, let us restrict to height minimization, i.e. consider z as the height. The following definitions are useful. For z >0, rightz(e~) = max k, i < k < n, such that the optimal tree for (e~..... ek) has height -<z. left~(ej) = min k, 1 < k <j, such that the optimal tree for (e k..... ej) has height <z. The values of right~ and left~ are clearly not independent. It follows that lefts(ej) = rain k, 1 < k <j, such that right~(ek) >j. Therefore, given right~(e~) for each i, l < i < n , all the left~ values can be computed in O (n) time. For 1 < i < k < j < n and height z > l define the candidate Q~=Qk(i,j, z) with value *Qk(i,j, z) given by *Qk(i,j, z) = min {~(i, k - 1,f (k,j, z)), fl(i, k - 1 , f (k,j, z ) - 1} + s k and let Q(i,j, z)= {Qk(i,j, z)[*Qk(i,j, z)< ~ , i< k <=j}. Then (iv) can be rewritten as ~(i,j, z) = min {*QklQkeQ(i,j, z)}. In other words, the candidates are the operands of the minimization (iv) and consequently ~(i,j,z) equals the minimal value among the candidates Qk(i,j, z). Lemma 2. Let 1 < i < j < n and height z> 1. If fl(i,j, z)=0 then Q(i,j+ 1; z) can be constructed from Q (i,j, z) as follows: Q(i,j + 1, z) = [ Q(i,j, z) u {Q/+ 1(i,j + 1, z)}] - E X (i,j, z), where EX(i,j, z) = {Qk(i,j, z)l fl(k + 1,j + 1, z - 1) = ~ , i < k <j}. Proof. Since fl(i,j, z) = O, ~(i,j, z) < L and therefore *Qi+ 1(i,j + 1, z) = min {~(i,j, z), fl(i,j, z - 1)} + sj+ 1 < ~ . Then Q j+ 1(i,j + 1, z)eQ(i,j + 1, z). As for the exclusions (the candidates of EX) if fl(k+ 1,j+ 1, z - 1 ) = ~ then Qk(i,j + l , z ) ~ Q ( i , j + l , z ) , i < k < j . When fl(k+l, j + l , z - l ) = 0 it follows that fl(k + 1,j, z - 1)=0 and since fl(i, k - 1, z)=0 we conclude that *Qk(i,j, z)=*Qk(i, j + 1, z). The latter corresponds to the common candidates of Q(i,j, z) and Q(i,j + 1, z),. L e m m a 3 . Let l <i<j<_n and height z>_l. If ~(i,j,z)=O and fl(i,j+l,z)=oo then Q(i,j+ 1, z + 1)= {Qkl*Qk=sk, leftz(ej+l)<k<-j+ 1}. Proof If k<leftz(ej+ 1) then /~(k+ 1,j+ 1, z)= ~ , consequently *Qk= ~ and QkCQ(i,j+l, z+ 1). In addition, if k > j + 1 then Qk is not a candidate of Q(i,j Optimal Search Trees 55 + 1, z + 1). Suppose now leftz(ej+ 1)<k<=j+ 1. Because fl(i,j, z)=O, fl(i, k - 1, z) =0. Because fl(k + l,j+ l,z)=O, fl(k + l,j,z)=O. Therefore *Qk(i,j+ l,z + l) = m i n {~t(i, k - 1, z + 1), fl(i, k - 1, z)} + s k = s k and QkeQ(i,j+ 1, z+ 1),. A possibility for further improving the height minimization algorithm is to use a priority queue to contain the sets Q(i,j,z). The central point then becomes updating the queue. We next describe a method for it. A simple change in the locking procedure is to make the choice of the unlocked key no longer arbitrary. Instead, we shall always select the unlocked key e~, with maximum i. Suppose such e~ has been chosen and that ~(i,j, z), j<n, has been calculated obtaining fl(i,j,z)=O. We now follow the next computations. Initially, since ~(i,j, z)=0, e i remains unlocked and is again chosen in the locking procedure. We prepare the computation of ct(i,j + 1, z). Use Lemma 2 to obtain Q(i,j+ 1, z) from Q(i,j, z). This corresponds to inserting in the queue the candidate of value *Qj+ 1(i,j + 1, z) = min {ct(i,j, z), fl(i,j, z - 1)} + sj+ 1 and removing from it the candidates of the set EX(i,j, z). The latter is identified by iteratively checking for k=i, i + 1, ... whether fl(k+ 1,j+ 1, z)= c~ and stopping when this becomes false. As long as we keep choosing the same key e~ in the locking procedure, until e~ becomes locked no candidate is removed more than once from the priority queue. Therefore O(n) deletions may occur until e i gets locked again. This means O(n2H) deletions overall, where H is the height of the optimal tree, i.e. O(n 2 log 2 n) time. This is the dominant factor in the time complexity of the algorithm. Note that H is O(log n). If the above computation of ~(i,j+ 1, z) results fl(i,j+ 1, z ) = 0 and j + 1 <n, repeat the same argument and construct Q(i,j+2, z) from Q(i,j+l,z), and so on. Otherwise, if fl(i,j+ 1, z)=oo then e~ becomes locked but still unfinished. When ei is eventually again unlocked and chosen in a new computation of the locking procedure, the value to be calculated is ct(i,j+l,z+l). Lemma 3 indicates directly the contents of Q(i,j+ 1, z+ 1). We then disregard the current priority queue and construct a new one in O(n) time, containing the values of Q(i,j+l,z+l). A key may become locked O(H)times. Therefore O(n21ogn) time is needed overall for these constructions. From Lemma 3 we observe that the computation of Q ( i , j + l , z + l ) depends on knowing leftz(e~+~). Since fl(i,j, z)=0 and fl(i,j+ 1, z)= ~ we have rightz(ei)= j. This means that each time the set of unfinished keys becomes locked (at the end of the locking procedure), the corresponding right~ values are all known. At this momera and as absorved before, we can compute all leftz values in O(n) time, i.e. O(n log n) overall. The remaining main operation is the actual minimization in the priority queue to obtain the ~ values. This requires O(n 2 logn) time overall. The time complexity is therefore O(n 2 log 2 n). As for the space complexity, observe that the only ~ which we need to remember is ~(i,j, z) when updating the priority queue for Q(i,j+ 1, z). When this occurs, ~t(i,j, z) is precisely the last one calculated. Therefore constant space suffices for the ~t's. Values of fl corresponding to z - 1 or z are needed in general, when performing the computations for height z. But they can be easily obtained respectively, either from the left~_, or right z values, which must be 56 J.L. Szwarcfiter then available and o c c u p y O(n) space. The priority queue contains at most O(n) values. The space complexity is therefore O(n). The above strategy applies for z > 1. The case z = 1 should be done first and consists of calculating left~(ei) , using (iii). The m o n o t o n i c i t y principle c a n n o t be applied to any height or space minimization algorithm, seeking for minimal root size. See Fig. 3. C--z7-3 Optimal height and space tree of page limit 4 for (e2, e3, e4, es, e6), having minimal root size. Optimal height and space tree of page limit 4 for (el, e2, e3, e4, es, e6), having minimal root size. i 1 2 3 t. 5 6 si 2 2 2 2 2 1 key sequence Fig. 3. Failure of the monotonicity principle for either height or space criterion 5. Optimal Cost Weak B-Trees In this section we show that the problem of constructing an optimal cost weak B-tree is N P - h a r d . D e n o t e by E=(e~) a sequence of keys with sizes s~, key weights p~ and gap weights qi, l < i < n and O<j<-<_n. Let L I , L 2 and C be positive integers, LI < L 2. Theorem 2. Deciding whether there exists a weak B-tree for E having limits (L~, L2) and cost < C is NP-complete. It remains so even if all gap weights are zero and each key size equals the corresponding key weight. Proof. Consider an arbitrary instance of the partition problem namely a set A = { a 1.... ,a,n } each a i having a non-integer value vi associated with it. Let b Optimal Search Trees 57 = 89 be an integer, otherwise the solution is trivial. Construct a key sequence E = ( e ~.... , e 2,, + 3) formed by four types of keys: Type I. key e 1, with s I = P l = r e ( b + 1); Type 2. key e 2, with s2=p2 =m; Type 3. keys e3,e s .... ,e2m+3 , with Sk=Pk---1, k = 3 , 5 , . . . , 2 m + 3 ; and Type 4. keys e4,e 6 ..... e2.+2, with Sk=Pk=mV<k_2)/2, k = 4 , 6 ..... 2 m + 2 . Let all q~= 0, 0 < i < 2m + 3. It follows that there exists a subset A ' _ A such that ~ v~=b if and only if there exists a weak B-tree for E with limits (1, mb a,~A' +m) and c o s t < 5 m b + 5 m + 2 . Such a tree would have height 2 and the type 4 keys of the root would be one-to-one correspondence with A'A. Again the NP-completeness is not strong. An optimal cost weak B-tree can be found in pseudo-polynomial time by appropriately extending the algorithm of Sect. 3. 6. Weak B-Trees of Height 2 Given a key sequence E = (el, ..., e.), each e~ with size s t and given integers L 1, L 2 with 0 < L 1 < L2, we first consider the problem of finding a weak B-tree T for E having limits (L1,L2), height 2 and minimal root size. We assume that Z s i > L2, otherwise there is no reason for a tree of height 2. Observe that T can be determined just by finding the subset of keys which forms the root. In order to compute this subset, we construct an acyclic digraph G with vertex set {Vo,vl .... , v,+ 1}. G has one directed edge (vi, vj) and a distance d o for each i,j, 0 =<i < j =<n + 1. Each distance is defined as follows: dij=fsj,), when L1----<i<k<j ~ Sk~L2"Otherwise O0, where s, + 1 -- 0. Let the length of a path P in G be the sum of the distances of the edges of P. It follows that the weak B-trees of limits (L1, L2) and height 2 are in one-toone correspondence with the Vo-V,+ ~ paths in G of length < L 2. Denote by D~ the length dijw...Wdk, n+ 1 of the shortest v~-v.+~ path P~ in G. Compute the shortest .path Po of length D O from vo to v.+~. If D o > L 2 the desired weak Btree T cannot exist. Otherwise the root of T is formed by the keys of {ejl v:Po -{Vo,V.+l}}. A straightforward implementation of the above process gives a O(n z) time and space algorithm for finding the tree T. However it is possible to improve it by taking advantage of the special distribution of the edge distances. For For 0 < k < n + 1 define the candidate Qk with value *Qk = sk + 0 < i < n + 1, let Dk" Q(i)= {Qk[*Qk < oo and dik< 0% i < k < n + I}. 58 J.L. Szwarcfiter The shortest distances from vi to v,+ 1 can be computed by ~, if Q(i) = ~b. Otherwise D. , [min{.QklQkeQ(i)} (i) for i = n + 1, n, ...,0. We use a priority queue to contain the values of the candidates of Q(i). The point again is updating it after each iteration. The following functions are useful. For 0 < i < n, let ~n+2, az(i)=/min{jl when i < k ~ , + t s k < L t . Otherwise ~ sk>Ll, i+ 1 < j < n + 1} (ii) i<k<j 9 f n + I, when i = n. Otherwise ~2(z) = ~ max{jl ~ sk<=L2, i + l < j < n + l } (iii) i<k<j In other words, if the shortest distance D~ is finite and vj is the vertex that follows vi in P~ then i<trl(i)<=j<_~2(i ). In this case, *Qj=Dr There is no difficulty computing all trl(i ) and tr2(i), i=n, n - 1 ..... 0, in O(n) time. The following lemma shows how to construct Q(i-1) from Q(i). L e m m a 4. I f 0 < i <=n + 1 and Q(n + 1) = ~b then Q(i - 1) = [Q(i)uIN(i)] - EX(i), where IN(i) = {Qkl *Qk < 00, crI (i - 1) < k < min {trl(i ) - 1, tr 2(i - i)} } and EX(i)= {Qk[*Qk< o0, max {trl(i), a 2 ( i - 1)+ 1} <k<tr2(i)}. Proof. If t r l ( i - 1 ) > t r 2 ( i - 1 ) then t r l ( i - 1 ) = n + 2 and trl(i)>tr2(i ). Then Q(i) = IN(i) = EX(i) = tk and consequently Q(i - 1) = ~p. Otherwise a ~(i - 1) < tr 2(i - 1) and suppose first a 1(i) > a2(i ). Then Q(i) = EX(i) = ~b and since tr 1(i) = n + 2, tr2(i - 1 ) < t r l ( i ) - i and consequently Q(i-1) coincides with IN(i). Finally consider a i (i - 1) < tr 2(i - 1) and a i (i) < a 2(i). Let *Qk < ~ . If k < tr 1(i - 1) or k > tr 2(i) then Qkq~Q(i) and QkCQ(i-1). If k<a~(i) then Qkq~Q(i), but QkeQ(i-1) when al(i - 1 ) < k < c r 2 ( i - 1 ) . Therefore we include the candidates of 1N(i). If t r 2 ( i - 1 ) < k then Q ~ Q ( i - 1 ) , but QkeQ(i) when trl(i)<k<tr2(i ). Therefore we exclude the candidates of EX(i). The remaining possibility is al(i)<k <tr2(i-1). In this case QkeQ(i) and QkeQ(i - 1). The latter candidates remain unchanged,. The algorithm then follows. Initially, define Q(n+ 1 ) = ~ and compute a~(i) and tr2(i ) for each i,i=n, n - 1 ..... O, using (ii) and (iii) respectively. Subsequently, for i=n + 1, n.... ,0 compute D~ using (i). The priority queue which contains the candidates of Q(i) is updated using L e m m a 4, after each iteration i. There are O(n) minimizations, inclusions and exclusions in the process. We need therefore O(n log n) time 9The space complexity is O(n). Now, consider the following problem. We wish to find a weak B-tree T as in the above case, except that additionally its root is required to be formed by a given number M of keys [11]. A solution can be constructed using the Optimal Search Trees 59 following dynamic programming algorithm based agaid in the decomposition of w Let ek be the rightmost key in the root of T. Then L 1 __< ~ s i < L 2. In k<i~_n Case l, LI<= ~ s i ~ L 2. In Case 2, TL is a weak B-tree for (el,...,ek_l) of 1 <i<k limits (L~, L2), height 2, minimal root size and having M - 1 keys in the root 9 For 1 __i =<j_<_n and m > 0, define (the minimal size of the root of a weak B-tree for (e i.... , ej) ~(i,j,m)=~of limits (L1,L2), height 2 and having m keys in the root; / [ ~, whenever the above tree does not exist. For 1 <-i<=j<__n,define: w.on , 9 Z O, orwise i<k<-J The problem can be solved by computing at(1,n,M) using the following equation. For 2m+ l <-j<=n and l <_m<_M, ~(1,j, m) = {~t(1,k-l,m-1)+Sk+~(k+l,j,O)}. min 2m<k~j- I Observe that if ~(i,j, ra) is finite then necessarily j___2m+l and the right most key ek in the root of the corresponding weak B-tree satisfies 2m__<k =<j --1. Using the functions tr 1 and tr 2 as defined above we can compute each ~(i,j,0) in constant time, as follows: If l <i<=j<n, ot(i,j, 0)=~'0, when t r , ( i - 1 ) < j + l and t r 2 ( i - 1 ) > j + l . Otherwise Therefore we can compute or(1, n , M ) i n O(n2M) time and O(n) space. The time bound can be improved as below described. For 1 < m < M and 2m<k define the candidate Qk with value *Qk = 0t(l, k - 1, m - 1)+s k. For 1 < j < n and 1 < m < M, let Q(j,m)={QkI*Qk<=L2 and ct(k+ 1,j,0)=0, 2m<__k<=j-1}. Then ct(1,j, m) can be computed by 0t(1,j,m)=j'oo,, if Q(j,m)=~. Otherwise (min {*Qk IQReQ(J, m)}, for l < j < n and l < m < M . A priority queue is used to keep the sets Q(j, m). For a fixed m, update the queue as follows. If j < 2 m + l then Q(j,m)=q~. Otherwise we construct Q(j+ 1,m) from Q(j,m). Include each candidate Qk such that *Qk<L2, ~ ( k + l , j , 0 ) 60 J.L. Szwarcfiter = oo and ~t(k+ 1,j + 1, 0) = 0. Exclude each Qke Q(j, m) which satisfies ct(k+ 1,j, 0) = 0 and 0t(k+ 1,j + 1, 0)= oo. Each candidate to be changed can be identified in constant time, since the next possible inclusion and exclusion are Q~+I and Qp, respectively, where q = max {klQkeQ(j, m)} and p = min {klQk e Q(j, m)}. For each m, there can be O(n) minimizations, inclusions and exclusions. The time and space complexities are therefore O(nM log n) and O(n), respectively. 7. Conclusions The construction of optimal multiway search trees and optimal weak B-trees for variable size keys have been considered. In particular, constructing optimal cost trees is NP-hard in both cases, although both admit pseudo-polynomial time algorithms. But in many applications the limits of the trees are orders of magnitude less than the number of keys. Clearly, the algorithms are polynomial for this class of problems. It might well be worthwhile to consider an alternative strategy for defining weak B-trees of variable size keys. Namely, to adopt as lower limit a given number of keys, while maintaining the size as upper limit. The optimal cost problem remains of course NP-hard, but the trees become easier to manipulate. For example, an optimal height weak B-tree can be found in polynomial time if this alternative is adopted. When controlling nodes only by sizes, the construction of a weak B-tree of height 2 can also be carried out in polynomial time as described in Sect. 6. The case when the height > 3 would bear further investigation. Acknowledgements. To Ysmar V. Silva F9 for all the discussions and insightful remarks. References 1. Bayer, R., McCreight, E.: Organization and maintenance of large ordered indexes. Acta Informat. 1, 173-189 (1971) 2. Garey, M.R.: Optimal binary search trees with restricted maximal depth. SIAM J. Comput. 2, 101-110 (1974) 3. Gilbert, E.N., Moore, E.F.: Variable-length binary encodings. Bell System Tech. J. 38, 933-968 (1959) 4. Gotlieb, L.: Optimal multiway search trees. SIAM J. Comput. 10, 422-433 (1981) 5. Gotlieb, L., Wood, D.: The construction of optimal multiway search trees and the monotonicity principles. Internat. J. Comput. Math. So. Ag, 17-24 (1981) 6. Huddleston, S., Mehlhorn, K.: A new data structure for representing sorted lists: Acta Informat. 17, 157-184 (1982) 7. Itai, A.: Optimal alphabetic trees. SIAM J. Comput. 5, 9-18 (1976) 8. Karp, R.M.: Reducibility among combinatorial problems. In: Complexity of Computer Computations. Miller, R.E., Thatcher, J.W. (eds.) New York: Plenum Press 1972 9. Knuth, D.E.: Optimum binary search trees. Acta Informat. 1, 14-25 (1971) 10. Knuth, D.E.: The Art of Computer Programming, Vol. 3: Sorting and searching. Reading (Mass.): Addison-Wesley 1973 11. McCreight, E.M.: Pagination of B*-trees with variable length records. Comm. ACM 20, 670674 (1977) 12. Vaisbnavi, V.K., Kriegel, H.P., Wood, D.: Optimum multiway search trees. Acta Informat. 14, 119-133 (1980) Received March 22, 1983/November 8, 1983