Academia.eduAcademia.edu

Fast approximate string matching

1988, Software: Practice and Experience

Approximate string matching is an important operation in information systems because an input string is often an inexact match to the strings already stored. Commonly known accurate methods are computationally expensive as they compare the input string to every entry in the stored dictionary. This paper describes a two-stage process. The first uses a very compact ngram table to preselect sets of roughly similar strings. The second stage compares these with the input string using an accurate method to give an accurately matched set of strings. A new similarity measure based on the Levenshtein metric is defined for this comparison. The resulting method is both computationally fast and storage-efficient.

SOFTWARE-PRACTICE zyxwvutsrqpo zyxwvutsr ,AND EXPERIENCE, VOL. 18(4). 387-393 (APRIL 1988) zyxwvu zyxwv zyxw zyxw zy zyxwv Fast Approximate String Matching 0. OWOLABI AND D. R. MCGREGOR Department of Computer Science, Unieersity of Sti-athclyde, Glasgozc G I IHX, L-.K. SUMMARY Approximate string matching is an important operation in information systems because an input string is often an inexact match to the strings already stored. Commonly known accurate methods are computationally expensive as they compare the input string to every entry in the stored dictionary. This paper describes a two-stage process. The first uses a very compact ngram table to preselect sets of roughly similar strings. The second stage compares these with the input string using an accurate method to give an accurately matched set of strings. A new similarity measure based on the Levenshtein metric is defined for this comparison. The resulting method is both computationally fast and storage-efficient. KEY WORDS Dictionary Strings Approximate match Similarity measure n-gram INTRODUCTION Approximate string matching has become an important operation in information systems. Users often make spelling and typing mistakes in the process of data and query entry. T o avoid the failures that may result, approximate string matching can be performed to retrieve similar information, or the correct set of information, where tolerable errors are present in either stored or query strings. A survey of applicable techniques can be found in Reference 1. T h e most straightforward way to carry out matching is to compute some similarity function between the search string and every string in the dictionary. T h e results may be used in different ways. For example, the dictionary string that is most similar, according to the similarity measure, may then be regarded as the most likely substitute for an erroneous string. Alternatively, it may be that the user requires a set of similar terms which can be used to effect a more complete recall of information. A widely recommended similarity measure is that computed in the dynamic programming method’ which reckons the distance between two strings as the number of singlecharacter transformations that will turn one string into the other. T h e most similar string is that which has the lowest transformation cost. T h e problem with such methods is that they are computationally expensive. T h e method is thus unsatisfactory where similarities must be computed over large dictionaries in real time. T h e n-gram method that is described below is sometimes preferred for the reason that it avoids this exhaustive search. It is however hampered by the large table that has to be manipulated. T h e improved n-gram method described later employs a technique that overcomes the drawbacks by reducing the size of the required table. 0038-0644/88/0403 87-079605.00 @ 1988 by John Wiley & Sons, Ltd. Receiced 24 June 1987 Revised I6 iVorember I987 388 zyxwvutsrq zyxwvutsr zyxw zy zy 0. OWOLABI A N D D . R . MCGREGOR T h e resultant method requires only small resources of storage and computation, but produces a superset of the required strings. It is thus well-suited to the task of selecting an initial set of strings from which the best matches are selected using the dynamic programming method. INVERTED n-GRAM MATRIX In the n-gram method, the similarity of any two strings is based on the comparison of their n-grams. T h e larger the number of n-grams they possess in common, the more similar the ~ t r i n g s .Although ~ in general any n characters could be extracted to form the n-grams to be compared, we have used only adjacent n-grams. For example, for a string with p characters, ~ [ l . p. ] , there are p - 2 such trigrams: a [ l . .3], a[2. .4], . ., alp-2. . p ] . The n-grams usually employed are substrings of length 2 (diagram) or 3 (trigram). T h e shorter the substring, the lower the precision of matching and the greater the recall. For example, using single-character substrings results in anagrams being identified as definitely as exact matches. T h e n-gram method of finding the most similar string to a search string uses an abstraction of the dictionary in the form of sets of binary arrays usually maintained as a set of inverted files in bit array format, one such file per n-gram. Considered as a matrix, there is a row for each n-gram, and a column for each dictionary entry (Figure 1). T h e n-gram table is set up when the vocabulary is read in, and later consulted for matching purposes. As each string is read in, an identifier equal to its dictionary position is derived, and the column bits corresponding to its n-grams in the table are set. T h e row address of the bit-list corresponding to each n-gram is computed by a simple address calculation. T o search for an unknown string, we look up the inverted files for each of the ngrams in the unknown. T h e results are combined by forming a sum vector, with each element corresponding to its dictionary string. T h e vector element values are obtained by taking the n-grams of the unknown string one at a time, and computing the row addresses of the corresponding bit-lists. Each bit-list is then examined to determine the bits that are set, and hence the dictionary entries which also contain the n-grams, and for each of these the corresponding vector element is incremented. At the end, each vector element contains, for every dictionary string, the number of n-grams it has in common with the unknown search string. T h e similarity value can then be computed as the percentage zyxwvut zyxwvut 100(2c)/(s+d) where s and d are the numbers of n-grams in the search and dictionary strings respectively, and c is the number of common n-grams. T h e simple n-gram method has three disadvantages: (a) the large space required to store the binary arrays (b) the relatively low accuracy, as the matching depends on the somewhat arbitrary choice of n-grams used in setting up the inverted dictionary, and the possibility of ‘12-gram anagrams’ being selected in an approximate match even though grossly dissimilar overall to the input string. (c) T h e set-up time for a large dictionary is also a serious overhead (Table I ) . zyxwv zyxwvu FAST APPROXIMATE S T R I N G MATCHING -0 389 zyxwvu zyx zyxwvuts zyxwvu zyxw 4 4 1 2 d Addr 3ss Calculation 4 J-1 0 1 ... m-1 4 4 Numeric String Identifiers Figure 1. n-gram table T h e main advantage of this method is the low overhead in computing the matching set of dictionary entries - a characteristic of inverted file access. T h e new method presented here was devised to use the best features of both the ngram method and the dynamic programming method in such a way as to combine speed and accuracy with low storage requirements. T H E COMBINED TWO-STAGE METHOD Our improved method is to use the n-gram approach as a coarse pre-filter yielding a superset of the required strings. A dynamic programming stage is then used to extract the approximate matches as defined by the similarity measure. T h e inverted n-gram bit array method was fast but required a large table. With some 25,000 dictionary entries and some 17,576 possible n-grams for single-case letters alone, the matrix occupies 45Mbytes of storage. Operations on such a large table place heavy I/O loads on a system resulting in long elapsed times (Table 11). T h e Dynamic Programming method was also too slow; programmed in C and running on a Sun 3 system it required 1 min 37 s to locate a match (Table 11). zyxwvutsrq Stage 1: a fast approximate n-gram filter T o provide a more practically attractive approach, we compressed the bit matrix of the n-gram approach, even though in so doing, the accuracy of the matches derived was reduced. We could do this provided: (a) that the approximate n-gram method always delivers a superset of the required strings (b) that the size of the superset is such that it can be handled reasonably quickly by the second stage. zyxwvutsr zyxwvuts zyxwvu zyxwvut 390 0. OWOLABI AND D . R . MCGREGOR T h e first phase is speeded up because the reduction of the table size results in reduced I/O loading, and hence in an overall saving of elapsed time as well as space, whereas the second phase handles only a small fraction of the total number of strings in the dictionary. T h e representation used is a set of binary arrays each of length related to the dictionary size (Figure 1). It works just like the simple n-gram method, except for the modifications noted below. Each binary array represent one of the 17576 (=26X26X26) possible three-letter combination of the alphabet. If the letter combination corresponding to the binary array is present in string j , the position in the array corresponding t o j is 1. T o retrieve the subset of the dictionary most similar to a given string, the threeletter combinations of the string are computed. For the string ‘binary’, say, these will be bin, ina, nar, ary. By an inverted look-up of the bit arrays which have been previously set up, a counter is incremented for every bit position set to one in each bit array corresponding to the n-grams of the input string. Storage considerations T o save space in a large dictionary the bit arrays are ‘overloaded’, making them much shorter than the total number of strings. I n effect, each bit is made to correspond to a number of entries. T o preserve the ‘valid complete superset’ property, the bit is set when the n-gram entry is present in any one of the corresponding mapped entries. This is in contrast to the usual method of allocating a bit position to each string in the dictionary. An ‘overloading’ factor of 16 seems to be appropriate in practice as it reduces the size of a bit array by a factor of 16 while ensuring that the number of strings examined for every good score is not unduly large. T h e implication of this is that each position in a bit array will indicate the presence or otherwise of the corresponding letter combination in up to 16 strings. A further advantage of this is that strings that lie close to each other in a sorted dictionary are grouped together. Another storage-conserving measure is the use of a number of bit arrays much less than the maximum possible (511 in place of 17,576). Whereas the bit arrays were formerly reduced in length ( i e . horizontal compression), we now reduce the number of bit arrays (i.e. vertical compression). This works in a manner similar to a hash addressing mechanism. T h e two reduction factors were determined by experimentation (see Table 111). As was mentioned above, grouping adjacent strings together is advantageous in a sorted dictionary. This is due to the fact that strings that bear close similarity to each other are not scattered over several different groups. This leads to a better performance. Table I V shows the performance for the various reduction factors in the case of an unsorted dictionary. In this manner a considerable amount of space is saved in storing the bit arrays. This is shown below: (i) Using 17,576 bit arrays for a 20,000-word dictionary requires 17,576X20,OOO bits = 36x10’ bits = 4-5X107 bytes = 45 Mbytes zy zy zyxwvu FAST APPROXIMATE STRING MATCHING 39 1 (ii) With the measures stated above, however, the requirement is 5 11X (20,000/16) bits = 65X lo4 bits = 8=104 bytes = 80 kbytes. Thus the required space is reduced by a factor of about 500. The first threshold function T h e next step is to determine which strings are possible candidates for further matching. This is done by computing for each string the function: (100 X number of common triples)/(number of triples in search string). A string is selected for further matching where this function exceeds a given threshold. In this case, for every bit position that this is satisfied, the group of strings corresponding to it is to be further examined. T h e threshold has to be set according to the requirements of a particular application. This will determine the size of the set returned. Match Sensitivity T h e computation of the letter combinations in a string was modified in line with user requests. This required a lower acceptance criterion for short strings and also for biasing the matching towards the beginning of a string. Using the example that has already been given above, the modification will yield the combinations b, bi, bin, ina, nar, ary for the string ‘binary’. The effect of this can be seen in comparing the strings ‘part’ and ‘parts’. part : p, pa, par, art parts : p, pa, par, art, rts T h e above function will yield 80 per cent as opposed to 66 per cent without this modification. T h e result of this is greater recall in the matching process. Stage 2: dynamic programming I n this second stage, the superset isolated by the initial filter is passed to the dynamic programming (DP) process to compute the approximate matching set of strings. A threshold function is used to reduce the final set to the best matching elements, eliminating any widely dissimilar strings selected by the approximate filter. For each distance returned by the D P method, the function zyxwvuts zy (100 X (ab - 2Xd))lab is computed, where ab is the sum of the lengths of the two strings compared, and d is the value returned by the D P method. T h e string is returned where the computed function exceeds the given threshold. Setting the threshold controls the number and similarity of strings returned. 392 zyxwvu zyxwv zyxwvut 0. OU'OLABI AND D . R. MCGREGOR PERFORMANCE T h e new method was tested for performance, and the result compared with both the dynamic programming and the n-gram methods. T h e C programs were run with the same set of search strings and a 25,000-word dictionary. T h e results shown below represent the average response times on a Sun 3 workstation with 4Mbytes of main store. Table I shows the table set-up time for each method, and Table I1 shows the average time to compute the best match sets for a single string. Table I11 shows the average time to match a single string for the various combinations of the reduction factors. A set of strings was used in the evaluation, individual strings ranging between 5 and 10 characters in length, and the elapsed time for matching the complete set was measured directly at the console. T h e average time was not sensitive to the particular strings chosen, though, of course, there is a measurable variation as the thresholds were lowered to cause more searching in the second stage. T h e tables were obtained using a threshold of 60 in both stages. It can be seen that the best performance is obtained with a maximum of 511 ngrams and an overloading factor of 16 for the strings. All times are in seconds. T h e above figures are for a sorted dictionary. Table I V shows the figures for an unsorted dictionary. zyxwvutsr Table I Method zyxwvut Table size Dynamic programming Simple n-gram Combined method Set-up time 0 0 45Iblbytes 48min 80kbytes 30s Table I1 Method Time Dynamic programming Simple n-gram Combined method lmin 37s 1.5s 0.5s zyxwvutsrq Table 111 n-grams 1 25 1 2.9 511 1.5 1.8 1011 Overloading factor 8 16 1.3 0.7 0.8 32 1.6 6.0 0.5 0.6 0.8 1.3 FAST APPROXIMATE S T R I N G MATCHING Table I V n-grams 1 25 1 511 1011 2.9 5.0 1.7 Overloading factor 8 16 2.3 1.4 1.o 9.0 2.6 5.5 zyxwv 393 32 27.0 9.5 6.0 These figures demonstrate that the new method has succeeded very well in outperforming the other two methods. It also compares favourably with another method, based on a fourth-generation language approach, which returns the best match from a 1000-word dictionary in about 1 ~ e c o n d For . ~ our 25,000-word dictionary, the best average times for matching a single string are 0.5 s for the sorted dictionary, and 1.0 s for the unsorted dictionary. The better performance using the sorted dictionary is because when the table is ‘overloaded’, groups of strings are allocated a single hit in the n-gram table. In the case of the sorted dictionary each group is at least partially similar hence the ‘hits’ are much better clustered than in the unsorted case. This minimizes the number of hits processed by the second stage, and hence improves performance. SUMMARY T h e approximate string matching method described in this paper is sufficiently fast and storage-efficient (even when dictionaries in excess of 25,000 words are in use) to be suitable for use in many typical on-line applications. It can thus be used to improve ‘user friendliness’ and recall of information in every-day applications. The method discussed is compact and fast, yet it does not trade accuracy for speed. The reason for its power lies with the fact that it works in two stages, first preselecting a small set of strings, which are then analysed more thoroughly. A significant saving in space is also made by compressing the look-up bit array table, and this in turn reduces elapsed times by reducing I/O loading. Sorting the dictionary results in an improved performance. zyxwvutsrq zyxwv zyxwvutsrqp zyxwvu REFERENCES J . L. Peterson, ‘Computer programs for detecting and correcting spelling errors’, ClC’JI, 23, (12), 676687 ( 1980). R. A. Wagner and M. J. Fischer, ‘The string-to-string correction problem’, J . A(Y.1, 21, ( l ) , 168-178 ( 1974). R. C.’ Angell, G . E. Freund and P. Willett, ‘Automatic spelling correction using a trigram similarity measure’, Itlfornzat. Proc. and Manage., 19, 255-261 (1983). M . A. Bickel, ‘Automatic correction to misspelled names: a fourth-generation language approach’, Comm. ACM, 30, (3), 224-228 (1987).