Academia.eduAcademia.edu

A linear time, constant space differencing algorithm

1997, Power Conversion Conference

An efficient differencing algorithm can be used to compress version of files for both transmission over low bandwidth channels and compact storage. This can greatly reduce network traffic and execution time for distributed applications which include software distribution, source code control, file system replication, and data backup and restore. An algorithm for such applications needs to be both general and

UC Santa Cruz UC Santa Cruz Previously Published Works Title A linear time, constant space differencing algorithm Permalink https://escholarship.org/uc/item/206518xs Authors Burns, RC Long, DDE Publication Date 1997 DOI 10.1109/PCCC.1997.581547 Peer reviewed eScholarship.org Powered by the California Digital Library University of California zyx zyxwvu zyxw A LINEAR TIME, CONSTANT SPACE DIFFERENCING ALGORITHM Randal C. Bums and Darrell D. E. Long Department of Computer Science University of California Santa Cruz Santa Cruz, Califomia 95064 ABSTRACT An efficient differencingalgorithm can be used to compress version of files for both transmission over low bandwidth channels and compact storage. This can greatly reduce network traffic and execution time for distributed applications which include software distribution, source code control, file system replication, and data backup and restore. An algorithm for such applications needs to be both general and efficient; able to compress binary inputs in linear time. We present such an algorithm for differencing files at the granularity of a byte. The algorithm uses constant memory and handles arbitrarily large input files. While the algorithm makes minor sacrifices in compression to attain linear runtime performance, it outperforms the byte-wise differencing algorithms that we have encountered in the literature on all inputs. I. INTRODUCTION Differencing algorithms compress data by taking advantage of statistical correlations between different versions of the same data sets. Strictly speaking, they achieve compression by finding common sequences between two versions of the same data that can be encoded using a copy reference. We define a digerencing algorithm to be an algorithm that finds and outputs the changes made between two versions of the same file by locating common sequences to be copied and unique sequences to be added explicitly. A delta$le (A) is the encoding of the output of a differencing algorithm. An algorithm that creates a delta file takes as input two versions of a file, a base file and a version file to be encoded, and outputs a delta file representing the incremental changes made between versions. Fbase zyx the same data exists, ancl adld commands, instructions to add data into the version fibe followed by the data to be added. While there are othex representations [ 12, 1, 31, in any encoding scheme, a differencing algorithm must have found the copies and adds,to Ibe encoded. So, any encoding technique is compatible with the methods that we present. Several potential applications of version differencing motivate the need for a compact and efficient differencing algorithm. An efficient algorithm could be used to distribute software over a low bandwidth network such as a modem or the Internet. Upon releasing a new version of software, the version could be differenced with respect to previous version. With compact versions, a low bandwidth channel can effectively distribute a new release of dynamically self updating software in the form of a binary patch. This technology has the potential to greatly reduce time to market on a new version ancl ease the distribution of software customizations. For replication in dislributed file systems, differencing can reduce by a large factor the amount of information that needs to be updated by transmitting deltas for all of the modified files in the replicated file set. In distributed file system backup and restore, differential compression would reduce the time to perform file system backup, decrease network traffic during backup and restore, and lessen the storage to maintain a backup image [7]. Backup and restore can be limited by both bandwidth on the network, often 10IMBls, and poor throughput to secondary and tertiary storage devices, often 500 K B I s to tape storage. Since resource limitations frequently make backing up the just the changes to a file system infeasible over a single night or even wleekiend, differential file compression has great potential to alleviate bandwidth problems by using available processor cycles to reduce the amount of data transferred. This teichnology can be used to provide backup and restore services on a subscription basis over any network including the Internet. Differencing has it {originsin both longest common subsequence (LCS) algol-ithinsand the string-to-string correction problem [ 131. Some of the first applications of differencing updated the screens of slow terminals by sending a set of edits to be applied locally rather than retransmitting a screen full of data. Another early application was the UNIX diff utility which used the LCS method to find and output the changes to a text file. diff was useful for source code development anid primitive document control. zyx zyxwvutsrqpon -k Fversion -+ A(base,version) (1) Reconstruction,the inverse operation, requires the base file and a delta file to rebuild a version. One encoding of a delta file consists of a linear array of editing directives (Figure 1). These directives are copy commands, references to a location in a base file where 0-7803-3873-1/97 $10.00 0 1997 IEEE 429 Figure 1: The copies found between the base and version file are encoded as copy commands in the delta file. Unmatched sequences are encoded as an add command followed by the data to be added. LCS algorithms find the longest common sequence between two strings by optimally removing symbols in both files leaving identical and sequential symbols.’ While the LCS indicates the sequential commonality between strings, it does not necessarily detect the minimum set of changes. More generally, it has been asserted that string metrics that examine symbols sequentially fail to emphasize the global similarity of two strings [4]. Miller and Myers [6] established the limitations of LCS when they produced a new file compare program that executed at four times the speed of the diff program while producing significantly smaller deltas. The edit distance [ 101 proved to be a better metric for the difference of files and techniques based on this method enhanced the utility and speed of file differencing. The edit distance assigns a cost to edit operations such as delete a symbol, insert a symbol, and copy a symbol. For example, the LCS between strings xyz and xzy is xy, which neglects the common symbol z. Using the edit distance metric, z may be copied between the two files producing a smaller change cost than LCS. In the string-to-string correction problem [ 131, an algorithm minimizes the edit distance to minimize the cost of a given string transformation. Tichy [ 101adapted the string-to-string correction problem to file differencing using the concept of block move. Block move allows an algorithm to copy a string of symbols rather than an individual symbol. He then applied the algorithm to source code revision control package and created RCS [ 111. RCS detects the modified lines in a file and encodes a delta file by adding these lines and indicating lines to be copied from the base version of a file. We term this differencing at line granularity. The delta file is a line by line edit script applied to a base file to convert it to the new version. Although the SCCS version control system [9] precedes RCS, RCS generates “minimal” line granularity delta files and is the definitive previous work in version control. While line granularitymay seem appropriate for source code, the concept of revision control needs to be generalized to include binary files. This allows data, such as edited multimedia, to be revised with the same version control and recoverability guarantees as text. Whereas revision control is currently a programmers tool, binary revision control systems will enable the publisher, film maker, and graphic artist to realize the benefits of strict versioning. It also enables developers to place bitmap data, resource files, databases and binaries under their revision control system. Some previous packages have been modified to handle binary files, but in doing so they imposed an arbitrary line structure. This results in delta files that achieve little or no compression as compared to storing the versions uncompressed. Recently, an algorithm appeared that addresses differential compression of arbitrary byte streams [8]. The algorithm modifies the work of Tichy [ 101 to work on bytewise data streams rather than line oriented data. This algorithm adequately manages binary sources and is an effective developer’s tool for source code control. However, the algorithm exhibits execution time quadratic in the size of the input, O ( M x N ) for files of size M and N . The algorithm also uses memory linearly proportional to the size of the input files, O ( M N ) . To find matches the algorithm implements the greedy method, which we will show to be optimal under certain constraints. The algorithm will then be used as a basis for comparison. As we are interested in applications that operate on all data in a network file system, quadratic execution time renders differencing prohibitively expensive. While it is a well known result that the majority of the files are small, less than 1 kilobyte [2], a file system has a minority of files that may be largc, ten to hundreds of megabytes. In order to address the differential compression of large files, we devised an differencing algorithm that runs in both linear time, O ( M N ) , and constant space, O(1). Section I1 outlines the greedy differencing algorithm, proves it optimal, and establishes that the algorithm takes quadratic execution time. Section I11 presents the linear time differencing algorithm. Section IV analyzes the linear time algorithm for run-time and compression performance. Section V presents an experimental comparison of the linear time algorithm and the greedy algorithm. We conclude in Section VI that the linear time algorithm provides near optimal compression and the efficient performance required for distributed applications. zyxwvutsrq zyxwvutsrqpo zyxwvuts zyxwvuts zyxwvutsrqp + + zyxwvutsrqp zyxwvu zy ‘A stringhbstring contains all consecutive symbols between and including its first and last symbol whereas a sequencehbsequence may omit symbols with respect to the corresponding string. 11. GREEDY METHODS FOR FILE DIFFERENCING Greedy algorithms often provide simple solutions to optimization problems by making what appears to be the best decision, the greedy decision, at each step. For difkrencing files, a greedy algorithm takes the longest match it can find at a given offset on the assumption that this match 430 provides the best compression. Greedy makes a locally optimal decision with the hope that this decision is part of the optimal solution over the input. For file differencing, we prove the greedy algorithm provides an optimal encoding of a delta file and show that the greedy technique requires time proportional to the product of the sizes of the input files. Then we present an algorithm which approximates the greedy algorithm in linear time and constant space by finding the match that appears to be the longest without performing exhaustive search for all matching strings. file V may be represented in a differential file D exclusively by a copy or series of copies from B.Since we have assumed a unit cost function for encoding all copies and this cost is less than or eqiual to the cost of adding a symbol in the version file, there exists an optimal representation P, of V with respect to B, which only copies strings of symbols from B. In order to prove the optimality of a greedy encoding G, we require the intermediate result of Lemma 1. zyxwvu zyxwvutsrq zyxwvutsr zyxwvutsrqp zyx zyxwvutsrqpo zyxwvutsrq zyxwvu A. Examining Greedy Delta Compression For our analysis, we consider delta files constructed by a series of editing directives; “add commands”, followed by the data to be added, and “copy commands” that copy data from the base file into the version file using an offset and length (Figure 1). Given a base file and another version of that base file, the greedy algorithm for constructing differential files finds the longest copy in the base file from the first offset in the version. It then looks for the longest copy starting at the next offset. If at a given offset, it cannot find a copy, the symbol at this offset is marked to be added and the algorithm advances to the following offset. For an example of a greedy differencing algorithm refer to the work of Reichenberger [81. We now prove that the greedy algorithm is optimal for a simplified file encoding scheme. In this case an optimal algorithm produces the smallest output delta. For binary differencing, symbols in the file may be considered bytes and a file a stream of symbols. However, this proof applies to differencing at any granularity. We introduce and use the concept cost to mean the length (in bits) for the given encoding of a string of symbols. Lemma 1 For an arbitraiy number of copies encoded, the length of versionfile data encoded by the greedy encoding is greater than or equal to the length of data encoded by optimal encoding. Proof (by induction) We introduce pi to be the length of the ith copy in the optimd encoding P and gi to be the length of the ith copy in the greedy encoding G. The length of data encoded in P and G after n copies are respectively given by: 0 0 i=l n-1 n-1 n n i=l i= 1 i=l i=l G encodes a copy of length gn and P encodes a copy of length pn. If eqiuation 3 did not hold, P would have found a copy of length pn at offset pi that is greater than gn E;.: gi pi. A substring of this copy would be a string starting at gi of length greater than gn. As G always encodes the longest matching string, in this case gn, this is a contradiction and equation 3 must hold. H + All symbols in the alphabet C appear in the base file Copying a string of length 1 with maximum cost c x 1 provides an encoding as compact as adding the same string. we can state: Theorem 1 The greedy algorithmfinds an optimal encoding of the versionjle V with respect to the basejle B. Proof Since all symbols in the alphabet C appear in the base file B, a symbol or string of symbols in the version i= 1 Given that G and P have encoded n - 1copies and the current offset in G is greater than the current offset in P, we can conclude that after G and P encode an nth copy that the offset in G for n copies is greater than the offset in P. A copy of any length may be encoded with a unit cost = c. B. n At file offset 0 in V , P has a copy command of length pl . G encodles a matching string of length g1 which is the longest sbring starting at offset 0 in V. Since G encodes the longest possible copy, 91 2 pl. Claim Given a base file B, a version of that base file V, and an alphabet of the symbols C, by making the following assumptions: 0 n Having established L,emma 1, we conclude that the number of copy commands that G uses to encode V is less than or equal to the number of copies used by P. However, since P is an optimal encoding, the number of copies P uses to encode V is less than or equal to the number the G uses. We can therefore state that, size(G) = size(P) = c x N where N is the number of copy commands in greedy encoding. We have shown that the greedy algorithm provides an optimal encoding of a version file. Practical elements of 43 1 zyxwvu zyxwvut zyxwvu zyxwvutsrqp zyxw zyxwvu zyxwvuts zyxwvut the algorithm weaken our assumptions. Yet, the greedy algorithm consistently reduces files to near optimal and should be considered a minimal differencing algorithm. Encoding an Insertion Base \ Version B. Analysis of Greedy Methods Common strings may be quickly identified as they also have common footprints. In this case a footprint is the value of a hash function over a fixed length prefix of a string. The greedy algorithm must examine all matching footprints and extend the matches in order to find the longest matching string. The number of matching footprints between the base and version file can grow with respect to the product of the sizes of the input files, i.e. O ( M x N ) for files of size M and N,and the algorithm uses time proportional to this value. In practice, many files elicit this worst case behavior. In both database files and executable files, binary zeros are stuffed into the file for alignment. This “zero stuffing” creates frequently occuring footprints which must all be examined by the algorithm. Having found a footprint in the version file, the greedy algorithm must compare this footprint to all matching footprints in the base file, This requires it to maintain a canonical listing of all footprints in one file, generally kept by computing and storing a hash value over all string prefixes [8]. Consequently, the algorithm uses memory proportional to the size of the input, O ( N ) ,for a size N base file. 111. VERSIONING IN LINEAR TIME Having motivated the need to difference all files in a file system and understanding that not all file are small [ 2 ] , we improve upon both the runtime performance bound and runtime memory utilization of the greedy algorithm. Our algorithm intends to find matches in a greedy fashion but does not guarantee to execute greedy exactly. \ Add Copy Encoding a Deletion Base Version c _ _ COPY Encoding an Insertion and a Deletion Base Version * Add Copy Figure 2: Simple file edits consist of insertions, deletions and combinations of both. The linear time algorithm finds and encodes all modifications that meet the simple edit criteria. Algorithm Start file pointers boffset in the base file and voffset in the version file at file offset zero. Create a footprint for each offset by hashing a prefix of bytes. Store the start position in the version file as vstm. We call this state “hashing mode”. For each footprint: (a) If there is no entry in the table at that footprint value, make an entry in the hash table. An entry will indicate the file and offset from which the footprint was hashed. (b) If there is an entry at a footprint value, if the entry is from the other version of the file, verify that the prefixes are identical. If the prefixes prove to be the same, matching strings have been found. Continue with step 3. (c) If there is an entry at a footprint value and the entry is from the same file, retain the existing hash entry. A. A Linear Time Differencing Algorithm The linear algorithm modifies the greedy algorithm in that it attempts to take the longest match at a given offset by taking the longest matching string at the first matching string prefix beyond the offset at which a previous match was encoded; we term this the next match policy. In many instances matching strings are sequential between file versions, i.e. they occur in the same order in both files. When strings that match are sequential, the next matching prefix approximates the best match extremely well. In fact this property holds for all changes that are insertions and deletions (Figure 2). We expect many files to exhibit this property, most notably mail, database, image and log files. Advance both voffset and boffset one byte, hash prefixes, and repeat step 2. Having found a match at step 2b, leave hashing mode and enter “identity mode”. Given matching prefixes between some offset “copy in the version file, and some offset bcopy in the base file, match bytes forward in the files to find the longest match of length 1. Set voffset and boffset to the ends of the match. Encode the region of the version file from wstart to “COPY using an add codeword followed by the data to be added. Encode the regon from “copy to woffset in the version file using a copy codeword encoding I , the length of the copy found, and bcopy, the offset in the base file. zyx The linear time differencing algorithm takes as input two files, usually versions of each other, and using one hash table performs the following actions: 432 zyxwvutsrqp zyxwvut 5. Flush the hash table to remove the information about the files previous to this point. Set “start to the and repeat step 2. By flushing the hash table, the algorithm enforces the next match policy. Note that a match can be between the current offset in one version of the file and a previous offset in the other version. After flushing the hash table, the algorithm effectively remembers the first instance of every footprint that it has seen since encoding the last copy. IV. ANALYSIS OF THE LINEAR TIME ALGORITHM We often expect the changes between two versions of a file to be simple edits, insertions of information and deletions of information. This property implies that the common strings that occur in these files are sequential. An algorithm can then find all matching strings in a single pass over the inputs files. After finding a match, we can limit our search space for subsequent matches to only the file offsets greater than the end of the previous matching string. Many files exhibit insert and delete only modifications. In particular mail files and database files. Mail files have messages deleted out from the middle of the file and data appended to the end. Relational database files operate on tables of records, appending records to the end of a table, modifying records in place, and deleting them from the middle of the table. System logs have an even more rigid format as they are append only files. When a match is found and the algorithm enters identity mode, if the match is not spurious (section B), the pointers are “synchronized”, indicating that the current offset in the version file represents the same data at the offset in the base file. The algorithm’s two phases, hashing and identity, represent the synchronization of file offsets and copying from synchronized offsets. When the identity test fails, the files differ and the file offsets are again “out of synch”. Then, the algorithm enters hashing mode to regain the common location of data in the two files. We selected the Karp-Rabin hashing function [5]for generating footprints as it can be calculated incrementally, i.e. a footprint may be evaluated from the footprint at the previous offset and the last byte of the current string prefix. This technique requires fewer operations when calculating the value of overlapping footprints sequentially. Our algorithm always hashes successive offsets in hashing mode and realizes significant performance gains when using this function. A. Performance Analysis The presented algorithm operates both in linear time and constant space. At all times, the algorithm maintains a hash table of constant size. After finding a match, hash entries are flushed and the same hash table is reused to find the next matching prefix. Since this hash table neither grows nor is deallocated, the algorithm operates in constant space, roughly the size of the hash table, on all inputs. Since the maximum number of hash entries does not necessarily depend on the file input size, the size of the hash table need not grow wilth the size of the file. The maximum number of hash entries is bounded by twice the number of bytes between the: end of the previous copied string and the following matching prefix. On highly correlated files, we would expect a small maximum number of hash entries since we expect to find matching strings frequently. The algorithm operates in time linear in the size of the input files as we are guaranteed to advance either the base file offset or the version file (offsetby one byte each time through the inside loop of‘the program. In identity mode, both the base file offset and the version file offset are incremented by one byte at each step. Whereas in hashing mode, each time a new offset is hashed, at least one of the offsets is incremented, as matching prefixes are always found between the current offset in one file and a previous offset in another. Therefore, identity mode proceeds through the input at as much as twice the rate of hashing mode. Furthermore, the byte identity function is far easier to compute than the Kap-Rabin [5] hashing function. On highly correlated files, we expect the algorithm to spend more time in identity mo’de than it would on less correlated versions. We can then state that the algorithm executes faster on more highly correlated inputs and the linear algorithm operates best on its most common input, similar version files. zyxw B. Sub-optimal Compression The algorithm achieves less than optimal compression when either the algorithm falisely believes that the offsets are synchronized, the assumption that all changes between versions consist of insertions and deletions fails to hold, or when the implemented hashing function exhibits less than ideal behavior. Due to the assumption of changes being only inserts and deletes, the algorithm fails to find rearranged strings. Upon encountering a rearranged string, the algorithm takes the next match it can find. ?’his leaves some string in either the base file or in the vexsion file that could be compressed and encoded as a copy, but will be encoded as an add, achieving no additional compression. In Figure 3, the algorithm fails to find the copy of tokens ABCD since the string has been rearranged. In this simplified example we have selected a prefix of length one. The algorithm encodes EFG as a copy and flu,shesthe hash table, removing symbols ABCD that previously appeared in the base file. When hashing mode reslarts the match has been missed and will be encoded as ani add. The algorithm is also susceptible to spurious hash collisions as a result of taking the next match rather than the zyxwvuts zyx 433 zyxwvutsr zyxwvutsrq zyxwvutsrqponmlkji dress footprint blocking by scanning both forwards and backwards in identity mode. This simple modification allows the algorithm to go back and find matches starting at a prefix that was hash blocked. The longer the matching string, the less likely that match will be blocked as this requires consecutive blocked prefixes. Under this solution, the algorithm still operates in constant space, and although matches may still be blocked, the probability of blocking a match decreases geometrically with the length of the match. Rearranged Sequences Start t Base Restart Collide + t Version zyxwvutsrqponmlk zyxwvutsrqponmlk t 4 Start t Restart Collide Spurious or Aliied Match Start I BaSe Version t Start ColliddRfstsrt t t Collide Collide V. EXPERIMENTAL RESULTS I t Restart t Collide Figure 3: Sub-optimal compression may be achieved due to the occurrence of spurious matches or rearranged strings. The encoded matches are shaded. We compared the Reichenberger [8] greedy algorithm against our linear time algorithm to experimentally verify the performance improvements and quantify the amount of compression realized. The algorithms were run against multiple types of data that are of interest to potential applications. Data include mail files, modified and recompiled binaries, and database files. Both algorithms, where appropriate, were implemented with the same facilities. This includes the use of the Reichenberger codewords for encoding copy and add commands in the delta file, memory mapped YO, the use of the same prefix length for footprint generation, and the use of the Karp-Rabin hashing algorithm in both cases. Karp-Rabin hashing is used by the Reichenberger algorithm since it also realizes benefits from incremental hashing, by sequentially hashing one whole file before searching for common strings. The linear algorithm outperforms the Reichenberger algorithm on all inputs, operating equally as fast on very small inputs and showing significant performance gains on all inputs larger than a few kilobytes. The performance curve of the Reichenberger algorithm grows quadratically with the file input size. The algorithm consistently took more than 10 seconds to difference a 1MB file, extrapolating this curve to a lOMB file, the algorithm would complete in slightly more than 15 minutes. Depending upon the machine, the linear algorithm can compress as much as several megabytes per second. Currently, the data throughput is U 0 bound when performing the byte identity function and processor bound when performing the hashing function. The relative data rates are approximately 10 MBls and 280 K B I s for identity and hashing mode respectively. These results were attained upon an IBM 43P PowerPC with a 133MHz processor and a local F/W SCSI hard drive on a 10 MBls data bus. In Figure 4a, the runtime performance of the Reichenberger algorithm grows in a quadratic fashion, whereas the linear time algorithm exhibits growth proportional to the file input size. We also show that our algorithm’s execution time continues to grow linearly on large input files in Figure 4b. There is a high amount of variability in the time performance of the linear algorithm on a file of any given size depending upon how long the algorithm spends in hashing mode as compared to identity mode. zyxwvutsrqp zyxw zyxwv best match. These collisions indicate that the algorithm believes that it has found synchronized offsets between the files when in actuality the collision just happens to be between strings that match by chance. In Figure 3, the algorithm misses the true start of the string ABCDEF in the base file (best match) in favor of the previous string at A 0 (next match). Upon detecting and encoding a “spurious” match, the algorithm achieves some degree of compression, just not the best compression. Furthermore, the algorithm never bypasses “synchronized offsets” in favor of a spurious match. This also follow directly from choosing the next match and not the best match. This result may be generalized. Given an ideal hash function, the algorithm never advances the file offsets past a point of synchronization. Hashing functions are, unfortunately, not ideal. Consequently, the algorithm may also experience the bZocking of footprints. For a new footprint, if there is another footprint from the same file already occupying that entry in the hash table, the second footprint is ignored and the first one retained. In this instance, we term the second footprint to be blocked. This is the correct procedure to implement the next match policy assuming that all footprints represent a unique string. However, hash functions generally hash a large number of inputs to a smaller number of keys and are therefore not unique. Strings that hash to the same value may differ and the algorithm loses the ability to find strings matching the discarded string prefix. Footprint blocking could be addressed by any rehash function or hash chaining. However, this solution would destroy the constant space utilization bound on the algorithm. Instead of a rehash function, we propose to ad- 434 zyxwvutsrqpon zyxwvutsrqponmlkj I:/ zyxwvutsrqpo zyxwvutsrqponm + i I + + -10- i I & 8- 0 - Table 1: The relative size of the delta files measured in the percent size as compared to the version file. Total is the compression over the sum of all files. The "*" indicates a data set of files too large for the Reichenberger algorithm. + + + 2.: + + Mail 2 i ++ + I2t f I I ++ + + + *+ 0.2 X 0.4 zyxwvutsrqpo 1 1.2 File Size (bytes) 0.6 1.4 0.8 1.6 2 1.8 x loe (a) A comparison of the algorithms' execution time performance on inputs of less than 2MB. 7 + 6- -; + 5- 1 4 - ++ 43- ++ ++ 2+ - Most of the data that we experimented on shows a high degree of compressibility,with some instances of databases showing the best compressibility. Our data sets (table 1) include: Mail 1 , electronic mail files less than 1MB from a UNIX network file system; Mail 2, files greater than 1MB from the same system; D13 1 , the weekly backup of a university student informatialn database; DB 2, the same data from a different pair of weeks; and the Binary entry represents the compressibility of for all executable versions of both algorithms over their development cycle. The total compressibility represents the ratio between the sum of the sizes of the delta files compared to the sum of the sizes of the version files. Generally, the mean file size is larger than the total indicating that larger files in any data set tend to be more compressible. The standard deviation indicates the volatility of the data set with a high deviation showing data with more files that hLave. been significantly altered. Compressibilityfigures depend totally on the input data and are not meant to indicate that delta file compression achieves these kinds of rissulits on all inputs. Rather, the data is, in our experience, representative of the compression that can be achieved on versions in a typical UNIX file system. We consistently noteld that the data sets with larger files also tended to be mare compressible. This is verified by the data in table 1. Mail 2 consists of files larger than 1MB and are 10%more compressible than the files in Mail 1, files less than 1MB. The linear algorithm conisistently compressed data to within a small factor of the compression the greedy algorithm realizes. On all the mail files less than IMB, the linear algorithm achieved compression to less than 12%the original size whereas the: Reichenberger algorithm compressed the files less than 3% more to slightly under 9%. The relative compression of the algorithms are displayed in Figure 4c. Points on the unit slope line represent files that were compressed equally by both methods. The Reichenberger encoding is consistently equal to or more compact than the linear algorithm, but only by a small factor. Experimentalresults indicate the suitability of our methods to many applications as compression by a factor of 30 or more is feasible on many data sets. The results also indicate that the linear algorithm consistently performs well in compressing versions when compared with the greedy algorithm. The linear algorithm provides near optimal compression and does so in linear time. +++ :.& +h zyxwvutsrqponmlkji '0 1 2 5 4 3 6 7 File Size (byies) 8 x IOS (b) The linear algorithm'sexecution time performance on inputs as large as 8MB. + + .' .,' + + +I A zyxwvutsrqpon 1 ++ 7 0 1 2 3 4 5 ReichenbergerAlgorithm 6 File size - Deila L 8 9 1 0 x io' (c) Relative compression of linear algorithm with respect to the Reichenberger algorithm for delta files less than 1OOKB. Figure 4: Experimental Results 435 VI. SUMMARY AND CONCLUSIONS zyxwvut zyx zyxw We have described a differencing algorithm that executes in both linear time and constant space. This algorithm executes significantly faster than the greedy algorithm and provides comparative compression to the greedy method, which has been shown to provide optimal compression. The linear algorithm approximates the greedy algorithm by taking the next matching string following the previous match, rather than exhaustively searching for the best match over the whole file. This next match policy corresponds highly with best match when files are versions with insert and delete modifications. The algorithm enforces the next match policy by synchronizing pointers between two versions of a file to locate similar data. Experiments have shown the linear time algorithm to consistently compress data to within a small percentage of the greedy algorithm and to execute significantly faster on inputs of all sizes. Results have also shown many types of data to exhibit high correlation among versions and differencing can efficiently compress the representation of these files. We envision a scalable differencing algorithm as an enabling technology that permits files of any size and format to be placed under version control, and allows the transmission of new version of files over low bandwidth channels. File differencing can mitigate the transmission time and network traffic for any application that manages distributed views of changing data. This includes replicated file systems and distributed backup and restore. A technology that was previous relegated to source code control may be generalized with this algorithm and applied to address network resource limitations for distributed applications. VII. ACKNOWLEDGMENTS We wish to thank and credit Dr. Robert Moms of the IBM Almaden Research Center for his innovation in the application of delta compression. We also wish to thank Professor David Helmbold and the research group of Dr. Ronald Fagin for their assistance in verifying and revising our methods, Mr. Norm Pass for his support of this effort, and the University of California Santa Cruz which provided us with file system data. [3] BLACK, A. P., AND CHARLES H. BURRIS,JR. A compact representation for file versions: A preliminary report. In Proceedings of the 5th International Conference on Data Engineering (1989), IEEE, pp. 321-329. [4] EHRENFEUCHT, A., AND HAUSSLER,D. A new distance metric on string computable in linear time. Discrete Applied Mathematics 20 (1988), 191-203. [5] KARP,R. M., AND RABIN, M. 0 . Efficient randomized pattern-matching algorithms. ZBM Journal of Research and Development 31,2 (1987), 249-260. [6] MILLER,w., AND MYERS,E. w. A file comparison program. Software - Practice and Experience 15, 11 (NOV.1985), 1025-1040. [7] MORRIS,R. Conversations regarding differential compression for file system backup and restore, Feb. 1996. [8] REICHENBERGER, C. Delta storage for arbitrary non-text files. In Proceedings of the 3rd International Workshop on Software Conjiguration Management, Trondheim, Norway, 12-14 June 1991 (June 1991), ACM, pp. 144-152. [9] ROCHKIND,M. J. The source code control system. IEEE Transactions on Sofnyare Engineering SE-l,4 (Dec. 1975), 364-370. [IO] TICHY, W. F. The string-to-string correction problem with block move. ACM Transactions on Computer Systems 2 , 4 (Nov. 1984). [ l l ] TICHY, W. F. RCS - A system for version control. SofhYare - Practice and Experience 15, 7 (July 1985), 637-654. [12] TSOTRAS,v., AND GOPINATH, B. Optimal versioning of objects. In Proceeedings of the Eight International Conference on Data Engineering, Tempe, AZ, USA, 2-3 Feb. 1992 (Feb. 1992), IEEE, pp. 358-365. zyx zyx zyxwvutsrq [13] WAGNER, R., AND FISCHER, M. The string-tostring correction problem. Joumal of the ACM 21, 1 (Jan. 1973), 168-173. VIII. REFERENCES ALDERSON,A. A space-efficient technique for recording versions of data. SofhYare Engineering Journal 3 , 6 (June 1988), 240-246. BAKER, M. G., HARTMAN, J. H., KUPFER,M. D., SHIRRIFF, K. W., AND OUSTERHOUT, J. K. Measurements of a distributed file system. In Proceedings of the 13th Annual Symposium on Operating Systems (Oct. 1991). 436