Waterman - 1984 - Efficient Sequence Alignment Algorithms
Waterman - 1984 - Efficient Sequence Alignment Algorithms
Waterman - 1984 - Efficient Sequence Alignment Algorithms
(1984) 108,333-337
Introduction
With the advent of rapid methods for determining nucleic acid sequences,
there is increased interest in computer methods for comparing these sequen-
ces. Sequencing is estimated to be proceeding at the rate of lo6 bases per
year and various data bases are being structured to organize the sequences
and attendant information. Relevant portions of the data base are searched
for sequences similar to ones recently determined; and rapid, efficient, and
meaningful algorithms are necessary.
The Needleman & Wunsch (1970) algorithm was the first rapid method
in the biological literature for determining sequence homology and was
followed by the metrics of Sankoff (1972) and Sellers (1974a,b)for finding
the distance between two sequences. Kruskal(l983) recently presented an
extensive review of these and related methods and applications.
Sellers' work was generalized to include multiple insertion/deletions by
Waterman, Smith & Beyer (1976). A review of the use of these techniques
' < in sequence analysis was given by Smith, Fitch & Waterman (1981). Fitch
* & Smith (1983) recently studied these algorithms for a region of DNA
coding for alpha and beta hemoglobin in chicken, where the alignment is
assumed known by comparison of the many corresponding protein sequen-
I
ces, and they determined that a specific range of weights for multiple
insertion/deletions were necessary to obtain correct alignments. Therefore,
it is important to include these in application algorithms.
334 M . S. W A T E R M A N
Linear Insertion/Deletions
To make the discussion specific, the algorithm of Waterman et al. (1976)
will now be presented. a = alaz . . . a, and b = blb2 . . . b,,, are the sequences
of interest and D(a, b) will denote the (minimum) distance between a and
b where d ( x , y) is a dissimilarity measure between the sequence elements,
and deletions of length k are assigned weight w(k). We take Die= w(i),
Do, = w( j ) and D,, = D(alaz. . . a,, bl b2 . . . b,) and proceed by
D,, = min { D,-l,l-l+ d ( a,, b,), min { D,,,-k+ w ( k): 1I k I j } ,
min{Dz-l,l+w(l): l s l s i } } .
Of course, the final result is D,,,,,,= D(a, b). The computational efficiency
of the algorithm is of order
( i +j ) = O(n’m + m’n)
1.1
Concave Insertion/Deletions
The intuitive argument for multiple insertion/deletion functions is that .
a deletion of fourteen bases (say) should not be thought of as fourteen
independent single deletions but as one deletion event which has weight
less than the sum the weights of fourteen single deletions. This reasoning
implies a general inequality
w(k+l)s w(k)+w(l), k, I r l .
It is possible to give a slightly improved set of inequalities which proves
useful in this problem. If we agree that bases are increasingly easier to
EFFICIENT S E Q U E N C E A L I G N M E N T ALGORITHMS 335
insert (delete) as the insertion (deletion) length grows, then the resulting
inequalities are
w ( m + k + I ) - w ( m + k ) S w(k+I)-w(k), k,l,m>l.
If equality is required, then linearity follows:
w ( k ) = a + b ( k - 1).
The assumption of linearity is now transparent; w ( k ) linear means that
after the first deletion (insertion) each successive addition of a base has
equal cost. We argue that strict inequality could be preferable. Since the
function w is increasing and has decreasing differences, it is concave down-
ward, here simply referred to as concave. Such functions as
w(k)=a+blog(k), a, b>O
are concave and have intuitive appeal.
Next we make the assumption of linearity, w ( k) = a + b( k - 1) and review
Gotoh’s proof. He presents the above recursion for Dij in the form
Dij=min {Di-l,j-l+d(ai, bj), FiSj},
where
Ei,j= min {DLj-k + w( k): 1 Ik 5 j } ,
Fi,j= min { Di-si+ w( 1 ) : 1 5 1 Ii},
where E,,,, = Foo= Doo= 0,Eio= Dio= w ( i ) , Foj = DOj= w ( j ) . Gotoh then
observes
E,j=min{Di,j-l+a,min{Di,j--k+w(k): 2 5 k ~ j } }
+ w(l+ 1): 11
=min {Dhj-l+a, min {DLj-l-l I5 j - 1))
+
= min { Di,j-l a, min { Di,j-l-I + w (1 ) : 1 I1 5 j - 1) + b}
= min {Di,j-l+a, Ei,j-l + b}.
Of course
Fi,j=min {Di-l,j+a,Fi-l,j+b}
and the algorithm is of order O(nm) or O(n’) in case n = m.
Y Now we study the problem of e.fficient algorithms for w concave. First
of all, set
Ei,j=min{D,k+ w ( j - k ) : k ~ [ O , j - l ] } .
This means that, for some 1 E [0, j - 13,
Di,,+W(j-I)aDi,k+W(j-k), Osksj-1.
336 M. S . W A T E R M A N
Conclusion
More development of sequence comparison algorithms will occur in the
near future. The rapidly increasing data base will force those working on
algorithms for molecular biology to construct algorithms that are more
efficient and that answer increasingly more special and involved questions.
We are, for example, applying our results on concave deletion functions to
algorithms for RNA secondary structure. One of the interesting aspects
will be the effects that molecular biology and computer science will have
as they continue to communicate and cooperate.
EFFICIENT S E Q U E N C E A L I G N M E N T ALGORITHMS 337
The Department of Biochemistry and Biophysics of the University of California
at San Francisco and System Development Foundation are each gratefully acknow-
ledged for providing partial support of this work.
REFERENCES
FITCH,W. M. & SMITH,T. F. (1983).Roc. natn. Acad Sci. U.S.A. 80, 1382.
GOTOH, 0.(1982),J. mol. Biol. 162, 705.
GOAD, W. B. & KANEHISA,M. I. (1982).Nucleic Acids Res. 10, 247.
KRUSKAL,J. B. (1983).SZAM Rev. 25,201.
NEEDLEMAN,S . & WUNSCH, C. (1970).J. mol. Biol. 42,245.
SANKOFF,D.(1972).Roc. natn. Acad. Sci U.S.A. 69,4.
, SMITH,T . F., FITCH, W. M. & WATERMAN,M. S. (1981).J. mol. Evol. 18, 38.
SELLERS,P. (1974a).J. Comb. Theory Ser. A 16, 253.
SELLERS,P. (1974b).J. SZAM 26,787.
WATERMAN,M.S., SMITH,T. F. & BEYER, W. A. (1976).Adv. Math. 20,367.
WATERMAN,M.S. & SMITH,T. F. (1978).Math. Bio. Sci 42,257.