Waterman - 1984 - Efficient Sequence Alignment Algorithms

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

J. theor. Biol.

(1984) 108,333-337

Efficient Sequence Alignment Algorithms


MICHAELS. WATERMAN
Department of Mathematics, University of Southern California, Los
Angeles, California 90089-1113, W.S.A.
(Received 23 December 1982, and in revised form 25 July 1983)
Sequence alignments are becoming more important with the increase of
nucleic acid data. Fitch and Smith have recently given an example where
multiple insertion/deletions (rather than a series of adjacent single inser-
tion/deletions) are necessary to achieve the correct alignment. Multiple
insertionJdeletions are known to increase computation time from O(n 2 )
to O ( n 3 ) although Gotoh has presented an O(n2)algorithm in the case
the multiple insertion/deletion weighting function is linear. It is argued in
this paper that it could be desirable to use concave weighting functions.
For that case, an algorithm is derived that is conjectured to be O ( n 2 ) .

Introduction
With the advent of rapid methods for determining nucleic acid sequences,
there is increased interest in computer methods for comparing these sequen-
ces. Sequencing is estimated to be proceeding at the rate of lo6 bases per
year and various data bases are being structured to organize the sequences
and attendant information. Relevant portions of the data base are searched
for sequences similar to ones recently determined; and rapid, efficient, and
meaningful algorithms are necessary.
The Needleman & Wunsch (1970) algorithm was the first rapid method
in the biological literature for determining sequence homology and was
followed by the metrics of Sankoff (1972) and Sellers (1974a,b)for finding
the distance between two sequences. Kruskal(l983) recently presented an
extensive review of these and related methods and applications.
Sellers' work was generalized to include multiple insertion/deletions by
Waterman, Smith & Beyer (1976). A review of the use of these techniques
' < in sequence analysis was given by Smith, Fitch & Waterman (1981). Fitch
* & Smith (1983) recently studied these algorithms for a region of DNA
coding for alpha and beta hemoglobin in chicken, where the alignment is
assumed known by comparison of the many corresponding protein sequen-
I
ces, and they determined that a specific range of weights for multiple
insertion/deletions were necessary to obtain correct alignments. Therefore,
it is important to include these in application algorithms.
334 M . S. W A T E R M A N

In this paper we review recent work of Gotoh (1982) who presented an


O(n’) algorithm for the case of linear insertion/deletion functions. An
extension to concave insertion/deletion functions is obtained here which
has computation complexity close to O(n’).

Linear Insertion/Deletions
To make the discussion specific, the algorithm of Waterman et al. (1976)
will now be presented. a = alaz . . . a, and b = blb2 . . . b,,, are the sequences
of interest and D(a, b) will denote the (minimum) distance between a and
b where d ( x , y) is a dissimilarity measure between the sequence elements,
and deletions of length k are assigned weight w(k). We take Die= w(i),
Do, = w( j ) and D,, = D(alaz. . . a,, bl b2 . . . b,) and proceed by
D,, = min { D,-l,l-l+ d ( a,, b,), min { D,,,-k+ w ( k): 1I k I j } ,
min{Dz-l,l+w(l): l s l s i } } .
Of course, the final result is D,,,,,,= D(a, b). The computational efficiency
of the algorithm is of order
( i +j ) = O(n’m + m’n)
1.1

which, when n = m,is O ( n 3 ) .


Paul Haeberli of the University of Wisconsin brought it to our attention
that this algorithm can be improved to O(n’) when w ( k) is a linear function
of k, although he did not carefully state the condition on w(k) or give a
complete proof. Much the same observation was made by Goad & Kanehisa
(1982) regarding the secondary structure algorithm of Waterman & Smith
(1978). More recently Gotoh (1983) gave a complete and clear proof for
the new algorithm.

Concave Insertion/Deletions
The intuitive argument for multiple insertion/deletion functions is that .
a deletion of fourteen bases (say) should not be thought of as fourteen
independent single deletions but as one deletion event which has weight
less than the sum the weights of fourteen single deletions. This reasoning
implies a general inequality
w(k+l)s w(k)+w(l), k, I r l .
It is possible to give a slightly improved set of inequalities which proves
useful in this problem. If we agree that bases are increasingly easier to
EFFICIENT S E Q U E N C E A L I G N M E N T ALGORITHMS 335
insert (delete) as the insertion (deletion) length grows, then the resulting
inequalities are
w ( m + k + I ) - w ( m + k ) S w(k+I)-w(k), k,l,m>l.
If equality is required, then linearity follows:
w ( k ) = a + b ( k - 1).
The assumption of linearity is now transparent; w ( k ) linear means that
after the first deletion (insertion) each successive addition of a base has
equal cost. We argue that strict inequality could be preferable. Since the
function w is increasing and has decreasing differences, it is concave down-
ward, here simply referred to as concave. Such functions as
w(k)=a+blog(k), a, b>O
are concave and have intuitive appeal.
Next we make the assumption of linearity, w ( k) = a + b( k - 1) and review
Gotoh’s proof. He presents the above recursion for Dij in the form
Dij=min {Di-l,j-l+d(ai, bj), FiSj},
where
Ei,j= min {DLj-k + w( k): 1 Ik 5 j } ,
Fi,j= min { Di-si+ w( 1 ) : 1 5 1 Ii},
where E,,,, = Foo= Doo= 0,Eio= Dio= w ( i ) , Foj = DOj= w ( j ) . Gotoh then
observes
E,j=min{Di,j-l+a,min{Di,j--k+w(k): 2 5 k ~ j } }
+ w(l+ 1): 11
=min {Dhj-l+a, min {DLj-l-l I5 j - 1))
+
= min { Di,j-l a, min { Di,j-l-I + w (1 ) : 1 I1 5 j - 1) + b}
= min {Di,j-l+a, Ei,j-l + b}.
Of course
Fi,j=min {Di-l,j+a,Fi-l,j+b}
and the algorithm is of order O(nm) or O(n’) in case n = m.
Y Now we study the problem of e.fficient algorithms for w concave. First
of all, set
Ei,j=min{D,k+ w ( j - k ) : k ~ [ O , j - l ] } .
This means that, for some 1 E [0, j - 13,
Di,,+W(j-I)aDi,k+W(j-k), Osksj-1.
336 M. S . W A T E R M A N

If k 2 1, w ( j - I + 1) - w ( j - I ) Iw( j - k + 1 ) - w( j - k), and


Di,/+ w ( j - I ) + w ( j - I + 1 )- w ( j - I)
5 Di,k +W(j- k)+ W ( j - k+ 1 ) - W ( j - k)

The utility of the last equation is that minimization can be reduced to


those Dl, associated with
S ( i )= { I : =
EIJ+1 D,J+w ( 1 ) ) .
In fact
E4,= min { D b k + w ( j - k): k E S ( i ) } .
The computational complexity of this algorithm depends on the rate of
growth of S(i) as a function of sequence length. On a row of length m,
how many times will length one deletions be optimal? We conjecture that
this function does not grow faster than log (m).
For further improvement of the algorithm, notice that whenever
Dl,,-1 + W ( 1) < D l , k + W ( j - k),j - 1 > k E s(i),
it is possible to calculate when D1.k can be superior to Dl,,-l.That is, solve
Dl,,-l+ w( 1 + h ) = D1.k + w ( j - k + h ) .
If j + h > m,then D1.k never need be considered again and k can be removed
from S ( i ) . Otherwise, D l k need not be considered until calculation of El,,+,,.
While we cannot establish the computational complexity of this last result
on first principles, the practical implication is an O ( n 2 )algorithm.

Conclusion
More development of sequence comparison algorithms will occur in the
near future. The rapidly increasing data base will force those working on
algorithms for molecular biology to construct algorithms that are more
efficient and that answer increasingly more special and involved questions.
We are, for example, applying our results on concave deletion functions to
algorithms for RNA secondary structure. One of the interesting aspects
will be the effects that molecular biology and computer science will have
as they continue to communicate and cooperate.
EFFICIENT S E Q U E N C E A L I G N M E N T ALGORITHMS 337
The Department of Biochemistry and Biophysics of the University of California
at San Francisco and System Development Foundation are each gratefully acknow-
ledged for providing partial support of this work.

REFERENCES

FITCH,W. M. & SMITH,T. F. (1983).Roc. natn. Acad Sci. U.S.A. 80, 1382.
GOTOH, 0.(1982),J. mol. Biol. 162, 705.
GOAD, W. B. & KANEHISA,M. I. (1982).Nucleic Acids Res. 10, 247.
KRUSKAL,J. B. (1983).SZAM Rev. 25,201.
NEEDLEMAN,S . & WUNSCH, C. (1970).J. mol. Biol. 42,245.
SANKOFF,D.(1972).Roc. natn. Acad. Sci U.S.A. 69,4.
, SMITH,T . F., FITCH, W. M. & WATERMAN,M. S. (1981).J. mol. Evol. 18, 38.
SELLERS,P. (1974a).J. Comb. Theory Ser. A 16, 253.
SELLERS,P. (1974b).J. SZAM 26,787.
WATERMAN,M.S., SMITH,T. F. & BEYER, W. A. (1976).Adv. Math. 20,367.
WATERMAN,M.S. & SMITH,T. F. (1978).Math. Bio. Sci 42,257.

You might also like