New Upper Bounds on Various String
Manipulation Problems
Christos Makris1,3 , Yannis Panagis1,2 , Katerina Perdikuri1,2 , Spyros
Sioutas1,2 , Evangelos Theodoridis1,2 , Athanasios Tsakalidis1,2 , and
Kostas Tsichlas1,2
1
Department of Computer Engineering and Informatics,
University of Patras, 26500 Patras, Greece
{makri, perdikur, panagis, sioutas, theodori, tsihlas}@ceid.upatras.gr
2
Research Academic Computer Technology Institute,
61 Riga Feraiou Str., 26221 Patras, Greece
[email protected]
3
Department of Applied Informatics in Management & Finance, Technological
Educational Institute of Mesolonghi
Mesolonghi, Greece
[email protected]
Abstract. In this chapter we deal with various string manipulation
problems which originate from the field of computational biology and musicology. These problems are: “approximate string matching with gaps”,
“inference of maximal pairs in a set of strings” and “handling of weighted
sequences”. We provide new upper bounds for solving these problems and
for the third we propose a novel data structure, for the representation
of the weighted sequences, which inherits most of the properties of the
suffix tree.
1
Introduction
String Manipulation Techniques arise in a variety of practical applications such as: word processors, information retrieval systems, molecular
sequence databases and music analysis programs. In this chapter we focus on the application of well known Data Structures in solving efficiently
string manipulation problems.
In the following paragraphs we will introduce the Approximate Matching Problem with Gaps, the Model Inference Problem in Multiple Strings
and the Weighted Suffix Tree, an efficient data structure for solving string
manipulation problems in weighted sequences.
The practical importance of these data structuring applications appears in the fields of computerized music analysis and computational biology. The algorithms to be presented can be easily used in the analysis of
2
musical works in order to discover similarities between different musical
entities that may lead to establishing a “characteristic signature” [3, 4].
This can be accomplished by noticing that a musical score can be represented as a string and by defining the alphabet to be the set of notes
in the chromatic or diatonic notation or the set of intervals that appear
between notes.
On the other hand two of the most important goals in computational
molecular biology include finding regularities in nucleic or protein sequences, and finding features that are common to a set of such sequences.
Both imply inferring patterns, unknown at first, from one or more strings.
In Computational Biology, DNA and Protein sequences can be seen as
long texts over specific alphabets. When dealing with DNA sequences the
alphabet consists of the four nucleotides, while in the case of protein sequences, the alphabet consists of the twenty amino acids. Those sequences
represent the genetic code of living beings. Searching specific sequences
over those texts appears as a fundamental operation for problems such
as assembling the DNA chain from the pieces obtained by experiments,
looking for given DNA chains or determining how different two genetic
sequences are.
Regularities in molecular sequences may come under many guises.
They may correspond to approximate repetitions randomly dispersed
along the sequence, or to repetitions that occur in a periodic or approximately periodic fashion, or else to tandem arrays. The length and number
of repeated elements one wishes to be able to identify may be highly variable. Patterns common to a set of sequences may likewise present diverse
forms. For various problems in molecular biology, in particular the study
of gene expression and regulation, it is important to be able to infer what
has been called “structured patterns”. Structured patterns allow to identify conserved elements recognized by different parts of the same protein
or macromolecular complex, or by various complexes that then interact
with each other.
In molecular biology binding site of a regulatory protein can be modeled as a weighted sequence. Each base in a candidate motif instance
makes some positive, negative or neutral contribution to the binding stability of the DNA-protein complex. The weights assigned to each character
can be thought of as modeling those effects. If the sum of the individual
contributions is greater than a treshold, the DNA-protein complex can be
considered stable enough to be functional. In this chapter we present the
weighted suffix tree, a data structure for handling weighted sequences.
3
In the next paragraph we give the basic definitions to be used in the
following sections.
1.1
Basic Definitions
A string is a sequence of zero or more symbols drawn from an alphabet Σ.
The set of all strings over the alphabet Σ is denoted by Σ + . A string x of
length n is represented by x1..n = x1 x2 · · · xn , where xi ∈ Σ for 1 ≤ i ≤ n,
and n = |x| is the length of x. The empty string is the empty sequence (of
zero length) and is denoted by ε; we write Σ ∗ = Σ + ∪ {ε}. The string xy
is a concatenation of two strings x and y. The concatenation of k copies
of x is denoted by xk and is called the kth power of x.
A string w is a substring of x if x = uwv for u, v ∈ Σ ∗ . A string w is
a prefix of x if x = wu for u ∈ Σ ∗ , a proper prefix if u ∈ Σ + . Similarly,
w is a suffix of x if x = uw for u ∈ Σ ∗ . A string u that is both a proper
prefix and a suffix of x is called a border of x.
If x has a nonempty border, it is called periodic. Otherwise, x is is
said to be primitive.
Definition 1. Given a string p called the pattern and a longer string t,
called the text, the exact pattern matching problem is to find all occurrences, if any, of pattern p inside the text t.
According to the above definition, in the problem of exact pattern
matching, one is interested in finding all occurrences of a given pattern
(“structured” or “non-structured”) in a given input sequence. A “nonstructured” pattern p is a string of length m, while a “structured” pattern
can be defined as an ordered collection of k “boxes” Bi and k − 1 intervals
of distances, called gaps (each one between each pair of successive boxes
- see Fig. 1). Each gap gi could have a minimum mini and a maximum
maxi value , or a fixed length.
B1
g1
min 1 ..... max 1
B2
g2
B3
Bk-1
min 2 ..... max 2
gk-1
Bk
min k-1 ..... max k-1
Fig. 1. A structured pattern model.
When we consider the approximate version of this problem we do not
require a perfect matching but a matching that is good enough to satisfy
certain criteria. The problem of finding substrings in a text similar to
4
a given pattern has been extensively studied in recent years because it
has a variety of applications including file comparison, spelling correction,
information retrieval, searching for similarities among biosequences and
computerized music analysis.
One of the most common variant of the approximate string matching
problem is that of finding substrings that match the pattern with at
most k-differences. In this case, k defines the approximation extent of the
matching (the edit distance with respect to the three edit operations substitute, insert, delete).
Definition 2. Given a pattern p, a text t, an integer k and an edit distance function d(), the approximate pattern matching problem is to find
the set of all text positions j, such that there exists i, with d(p, ti···j ) ≤ k.
Considering the above problem under a variety of similarity or distance rules (i.e. Hamming distance, etc.) we can define a set of relevant
problems. The basic idea in the approximate pattern matching is to locate occurrences of a pattern that can tolerate certain ranges of error.
The notion of error is defined either locally or globally as follows.
Definition 3. Let δ and γ be integers. Two symbols a, b of an alphabet
Σ are said to be δ-approximate, denoted as a =δ b, if and only if |a −
b| ≤ δ. In that manner two strings x, y are δ-approximate, denoted as
x =δ y, if and only if |x| = |y| and ∀i, xi =δ yi . Also two strings x,
y are γ-approximate, denoted as x =γ y, if and only if |x| = |y| and
|x|
i=1 |xi − yi | ≤ γ. Finally we say that two strings are (δ, γ)-approximate
if both conditions are satisfied.
The error in the first case (δ-approximation) is defined locally for
each symbol. In the second case (γ-approximation) the error is defined
in a more global sense allowing the uneven distribution of the error to
the symbols. Efficient algorithms for approximately matching patterns to
text strings are given in [3, 4, 7].
Another set of problems arises when we allow gaps in the approximate matching problem. The problem of pattern matching with gaps was
introduced in [5] and is defined as follows.
Definition 4. Given a pattern p, and a text t, find all occurrences of p in
t such that pi = tj i , ∀i ∈ {1, · · · , m}, where m is the length of p. Note that
p occurs at position j1 of t with a gap sequence G = (g1 , g2 , · · · , gm−1 ),
with gi = ji+1 − ji , ∀i ∈ {1, · · · , m − 1} and j1 < j2 < · · · < jm .
5
The different versions of the problem of matching with gaps result
from the different constraints posed on the structure of the gaps. In all
the above cases the pattern p is given. In the case where the pattern p is
not given we consider the Model Identification Problem.
Definition 5. Given a set of strings S = {S1 , S2 , · · · , Sk } we seek a
“structured pattern” P that occurs in every Si , ∀i ∈ {1, · · · , k}. In the
special case where the pattern P consists of two identical boxes B (with
various restrictions on gaps) we address the Maximal Pairs Identification
Problem.
The suffix tree is a fundamental data structure supporting a wide
variety of efficient string searching algorithms. In particular, the suffix
tree is well known to allow efficient and simple solutions to many problems
concerning the identification and location either of a set of patterns or
repeated substrings (contiguous or not) in a given sequence. The reader
can find an extended literature on such applications in [9].
Definition 6. We denote by T (S) the suffix tree of S, as the compressed
trie of all the suffixes of S$, $ ∈ Σ. Let L(v) denote the path-label of node
v in T (S), which results by concatenating the edge labels along the path
from the root to v. Leaf v of T (S) is labeled with index i iff L(v) = Si..n .
We define the leaf-list LL(v) of v as a list of the leaf-labels in the subtree
below v.
Linear time algorithms for suffix tree construction are presented in
[13], [14].
In the case of weighted sequences we consider the presence of a set
of characters each with a given probability of appearance for a given
position of a word w. Thus we define the concept of a weighted word w,
as following:
Definition 7. A weighted word w = w1 , . . . , wn is a sequence of positions, where each position wi consists of a set of ordered pairs. Each pair
has the form (s, πi (s)), where πi (s) is the probability of having the char
acter s at position i. For every position wi , 1 ≤ i ≤ n, ∀s πi (s) = 1.
For example, if we consider the DNA alphabet Σ = {A,C,G,T} the
word w shown in Fig. 2 represents a word having 11 letters: the first
four are definitely ACTT, the fifth can be either A or C each with 0.5
probability of appearance, letters 6 and 7 are T and C, and letter 8
can be A, C or T with probabilities 0.5, 0.3 and 0.2 respectively and
6
Word w
Position 1 2 3 4
5
6 7
8
9 10 11
A C T T (A,0.5) T C (A,0.5) T T T
(C,0.5)
(C,0.3)
(G, 0)
(G,0)
(T, 0)
(T,0.2)
Fig. 2. Example of a weighted word with three weighted positions. Positions consisting
of a single character indicate that this character appears with probability 1.
finally letters 9 to 11 are T. Some of the words that can be produced are:
w1 = ACT T AT CAT T T , w2 = ACT T CT CAT T T 1 , etc. The probability
of presence of a word is the cumulative probability which is calculated by
multiplying the respective probabilities of appearance of each character
in every position. For the above example, π(w1 ) = π1 (A) ∗ π2 (C) ∗ π3 (T ) ∗
π4 (T ) ∗ π5 (A) ∗ · · · ∗ π8 (T ) = π5 (A) ∗ π8 (A) = 0.25. Similarly π(w2 ) =
π5 (C) ∗ π8 (A) = 0.25. The definition of substring can be easily extended
to accommodate weighted substrings.
2
Approximate Matching with Gaps
In this section we present algorithms for various versions of the problem of
approximate matching with gaps. These different versions of the problem,
which are extracted by the different constraints posed on the structure of
the gaps, were introduced in [5, 6] and are the following: (i) δ-occurrence
and (δ, γ)-occurrence with α-bounded gaps, (ii) δ-occurrence minimizing total difference of gaps, (iii) δ-occurrence and (δ, γ)-occurrence with
ǫ-bounded-difference gaps and (iv) δ-occurrence of a set of strings with
bounded gaps. In the relevant literature ([5, 6]) algorithms for the following problems are also considered: (i) δ-occurrence and (δ, γ)-occurrence
with strictly bounded gaps, (ii) δ-occurrence and (δ, γ)-occurrence with
unbounded gaps and (iii) δ-occurrence minimizing total sum of gaps. We
will not discuss these techniques here.
2.1
δ-Occurrence and (δ, γ)-Occurrence with α-Bounded
Gaps
The δ-Occurrence with α-Bounded Gaps problem is defined as follows.
Definition 8. Given a text string t = t1 , . . . , tn , a pattern p = p1 , . . . , pm
and integers α, δ, check whether there is a δ-occurrence of p in t with gaps
whose sizes are bounded by constant α.
1
underlined letters indicate the choice of a particular letter in a weighted position
7
The problem is solved by employing an incremental procedure based
on dynamic programming. Firstly, we define the set of all non-empty prefixes of pattern p to be Prefixes(p), that is Prefixes(p) = {π1 , π2 , . . . , πm },
where πi = p1,...,i . The proposed algorithm computes the entries of a matrix D0...m,0...n . The entry Di,j of the matrix designates the position of
the last δ-occurrence with α-bounded gaps of prefix πi before position
j in text t, otherwise it has the value 0. The entries of the matrix are
computed as follows:
Di,j =
j
if (tj =δ pi ) and (j − Di−1,j−1 ≤ α + 1) and (Di−1,j−1 > 0)
Di,j−1 if (tj =δ pi ) and (j − Di,j−1 < α + 1)
0
otherwise
If Di,j = j, then there is a match between tj and pi while the prefix
πi−1 has a δ-occurrence at a position given by Di−1,j−1 and the formed
gap is ≤ α. If Di,j = Di,j−1 then there is no match between tj and pi .
This means that it is not possible to extend the δ-occurrence of prefix
πi−1 to πi and so we store in Di,j the previous value of Di,j−1 as long as
the gap invariant is not violated. In any other case we store in Di,j the
value 0. The boundary conditions of matrix D are as follows:
D0,0 = 1, Di,0 = 0 and D0,j = j
The above algorithm runs in O(mn) time and uses O(mn) space.
In practice the space is linear O(n), since the computation of each row
depends only on the previous row. If we want to retrieve a match, we
can use the matrix D and perform a trace-back procedure. The time
complexity of this procedure is O(m) while the space complexity remains
O(mn). Another option would be to use Hirschberg’s divide and conquer
technique [10]. In this case the space complexity is reduced to O(m + n).
In the sequel we are going to explore the problem of computing (δ, γ)occurrences of a pattern with α-bounded gaps. This problem is defined
as follows:
Definition 9. Given a string t = t1 , . . . , tn , a pattern p = p1 , . . . , pm and
integers α, δ, γ, check if there is a (δ, γ)-occurrence of p with α-bounded
gaps in t.
We will follow an approach similar to the previously used. The computation of D is performed exactly in the same way, except that as we scan
each symbol of the text (that is, as we complete each column of matrix D)
we maintain for each pi a min-queue ([8]) storing the occurrences of πi−1 ,
8
such that the gap invariant is not violated. All occurrences stored in this
queue satisfy the gap invariant with respect to the current position, while
the order key is the approximation error. When we find a δ-occurrence of
pi that extends an occurrence of πi−1 to an occurrence of a πi we add it
to the queue of symbol pi+1 . When we scan a text symbol we may also
delete the first element inserted in the queue since the gap invariant may
be violated.
When we encounter a δ-occurrence of pi we form πi by letting pi−1 be
the minimum error δ-occurrence of pi−1 among the δ-occurrences stored
in the list corresponding to pi . We also keep the costs of occurrences in
an auxiliary matrix C. This matrix is constructed simultaneously with
matrix D. When D(i, j) = 0 then the corresponding C(i, j) contains the
cost of the occurrence of πi at position j of the text. The cost is the sum
of all δ-errors introduced in each symbol of the occurrence of prefixes.
When D(i, j) = 0 then C(i, j) yields the total error of this occurrence. In
this way, if there is an occurrence at matrix D (row m of matrix D) then
by using matrix C we can deduce whether this is a γ-occurrence as well
or not. The time complexity of the algorithm is O(nm) while the space
complexity becomes O(nm + mα) = O(nm).
2.2
δ-Occurrence Minimizing Total Difference of Gaps
This problem is formally stated as follows:
Definition 10. Given a text string t = t1 , . . . , tn , a pattern p = p1 , . . . , pm
and an integer δ, check if there is a δ-occurrence of p with gaps minimizing
m−2
the quantity i=1
Gi , where Gi = |gi − gi+1 |.
To solve this problem we construct a directed acyclic graph (DAG)
H = (V, E) by creating for each symbol pi , 1 ≤ i ≤ m a node vij , 1 ≤ j ≤ n
whenever pi =δ tj . In this way, for each symbol pi of pattern p we may
create as many as n nodes vij producing totally, at most nm nodes. The
construction is made so that the nodes are divided in layers of nodes,
where each layer corresponds to the δ-occurrences of a specific symbol pi
of pattern p.
The set of edges E will be constructed as follows. Edges among nodes
of the same layer are forbidden. This because we would like the edges
to represent all the different δ-occurrences of a pattern p in text t. We
′
introduce a new directed edge between two nodes vij and vij′ if and only
if i′ = i + 1 and j ′ > j. All nodes that correspond to a δ-occurrence of
the symbol p1 , that is nodes v1j that lie at the first layer, are connected
9
to a node s (s → v1j ). All nodes that correspond to a δ-occurrence of the
j → d). To each edge we assign
symbol pm are connected to a node d (vm
a cost proportional to the length of the gap defined by this occurrence.
j′
In this way, the edge e = (vij , vi+1
) is given weight we = j ′ − j. Edges
starting from s or ending at d are given zero weights. It is obvious that
in the worst case the edge set will have O(n2 m) size. Concluding, we
can compute the set V in O(nm) time while the set E is computed in
O(n2 m) time. Thus, the time complexity as well as the space complexity
is asymptotically equal to O(n2 m).
From graph H we construct a new graph H ′ . H ′ is implemented by
j′
that are connected by the directed
contracting two nodes vij and vi+1
j j′
j′
j ′′
edge e = (vi , vi+1 ) into a single node ve′ . If the edge f = (vi+1
, vi+2
) does
also exist in H (in H ′ it is represented by the node vf′ ) then in H ′ we
introduce the edge e′ = (ve′ , vf′ ). The weight of the edge e′ in H ′ is defined
as the difference between gap lengths corresponding to weights at edges e
and f in H. The new graph H ′ may have as many as O(n2 m) nodes and
O(n2 m) edges. Because of the fact that H ′ is a DAG (Directed Acyclic
Graph) we can compute a shortest path in O(n2 m) time by topologically
sorting it. The computation of the shortest path in this graph provides us
with the solution to this problem using O(n2 m) space in O(n2 m) time.
The space complexity can be reduced from O(n2 m) to O(n2 ) since it
is possible to simulate the shortest path computation during the construction of H ′ . Note that during the construction of H ′ we need the nodes
and edges that correspond to three consecutive symbols pi−1 , pi and pi+1
of the pattern p when the current symbol scanned is pi+1 . By scanning
through the pattern and constructing the required parts of H ′ , one can
compute the shortest path to each new node by taking the minimum over
the incoming edges.
A similar procedure to the one described above can be used to solve the
problem of δ-occurrences and (δ, γ)-occurrences with ǫ-Bounded-Difference
Gaps. The time and space complexities remain O(n2 m) and O(n2 ) respectively (for more details the reader can refer to [6]).
2.3
δ-Occurrence of a Set of Strings with Bounded Gaps
Let S1 , S2 , . . . , Sk ∈ Σ ∗ , be a set of k strings. The problem is formed as
follows.
Definition 11. Given a set of strings, a text string t = t1 , . . . , tn , and an
integer δ, check if there is a δ-occurrence of each string Si in text t such
10
that if t[l1 . . .] =δ S1 , t[l2 . . .] =δ S2 , . . . , t[lk . . .] =δ Sk then g1 = l2 − l1 ≤
∆ and generally gi = li+1 − li ≤ ∆.
The solution is quite similar to that of subsection 2.2. Define the
pattern p to be p = S1 S2 . . . Sk . Then, construct matrix D by finding in
each row i the δ-occurrence of the substring Si exactly as we did when
we tried to find the δ-occurrence of a single symbol pi . This can be easily
accomplished in time linear to the length of the pattern Si , by a series
of character by character comparisons. Thus, the whole time complexity
is equal to O(n(|S1 | + |S2 | + · · · + |Sk |)) and the space consuption is
O(n(|S1 | + |S2 | + · · · + |Sk )|)).
3
Model Inference in Multiple Strings
A very interesting problem that is derived from computational molecular
biology is the “Model Inference Problem”. In this problem we seek to
identify all the regularities (repeated substrings or structures), not known
a priori, in a nucleic or protein sequence (Fig. 3).
x
x
gap
x
gap
Fig. 3. Inference of Regularities
Towards this goal, the first step that has to be made, is to infer the
occurrences of pairs of equal substrings. The space (number of characters) between the end of the first of the equal substrings and the start
of the second one is called gap. Gusfield (in [9]) demonstrates a basic
technique of finding all maximal pairs in a string of length n, without
any restriction on gaps, in time O(n + a) and space O(n), where a is the
number of reported pairs. His method is based upon the suffix tree and
one of its basic properties that we are going to discuss below. Brodal et
al. in [1] extended the algorithm of Gusfield and proposed two new ones
that compute all maximal pairs in an input string of length n whose gaps
are restricted. When the gaps are lower and upper bounded the algorithm
works in O(n log n + a) time and when the gaps are only lower bounded
O(n+a) time is needed, where a is the size of the output. Both algorithms
use linear space.
11
In this section we are going to consider a more general version of this
problem where repetitive structures are sought in several strings together
and not only in a particular string (Fig. 4). This kind of information
points out common features on a set of sequences, with important biological meaning. We consider the problem of identifying the occurrences of
maximal pairs and especially in multiple strings and describe the methodology introduced in [12].
x
x
x
x
x
x
x
x
Fig. 4. Regularities in Multiple Strings
3.1
Problem Definitions
Consider that we have a set of strings S = {S1 , S2 . . . , Sk } where each of
these strings is constructed from the alphabet Σ and the total length of
the strings is 1≤i≤k |Si | = n. Our goal is to find all the maximal pairs
that appear simultaneously in all strings Si (1 ≤ i ≤ k). Therefore, we
will use the suffix tree.
The key feature of the suffix tree is that for any leaf i, the concatenation of the edge-labels (substrings assigned to edges) on the path from
the root to leaf i spells out the suffix of t starting at position i. Two
significant details are: (i) for each internal node, each outgoing edge-label
has to start by a unique non empty character, (ii) the path-label of a leaf
node is the concatenation of the edge-labels on the path from the root to
that leaf (see Fig. 5).
Definition 12. A left-maximal pair in a string t is a pair of identical
substrings α and β in t such that the character to the immediate left of α
is different from the character to the immediate left of β.
Definition 13. A right-maximal pair in defined respectively as the pair
that can not be extended to the right.
Notice that the suffix tree detects easily the right-maximal pairs because at any internal node u the suffixes that lie in different subtrees of u
12
T(xabxac)
b
b
$c
ax
bx
a
$c
T(s)
c$
3
6
i
xa
c
xa $
c$
ac
4
j
x
x
gap
$
x
1
u
2
5
a)
i
b)
j
Fig. 5. Suffix tree
start with the same path-label of u and then immediately to the right a
different character exists. The later holds because every such suffix corresponds to a path from root to a leaf, touching u, that follows a different
edge from u (see (Fig. 5.b).
Definition 14. A maximal pair in a string t is a pair of identical substrings α and β in t such that the character to the immediate left and right
of α are different from the characters to the immediate left and right of β.
In other words extending α and β in either direction destroys the equality
of the two substrings. A maximal pairs is a left-and-right maximal pair
together.
Remark 1. Having the suffix tree of a string t in our disposal, we can organize in linear time to the length of t the indexes of the suffixes so that the
detection of the left-maximal pairs is convenient. At every leaf we keep an
array of |Σ| lists (one list per each letter of the alphabet - {L1 {}, L2 {}, ...,
L|Σ| {}}). These structures are called leaf-lists. Every index i is kept in the
list that corresponds to the character at position i−1 (immediately to the
left). Running a bottom up process in the suffix tree, every internal node
u examines all the leaf-lists of its children. The node generates for every
possible pair of its children k, l (this guarantees the right-maximality) the
maximal pairs, combining all the elements of every sub-list of the first,
with the elements of every sub-lists of the other, skipping at the same
time sub-lists that correspond to the same character (this guarantees the
left-maximality). After reporting the produced maximal pairs, the parent
node u merges all the leaf-lists into one, concatenating all the sub-lists
that correspond to the same character of the alphabet. Hereafter, the
leaf-list is assigned to the parent-node. The merging of leaf-lists at every
internal node takes O(|Σ|) = O(1) time, therefore the total time needed
13
is O(n) plus the size of the output. Consequently O(n+a) time and linear
space O(n) is needed because each of the n indexes occurs only once in a
list.
Due to the above remark we can detect the maximal pairs in one
string t without caring about the gap between the equal substrings. Trying
to find the maximal pairs for either fixed or bounded gaps and at the
same time indicating the ones that lie in a set of strings is a bit more
complicated.
3.2
Algorithms
Initially we discuss the problem of identification of maximal pairs with
arbitrary gaps in a set of strings. We are going to extend Remark 1 to
detect when a reported pair from a specific string Si occurs in all the
other strings Sj (∀j = k) as well. Toward this goal we use a generalized
suffix tree for all the strings of set S.
A generalized suffix tree GST(S) contains all the suffixes of all the
strings of S and can be built in linear time to the sum of lengths of the
strings. Contrary to a suffix tree, the GST(S) may store more than one
indexes at a leaf. Each index belongs to a different string Si indicating
that there are common suffixes among the strings. A similar method to
the one described in Remark 1 is used but combining the process of every
string together. For this purpose we use new leaf-lists that contains one
leaf-list of the Remark 1 for every string Si (Fig 6).
L1
L2
...
S1
...
S2
...
...
...
Sk
L| |
...
Fig. 6. leaf-list
A bottom up process runs again in the generalized suffix tree, with
every internal node u receiving all the leaf-lists of its children. It generates
for every possible pair of its children k, l the maximal pairs for each string
Si of the set using the sub leaf-lists that correspond to Si exactly as
14
described in Remark 1. The produced maximal pairs for every string Si
are not reported immediately because we need to examine if for all the
strings Si′ s at least one pair is generated.
In order to achieve this, a temporary array of size k of lists is used to
store, in a different list, the produced maximal pairs for every string Si . In
time O(k) can be verified if a list has length equal to zero and consequently
if all the strings report at least a pair. Then if such a thing does not hold
we report all the lists. After reporting, the concatenating step follows
where each of sub leaf-list for each of the strings Si is concatenated into
one sub leaf-list (like Remark 1). The second step takes time O(|Σ|k) =
O(1). Thus the total space and time needed is linear to the sum of the
length of strings ( 1≤i≤k |Si | = n) plus the size of the output.
Invoking the constraint that the length of the gaps has to be bounded
by a constant b, the previous approach should be extended. This constraint further complicates things as can be illustrated in the following
example. Consider that the previous bottom up method at an internal
node u, the candidate pairs of string Si are produced taking all the possible combinations of pairs from the lists Si → Lx {} and Si → Lz {}, ∀x = z.
Now with the new constraint an index i of the first list has not got
to be combined with all the indexes of the second one (denoted by j)
but only with at most 2b indexes that meet the gap constraint(|j − (i +
|path label|)| ≤ b).
In order to achieve this, the Si → Lx {} lists can not be implemented
as linear lists, because searching for the proper j ′ s in all the other lists
will incur O(n2 ) time. If we implement these lists as AVL trees (like in [1])
we trade off merging time (now we do not have to concatenate but merge
the lists) for searching time and as we are going to see below, with the
following four lemmas, it leads to an O(n log n) algorithm.
Lemma 1. Two AVL trees of size at most n and m ,with n ≤ m, can be
merged in O(log n+m
n ) time.
Proof. See [2].
Lemma 2. Given two sorted list of elements e1 , e2 , . . . , en and a1 , a2 , . . . , am
(n ≤ m) structured in two AVL trees T, T ′ , we can find qi = min{x ∈
T ′ |x ≥ ei } for all 1 ≤ i ≤ n in O(log n+m
n ) time.
Proof. The basic idea is to use the merging algorithm of Brown and Tarjan in [2] and instead of performing a real merge, merely find where the
element has to be inserted by keeping a pointer.
15
Lemma 3. Let T be an arbitrary binary tree
T with n leaves. The sum
2
, where n1 and n2 are
over all internal nodes u ∈ T of terms log n1n+n
1
the number of leaves in the two subtrees rooted with u (and n1 ≤ n2 ), is
O(n log n).This lemma is known as the “smaller-half trick”.
Proof. This can be proved by induction to the number of leaves of the
binary tree. See [1].
Lemma 4. The step of producing the candidate pairs and the step of
merging at an internal node u with degree d (d ≤ |Σ|) of the generalized
suffix tree, can be simulated in a binary tree.
Proof. As mentioned above for every string Si and every internal node u
the process generates for every possible pair of its children k, l (and this
guarantees the right-maximality) the maximal pairs, combining all the
elements of every sublist of the first, with the elements of every sub-lists
of the other skipping at the same time sub-lists that correspond to the
same character (and this guarantees the left-maximality). Assuming that
u has degree d, if we transform u into a binary tree with d leaves where
every internal edge has empty string label, each of the initial children of
u enters the binary tree at a different leaf. If we perform the same process
in the binary tree and at every node we just use the existing two leaf-lists
from the leaves to the root of the binary tree, we will have considered
every possible pair of children k, l of the initial u. For example if u has
four children 1,2,3,4 the produced binary tree will have four leaves and
height two. Thus, in the bottom up method the pairs of leaf-lists (1,2)
and (3,4) are first examined and then merged. In the next level the pair
(1 ∪ 2, 3 ∪ 4) is examined which is equal to pairs (1,3),(1,4), (2,3) and
(2,4). So all the possible pairs are formed.
Finally, assuming that we have the generalized suffix tree (GST) T of
set S, with at most 1≤i≤k |Si | = n leaves, we can transform T into a
binary tree (this process is called binarize) as described in Lemma 4 in
linear time, because every node of the GST has at most degree |Σ| and
thus can be transformed into a binary tree in O(1) time. Consequently
if we run the bottom up process at the “binarized” tree and perform at
every internal node u the producing and merging steps, using the smaller
leaf-list (the one that corresponds to the subtree with the smaller number
of leaves) against the other, then Lemmas 1, 2, 3 ensure that overall the
process takes O(n log n) time. Consequently, searching for maximal pairs
with bounded gaps in a set of strings can be performed in linear space
and O(n log n + a) time, where a is the size of the output.
16
4
Weighted Sequences: Data Structures and Algorithms
In this section, we present a data structure for storing the set of suffixes
of a weighted sequence with probability of appearance greater than 1/k,
where k is a given constant. We use as fundamental data structure the
suffix tree, incorporating the notion of probability of appearance for every
suffix stored in a leaf. Thus, the introduced data structure is called the
Weighted Suffix Tree (abbrev. WST).
4.1
Data Structure Description
The weighted suffix tree can be considered as a generalisation of the
ordinary suffix tree to handle weighted sequences. We give a construction
of this structure in the next section. The constructed structure inherits
all the interesting string manipulation properties of the ordinary suffix
tree. However, it is not straightforward to give a formal definition as with
its ordinary counterpart. A quite informal definition appears below.
Definition 15. Let S be a weighted sequence. For every suffix starting
at position i we define a list of possible weighted substrings so that the
probability of appearance for each one of them is greater than 1/k. Denote
each of them as Si,j , where j is the substring rank in arbitrary numbering.
We define W ST (S) the weighted suffix tree of a weighted sequence S, as
the compressed trie of a portion of all the weighted substrings starting
within each suffix Si of S$, $ ∈ Σ, having a probability of appearance
greater than 1/k. Let L(v) denote the path-label of node v in W ST (S),
which results by concatenating the edge labels along the path from the root
to v. Leaf v of W ST (S) is labeled with index i if ∃j > 0 such that
L(v) = Si,j and π(Si,j ) ≥ 1/k, where j > 0 denotes the j-th weighted
substring starting at position i. We define the leaf-list LL(v) of v as a list
of the leaf-labels in the subtree below v.
We will use an example to illustrate the above definition. Consider
again the weighted sequence shown in Fig. 2 and suppose that we are
interested in storing all suffixes with probability of appearance greater
than a predefined parameter. We will construct the suffix tree for the
sequence incorporating the notion of probability of appearance for each
suffix.
For the above sequence and k ≥ 1/4 we have the following possible
prefixes for every suffix:
– Prefixes for suffix x1···11 : S1,1 = ACT T AT CAT T T , π(S1,1 ) = 0.25,
and S1,2 = ACT T CT CAT T T , π(S1,2 ) = 0.25.
17
0
T
C
A
$
S9,1
T
$
..A
$.
T$
CTCA...$
A
CT
TT
...$
TCA
ATTT$
$..
.AC
..$
A.
S3,1
S8,1
CT
S4,1
T$
CTCA...
$
S5,1
C
$.
S2,1
A...$
S1,2
TA
$
S6,2
S10,1
$
CTCA...$
S1,1
$
S11,1
..A
C
$
$
TT
A
CT
C
S7,2
S7,1
A
...
C
AT
T
T$
TT
C
T
T
$T
TT
A
12
TTC
S5,3
S8,2
S3,1
S6,1
S4,2
S5,2
S2,2
Fig. 7. A Weighted Suffix Tree example.
– Prefixes for suffix x2···11 : S2,1 = CT T AT CAT T T , π(S2,1 ) = 0.25, and
S2,2 = CT T CT CAT T T , π(S2,2 ) = 0.25, etc.
The weighted suffix tree for the above substrings appears in Fig. 7.
4.2
The Construction of Weighted Suffix Tree
In this paragraph we describe an efficient algorithm for constructing the
WST for a given weighted sequence w = w1..n , of length n. Firstly we describe the naive approach, which is quadratic in time. As already discussed
the weighted suffix tree, (which consists of all substrings with probability
of appearance greater than 1/k, k is a given constant), is a generalized
suffix tree (GST) that can be built as follows.
Step 1: For each i, (2 ≤ i ≤ n), generate all possible weighted suffixes
of the weighted sequence with probability of appearance greater than
1/k.
Step 2: Construct the Generalized Suffix Tree GST , for the list of all
possible weighted suffixes.
The above naive approach is not optimal since the time for construction is O(n2 ). In the following paragraphs we present an alternative efficient approcah. The exact steps of our methodology for construction
are:
Step 1: Scan all the positions i (1 ≤ i ≤ n) of the weighted sequence
and mark each one according to the following criteria:
18
– mark position i black, if none of the possible characters, listed at
position i, has probability of appearance greater than 1 − 1/k,
– mark position i gray, if at least one of the possible characters listed
at position i, has probability of appearance greater than 1 − 1/k,
– and finally mark position i white, if one of the possible characters
has probability of appearance equal to 1.
Notice that the following holds: at white positions we have only one
possible character appearing, thus we can call them solid positions,
at black positions since no character appears with probability greater
than 1−1/k, more than one character appear with probability greater
than 1/k hence we can call them branching positions. At gray positions, only one character eventually survives, since all the possible
characters except one, have probability of appearance less than 1/k,
which implies that they can not produce an eligible substring (i.e.
π(substring) ≥ 1/k). During the first step we also maintain a list B
of all black positions.
Step 2: Scan all the positions in B from left to right. At each black position i a list of possible substrings starting from this position is created.
The production of the possible substrings is done as follows: moving
rightwards, we extend the current substrings by adding the same single character whenever we encounter a white or gray position, only
one possible choice, and creating new substrings at black positions
where potentially many choices are provided. The process is illustrated in Fig. 8. At this point we define for every produced substring
two cumulative probabilities π ′ , π ′′ . The first one measures the actual
substring probabilities and the second one is defined by temporarily
treating gray positions as white. The generation of a substring stops
when it meets a black position and π ′′ (which skips gray positions)
has reached the 1/k threshold. We call this position extended position.
Notice that the actual substring may actually be shorter as π ′ (which
incorporates gray positions) may have met the 1/k threshold earlier.
For every substring we store the difference D of the actual ending
position and the extended one as shown in Fig. 9. Notice that only
the actual substrings need to be represented with the GST.
Step 3: Having produced all the substrings from every black position,
we insert the actual substrings in the generalised suffix tree in the
following way. For every substring we initially insert the corresponding
extended substring in the GST and then remove from it the redundant
′
portion D. To further illustrate the case, suppose that X ′ = Xi..
i+f ′ −1
is the extended substring of the actual substring X = Xi.. i+f −1 (f ≤
19
...
...
...
Fig. 8. Producing all possible substrings from left to right
positions
i i+1
i+f-1
i+f'-1
...
X
actual
substrings
D
D1
D2
D3
D4
D5
extended
substrings
Fig. 9. Insertion of substrings in the GST
f ′ ) that begins at black position i of the weighted sequence in Fig. 9.
Observe the following two facts:
– There is no need to insert every suffix of X in the GST apart from
those starting to the left of the next black position i′ , as all the
other suffixes will be taken into account when step 2 is executed
for i′ .
– A suffix of X ′ can possibly extend to the right of position i + f −
1, where the actual substring ends, since π ′ does not take gray
positions into account (cf. Fig. 9). No suffix can end though at a
position greater than i+f ′ −1, where the extended substring ends.
We have kept every leaf storing a suffix of X ′ , in a list L. Let Dj
′
′
denotes the redundant portion of suffix Xi+j..i+f
′ −1 of X (cf. Fig. 9).
After we have inserted the extended substring and the proper suffixes
using McCreight’s algorithm [13], we have to remove all the Dj ’s from
the GST. Starting from the leaf corresponding to the entire X ′ , we
move upwards the tree by D characters. At the current position we
eliminate the extra portion of X ′ , storing X. The next redundancy of
′
length D1 is at the end of Xi+1..i+f
′ −1 . We locate this suffix using the
suffix link. Let λd = |Dd−1 |−|Dd |, d > 1 and λ1 = D−D1 . After using
the suffix link we also may descend by λ1 characters. At this position
we store the correct suffix (possibly extending it up to λ1 characters
after position i+f −1). We continue the elimination procedure for the
remaining suffixes of X ′ , as outlined above. The entire process costs
at most d>0 λd = O(D), which is the time required to complete the
suffix tree construction.
20
Note: The above description implicitly assumes that there are no positions i where πi (σ) < 1/k, ∀σ ∈ Σ. If this is not the case, the sequence
can be divided into subsequences where this assumption holds and process
these subsequences separately, according to the previous algorithm.
4.3
Time and Space Analysis on the Construction of the
WST
The time and space complexity analysis for the construction of the WST
is based on the combination of the following lemmas:
k
Lemma 5. At most O |Σ|log k/ log( k−1 ) substrings could start at each
branching position i (1 ≤ i ≤ n) of the weighted sequence.
Proof. Consider for example position i and the longest substring u which
starts at that position. If we suppose that u is λ characters long, its cumulative probability will be π(u1.. λ ) = πi (u1 ) ∗ πi+1 (u2 ) ∗ · · · ∗ πi+λ−1 (uλ ). In
order to produce this substring we have to pass through l black positions
of the weighted sequence. Recall that at black positions none of the possible characters has probability of appearance greater than π̂ = 1 − 1/k.
Assuming that there are no gray positions that could reduce the cumulative probability, π(u1..λ ) is less or equal to π̂ l (taking only black positions
into account). In order to store this substring its cumulative probability
k
) by taking logarithms (all logais π̂ l ≥ 1/k and thus l ≤ log k/ log( k−1
rithms are log2 ). For example, typical values of l are ∼
= 21.9 for k = 20
and ∼
1046
for
k
=
200.
=
Thus, regardless of considering or not the gray positions, u includes
at most l = O(1) black positions, or in other words, positions where new
substrings are produced. Hence, every position i of the weighted sequence
can be the starting point of at most |Σ|l number of substrings.
Lemma 6. The number of substrings with probability greater than or
equal to 1/k is at most O(n).
Proof. If every position i of the weighted sequence is the starting point of a
constant number of substrings (Lemma 5), the total number of substrings
is O(n).
Lemma 7. Step 2 of the construction algorithm takes O(n) time.
Proof. Suppose that the weighted sequence is divided into windows Nj , j ≥
k
) black positions.
1 (cf. Fig. 10). Each window contains l = log k/ log( k−1
21
...
N1
N2
N3
Fig. 10. Time cost for step 2
Notice that a window can contain more than l positions of all types and
that j≥1 |Nj | = n. Lets consider window Ni . Step 2 scans the black positions inside Ni . Every black position will generate O(1) substrings (according to Lemma 5) and none of them is going to exceed window Ni+1
because it can not be extended to more than l black positions. Thus, the
length of substrings will be at most equal to |Ni | + |Ni+1 |. Thus, for the
window Ni , step 2 costs at most O(l2 (|Ni | + |Ni+1 |)) = O(|Ni | + |Ni+1 |)
time. Summing up the costs for all windows we conclude that step 2 incurs
a total of O ( (|Ni | + |Ni+1 |)) = O(n) cost.
Lemma 8. Step 3 of the construction algorithm takes O(n) time.
Proof. Consider again the windows scheme as in the previous lemma and
in particular window Ni . In step 3 we insert the extended substrings in
the WST that correspond to that window. Each one of them has length
at most |Ni | + |Ni+1 |. The cost to insert those extended substrings in the
WST using McCreight’s algorithm is O(l·|Ni |+|Ni+1 |) = O(|Ni |+|Ni+1 |)
and the cost to repair the WST (as we described in step 3) is O(l · D).
D is always smaller than |Ni | + |Ni+1 | thus for window Ni step 3 costs
O(|Ni | + |Ni+1 |) time. Summing the costs for all windows, step 3 yields
O(n) time in total.
Based on the previous lemmas we derive the following theorem.
Theorem 1. The time and space complexity of constructing the WST is
linear to the length of the weighted sequence.
Proof. The WST, which is a compact trie data structure, stores O(n)
substrings (by Lemma 6) and thus the space is O(n). None of the three
construction steps takes more than O(n) time so the total time complexity
is O(n).
4.4
Applications
The WST is endowed with most of the sequence manipulation capabilities
of the Generalized Suffix Tree. Some of its applications are listed below.
22
Exact pattern matching: There are two versions of the exact pattern
matching problem, one when the pattern is unweighted and the other
when it is weighted. Pattern matching with unweighted pattern proceeds as with the regular Suffix Tree matching procedure. Thus, if
after having spelled the entire pattern from the tree root, we end up
in an internal node, all the leaves of its subtree are reported. On the
other hand when the pattern is weighted, its non-weighted counterparts are derived and pattern matching reduces to the above case.
For a pattern of size m, pattern matching takes O(m + a) time, a the
number of reported occurrences.
Finding repetitions in weighted sequences: The WST can be used
in order to compute all the repetitions in a given weighted sequence,
each repetition having probability of appearance greater than 1/k.
The WST with parametre 1/k is constructed and then the repetition
finding problem is reduced to a depth-first traversal of the tree, during which a leaf-list is kept for each internal node. If the size of the
list exceeds two, this constitutes a repetition. Thus, its elements are
reported. Apparently, the problem can be solved in O(n + a) time,
where n is the sequence length and a is the answer size (cf. [11]).
Sequence alignment: In the sequence alignment problem we are looking for the best alignment of two sequences S1 and S2 which minimises
the edit distance of S1 and S2 (or in other words maximises the similarity measure of S1 and S2 ). The suffix tree can be used in combination
with dynamic programming to produce a hybrid dynamic programming method that is faster than dynamic programming alone(for more
details see [9]).
When we consider the sequence alignment problem for two weighted
sequences, we have to incorporate the notion of probability in the produced alignment. The edit distance between two weighted sequences is
labeled with the probability of appearance of the respective weighted
factors. The WST can be used instead of the ordinary suffix tree to
efficiently compute the alignment of all pairs of weighted substrings
for two given weighted sequences.
Longest common substring of weighted sequences: The Generalized Weighted Suffix Tree is built for two weighted sequences w1 and
w2 . Subsequently, a traversal of the tree that computes the internal
node with the greatest depth is required. The string being spelled at
this node is the longest weighted subsequence of the two weighted
strings. The process can be completed in O(n1 + n2 ) time, n1 ,n2 the
sizes of w1 ,w2 , respectively.
23
5
Conclusions
In this chapter we have presented efficient Data Structuring Applications
and Algorithms for solving string manipulation problems derived from
different research areas.
We initially discussed the Approximate Matching Problem with Gaps,
and consequently the Model Inference Problem in Multiple Strings. Finally we presented the Weighted Suffix Tree, an efficient data structure
for solving string manipulation problems in weighted sequences.
In our point the above presented techniques can be a useful tool to
every researcher who wants to study other string manipulation problems.
In the Approximate Matching Problem with Gaps it would be interesting to explore new versions of the problem as well as more efficient algorithms than the previously described. We should point out that the
presented algorithms were based mainly on dynamic programming and
on the equivalence of some optimization problems on strings to problems
on graphs.
In the Model Inference Problem it could be interesting to design efficient algorithms to identify not only pairs of equal substrings in multiple
strings, but a collection of equal substrings that can meet some restrictions on gaps. A naive approach (although not sufficient) could be the
reporting of maximal pairs and the combination of them building triads,
quadruplets and so on.
Finally the Weighted Suffix Tree can solve a variety of string manipulation problems in weighted sequences.
References
1. Brodal, G.S., R. B. Lyngso, C. N. Storm Pedersen and Jens Stoye.: Finding Maximal pairs with bounded gaps. Journal of Discrete Algorithms, Vol. 1(1), pages
77-104, 2000.
2. Brown, M.R., Tarjan, E.: A Fast Merging Algorithm. Journal of the ACM,
Vol. 26(2), pages 311–226, 1979.
3. Cambouropoulos, E., Crawford T., Iliopoulos, C., S.: Pattern Processing in Melodic
Sequences: Challenges, Caveats and Prospects. In Proc. of the AISB’99 Convention
(Artificial Intelligence and Simulation of Behavior),pages 42–47, 1999.
4. Cambouropoulos, E., Crochemore, M., Iliopoulos, C., S., Mouchard, L., Pinzon, Y.:
Algorithms for computing approximate repetitions in musical sequences. In Proc.
of the 10th Australian Workshop on Combinatorial Algorithms, pages 129–144,
1999.
5. Crochemore, M., Iliopoulos, C.,S., Pinzon, Y.,Rytter, W.: Finding Motifs with
Gaps. Unpublished manuscript.
24
6. Crochemore, M., Iliopoulos, C., Makris, Ch., Rytter, W., Tsakalidis, A., Tsichlas, K.: Approximate String Matching With Gaps. Nordic Journal of Computing,
Vol. 9(1), pages 54–66,2002.
7. Crochemore M., Iliopoulos C. S. , Lecroq T., and Pinzon Y. J.: Approximate string
matching in musical sequences. In M. Balik and M. Simanek, editors, Proceedings
of Prague Stringology Club Workshop (PSCW’01), pages 26-36, Czech Technical
University, Prague, Czech Republic, 2001.
8. Gajewska, H., Tarjan, R.,E.: Deques with heap order. Information Processing Letters, Vol. 12(4), pages 197–200, 1986.
9. Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and
Computational Biology. Cambridge University Press, New York, 1997.
10. Hirschberg, D., S.: A linear space algorithm for computing maximal common subsequences. Comm. Assoc. Comput. Mach. Vol. 18(6), pages 341–343, 1975.
11. Iliopoulos, C., Makris, Ch, Panagis, I., Perdikuri, K., Theodoridis, E., Tsakalidis,
A. . Efficient Algorithms for Handling Molecular Weighted Sequences, In 3rd IFIP
International Conference on Theoretical Computer Science, 2004, to appear.
12. Iliopoulos, C.S., Makris C., Sioutas S., Tsakalidis A., Tsichlas K., : Identifying
Occurrences of Maximal Pairs in Multiple Strings. Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching,Lecture Notes In Computer
Science, pages 133 - 143, 2002.
13. McCreight, E.,M.,: A space-economical suffix tree construction algorithm. Journal
of the ACM, Vol. 23(2), pages 262–272, 1976.
14. Ukkonen, E.,: On-line construction of suffix trees. Algorithmica, Vol. 14(3), pages
249–260, 1995.