Academia.eduAcademia.edu

Algorithms on Strings

Algorithms on Strings Maxime Crochemore Christophe Hancart Thierry Lecroq Algorithms on Strings Cambridge University Press Algorithms on Strings – Maxime Crochemore, Christophe Hancart et Thierry Lecroq Table of contents Preface 1 2 VII Tools 1 1.1 Strings and automata 2 1.2 Some combinatorics 8 1.3 Algorithms and complexity 17 1.4 Implementation of automata 21 1.5 Basic pattern matching techniques 1.6 Borders and prefixes tables 36 26 Pattern matching automata 51 2.1 Trie of a dictionary 52 2.2 Searching for several strings 53 2.3 Implementation with failure function 61 2.4 Implementation with successor by default 67 2.5 Searching for one string 76 2.6 Searching for one string and failure function 79 2.7 Searching for one string and successor by default 3 String searching with a sliding window 3.1 Searching without memory 95 3.2 Searching time 101 3.3 Computing the good suffix table 105 3.4 Automaton of the best factor 109 3.5 Searching with one memory 113 3.6 Searching with several memories 119 3.7 Dictionary searching 128 4 Suffix arrays 137 4.1 Searching a list of strings 138 4.2 Searching with common prefixes 4.3 Preprocessing the list 146 141 95 86 VI Table of contents 4.4 Sorting suffixes 147 4.5 Sorting suffixes on bounded integer alphabets 4.6 Common prefixes of the suffixes 158 5 Structures for index 165 5.1 Suffix trie 165 5.2 Suffix tree 171 5.3 Contexts of factors 180 5.4 Suffix automaton 185 5.5 Compact suffix automaton 153 197 6 Indexes 205 6.1 Implementing an index 205 6.2 Basic operations 208 6.3 Transducer of positions 213 6.4 Repetitions 215 6.5 Forbidden strings 216 6.6 Search machine 219 6.7 Searching for conjugates 223 7 Alignments 227 7.1 Comparison of strings 228 7.2 Optimal alignment 234 7.3 Longest common subsequence 245 7.4 Alignment with gaps 255 7.5 Local alignment 258 7.6 Heuristic for local alignment 261 8 Approximate patterns 269 8.1 Approximate pattern matching with jokers 270 8.2 Approximate pattern matching with differences 274 8.3 Approximate pattern matching with mismatches 286 8.4 Approximate matching for short patterns 295 8.5 Heuristic for approximate pattern matching with differences 304 9 Local periods 311 9.1 Partitioning factors 311 9.2 Detection of powers 320 9.3 Detection of squares 324 9.4 Sorting suffixes 332 Bibliography Index 353 341 Preface This book presents a broad panorama of the algorithmic methods used for processing texts. For this reason it is a book on algorithms, but whose object is focused on the handling of texts by computers. The idea of this publication results from the observation that the rare books entirely devoted to the subject are primarily monographs of research. This is surprising because the problems of the field are known since the development of advanced operating systems, and the need for effective solutions becomes crucial since the massive use of data processing in office automation is crucial in many sectors of the society. In a written or vocal form, text is the only reliable vehicle of abstract concepts. Therefore it remains the privileged support of information systems, despite of significant efforts towards the use of other media (graphic interfaces, systems of virtual reality, synthesis movies, etc). This aspect is still reinforced by the use of knowledge databases, legal, commercial or others, which develop on the Internet thanks in particular to the Web services. The content of the book carry over into formal elements and technical bases required in the fields of Information retrieval, of automatic indexing for search engines, and more generally of software systems, which includes the edition, the treatment and the compression of texts. The methods that are described apply to the automatic processing of natural languages, to the treatment and analysis of genomic sequences, to the analysis of musical sequences, to problems of safety and security related to data flows, and to the management of the textual databases, to quote only some immediate applications. The selected subjects address pattern matching, the indexing of textual data, the comparison of texts by alignment and the search for local regularities. In addition to their practical interest, these subjects have theoretical and combinatorial aspects which provide astonishing examples of algorithmic solutions. The goal of this work is principally educational. It is initially aimed at graduate and undergraduate students. But it can also be used by VIII Preface software designers. We warmly thank the researchers who took time to read and comment on the preliminary outlines of this book. They are: Saïd Abdeddaïm, Marie-Pierre Béal, Christian Charras, Sabine Mercier, Laurent Mouchard, Gregory Kucherov, Johann Pelfrêne, Bruno Petazzoni, Mathieu Raffinot, Giuseppina Rindone, Marie-France Sagot. Remaining errors are ours. Finally, extra elements to the contents of the book are accessible on the site http://chl.univ-mlv.fr/ or from the web pages of the authors. Maxime Crochemore Christophe Hancart Thierry Lecroq London and Rouen, April 2006 1 Tools This chapter presents the algorithmic and combinatorial framework in which are developed the following chapters. It first specifies the concepts and notation used to work on strings, languages and automata. The rest is mainly devoted to the introduction of chosen data structures for implementing automata, to the presentation of combinatorial results, and to the design of elementary pattern matching techniques. This organization is based on the observation that efficient algorithms for text processing rely on one or the other of these aspects. Section 1.2 provides some combinatorial properties of strings that occur in numerous correctness proofs of algorithms or in their performance evaluation. They are mainly periodicity results. The formalism for the description of algorithms is presented in Section 1.3, which is especially centered on the type of algorithm presented in the book, and introduces some standard objects related to queues and automata processing. Section 1.4 details several methods to implement automata in memory, these techniques contribute in particular to results of Chapters 2, 5 and 6. The first algorithms for locating strings in texts are presented in Section 1.5. The sliding window mechanism, the notions of search automaton and of bit vectors that are described in this section are also used and improved in Chapters 2, 3 and 8, in particular. Section 1.6 is the algorithmic jewel of the chapter. It presents two fundamental algorithmic methods used for text processing. They are used to compute the border table and the prefix table of a string that constitute two essential tables for string processing. They synthesize a part of the combinatorial properties of a string. Their utilization and adaptation is considered in Chapters 2 and 3, and also punctually come back in other chapters. Finally, we can note that intuition for combinatorial properties or algorithms sometimes relies on figures whose style is introduced in this chapter and kept thereafter. 2 Tools 1.1 Strings and automata In this section we introduce notation on strings, languages and automata. Alphabet and strings An alphabet is a finite non-empty set whose elements are called letters. A string on an alphabet A is a finite sequence of elements of A. The zero letter sequence is called the empty string and is denoted by ε. For the sake of simplification, delimiters and separators usually employed in sequence notation are removed and a string is written as the simple juxtaposition of the letters that compose it. Thus, ε, a, b and baba are strings on any alphabet that contains the two letters a and b. The set of all the strings on the alphabet A is denoted by A∗, and the set of all the strings on the alphabet A except the empty string ε is denoted by A+ . The length of a string x is defined as the length of the sequence associated with the string x and is denoted by |x|. We denote by x[i], for i = 0, 1, . . . , |x| − 1, the letter at index i of x with the convention that indices begin with 0. When x 6= ε, we say more specifically that each index i = 0, 1, . . . , |x| − 1 is a position on x. It follows that the i-th letter of x is the letter at position i − 1 on x and that: x = x[0]x[1] . . . x[|x| − 1] . Thus an elementary definition of the identity between any two strings x and y is: x=y if and only if |x| = |y| and x[i] = y[i] for i = 0, 1, . . . , |x| − 1 . The set of letters that occur in the string x is denoted by alph(x). For instance, if x = abaaab, we have |x| = 6 and alph(x) = {a, b}. The product – we also say the concatenation – of two strings x and y is the string composed of the letters of x followed by the letters of y. It is denoted by xy or also x · y to show the decomposition of the resulting string. The neutral element for the product is ε. For every string x and every natural number n, we define the n-th power of the string x, denoted by xn , by x0 = ε and xk = xk−1 x for k = 1, 2, . . . , n. We denote respectively by zy −1 and x−1 z the strings x and y when z = xy. The reverse – or mirror image – of the string x is the string x∼ defined by: x∼ = x[|x| − 1]x[|x| − 2] . . . x[0] . 1.1 Strings and automata 3 b a b a a b a b a Figure 1.1 1. An occurrence of string aba in string babaababa at (left) position A string x is a factor of a string y if there exist two strings u and v such that y = uxv. When u = ε, x is a prefix of y; and when v = ε, x is a suffix of y. The string x is a subsequence 1 of y if there exist |x| + 1 strings w0 , w1 , . . . , w|x| such that y = w0 x[0]w1 x[1] . . . x[|x| − 1]w|x| ; in a less formal way, x is a string obtained from y by deleting |y| − |x| letters. A factor or a subsequence x of a string y is proper if x 6= y. We denote respectively by x fact y, x ≺fact y, x pref y, x ≺pref y, x suff y, x ≺suff y, x sseq y and x ≺sseq y when x is a factor, a proper factor, a prefix, a proper prefix, a suffix, a proper suffix, a subsequence and a proper subsequence of y. One can verify that fact , pref , suff and sseq are orderings on A∗. The lexicographic order, denoted by ≤, is an order on the strings induced by an order on the letters and denoted by the same symbol. It is defined as follows. For x, y ∈ A∗, x ≤ y if and only if, either x pref y, or x and y can be decomposed as x = uav and y = ubw with u, v, w ∈ A∗, a, b ∈ A and a < b. Thus, ababb < abba < abbaab assuming a < b. Let x be a non-empty string and y be a string, we say that there is an occurrence of x in y, or, more simply, that x occurs in y, when x is a factor of y. Every occurrence, or every appearance, of x can be characterized by a position on y. Thus we say that an occurrence of x starts at the left position i on y when y[i . . i + |x| − 1] = x (see Figure 1.1). It is sometimes more suitable to consider the right position i + |x| − 1 at which this occurrence ends. For instance, the left and right positions where the string x = aba occurs in the string y = babaababa are: i y[i] left positions right positions 0 b 1 a 1 2 b 3 a 3 4 a 4 5 b 6 a 6 6 7 b 8 a 8 The position of the first occurrence pos(x) of x in y is the minimal (left) position at which starts the occurrence of x in yA∗. With the notation on the languages recalled thereafter, we have: pos(x) = min{|u| : {ux}A∗ ∩ {y}A∗ 6= ∅} . The square bracket notation for the letters of strings is extended to 1 We avoid the common use of “subword” because it has two definitions in literature: one of them is factor and the other is subsequence. 4 Tools factors. We define the factor x[i . . j] of the string x by: x[i . . j] = x[i]x[i + 1] . . . x[j] for all integers i and j satisfying 0 ≤ i ≤ |x|, −1 ≤ j ≤ |x| − 1 and i ≤ j + 1. When i = j + 1, the string x[i . . j] is the empty string. Languages Any subset of A∗ is a language on the alphabet A. The product defined on strings is extended to languages as follows XY = X · Y = {xy : (x, y) ∈ X × Y } for every languages X and Y . We extend as well the notion of power as follows X 0 = {ε} and X k = X k−1 X for k ≥ 1. The star of X is the language: [ X∗ = Xn . n≥0 We denote by X + the language defined by: [ X+ = Xn . n≥1 Note that these two latter notation are compatible with the notation A∗ and A+ . In order not to overload the notation, a language that is reduced to a single string can be named by the string itself if it does not lead to any confusion. For instance, the expression A∗abaaab denotes the language of the strings in A∗ having the string abaaab as suffix, assuming {a, b} ⊆ A. The notion of length is extended to languages as follows: X |X| = |x| . x∈X In the same way, we define alph(X) by [ alph(X) = alph(x) x∈X and X ∼ by X ∼ = {x∼ : x ∈ X} . The sets of factors, prefixes, suffixes and subsequences of the strings of a language X are particular languages that are often considered in the rest of the book; they are respectively denoted by Fact(X), Pref (X), Suff (X) and Subs(X). 1.1 Strings and automata 5 The right context of a string y relatively to a language X is the language: y −1 X = {y −1 x : x ∈ X} . The equivalence relation defined by the identity of right contexts is denoted by ≡X , or simply2 ≡. Thus y ≡ z if and only if y −1 X = z −1 X for y, z ∈ A∗. For instance, when A = {a, b} and X = A∗{aba}, the relation ≡ admits four equivalence classes: {ε, b} ∪ A∗{bb}, {a} ∪ A∗{aa, bba}, A∗{ab} and A∗{aba}. For every language X, the relation ≡ is an equivalence relation that is compatible with the concatenation. It is called the right syntactic congruence associated with X. Regular expressions and languages The regular expressions on an alphabet A and the languages they describe, the regular languages, are recursively defined as follows: • 0 and 1 are regular expressions that respectively describe ∅ (the empty set) and {ε}; • for every letter a ∈ A, a is a regular expression that describes the singleton {a}; • if x and y are regular expressions respectively describing the regular languages X and Y , then (x)+(y), (x).(y) and (x)* are regular expressions that respectively describe the regular languages X ∪ Y , X · Y and X ∗. The priority order of operations on the regular expressions is *, ., then +. Possible writing simplifications allow to omit the symbol . and some parentheses pairs. The language described by a regular expression x is denoted by Lang(x). Automata An automaton M on the alphabet A is composed of a finite set Q of states, of an initial state 3 q0 , of a set T ⊆ Q of terminal states and of a set F ⊆ Q×A×Q of arcs – or transitions. We denote the automaton M by the quadruplet: (Q, q0 , T, F ) . 2 As in all the rest of the book, the notation is indexed by the object to which they refer only when it could be ambiguous. 3 The standard definition of automata considers a set of initial states rather than a single initial state as we do in the entire book. We leave the reader to convince himself that it is possible to build a correspondence between any automaton defined in the standard way and an automaton with a single initial state that recognizes the same language. 6 Tools We say of an arc (p, a, q) that it leaves the state p and that it enters the state q; state p is the source of the arc, letter a its label and state q its target. The number of arcs outgoing a given state is called the outgoing degree of the state. The incoming degree of a state is defined in a dual way. By analogy with graphs, the state q is a successor by the letter a of the state p when (p, a, q) ∈ F ; in the same case, we say that the pair (a, q) is a labeled successor of the state p. A path of length n in the automaton M = (Q, q0 , T, F ) is a sequence of n consecutive arcs h(p0 , a0 , p′0 ), (p1 , a1 , p′1 ), . . . , (pn−1 , an−1 , p′n−1 )i , that satisfies p′k = pk+1 for k = 0, 1, . . . , n − 2. The label of the path is the string a0 a1 . . . an−1 , its origin the state p0 , its end the state p′n−1 . By convention, there exists for each state p a path of null length of origin and of end p; the label of such a path is ε, the empty string. A path in the automaton M is successful if its origin is the initial state q0 and if its end is in T . A string is recognized – or accepted – by the automaton if it is the label of a successful path. The language composed of the strings recognized by the automaton M is denoted by Lang(M ). Often, more than its formal notation, a diagram illustrates how an automaton works. We represent the states by circles and the arcs by directed arrows from source to target, labeled by the corresponding letter. When several arcs have the same source and the same target, we merge the arcs and the label of the resulting arc becomes an enumeration of the letters. The initial state is distinguished by a short incoming arrow and the terminal states are double circled. An example is shown Figure 1.2. A state p of an automaton M = (Q, q0 , T, F ) is accessible if there exists a path in M starting at q0 and ending in p. A state p is coaccessible if there exists a path in M starting at p and ending in T . An automaton M = (Q, q0 , T, F ) is deterministic if for every pair (p, a) ∈ Q×A there exists at most one state q ∈ Q such that (p, a, q) ∈ F . In such a case, it is natural to consider the transition function δ: Q × A → Q of the automaton defined for every arc (p, a, q) ∈ F by δ(p, a) = q and not defined elsewhere. The function δ is easily extended to strings. It is enough to consider its extension δ̄: Q × A∗ → Q recursively defined by δ̄(p, ε) = p and δ̄(p, wa) = δ(δ̄(p, w), a) for p ∈ Q, w ∈ A∗ and a ∈ A. It follows that the string w is recognized by the automaton M if and 1.1 Strings and automata 7 a 2 c b,c a a 0 c 1 a b b a 3 4 b b,c c Figure 1.2 Representation of an automaton on the alphabet A = {a, b, c}. The states of the automaton are numbered from 0 to 4, its initial state is 0 and its terminal states are 2 and 4. The automaton possesses 3 × 5 = 15 arcs. The language that it recognizes is described by the regular expression (a+b+c)*(aa+aba), i.e. the set of strings on the three letter alphabet a, b and c ending by aa or aba. only if δ̄(q0 , w) ∈ T . Generally, the function δ and its extension δ̄ are denoted in the same way. The automaton M = (Q, q0 , T, F ) is complete when for every pair (p, a) ∈ Q×A there exists at least one state q ∈ Q such that (p, a, q) ∈ F . Proposition 1.1 For every automaton, there exists a deterministic and complete automaton that recognizes the same language. To complete an automaton is not difficult: it is enough to add to the automaton a sink state, then to make it the target of all undefined transitions. It is a bit more difficult to determinize an automaton, that is, to transform an automaton M = (Q, q0 , T, F ) into a deterministic automaton recognizing the same language. One can use the so-called method of construction by subsets: let M ′ be the automaton whose states are the subsets of Q, the initial state is the singleton {q0 }, the terminal states are the subsets of Q that intersect T , and the arcs are the triplets (U, a, V ) where V is the set of successors by the letter a of the states p belonging to U ; then M ′ is a deterministic automaton that recognizes the same language as M . In practical applications, we do not construct the automaton M ′ entirely, but only its accessible part from the initial state {q0 }. A language X is recognizable if there exists an automaton M such that X = Lang(M ). The statement of a fundamental theorem of automata theory that establishes the link between recognizable languages and regular languages on a given alphabet follows. Theorem 1.2 (Kleene theorem) 8 Tools A language is recognizable if and only if it is regular. If X is a recognizable language, the minimal automaton of X, denoted by M(X), is determined by the right syntactic congruence associated to X. It is the automaton whose set of states is {w−1 X : w ∈ A∗}, the initial state is X, the set of terminal states is {w−1 X : w ∈ X}, and the set of arcs is {(w−1 X, a, (wa)−1 X) : (w, a) ∈ A∗ × A}. Proposition 1.3 The minimal automaton M(X) of a language X is the automaton having the smallest number of states among the deterministic and complete automata that recognize the language X. The automaton M(X) is the homomorphic image of every automaton recognizing X. We often say of an automaton that it is minimal though it is not complete. Actually, this automaton is indeed minimal if one takes care to add a sink state. Each state of an automaton, or even sometimes each arc, can be associated with an output. It is a value or a set of values associated with the state or the arc. 1.2 Some combinatorics We consider the basic properties. have interesting periodicities and notion of periodicity on strings for which we give the We begin with presenting two families of strings that combinatorial properties with regard to questions of repeats examined in several chapters. Some specific strings Fibonacci numbers are defined by the recurrence: 0 , 1 , F0 F1 = = Fn = Fn−1 + Fn−2 for n ≥ 2 . These famous numbers satisfy properties all more remarkable than the others. Among those, we just give two: • for every natural number n ≥ 2, gcd(Fn , Fn−1 ) = 1, √ • for every natural number n, Fn is the nearest integer of Φn / 5, where √ Φ = 21 (1 + 5) = 1,61803 . . . is the golden ratio. Fibonacci strings are defined on the alphabet A = {a, b} by the following recurrence: f0 = ε , 1.2 Some combinatorics f1 = b , f2 fn = a , = fn−1 fn−2 9 for n ≥ 3 . Note that the sequence of lengths of the strings is exactly the sequence of Fibonacci numbers, that is, Fn = |fn |. Here are the first ten Fibonacci numbers and strings: n 0 1 2 3 4 5 6 7 8 9 Fn 0 1 1 2 3 5 8 13 21 34 fn ε b a ab aba abaab abaababa abaababaabaab abaababaabaababaababa abaababaabaababaababaabaababaabaab The interest in Fibonacci strings is that they satisfy many combinatorial properties and they contain a large number of repeats. The de Bruijn strings considered here are defined on the alphabet A = {a, b} and are parameterized by a non-null natural number. A nonempty string x ∈ A+ is a de Bruijn string of order k if each string on A of length k occurs once and only once in x. A first example: ab and ba are the only two de Bruijn strings of order 1. A second example: the string aaababbbaa is a de Bruijn string of order 3 since its factors of length 3 are the eight strings of A3 , that is, aaa, aab, aba, abb, baa, bab, bba and bbb, and each of them occurs exactly once in it. The existence of a de Bruijn string of order k ≥ 2 can be verified with the help of the automaton defined by: • states are the strings of the language Ak−1 , • arcs are of the form (av, b, vb) with a, b ∈ A and v ∈ Ak−2 , the initial state and the terminal states are not given (an illustration is shown Figure 1.3). We note that exactly two arcs exit each of the states, one labeled by a, the other by b; and that exactly two arcs enter each of the states, both labeled by the same letter. The graph associated with the automaton thus satisfies the Euler condition: the outgoing degree and the incoming degree of each state are identical. It follows that there exists an Eulerian circuit in the graph. Now, let h(u0 , a0 , u1 ), (u1 , a1 , u2 ), . . . , (un−1 , an−1 , u0 )i be the corresponding path. The string u0 a0 a1 . . . an−1 is a de Bruijn string of order k, since each arc of the path is identified with a factor 10 Tools of length k. It follows in the same way that a de Bruijn string of order k has length 2k + k − 1 (thus n = 2k with the previous notation). It can also be verified that the number of de Bruijn strings of order k is exponential in k. The de Bruijn strings are often used as examples of limit cases in the sense that they contain all the factors of a given length. Periodicity and borders Let x be a non-empty string. An integer p such that 0 < p ≤ |x| is called a period of x if: x[i] = x[i + p] for i = 0, 1, . . . , |x| − p − 1. Note that the length of a non-empty string is a period of this string, such that every non-empty string has at least one period. We define thus without any ambiguity the period of a nonempty string x as the smallest of its periods. It is denoted by per(x). For instance, 3, 6, 7 and 8 are periods of the string x = aabaabaa and the period of x is per(x) = 3. We note that if p is a period of x, its multiples kp are also periods of x when k is an integer satisfying 0 < k ≤ ⌊|x|/p⌋. Proposition 1.4 Let x be a non-empty string and p an integer such that 0 < p ≤ |x|. Then the five following properties are equivalent: 1. The integer p is a period of x. 2. There exist two unique strings u ∈ A∗ and v ∈ A+ and an integer k > 0 such that x = (uv)k u and |uv| = p. 3. There exist a string t and an integer k > 0 such that x pref tk and |t| = p. 4. There exist three strings u, v and w such that x = uw = wv and |u| = |v| = p. 5. There exists a string t such that x pref tx and |t| = p. a b aa ab b a b a ba a bb b Figure 1.3 The order 3 de Bruijn automaton on the alphabet {a, b}. The initial state of the automaton is not given. 1.2 Some combinatorics 11 a a b a a b a a a a b a a b a a 6 Figure 1.4 Duality between the notions of borders and periods. String aa is a border of string aabaabaa; it corresponds to period 6 = |aabaabaa| − |aa|. Proof 1 ⇒ 2: if v 6= ε and k > 0, then k is the quotient of the integer division of |x| by p. Now, if the triplet (u′ , v ′ , k ′ ) satisfies the same conditions than the triplet (u, v, k), we have k ′ = k then, due to the equality of length, |u′ | = |u|. It follows immediately that u′ = u and v ′ = v. This shows the uniqueness of the decomposition if it exists. Let k and r be respectively the quotient and the remainder of the Euclidean division of |x| by p, then u and v be the two factors of x defined by u = x[0 . . r − 1] and v = x[r . . p − 1]. Thus x = (uv)k u and |uv| = p. This demonstrates the existence of the triplet (u, v, k) and ends the proof of the property. 2 ⇒ 3: it is enough to consider the string t = uv. 3 ⇒ 4: let w be the suffix of x defined by w = t−1 x. As x pref tk , w is also a prefix of x. Thus the existence of two strings u (= t) and v such that x = uw = wv and |u| = |v| = |t| = p. 4 ⇒ 5: since uw pref uwv, we have x pref tx with |t| = p by simply setting t = u. 5 ⇒ 1: let i be an integer such that 0 ≤ i ≤ |x| − p − 1. Then: x[i + p] = (tx)[i + p] = x[i] (since x pref tx) (since |t| = p) . This shows that p is a period of x. We note in particular that Property 3 can be expressed in a more general way by replacing pref by fact (Exercise 1.4). A border of a non-empty string x is a proper factor of x that is both a prefix and a suffix of x. Thus, ε, a, aa and aabaa are the borders of the string aabaabaa. The notions of borders and of periods are dual as shown by Property 4 of the previous proposition (see Figure 1.4). The proposition that follows expresses this duality in different terms. We introduce the function Border: A∗ → A∗ defined for every nonempty string x by: Border(x) = the longest border of x . We say of Border(x) that it is the border of x. For instance, the border of every string of length 1 is the empty string and the border of the string aabaabaa is aabaa. Also note that, when defined, the border of a border of a given string x is also a border of x. 12 Tools Proposition 1.5 Let x be a non-empty string and n be the largest integer k for which Border k (x) is defined (thus Border n (x) = ε). Then hBorder(x), Border 2 (x), . . . , Border n (x)i (1.1) is the sequence of borders of x in decreasing order of length, and h|x| − |Border(x)|, |x| − |Border 2 (x)|, . . . , |x| − |Border n (x)|i (1.2) is the sequence of periods of x in increasing order. Proof We proceed by recurrence on the length of strings. The statement of the proposition is valid when the length of the string x is equal to 1: the sequence of borders is reduced to hεi and the sequence of periods to h|x|i. Let x be a string of length greater than 1. Then every border of x different from Border(x) is a border of Border(x), and conversely. It follows by recurrence hypothesis that the sequence (1.1) is exactly the sequence of borders of x. Now, if p is a period of x, Proposition 1.4 ensures the existence of three strings u, v and w such that x = uw = wv and |u| = |v| = p. Then w is a border of x and p = |x| − |w|. It follows that the sequence (1.2) is the sequence of periods of x. Lemma 1.6 (Periodicity lemma) If p and q are periods of a non-empty string x and satisfy p + q − gcd(p, q) ≤ |x| , then gcd(p, q) is also a period of x. Proof By recurrence on max{p, q}. The result is straightforward when p = q = 1 and, more generally when p = q. We can then assume in the rest that p > q. From Proposition 1.4, the string x can be written both as uy with |u| = p and y a border of x, and as vz with |v| = q and z a border of x. The quantity p − q is a period of z. Indeed, since p > q, y is a border of x of length strictly less than the length of the border z. Thus y is a border of z. It follows that |z| − |y| is a period of z. And |z| − |y| = (|x| − q) − (|x| − p) = p − q. But q is also a period of z. Indeed, since p > q and gcd(p, q) ≤ p − q, we have q ≤ p − gcd(p, q). One the other hand we have p − gcd(p, q) = p + q − gcd(p, q) − q ≤ |x| − q = |z|. It follows that q ≤ |z|. This shows that the period q of x is also a period of its factor z. Moreover, we have (p − q) + q − gcd(p − q, q) = p − gcd(p, q), which, as can be seen above, is a quantity less than |z|. We apply the recurrence hypothesis to max{p − q, q} relatively to the string z, and we obtain thus that gcd(p, q) is a period of z. 1.2 Some combinatorics 13 a b a a b a b a a b a b a a b a b a a b a b a a b a b a a b a a b a a b a b a a b a a b a b a a b a a b a b a Figure 1.5 Application of the Periodicity lemma. String abaababaaba of length 11 possesses 5 and 8 as periods. It is not possible to extend them to the left nor to the right while keeping these two periods. Indeed, if 5 and 8 are periods of some string, but 1 – the greatest common divisor of 5 and 8 – is not, then this string is of length less than 5 + 8 − gcd(5, 8) = 12. The conditions on p and q (those of the lemma and gcd(p, q) ≤ p − q) give q ≤ |x|/2. And as x = vz and z is a border of x, v is a prefix of z. It has moreover a length that is a multiple of gcd(p, q). Let t be the prefix of x of length gcd(p, q). Then v is a power of t and z is a prefix of a power of t. It follows then by Proposition 1.4 that x is a prefix of a power of t, and thus that |t| = gcd(p, q) is a period of x. Which ends the proof. To illustrate the periodicity lemma, let us consider a string x that admits both 5 and 8 as periods. Then, if we assume moreover that x is composed of at least two distinct letters, gcd(5, 8) = 1 is not a period of x, and, by application of the lemma, the length of x is strictly less than 5 + 8 − gcd(5, 8) = 12. It is the case, for instance, for the four strings of length greater than 7 which are prefixes of the string abaababaaba of length 11. Another illustration of the result is proposed in Figure 1.5. We wish to show in what follows that one cannot weaken the condition required on the periods in the statement of the periodicity lemma. More precisely, we give examples of strings x that have two periods p and q such that p + q − gcd(p, q) = |x| + 1 but which do not satisfy the conclusion of the lemma. (See also Exercise 1.5.) Let β: A∗ → A∗ be the function defined by β(uab) = uba for every string u ∈ A∗ and every letters a, b ∈ A. Lemma 1.7 For every natural number n ≥ 3, β(fn ) = fn−2 fn−1 . Proof By recurrence on n. The result is straightforward when 3 ≤ n ≤ 4. If n ≥ 5, we have: β(fn ) = β(fn−1 fn−2 ) (by definition of fn ) = fn−1 β(fn−2 ) = fn−1 fn−4 fn−3 (since |fn−2 | = Fn−2 ≥ 2) (by recurrence hypothesis) = fn−2 fn−3 fn−4 fn−3 (by definition of fn−1 ) 14 Tools = fn−2 fn−2 fn−3 (by definition of fn−2 ) = fn−2 fn−1 (by definition of fn−1 ) . For every natural number n ≥ 3, we define the string gn as the prefix of length Fn − 2 of fn , that is, fn with its last two letters chopped off. Lemma 1.8 For every natural number n ≥ 6, gn = fn−2 2 gn−3 . Proof We have: fn = fn−1 fn−2 (by definition of fn ) = fn−2 fn−3 fn−2 = fn−2 β(fn−1 ) = fn−2 β(fn−2 fn−3 ) (by definition of fn−1 ) (from Lemma 1.7) (by definition of fn−1 ) = fn−2 2 β(fn−3 ) (since |fn−3 | = Fn−3 ≥ 2) . The stated result immediately follows. Lemma 1.9 For every natural number n ≥ 3, gn pref fn−1 2 and gn pref fn−2 3 . Proof We have: gn pref = = fn fn−3 fn−1 fn−2 fn−3 fn−1 2 (since gn pref fn ) (by definition of fn ) (by definition of fn−1 ) . The second relation is valid when 3 ≤ n ≤ 5. When n ≥ 6, we have: gn = fn−2 2 gn−3 pref = 2 fn−2 fn−3 fn−4 fn−2 3 (from Lemma 1.8) (since gn−3 pref fn−3 ) (by definition of fn−2 ) . Now, let n be a natural number, n ≥ 5, so that the string gn is both defined and of length greater than 2. It follows then: |gn | = Fn − 2 = Fn−1 + Fn−2 − 2 ≥ Fn−1 (by definition of gn ) (by definition of Fn ) (since Fn−2 ≥ 2) . It results from this inequality, from Lemma 1.9 and from Proposition 1.4 that Fn−1 and Fn−2 are two periods of gn . In addition note that, since gcd(Fn−1 , Fn−2 ) = 1, we also have: Fn−1 + Fn−2 − gcd(Fn−1 , Fn−2 ) = Fn − 1 = |gn | + 1 . 1.2 Some combinatorics 15 Thus, if the conclusion of the periodicity lemma applied to the string gn and its two periods Fn−1 and Fn−2 , gn would be the power of a string of length 1. But the first two letters of gn are distinct. This indicates that the condition of the periodicity lemma is in some sense optimal. Powers, primitivity and conjugacy Lemma 1.10 Let x and y be two strings. If there exist two natural non-null numbers m and n such that xm = y n , x and y are powers of some string z. Proof It is enough to show the result in the non-trivial case where neither x nor y are empty strings. Two sub-cases can then be distinguished, whether min{m, n} is equal to 1 or not. If min{m, n} = 1, it is sufficient to consider the string z = y if m = 1 and z = x if n = 1. Otherwise, min{m, n} ≥ 2. Then we note that |x| and |y| are periods of the string t = xm = y n which satisfy the condition of the periodicity lemma: |x| + |y| − gcd(|x|, |y|) ≤ |x| + |y| − 1 < |t|. Thus it is sufficient to consider the string z defined as the prefix of t of length gcd(|x|, |y|) to get the stated result. A non-empty string is primitive if it is not the power of any other string. In other words, a string x ∈ A+ is primitive if and only if every decomposition of the form x = un with u ∈ A∗ and n ∈ N implies n = 1, and then u = x. For instance, the string abaab is primitive, while the strings ε and bababa = (ba)3 are not. Lemma 1.11 (Primitivity lemma) A non-empty string is primitive if and only if it is a factor of its square only as a prefix and as a suffix. In other words, for every non-empty string x, x primitive if and only if yx pref x2 implies y = ε or y = x . An illustration of this result is proposed in Figure 1.6. Proof If x is a non-empty non-primitive string, there exist z ∈ A+ and n ≥ 2 such that x = z n . Since x2 can be decomposed as z · z n · z n−1 , the string x occurs at the position |z| on x2 . This shows that every nonempty non-primitive string is a factor of its square without being only a prefix and a suffix of it. 16 Tools a b a b a b a b a b a b a b b a b a a b b a b a (a) a b a b a b (b) Figure 1.6 Application of the Primitivity lemma. (a) String x = abbaba does not possess any “non trivial” occurrence in its square x2 – i.e. that is neither a prefix nor a suffix of x2 – since x is primitive. (b) String x = ababab possesses a “non trivial” occurrence in its square x2 since x is not primitive: x = (ab)3 . Conversely, let x be a non-empty string such that its square x2 can be written as yxz with y, z ∈ A+ . Due to the length condition, it first follows that |y| < |x|. Then, and since x pref yx, we obtain from Proposition 1.4 that |y| is a period of x. Thus |x| and |y| are periods of yx. From the periodicity lemma, we deduce that p = gcd(|x|, |y|) is also a period of yx. Now, as p ≤ |y| < |x|, p is also a period of x. And as p divides |x|, we deduce that x is of the form tn with |t| = p and n ≥ 2. This shows that the string x is not primitive. Another way of stating the previous lemma is that the primitivity of x is equivalent to saying that per(x2 ) = |x|. Proposition 1.12 For every non-empty string, there exists one and only one primitive string which it is a power of. Proof The proof of the existence comes from a trivial recurrence on the length of the strings. We now have to show the uniqueness. Let x be a non-empty string. If we assume that x = um = v n for two primitive strings u and v and two natural non-null numbers m and n, then u and v are necessarily powers of a string z ∈ A+ from Lemma 1.10. But their primitivity implies z = u = v, which shows the uniqueness and ends the proof. If x is a non-empty string, we say of the primitive string z which x is the power of that it is the root of x, and of the natural number n such that x = z n that it is the exponent4 of x. Two strings x and y are conjugates if there exist two strings u and v such that x = uv and y = vu. For instance, the strings abaab and ababa are conjugate. It is clear that conjugacy is an equivalence relation. It is not compatible with the product. 4 More generally, the exponent of x is the quantity |x|/per(x) which is not necessarily an integer (see Exercise 9.2). 1.3 Algorithms and complexity 17 Proposition 1.13 Two non-empty strings are conjugate if and only if their roots also are. Proof The proof of the reciprocal is immediate. For the proof of the direct implication, we consider two non-empty conjugate strings x and y and we denote by z and t then m and n their roots and exponents respectively. Since x and y are conjugate, there exist z ′ , z ′′ ∈ A+ and p, q ∈ N such that z = z ′ z ′′ , x = z p z ′ · z ′′ z q , y = z ′′ z q · z p z ′ and m = p + q + 1. We deduce that y = (z ′′ z ′ )m . Now, as t is primitive, Lemma 1.10 implies that z ′′ z ′ is a power of t. This shows the existence of a natural non-null number k such that |z| = k|t|. Symmetrically, there exists a natural non-null number ℓ such that |t| = ℓ|z|. It follows that k = ℓ = 1, that |t| = |z|, then that t = z ′′ z ′ . This shows that the roots z and t are conjugate. Proposition 1.14 Two non-empty strings x and y are conjugate if and only if there exists a string z such that xz = zy. Proof ⇒: x and y can be decomposed as x = uv and y = vu with u, v ∈ A∗, then the string z = u suits since xz = uvu = zy. ⇐: in the non-trivial case where z ∈ A+ , we obtain by an immediate recurrence that xk z = zy k for every k ∈ N. Let n be the (non-null) natural number such that (n − 1)|x| ≤ |z| < n|x|. There exist thus u, v ∈ A∗ such that x = uv, z = xn−1 u and vz = y n . It follows that y n = vxn−1 u = (vu)n . Finally, since |y| = |x|, we have y = vu, which shows that x and y are conjugate. 1.3 Algorithms and complexity In this section, we present the algorithmic elements used in the rest of the book. They include the writing conventions, the evaluation of the algorithm complexity and some standard objects. Writing conventions of algorithms The style of the algorithmic language used here is relatively close to real programming languages but at a higher abstraction level. We adopt the following conventions: • Indentation means the structure of blocks inherent to compound instructions. • Lines of code are numbered in order to be referenced in the text. • The symbol ⊲ introduces a comment. • The access to a specific attribute of an object is signified by the name 18 Tools of the attribute followed by the identifier associated with the object between brackets. • A variable that represents a given object (table, queue, tree, string, automaton) is a pointer to this object. • The arguments given to procedures or to functions are managed by the “call by value” rule. • Variables of procedures and of functions are local to them unless otherwise mentioned. • The evaluation of boolean expressions is performed from left to right in a lazy way. We consider, following the example of a language like the C language, the iterative instruction do-while – used instead of the traditional instruction repeat-until – and the instruction break which produces the termination of the most internal loop in which it is located. Well adapted to the sequential processing of strings, we use the formulation: 1 for each letter a of u, sequentially do 2 processing of a for every string u. It means that the letters u[i], i = 0, 1, . . . , |u| − 1, composing u are processed one after the other in the body of the loop: first u[0], then u[1], and so on. It means that the length of the string u is not necessarily known in advance, the end of the loop can be detected by a marker that ends the string. In the case where the length of the string u is known, this formulation is equivalent to a formulation of the type: 1 for i ← 0 to |u| − 1 do 2 a ← u[i] 3 processing of a where the integer variable i is free (its use does not produce any conflict with the environment). Pattern matching algorithms A pattern represents a non-empty language not containing the empty string. It can be described by a string, by a finite set of strings, or by other means. The pattern matching problem is to search for occurrences of strings of the language in other strings – or in texts to be less formal. The notions of occurrence, of appearance and of position on the strings are extended to patterns. According to the specified problem, the input of a pattern matching algorithm is a string x or a language X and a text y, together or not with their lengths. 1.3 Algorithms and complexity 19 The output can take several forms. Here are some of them: • Booleans values: to implement an algorithm that tests whether the pattern occurs in the text or not, without specifying the positions of the possible occurrences, the output is simply the boolean value true in the first situation and false in the second. • A string: during a sequential search, it is appropriate to produce a string ȳ on the alphabet {0, 1} that encodes the existence of the right positions of occurrences. The string ȳ is such that |ȳ| = |y| and ȳ[i] = 1 if and only if i is the right position of an occurrence of the pattern on y. • A set of positions: the output can also take the form of a set P of left – or right – positions of occurrences of the pattern on y. Let e be a predicate having value true if and only if an occurrence has just been detected. A function corresponding to the first form and ending as soon as an occurrence is detected should integrate in its code an instruction: 1 if e then 2 return true in the heart of its searching process, and return the value false at the termination of this process. The second form needs to initialize the variable ȳ with ε, the empty string, then to modify its value by an instruction: 1 if e then 2 ȳ ← ȳ · 1 3 else ȳ ← ȳ · 0 then to return it at the termination. It is identical for the third form, where the set P is initially emptied, then augmented by an instruction: 1 if e then 2 P ← P ∪ {the current position on y} and finally returned. In order to present only one variant of the code for an algorithm, we consider the following special instruction: Output-if(e) means, at the location where it appears, an occurrence of the pattern at the current position on the text is detected when the predicate e has value true. Expression of complexity The model of computation for the evaluation of the algorithms complexity is the standard random access machine model. 20 Tools In a general way, the algorithm complexity is an expression including the input size. This includes the length of the language represented by the pattern, the length of the string in which the search is performed, and the size of the alphabet. We assume that the letters of the alphabet are of size comparable to the machine word size, and, consequently, the comparison between two letters is an elementary operation that is performed in constant time. We assume that every instruction Output-if(e) is executed in constant time5 once the predicate e has been evaluated. We use the notation recommended by Knuth [72] to express the orders of magnitude. Let f and g be two functions from N to N. We write “f (n) is O(g(n))” to mean that there exists a constant K and a natural number n0 such that f (n) ≤ Kg(n) for every n ≥ n0 . In a dual way, we write “f (n) is Ω(g(n))” if there exists a constant K and a natural number n0 such that f (n) ≥ Kg(n) for every n ≥ n0 . We finally write “f (n) is Θ(g(n))” to mean that f and g are of the same order, that is to say that f (n) is both O(g(n)) and Ω(g(n)). The function f : N → N is linear if f (n) is Θ(n), quadratic if f (n) is Θ(n2 ), cubic if f (n) is Θ(n3 ), logarithmic if f (n) is Θ(log n), exponential if there exists a > 0 for which f (n) is Θ(an ). We say that a function with two parameters f : N × N → N is linear when f (m, n) is Θ(m + n) and quadratic when f (m, n) is Θ(m × n). Some standard objects Queues, states and automata are objects often used in the rest of the book. Without telling what their true implementations are – they can actually differ from one algorithm to the other – we indicate the minimal attributes and operations defined on these objects. For queues, we only describe the basic operations. Empty-Queue() creates then returns an empty queue. Queue-is-empty(F ) returns true if the file F is empty, and false otherwise. Enqueue(F, x) adds the element x to the tail of the queue F . Head(F ) returns the element located at the head of the queue F . Dequeue(F ) deletes the element located at the head of the queue F . Dequeued(F ) deletes the element located at the head of the queue F then returns it; Length(F ) returns the length of the queue F . 5 Actually we can always come down to it even though the language represented by the pattern is not reduced to a single string. For that, it is sufficient to only produce one descriptor – previously computed – of the set of strings that occur at the current position (instead for instance, of producing explicitly the set of strings). It then remains to use a tool that develops the information if necessary. 1.4 Implementation of automata 21 States are objects that possess at least the two attributes terminal and Succ. The first attribute indicates if the state is terminal or not and the second is an implementation of the set of labeled successors of the state. The attribute corresponding to an output of a state is denoted by output. The two standard operations on the states are the functions New-state and Target. While the first creates then returns a non-terminal state with an empty set of labeled successors, the second returns the target of an arc given the source and the label of the arc, or the value special nil if such an arc does not exist. The code for these two functions can be written in a few lines: New-state() 1 allocate an object p of type state 2 terminal[p] ← false 3 Succ[p] ← ∅ 4 return p Target(p, a) 1 if there exists a state q such that (a, q) ∈ Succ[p] then 2 return q 3 else return nil The objects of the type automaton possess at least the attribute initial that specifies the initial state of the automaton. The function New-automaton creates then returns an automaton with a single state. It constitues its initial state and has an empty set of labeled successors. The corresponding code is the following: New-automaton() 1 allocate an object M of type automaton 2 q0 ← New-state() 3 initial[M ] ← q0 4 return M 1.4 Implementation of automata Some pattern matching algorithms rely on specific implementations of the deterministic automata they consider. This section details several methods, including the data structures and the algorithms, that can be used to implement these objects in memory. Implementing a deterministic automaton (Q, q0 , T, F ) consists of setting in memory, either the set F of its arcs, or the sets of the labeled successors of its states, or its transition function δ. Those are equivalent problems that fit in the general framework of representing partial functions (Exercise 1.15). We distinguish two families of implementations: 22 Tools 0 1 2 3 4 a 1 2 2 4 2 b 0 3 3 0 3 Figure 1.7 c 0 0 0 0 0 The transition matrix of the automaton of Figure 1.2. • the family of full implementations in which all the transitions are represented, • the family of reduced implementations that use more or less elaborate techniques of compression in order to reduce the memory space of the representation. The choice of the implementation influences the time necessary to compute a transition, i.e. to execute Target(p, a), for a state p ∈ Q and a letter a ∈ A. This computation time is called the delay since it measures also the time necessary for going from the current letter of the input to the next letter. Typically, two models can be opposed: • The branching model in which δ is implemented with a Q × A matrix and where the delay is constant (in the random access model). • The comparisons model in which the elementary operation is the comparison of letters and where the delay is typically O(log card A) when any two letters can be compared in one unit of time (general assumption formulated in Section 1.3). We also consider in the next section an elementary technique known as the “bit-vector model” whose application scope is restricted: it is especially interesting when the size of the automaton is very small. For each of the implementation families, we specify the orders of magnitude of the necessary memory space and of the delay. There is always a trade-off to be found between these two quantities. Full implementations The most simple method for implementing the function δ is to store its values in a Q × A matrix, known as the transition matrix (an illustration is given Figure 1.7) of the automaton. It is a method of choice for a deterministic complete automaton on an alphabet of relatively small size and when the letters can be identified with indices on a table. Computing a transition reduces to a mere table look-up. Proposition 1.15 In an implementation by transition matrix, the necessary memory space is O(card Q × card A) and the delay O(1). In the case where the automaton is not complete, the representation 1.4 Implementation of automata 23 remains correct except that the execution of the automaton on the text given as an input can now stop on an undefined transition. The matrix can be initialized in time O(card F ) only if we implement partial functions as proposed in Exercise 1.15. The above stated complexities for the memory space as well as for the delay remain valid. An automaton can be implemented by means of an adjacency matrix as it is classical to do for graphs. We associate then to each letter of the alphabet a boolean Q × Q matrix. This representation is in general not adapted for the applications developed in this book. It is however related to the method that follows. The method by list of transitions consists in implementing a list of triplets (p, a, q) that are arcs of the automaton. The required space is only O(card F ). Having done this, we assume that this list is stored in a hash table in order to allow a fast computation of the transitions. The corresponding hash function is defined on the pairs (p, a) ∈ Q × A. Given a pair (p, a), the access to the transition (p, a, q), if it is defined, is done in average constant time with the usual assumptions specific to this type of technique. These first types of representations implicitly assume that the alphabet is fixed and known in advance, which opposes them to the representations in the comparison model considered by the method described below. The method by sets of labeled successors consists in using a table t indexed on Q for which each element t[p] gives access to an implementation of the set of the labeled successors of the state p. The required space is O(card Q + card F ). This method is valuable even when the only authorized operation on the letters is the comparison. Denoting by s the maximum outgoing degree of the states, the delay is O(log s) if we use an efficient implementation of the sets of labeled successors. Proposition 1.16 In an implementation by sets of labeled successors, the space requirement is O(card Q + card F ) and the delay O(log s) where s is the maximum outgoing degree of states. Note that the delay is also O(log card A) in this case: indeed, since the automaton is assumed to be deterministic, the outgoing degree of each of the states is less than card A, thus s ≤ card A with the notation used above. Reduced implementations When the automaton is complete, the space complexity can however be reduced by considering a successor by default for the computation of the transitions from any given state – the state occurring the most often in a set of labeled successors is the best possible candidate for being 24 Tools a 2 a 0 a 1 a b b a 3 4 b Figure 1.8 Reduced implementation by adjonction of successors by default. We consider the automaton of Figure 1.2 and we chose the initial state as unique successor by default (this choice parfectly suits for pattern matching problems). States that admit the initial state as successor by default (actually all here) are indicated by a short gray arrow. For example, the target of the transition from state 3 by the letter a is state 4, and by every other letter, here b or c, it is the initial state 0. the successor by default. The delay can also be reduced since the size of the sets of labeled successors becomes smaller. For pattern matching problems, the choice of the initial state as successor by default perfectly suits. Figure 1.8 shows an example where short gray arrows mean that the state possesses the initial state as successor by default. Another method to reduce the implementation space consists in using a failure function. The idea is here to reduce the necessary space for implementing the automaton, by redirecting, in most cases, the computation of the transition from the current state to the one from another state but by the same letter. This technique serves to implement deterministic automata in the model comparison. Its principal advantage is – generally – to provide linear size representations and to simultaneously get a linear time computation of series of transitions even when the computation of a single transition cannot be done in constant time. Formally, let γ: Q × A → Q and f: Q → Q be two functions. We say that the pair (γ, f ) represents the transition function δ of a complete automaton having δ as transition function if and only if γ is a sub-function of δ, f defines an order on elements of Q, and for every pair (p, a) ∈ Q × A  γ(p, a) if γ(p, a) is defined , δ(p, a) = δ(f (p), a) otherwise . When it is defined, we say of the state f (p) that it is the failure state of the state p. We say of the functions γ and f that they are respectively, and jointly, a sub-transition and a failure function of δ. 1.4 Implementation of automata 25 2 2 b,c a a 0 1 3 4 0 (a) 1 b 3 a 4 (b) Figure 1.9 Reduced implementation by adjonction of a failure function. We take again the example of the automaton of Figure 1.2. (a) A failure function given under the form of an directed graph. As this graph does not possess any cycle, the function defines an order on the set of states. (b) The corresponding reduced automaton. Each link state-failure state is indicated by a dashed arrow. The computation of the transition from state 4 by the letter c is refered to state 1, then to state 0. State 0 is indeed the first among states 4, 1 et 0, in this order, to possess a transition defined by c. Finally, the target of the transition from state 4 by c is state 0. We indicate the link state-failure state by a directed dash arrow in figures (see the example in Figure 1.9). The space needed to represent the function δ by the functions γ and f is O(card Q + card F ′ ) in the case of an implementation by sets of labeled successors where F ′ = {(p, a, q) ∈ F : γ(p, a) is defined} . Note that γ is the transition function of the automaton (Q, q0 , T, F ′ ). A complete example The method presented here is a combination of the previous ones together with a fast computation of transitions and a compact representation of transitions due to the joint use of tables and of a failure function. It is known as “compression of transition table”. Two extra attributes, fail and base, are added to states, the first has values in Q and the second in N. We consider also two tables indexed by N and with values in Q: target and control. For each pair (p, a) ∈ Q×A, base[p] + rank[a] is an index on both target and control, denoting by rank the function that associates with every letter of A its rank in a fixed ordered sequence of letters of A. The computation of the successor of a state p ∈ Q by a letter a ∈ A proceeds as follows: 1. If control[base[p]+rank[a]] = p, target[base[p]+rank[a]] is the target of the arc of source p and labeled by a. 2. Otherwise the process is repeated recursively on the state fail[p] and the letter a (assuming that fail is a failure function). 26 Tools y a a b a a b a b a a b a b b b a a a b b x a a b b a a a Figure 1.10 An attempt to locate string x = aabbaaa in text y = aabaababaababbbaaabb. The attempt takes place at position 5 on y. The content of the window and the string matches in four positions. The (non-recursive) code of the corresponding function follows. Target-by-compression(p, a) 1 while control[base[p] + rank[a]] 6= p do 2 p ← fail[p] 3 return target[base[p] + rank[a]] In the worst case, the space required by the implementation is O(card Q× card A) and the delay is O(card Q). This method allows us to reduce the space in O(card Q + card A) with a constant delay in the best case. 1.5 Basic pattern matching techniques We present in this section elementary approaches for the pattern matching problem. It includes the notion of sliding window common to many searching algorithms, the utilization of heuristics in order to reduce the computation time, the general method based on automata when the texts are to be processed in a sequential order, and the use of techniques that rely on the binary encoding of letters realized by machine words. Notion of sliding window When the pattern is a non-empty string x of length m, it is convenient to consider that the text y of length n in which the search is performed, is examined through a sliding window. The window delimits a factor of the text – called the content of the window – which has, in most cases, the length of the string x. It slides along the text from the beginning to the end – from left to right. The window being at a given position j on the text, the algorithm tests whether the string x occurs or not at this position, by comparing some letters of the content of the window with aligned letters of the string. We speak of an attempt at the position j (see an example in Figure 1.10). If the comparison is successful, an occurrence is signalled. During this phase of test, the algorithm acquires some information on the text which can be exploited in two ways: • to set up the length of the next shift of the window according to rules that are specific to the algorithm, 1.5 Basic pattern matching techniques 27 • to avoid comparisons during next attempts by memorizing a part of the collected information. When the shift slides the window from the position j to the position j + d (d ≥ 1), we say that the shift is of length d. To answer to the given problem, a shift of length d for an attempt at the position j must be valid, i.e. it must ensure that, when d ≥ 2, there is no occurrence of the searched string x from positions j + 1 to j + d − 1 on the text y. The naive algorithm The simplest implementation of the sliding window mechanism is the so-called naive algorithm. The strategy consists here in considering a window of length m and in sliding it one position to the right after each attempt. This leads, if the comparison of the content of the window and of the string is correctly implemented, to an obviously correct algorithm. We give below the code of the algorithm. The variable j corresponds to the left position of the window on the text. It is clear that the comparison of the strings at Line 2 is supposed to be performed letter by letter according to a pre-established order. Naive-search(x, m, y, n) 1 for j ← 0 to n − m do 2 Output-if(y[j . . j + m − 1] = x) In the worst case, the algorithm Naive-search executes in time Θ(m × n), as for instance when x and y are powers of the same letter. In the average case,6 its behavior is rather good, as the following proposition indicates. Proposition 1.17 With the double assumption of an alphabet non-reduced to a single letter and of a uniform and independent distribution of letters of the alphabet, the average number of comparisons of letters performed by the operation Naive-search(x, m, y, n) is Θ(n − m). Proof Let c be the size of the alphabet. The number of comparisons of letters necessary to determine if two strings u and v of length m are identical on average 1 + 1/c + · · · + 1/cm−1 , independently of the permutation of positions considered for comparing compared between letters of the strings. When c ≥ 2, this quantity is less than 1/(1 − 1/c), which is itself no more than 2. 6 Even when the patterns and the texts considered in practice have no reason to be random, the average cases express what one can expect of a given pattern matching algorithm. 28 Tools It follows that the average number of comparisons of letters counted during the execution of the operation is less than 2(n − m + 1). Thus the result holds since at least n − m + 1 comparisons are performed. Heuristics Some elementary processes sensibly improve the global behavior of pattern matching algorithms. We detail here some of the most significant. They are described in connection with the naive algorithm. But most of the other algorithms can include them in their code, the adaptation being more or less easy. We speak of heuristics since we are not able to formally measure their contribution to the complexity of the algorithm. When locating all the occurrences of the string x in the text y by the naive method, we can start by locating the occurrences of its first letter, x[0], in the prefix y[0 . . n − m + 1] of y. It then remains to test, for each occurrence of x[0] at a position j on y, the possible identity between the two strings x[1 . . m − 1] and y[j + 1 . . j + m − 1]. As the searching operation for the occurrence of a letter is generally a low level operation of operating systems, the reduction of the computation time is often noticeable in practice. This elementary search can still be improved in two ways: • by positioning x[0] as a sentinel at the end of the text y, in order to have to test less frequently the end of the text, • by searching, non-necessarily x[0], but the letter of x which has the smallest frequency of appearance in the texts of the family of y. It should be noted that the first technique assumes that such an alteration of the memory is possible and that it can be performed in constant time. For the second, besides the necessity of having to know the frequency of letters, the choice of the position of the distinguished letter requires a precomputation on x. A different process consists in applying a shift that takes into account only the value of the rightmost letter of the window. Let j be the right position of the window. Two antagonist cases can be envisaged whether or not the letter y[j] occurs in x[0 . . m − 2]: • in the case where y[j] does not occur in x[0 . . m − 2], the string x cannot occur at right positions j + 1 to j + m − 1 on y, • in the other case, if k is the maximal position of an occurrence of the letter y[j] on x[0 . . m−2], the string x cannot occur at right positions j + 1 to j + m − 1 − k − 1 on y. Thus the valid shifts to apply in the two cases have lengths: m for the first, and m − 1 − k for the second. Note that they do not depend on the letter y[j] and in no way on its position j on y. To formalize the previous observation, we introduce the table last-occ: A → {1, 2, . . . , m} 1.5 Basic pattern matching techniques y c b b c a c b a b a c a d a x y a last-occ[a] a 1 b 4 c 3 29 b b c a a c c b b c a c b a b a c a d a d 6 (a) x b b c a a c (b) Figure 1.11 Shift of the sliding window with the table of the last occurrence, last-occ, when x = bbcaac. (a) The values of the table last-occ on the alphabet A = {a, b, c, d}. (b) The window on the text y is at right position 8. The letter at this position, y[8] = b, occurs at the maximal position k = 1 on x[0 . . |x| − 2]. A valid shift consists in sliding the window of |x| − 1 − k = 4 = last-occ[b] positions to the right. defined for every letter a ∈ A by last-occ[a] = min({m} ∪ {m − 1 − k : 0 ≤ k ≤ m − 2 and x[k] = a}) . We call last-occ the table of the last occurrence. It expresses a valid shift, last-occ[y[j]], to apply after the attempt at the right position j on y. An illustration is proposed Figure 1.11. The code for the computation of last-occ follows. It executes in time Θ(m + card A). Last-occurrence(x, m) 1 for each letter a ∈ A do 2 last-occ[a] ← m 3 for k ← 0 to m − 2 do 4 last-occ[x[k]] ← m − 1 − k 5 return last-occ We give now the complete code of the algorithm Fast-search obtained from the naive algorithm by adding the table last-occ. Fast-search(x, m, y, n) 1 last-occ ← Last-occurrence(x, m) 2 j ←m−1 3 while j < n do 4 Output-if(y[j − m + 1 . . j] = x) 5 j ← j + last-occ[y[j]] If the comparison of the strings at Line 4 starts in position m − 1, the searching phase of the algorithm Fast-search executes in time Θ(n/m) in the best case. As for instance when no letter at positions congruent modulo m to m − 1 on y occurs in x; in this case, a single comparison 30 Tools between letters is performed during each attempt7 and the shift is always equal to m. The behavior of the algorithm on natural language texts is very good. One can show however that in the average case (with the double assumption of Proposition 1.17 and for a set of strings having the same length), the number of comparisons per text letter is asymptotically lower bounded by 1/ card A. The bound is independent of the length of the pattern. Search engine Some automata can serve as a search engine for the online processing of texts. We describe in this part two algorithms based on an automaton for locating patterns. We assume the automata are given; Chapter 2 presents the construction of some of these automata. Let us consider a pattern X ⊆ A∗ and a deterministic automaton M that recognizes the language A∗X (Figure 1.12(a) displays an example). The automaton M recognizes the strings that have a string of X as a suffix. For locating the strings of X that occur in a text y, it is sufficient to run the automaton M on the text y. When the current state is terminal, this means that the current prefix of y – the part of y already parsed by the automaton – belongs to A∗X; or, in other words, that the current position on y is the right position of an occurrence of a string of X. This remark leads to the algorithm whose code follows. An illustration of how the algorithm works is presented in Figure 1.12(b). Det-search(M, y) 1 r ← initial[M ] 2 for each letter a of y, sequentially do 3 r ← Target(r, a) 4 Output-if(terminal[r]) Proposition 1.18 When M is a deterministic automaton that recognizes the language A∗X for a pattern X ⊆ A∗, the operation Det-search(M, y) locates all the occurrences of strings of X in the text y ∈ A∗. Proof Let δ be the transition function of the automaton M . As the automaton is deterministic, it follows immediately that r = δ(initial[M ], u) , (1.3) where u is the current prefix of y, is satisfied after the execution of each of the instructions of the algorithm. 7 Note that it is the best case possible for an algorithm detecting a string of length m in a text of length n; at least ⌊n/m⌋ letters of the text must be inspected before the non-appearance of the searched string can be determined. 1.5 Basic pattern matching techniques 31 a b 1 a a 4 0 (a) 2 a a b 5 b 6 b a b b a 3 a 7 b b (b) j y[j] 0 1 2 3 4 5 c b a b b a state r 0 0 3 4 5 occurrence of ab 6 occurrences of babb and bb 4 Figure 1.12 Search for occurrences of a pattern with a deterministic automaton (see also Figure 1.13). (a) With alphabet A = {a, b, c} and pattern X = {ab, babb, bb}, the deterministic automaton represented above recognizes language A∗X. The grey arrows exiting each state stand for arcs having for source these same states, for target the initial state 0, and labeled by a letter that is not already present. To locate occurrences of strings of X in a text y, it is sufficient to operate the automaton on y and to indicate each time that a terminal state is reached. (b) Parsing example with y = cbabba. From the utilization of the automaton, it follows that there is at least one occurrence of a string of X at positions 3 and 4 on y, and none at other positions. If an occurrence of a string of X ends at the current position, the current prefix u belongs to A∗X. And thus, by definition of M and after Property (1.3), the current state r is terminal. As the initial state is not terminal (since ε ∈ / X), it follows that the operation signals this occurrence. Conversely, assume that an occurrence has just been signalled. The current state r is thus terminal, which, after Property (1.3) and by definition of M , implies that the current prefix u belongs to A∗X. An occurrence of a string of X ends thus at the current position, which ends the proof. The execution time and the extra space needed for running the algorithm Det-search uniquely depend on the implementation of the automaton M . For example, in an implementation by transition matrix, the time to parse the text is Θ(|y|), since the delay is constant, 32 Tools 0 1 a A b -1 (a) b 2 a 3 b 4 b 5 b 6 (b) j y[j] 0 1 2 3 4 5 c b a b b a b 7 set of states R {−1} {−1} {−1, 2, 6} {−1, 0, 3} {−1, 1, 2, 4, 6} occurrence of ab {−1, 2, 5, 6, 7} occurrences of babb and bb {−1, 0, 3} Figure 1.13 Search of occurrences of a pattern with a non deterministic automaton (see also Figure 1.12). (a) The non deterministic automaton recognizes the language A∗X, with alphabet A = {a, b, c} and pattern X = {ab, babb, bb}. To locate the occurrences of strings of X that occur in a text y, it is sufficient to operate the automaton on y and to signal an occurrence each time that a terminal state is reached. (b) Example when y = cbabba. The computation amounts to simultaneously follow all possible paths. It results that the pattern occurs at right positions 3 and 4 on y and nowhere else. and the extra space, in addition to the matrix, is also constant (see Proposition 1.15). The second algorithm of this part applies when we dispose of an automaton N recognizing the language X itself, and no longer A∗X. By adding to the automaton an arc from its initial state to itself and labeled by a, for each letter a ∈ A, we simply get an automaton N ′ that recognizes the language A∗X. But the automaton N ′ is not deterministic, and therefore the previous algorithm cannot be applied. Figure 1.13(a) presents an example of automaton N ′ for the same pattern X as the one of Figure 1.12(a). In such a situation, the retained solution usually consists in simulating the automaton obtained by the determinisation of N ′ , following in parallel all the possible paths having a given label. Since only states that are the ends of paths may perform the occurrence test, we simply keep the set R of reached states. It is what realizes the algorithm Nondet-search below. Actually, it is even not necessary to modify the automaton N since the loops on its initial state can also be simulated. This is realized in Line 4 of the algorithm by adding systematically the initial state to the set of states. During the execution of the automaton 1.5 Basic pattern matching techniques 33 on the input y, the automaton is not in a single state, but in a set of states, R. This subset of the set of states is recomputed after the analysis of the current letter of y. The algorithm calls the function Targets that performs a transition on a set of states, which function is an immediate extension of the function Target. Non-det-search(N, y) 1 q0 ← initial[N ] 2 R ← {q0 } 3 for each letter a of y, sequentially do 4 R ← Targets(R, a) ∪ {q0 } 5 t ← false 6 for each state p ∈ R do 7 if terminal[p] then 8 t ← true 9 Output-if(t) Targets(R, a) 1 S←∅ 2 for each state p ∈ R do 3 for each state q such that (a, q) ∈ Succ[p] do 4 S ← S ∪ {q} 5 return S Lines 5–8 of the algorithm Non-det-search give the value true to the boolean variable t when the intersection between the set of states R and the set of terminal states is non-empty. An occurrence is then signalled, Line 9, if the case arises. Figure 1.13(b) illustrates how the algorithm works. Proposition 1.19 When N is an automaton that recognizes the language X for a pattern X ⊆ A∗, the operation Non-det-search(N, y) locates all the occurrences of strings of X in the text y ∈ A∗. Proof Let us denote by q0 the initial state of the automaton N and, for every string v ∈ A∗, Rv the set of states defined by Rv = {q : q end of a path of origin q0 and of label v} . One can verify, by recurrence on the length of the prefixes of y, that the assertion [ R= Rv , (1.4) vsuff u where u is the current prefix of y, is satisfied after the execution of each of the instructions of the algorithm, except at Line 1. 34 Tools If an occurrence of a string of X ends at the current position, one of the suffixes v of the current prefix u belongs to X. Therefore, by the definition of N , one of the states q ∈ Rv is terminal, and by Property (1.4), one of the states of R is terminal. It follows that the operation signals this occurrence since no string of X is empty. Conversely, if an occurrence has just been signalled, it means that one of the states q ∈ R is terminal. Property (1.4) and the definition of N imply the existence of a suffix v of the current prefix u that belongs to X. It follows that an occurrence of a string of X ends at the current position. This ends the proof of the proposition. The complexity of the algorithm Non-det-search depends both on the implementation retained for the automaton N and the realization chosen for manipulating the sets of states. If, for instance, the automaton is deterministic, its transition function is implemented by a transition matrix, and the sets of states are implemented by boolean vectors which indices are states, the function Targets executes in time and space O(card Q), where Q is the set of states. In this case, the analysis of the text y runs in time O(|y| × card Q) and utilizes O(card Q) extra space. In the following paragraphs, we consider an example of realization of the above simulation adapted to the case of a very small automaton that possesses a tree structure. Bit-vector model The bit-vector model refers to the possibility of using machine words for encoding the states of the automata. When the length of the language associated with the pattern is not larger than the size of a machine word counted in bits, this technique gives algorithms that are efficient and easy to implement. The technique is also used in Section 8.4. Here, the principle is applied to the method that simulates a deterministic automaton and described the previous paragraphs. It encodes the set of reached states into a bit vector, and executes a transition by a simple shift controlled by a mask associated with the considered letter. Let us start with specifying the notation used in the rest for bit vectors. We identify a bit vector with a string on the alphabet {0, 1}. We denote respectively by ∨ and ∧ the “or” and “and” bitwise operators. These are binary operations internal to the sets of bit vectors of identical lengths. The first operation, ∨, puts to 1 the bit of the result if one of the two bits at the same position of the two operands is equal to 1, and to 0 otherwise. The second operation, ∧, puts to 1 the bits of the result if the two bits at the same position of the two operands are equal to 1, and to 0 otherwise. We denote by ⊣ the shift operation defined as follows: with a natural number k and a bit vector the result is the bit vector of same length obtained from the first one by shifting the bits to the right by k positions and by completing it to the left with k 0’s. 1.5 Basic pattern matching techniques 35 Thus, 1001 ∨ 0011 = 1011, 1001 ∧ 0011 = 0001, and 2 ⊣ 1101 = 0011. Let us consider a finite non-empty set X of non-empty strings. Let N be the automaton obtained from the card X elementary deterministic automata that recognizes the strings of X by merging their initial states into a single one, say q0 . Let N ′ be the automaton built on N by adding the arcs of the form (q0 , a, q0 ), for each letter a ∈ A. The automaton N ′ recognizes the language A∗X. The search for the occurrences of strings of X in a text y is realized here as in the above paragraphs by simulating the determinized automaton of N ′ by means of N (see Figure 1.13(a)). Let us set m = |X| and let us number the states of N from −1 to m − 1 using a depth-first traversal of the structure from the initial state q0 – it is the numbering used in the example of Figure 1.13(a). Let us encode now each set of states R \ {−1} by a vector r of m bits with the following convention: p ∈ R \ {−1} if and only if r[p] = 1 . Let r be the vector of m bits that encodes the current state of the search, a ∈ A be the current letter of y, and s be the vector of m bits that encodes the next state. It is clear that the computation of s from r and a observes the following rule: s[p] = 1 if and only if there exists an arc of label a, either from the state −1 to the state p, or from the state p − 1 to the state p with r[p − 1] = 1. Let us consider init the vector of m bits defined by init[p] = 1 if and only if there exists an arc with state −1 as its source and state p as its target. Let us consider also the table masq indexed on A and with values in the set of vectors of m bits, defined for every letter b ∈ A by masq[b][p] = 1 if and only if there exists an arc of label b and of target the state p. Then r, a and s satisfy the identity: s = (init ∨ (1 ⊣ r)) ∧ masq[a] . This latter expression translates the transition performed in Line 4 of algorithm Non-det-search in terms of bitwise operations, except for the initial state. The bit vector init encodes the potential transitions from the initial state, and one-bit right shift from reached states. The table masq validates the transitions labeled by the current letter. It only remains to indicate how to test whether one of the states represented by a vector r of m bits that encodes the current state of the search is terminal or not. To this goal, let term be the vector of m bits defined by term[p] = 1 if and only if the state p is terminal. Then one of the states represented by r is terminal if and only if: r ∧ term 6= 0m . The code of the function Small-automaton that computes the vectors init and term, and the table masq follows, then the code of the pattern matching algorithm is given. 36 Tools Small-automaton(X, m) 1 init ← 0m 2 term ← 0m 3 for each letter a ∈ A do 4 masq[a] ← 0m 5 p ← −1 6 for each string x ∈ X do 7 init[p + 1] ← 1 8 for each letter a of x, sequentially do 9 p←p+1 10 masq[a][p] ← 1 11 term[p] ← 1 12 return (init, term, masq) Short-strings-search(X, m, y) 1 (init, term, masq) ← Small-automaton(X, m) 2 r ← 0m 3 for each letter a of y, sequentially do 4 r ← (init ∨ (1 ⊣ r)) ∧ masq[a] 5 Output-if(r ∧ term 6= 0m ) An example of computation is treated in Figure 1.14. Proposition 1.20 Running the operation Short-strings-search(X, m, y) takes a Θ(m × card A+m×|y|) time. The required extra memory space is Θ(m×card A). Proof The time necessary for initializing the bit vectors init, term and masq[a], for a ∈ A, is linear in their size, thus Θ(m × card A). The instructions at Lines 4 and 5 execute in Θ(m) time each. The stated complexities follow. Once this is established, when the length m is less than the number of bits of a machine word, every bit vector of m bits can be implemented with the help of a machine word whose first m bits only are significant. This gives the following result. Corollary 1.21 When m = |X| is less than the length of a machine word, the operation Short-strings-search(X, m, y) executes in time Θ(|y| + card A) with an extra memory space Θ(card A). 1.6 Borders and prefixes tables We present in this section two fundamental methods for locating efficiently patterns or for searching for regularities in strings. There are 1.6 (a) (b) Borders and prefixes tables 0 1 0 1 0 0 k init[k] term[k] masq[a][k] masq[b][k] masq[c][k] j y[j] 0 1 2 3 4 5 c b a b b a 1 0 1 0 1 0 2 1 0 0 1 0 bit vector r 00000000 00000000 00100010 10010000 01101010 00100111 10010000 37 3 0 0 1 0 0 4 0 0 0 1 0 5 0 1 0 1 0 6 1 0 0 1 0 7 0 1 0 1 0 occurrence of ab occurrences of babb and bb Figure 1.14 Using bit vectors to search for the occurrences of the pattern X = {ab, babb, bb} (see Figure 1.13). (a) Vectors init and term, and table of vectors masq on the alphabet A = {a, b, c}. These vectors are of length 8 since |X| = 8. The first vector encodes the potential transitions from the initial state. The second encodes the terminal states. The vectors of the table masq encode the occurrences of letters of the alphabet in the strings of X. (b) Successive values of the vector r that encodes the current state of the search for strings of X in the text y = cbabba. The gray area that marks some bits indicates that a terminal state has been reached. two tables, the table of borders and the table of prefixes, that both store occurrences of prefixes of a string that occur inside itself. The tables can be computed in linear time. The computation algorithms also provide methods for locating strings that are studied in details in Chapters 2 and 3 (a prelude is proposed in Exercise 1.24). Table of borders Let x be a string of length m ≥ 1. We define the table border: {0, 1, . . . , m − 1} → {0, 1, . . . , m − 1} by border[k] = |Border(x[0 . . k])| for k = 0, 1, . . . , m − 1. The table border is called the table of borders for the string x, meaning that they are borders of the non-empty prefixes of the string. Here is an example of the table of borders for the string x = abbabaabbabaaaabbabbaa: k x[k] border[k] 0 a 0 1 b 0 2 b 0 3 a 1 4 b 2 5 a 1 6 a 1 7 b 2 8 b 3 9 a 4 10 b 5 11 a 6 38 Tools i j u a Border(u) Border(u) i j Figure 1.15 Schema showing the correspondence between variables i and j considered at Line 3 of function Borders and Lemma 1.22. k x[k] border[k] 12 a 7 13 a 1 14 a 1 15 b 2 16 b 3 17 a 4 18 b 5 19 b 3 20 a 4 21 a 1 The following lemma provides the recurrence relation used by the function Borders, given thereafter, for computing the table border. Lemma 1.22 For every (u, a) ∈ A+ × A, we have  Border(u)a Border(ua) = Border(Border(u)a) if Border(u)a pref u , otherwise . Proof We first note that if Border(ua) is a non-empty string, it is of the form wa where w is a border of u. If Border(u)a pref u, the string Border(u)a is then a border of ua, and the previous remark shows that it is the longest string of this kind. It follows that Border(ua) = Border(u)a in this case. Otherwise, Border(ua) is both a prefix of Border(u) and a suffix of Border(u)a. As it is of maximal length with this property, it is indeed the string Border(Border(u)a). Figure 1.15 schematizes the correspondence between the variables i and j of the function Borders, which code follows, and the statement of Lemma 1.22. Borders(x, m) 1 i←0 2 for j ← 1 to m − 1 do 3 border[j − 1] ← i 4 while i ≥ 0 and x[j] 6= x[i] do 5 if i = 0 then 6 i ← −1 7 else i ← border[i − 1] 8 i←i+1 9 border[m − 1] ← i 10 return border 1.6 Borders and prefixes tables 39 Proposition 1.23 The function Borders applied to a string x and its length m produces the table of borders for x. Proof The table border is computed by the function Borders sequentially: it runs from the prefix of x of length 1 to x itself. During the execution of the while loop Lines 4–7 the sequence of borders of x[0 . . j − 1] is inspected, following Proposition 1.5. When exiting this loop, we have |Border(x[0 . . j])| = |x[0 . . i]| = i + 1, in accordance with Lemma 1.22. The correctness of the code follows. Proposition 1.24 The operation Borders(x, m) executes in time Θ(m). The number of comparisons between letters of the string x is within m − 1 and 2m − 3 when m ≥ 2. These bounds are tight. We say, in the rest, that the comparison between two given letters is positive when these two letters are identical, and is negative otherwise. Proof Let us note that the execution time is linear in the number of comparisons performed between the letters of x. It is thus sufficient to establish the bound on the number of comparisons. The quantity 2j − i increases by at least one unit after each comparison of letters: the variables i and j are both incremented after a positive comparison; the value of i is decreased by at least one and the value of j remains unchanged after a negative comparison. When m ≥ 2, this quantity is equal to 2 for the first comparison (i = 0 and j = 1) and at most 2m − 2 during the last (i ≥ 0 and j = m − 1). The overall number of comparisons is thus bounded by 2m − 3 as stated. The lower bound of m − 1 is tight and is reached for x = abm−1 . The upper bound of 2m−3 comparisons is tight: it is reached for every string x of the form am−1 b with a, b ∈ A and a 6= b. This ends the proof. Another proof of the bound 2m − 3 is proposed in Exercise 1.22. Table of prefixes Let x be a string of length m ≥ 1. We define the table pref : {0, 1, . . . , m − 1} → {0, 1, . . . , m − 1} by pref [k] = |lcp(x, x[k . . m − 1])| for k = 0, 1, . . . , m − 1, where lcp(u,v) is the longest common prefix of strings u and v. 40 Tools The table pref is called the table of prefixes for the string x. It memorizes the prefixes of x that occur inside the string itself. We note that pref [0] = |x|. The following example shows the table of prefixes for the string x = abbabaabbabaaaabbabbaa. k x[k] pref [k] 0 a 22 1 b 0 2 b 0 3 a 2 4 b 0 5 a 1 6 a 7 7 b 0 8 b 0 9 a 2 k x[k] pref [k] 12 a 1 13 a 1 14 a 5 15 b 0 16 b 0 17 a 4 18 b 0 19 b 0 20 a 1 21 a 1 10 b 0 11 a 1 Some string matching algorithms (see Chapter 3) use the table suff which is nothing but the analogue of the table of prefixes obtained by considering the reverse of the string x. The method for computing pref that is presented below proceeds by determining pref [i] by increasing values of the position i on x. A naive method would consist in evaluating each value pref [i] independently of the previous values by direct comparisons; but it would then lead to a quadratic-time computation, in the case where x is the power of a single letter, for example. The utilization of already computed values yields a linear-time algorithm. For that, we introduce, the index i being fixed, two values g and f that constitute the key elements of the method. They satisfy the relations g = max{j + pref [j] : 0 < j < i} (1.5) and f ∈ {j : 0 < j < i and j + pref [j] = g} . (1.6) We note that g and f are defined when i > 1. The string x[f . . g − 1] is then a prefix of x, thus also a border of x[0 . . g − 1]. It is the empty string when f = g. We can note, moreover, that if g < i we have then g = i−1, and that on the contrary, by definition of f , we have f < i ≤ g. The following lemma provides the justification for the correctness of the function Prefixes. Lemma 1.25 If i < g, we have the relation   pref [i − f ] if pref [i − f ] < g − i , pref [i] = g − i if pref [i − f ] > g − i ,  g−i+ℓ otherwise , where ℓ = |lcp(x[g − i . . m − 1], x[g . . m − 1])|. 1.6 Borders and prefixes tables 41 a b b a b a a b b a b a a a a b b a b b a a a b b a a b a b a b b a a b a b Figure 1.16 Illustration of the function Prefixes. The framed factors x[6 . . 12] and x[14 . . 18] and the gray factors x[9 . . 10] and x[17 . . 20] are prefixes of string x = abbabaabbabaaaabbabbaa. For i = 9, we have f = 6 and g = 13. The situation at this position is the same that at position 3 = 9 − 6. We have pref [9] = pref [3] = 2 which means that ab, of length 2, is the longest factor at position 9 that is a prefix of x. For i = 17, we have f = 14 and g = 19. As pref [17 − 14] = 2 ≥ 19 − 17, we deduce that string ab = x[i . . g − 1] is a prefix of x. Letters of x and x[i . . m − 1] have to be compared from respective positions 2 and g for determining pref [i] = 4. Proof Let us set u = x[f . . g − 1]. The string u is a prefix of x by the definition of f and g. Let us also set k = pref [i − f ]. By the definition of pref , the string x[i−f . . i−f +k−1] is a prefix of x but x[i−f . . i−f +k] is not. In the case where pref [i−f ] < g−i, an occurrence of x[i−f . . i−f +k] starts at the position i − f on u — thus also at the position i on x — which shows that x[i − f . . i − f + k − 1] is the longest prefix of x starting at position i. Therefore, we get pref [i] = k = pref [i − f ]. In the case where pref [i − f ] > g − i, x[0 . . g − i − 1] = x[i − f . . g − f − 1] = x[i . . g − 1] and x[g − i] = x[g − f ] 6= x[g]. We have thus pref [i] = g − i. In the case where pref [i − f ] = g − i, we have x[g − i] 6= x[g − f ] and x[g−f ] 6= x[g], therefore we cannot decide on the result of the comparison between x[g − i] and x[g]. Extra letter comparisons are necessary and we conclude that pref [i] = g − i + ℓ. In the computation of pref , we initialize the variable g to 0 to simplify the writing of the code of the function Prefixes, and we leave f initially undefined. The first step of the computation consists thus in determining pref [1] by letter comparisons. The utility of the above statement comes for computing next values. An illustration of how the function works is given in Figure 1.16. A schema showing the correspondence between the variables of the function and the notation used in the statement of Lemma 1.25 and its proof is given in Figure 1.17. 42 Tools g−f u a f i u g b Figure 1.17 Variables i, f and g of the function Prefixes. The main loop has for invariants: u = lcp(x, x[f . . m − 1]) and thus a 6= b with a, b ∈ A, then f < i when f is defined. The schema corresponds to the situation in which i < g. Prefixes(x, m) 1 pref [0] ← m 2 g←0 3 for i ← 1 to m − 1 do 4 if i < g and pref [i − f ] 6= g − i then 5 pref [i] ← min{pref [i − f ], g − i} 6 else (g, f ) ← (max{g, i}, i) 7 while g < m and x[g] = x[g − f ] do 8 g ←g+1 9 pref [i] ← g − f 10 return pref Proposition 1.26 The function Prefixes applied to a string x and to its length m produces the table of prefixes for x. Proof We can verify that the variables f and g satisfy the relations (1.5) and (1.6) at each step of the execution of the loop. We note then that, for i fixed satisfying the condition i < g, the function applies the relation stated in Lemma 1.25, which produces a correct computation. It remains thus to check that the computation is correct when i ≥ g. But in this situation, Lines 6–8 compute |lcp(x, x[i . . m − 1])| = |x[f . . g − 1]| = g − f which is, by definition, the value of pref [i]. Therefore, the function produces the table pref . Proposition 1.27 The execution of the operation Prefixes(x, m) runs in time Θ(m). Less than 2m comparisons between letters of the string x are performed. Proof Comparisons between letters are performed at Line 7. Every comparison between equal letters increments the variable g. As the value of g never decreases and that it varies from 0 to at most m, there are at most m positive comparisons. Each negative comparison leads to the next step of the loop. Then there are at most m − 1 of them. Thus less than 2m comparisons on the overall. The previous argument also shows that the total time of all the executions of the loop at Lines 7–8 is Θ(m). The other instructions of the 1.6 Borders and prefixes tables 43 a b b a b a a b b a b a a a a b b a b b a a Figure 1.18 Relations between borders and prefixes. In string x = abbabaabbabaaaabbabbaa, pref [9] = 2 and border[9 + 2 − 1] = 5 6= 2. We also have border[15] = 2 and pref [15 − 2 + 1] = 5 6= 2. loop 3–9 take a constant time for each value of i giving again a global time Θ(m) for their execution and that of the function. The bound of 2m on the number of comparisons performed by the function Prefixes is relatively tight. For instance, we get 2m − 3 comparisons for a string of the form am−1 b with m ≥ 2, a, b ∈ A and a 6= b. Indeed, it takes m − 1 comparisons to compute pref [1], then one comparison for each of the m − 2 values pref [i] with 1 < i < m. Relation between borders and prefixes The tables border and pref , whose computation is described above, both memorize occurrences of prefixes of x. We explicit here a relation between these two tables. The relation is not immediate for the reason that follows, which is illustrated in Figure 1.18. When pref [i] = ℓ, the factor u = x[i . . i+ℓ−1] is a prefix of x but it is not necessarily the border of x[0 . . i+ℓ−1] because this border can be longer than u. In the same way, when border[j] = ℓ, the factor v = x[j − ℓ + 1 . . j] is a prefix of x but it is not necessarily the longest prefix of x occurring at position j − ℓ + 1. The proposition that follows shows how the table border is expressed using the table pref . One can deduce from the statement an algorithm for computing the table border knowing the table pref . Proposition 1.28 Let x ∈ A+ and j be a position on x. Then:  0 if I = ∅ , border[j] = j − min I + 1 otherwise , where I = {i : 0 < i ≤ j and i + pref [i] − 1 ≥ j}. Proof We first note that, for 0 < i ≤ j, i ∈ I if and only if x[i . . j] pref x. Indeed, if i ∈ I, we have x[i . . j] pref x[i . . i+pref [i]−1] pref x, thus x[i . . j] pref x. Conversely, if x[i . . j] pref x, we deduce, by definition of pref [i], pref [i] ≥ j − i + 1. And thus i + pref [i] − 1 ≥ j. Which shows that i ∈ I. We also note that border[j] = 0 if and only if I = ∅. It follows that if border[j] 6= 0 (thus border[j] > 0) and k = j − border[j] + 1, we have k ≤ j and x[k . . j] pref x. No factor x[i . . j], i < k, satisfies the relation x[i . . j] pref x by definition of border[j]. Thus k = min I by the first remark, and border[j] = j − k + 1 as stated. 44 Tools The computation of the table pref from the table border can lead to an iteration, and does not seem to give a simple expression, comparable to the one of the previous statement (see Exercise 1.23). Notes The chapter contains the basic elements for a precise study of algorithms on strings. Most of the notions that are introduced here are dispersed in different books. We cite here those that are often considered as references in their domains. The combinatorial aspects on strings are dealt with in the collective books of Lothaire [73, 74, 75]. One can refer to the book of Aho, Hopcroft and Ullman [63] for algorithmic questions: expression of algorithms, data structures and complexity evaluation. We were inspired by the book of Cormen, Leiserson and Rivest [69] for the general presentation and the style of algorithms. Concerning automata and languages, one can refer to the book of Berstel [67] or the one of Pin [76]. The books of Berstel and Perrin [68] and of Béal [65] contain elements on the theory of codes (Exercises 1.10 and 1.11). Finally, the book of Aho, Sethi and Ullman [64] describes methods for the implementation of automata. Section 1.5 on basic techniques contains elements frequently selected for the final development of software using algorithms that process strings. They are, more specifically, heuristics and utilization of machine words. This last technique is also tackled in Chapter 8 for approximate pattern matching. This type of technique has been initiated by Baeza-Yates and Gonnet [89] and by Wu and Manber [185]. The algorithm Fast-search is from Horspool [130]. The search for a string by means of a hash function is analyzed by Karp and Rabin [137]. The treatment of notions in Section 1.6 is original. The computation of the table of borders is classical. It is inspired by an algorithm of Morris and Pratt of 1970 (see [9]) that is at the origin of the first string matching algorithm running in linear time. The table of prefixes synthesizes differently the same information on a string as the previous table. The dual notion of table of suffixes is used in Chapter 3. Gusfield [5] makes it a fundamental element of string matching methods. (His Z algorithm corresponds to the algorithm Suffixes of Chapter 3). Exercises 1.1 (Computation) What is the number of prefixes, suffixes, factors and subsequences of a given string ? Discuss if necessary. 1.2 (Fibonacci morphism) Exercises 45 A morphism f on A∗ is an application from A∗ into itself that satisfies the rules: f (ε) = ε , f (x · y) = f (x) · f (y) for x, y ∈ A∗ . For every natural number n and every string x ∈ A∗, we denote by f n (x) the string defined by f 0 (x) = x and f k (x) = f k−1 (f (x)) for k = 1, 2, . . . , n. Let us consider the alphabet A = {a, b}. Let ϕ be the morphism on A∗ defined by ϕ(a) = ab and ϕ(b) = a. Show that the string ϕn (a) is identical to fn+2 , the Fibonacci string of index n + 2. 1.3 (Permutation) We call a permutation on the alphabet A a string u that satisfies the condition card alph(u) = |u| = card A. This is thus a string in which all the letters of the alphabet occur exactly once. For k = card A, show that there exists a string of length less than k 2 − 2k + 4 that contains as subsequences all the permutations on A. Design a construction algorithm for such a string. [Hint: see Mohanty [157].] 1.4 (Period) Show that the condition 3 of Proposition 1.4 can be replaced by the following condition: there exists a string t and an integer k > 0 such that x fact tk and |t| = p. 1.5 (Limit case) Show that the string (ab)k a(ab)k a with k ≥ 1 is the limit case for the periodicity lemma. 1.6 (Three periods) On the triplets of sorted positive integers (p1 , p2 , p3 ), p1 ≤ p2 ≤ p3 , we define the derivation by: the derivative of (p1 , p2 , p3 ) is the triplet made of the integers p1 , p2 − p1 and p3 − p1 . Let (q1 , q2 , q3 ) be the first triplet obtained by iterating the derivation from (p1 , p2 , p3 ) and such that q1 = 0. Show that if the string x ∈ A∗ has p1 , p2 and p3 as periods and that |x| ≥ 1 (p1 + p2 + p3 − 2 gcd(p1 , p2 , p3 ) + q2 + q3 ) , 2 then it has also gcd(p1 , p2 , p3 ) as period. [Hint: see Mignosi and Restivo [74].] 1.7 (Three squares) Let u, v and w be three non-empty strings. Show that we have 2|u| < |w| if we assume that u is primitive and that u2 ≺pref v 2 ≺pref w2 (see Proposition 9.17 for a more precise consequence). 46 Tools 1.8 (Conjugates) Show that two non-empty conjugate strings have the same exponent and conjugate roots. Show that the conjugacy class of every non-empty string x contains |x|/k elements where k is the exponent of x. 1.9 (Periods) Let p be a period of x that is not a multiple of per(x). Show that p > |x| − per(x). Let p and q be two periods of x such that p < q. Show that: • q − p is a period of first|x|−p (x) and of (firstp (x))−1 x, • p and q + p are periods of firstq (x)x. (The definition of firstk is given in Section 4.4.) Show that if x = uvw, uv and vw have period p and |v| ≥ p, then x has period p. Let us assume that x has period p and contains a factor v of period r with r divisor of q. Show that r is also a period of x. 1.10 (Code) A language X ⊆ A∗ is a code if every string of X + has a unique decomposition in strings of X. Show that the ASCII codes of characters on the alphabet {0, 1} form a code according to this definition. Show that the languages {a, b}∗, ab∗, {aa, ba, b}, {aa, baa, ba} and {a, ba, bb} are codes. Show that this is not the case of the languages {a, ab, ba} and {a, abbba, babab, bb}. A language X ⊆ A∗ is prefix if the condition u pref v implies u = v is satisfied for every strings u, v ∈ X. The notion of a suffix language is defined in a dual way. Show that every prefix language is a code. Do the same for suffix languages. 1.11 (Default theorem) Let X ⊆ A∗ be a finite set that is not a code. Let Y ⊆ A∗ be a code for which Y ∗ is the smallest set of this form that contains X ∗. Show that card Y < card X. [Hint: every string x ∈ X can be written in the form y1 y2 . . . yk with yi ∈ Y for i = 1, 2, . . . , k; show that the function α: X → Y defined by α(x) = y1 is surjective but is not injective; see [73].] 1.12 (Commutation) Show by the default theorem (see Exercise 1.11), then by the Periodicity lemma that, if uv = vu, for two strings u, v ∈ A∗, u and v are powers of a same string. Exercises 47 1.13 (nlogn) Let f : N → N be a function defined by: f (1) = a , f (n) = f (⌊n/2⌋) + f (⌈n/2⌉) + bn for n ≥ 2 , with a ∈ N and b ∈ N \ {0}. Show that f (n) is Θ(n log n). 1.14 (Filter) We consider a code for which characters are encoded on 8 bits. We want to develop a pattern matching algorithm using an automaton for strings written on the alphabet {A, C, G, T}. Describe data structures to realize the automaton with the help of a transition matrix of size 4 × m (and not 256 × m), where m is the number of states of the automaton, possibly using an amount of extra space which is independent of m. 1.15 (Implementation of partial functions) Let f : E → F be a partial function where E is a finite set. Describe an implementation of f able to perform each of the four following operations in constant time: • initialize f , such that f (x) is undefined for x ∈ E, • set the value of f (x) to y ∈ F , for x ∈ E, • test whether f (x) is defined or not, for x ∈ E, • produce the value of f (x), for x ∈ E. One can use O(card E) space. [Hint: simultaneously use a table indexed by E and a list of elements x for which f (x) is defined, with crossreferences between the table and the list.] Deduce that the implementation of such a function can be done in linear time in the number of elements of E whose images by f are defined. 1.16 (Not so naive) We consider here a slightly more elaborate implementation for the sliding window mechanism that the one described for the naive algorithm. Among the strings x of length m ≥ 2, it distinguishes two classes: one for which the first two letters are identical (thus x[0] = x[1]), and the antagonist class (thus x[0] 6= x[1]). This elementary distinction allows us to shift the window by two positions to the right in the following cases: string x belongs to the first class and y[j + 1] 6= x[1]; string x belongs to the second class and y[j + 1] = x[1]. On the other hand, if the comparison of the string x with the content of the window is always performed letter by letter, it considers positions on x in the following order 1, 2, . . . , m − 1, 0. Give the code of an algorithm that applies this method. Show that the number of comparisons between text letters is on the average strictly less than 1 when the average is evaluated on the set 48 Tools of strings of same length, that this length is more than 2 and that the alphabet contains at least four letters. [Hint: see Hancart [122].] 1.17 (End of window) Let us consider the method that, as the algorithm Fast-search using the rightmost letter in the window for performing a shift, uses the two rightmost letters in the window (assuming that the string is of length greater than 2). Give the code of an algorithm that applies this method. In which cases does it seem efficient? [Hint: see Zhu and Takaoka [186] or Baeza-Yates [88].] 1.18 (After the window) Same statement than the one of Exercise 1.17, but with using the letter located immediately to the right of the window (beware of the overflow at the right extremity of the text). [Hint: see Sunday [178].] 1.19 (Sentinel) We come back again to the string matching problem: locating occurrences of a string x of length m in a text y of length n. The sentinel technique can be used for searching the letter x[m − 1] by performing the shifts with the help of the table last-occ. Since the shifts can be of length m, we set y[n . . n + m − 1] to x[m − 1]m . Give a code for this sentinel method. In order to speed up the process and decrease the number of tests on letters, it is possible to chain several shifts without testing the letters of the text. For that, we back up the value of last-occ[x[m − 1]] in a variable, let say d, then we fix the value of last-occ[x[m − 1]] to 0. We can then chain shifts until one of them is of length 0. We then test the other letters of the window, signalling an occurrence when it arises, and we apply a shift of length d. Give a code for this method. [Hint: see Hume and Sunday [131].] 1.20 (In C) Give an implementation in C language of the algorithm Short-stringssearch. The operators ∨, ∧ and ⊣ are encoded by |, & and <<. Extend the implementation so that it accepts any parameter m (possibly strictly greater than the number of bits of a machine word). Compare the obtained code to the source of the Unix command agrep. 1.21 (Short strings) Describe a pattern matching algorithm for short strings in a similar way to the algorithm Short-strings-search, but in which the binary values 0 and 1 are swapped. Exercises 49 1.22 (Bound) Show that the number of positive comparisons and the number of negative comparisons performed during the operation Borders(x, m) are at most m − 1. Prove again the bound 2m − 3 of Proposition 1.24. 1.23 (Table of prefixes) Describe a linear time algorithm for the computation of the table pref , given the table border for the string x. 1.24 (Localisation by the borders or the prefixes) Show that the table of borders for the string x$y can be directly used in order to locate all the occurrences of the string x in the string y, where $∈ / alph(xy). Same question with the table of prefixes for the string xy. 1.25 (Cover) A string u is a cover of a string x if for every position i on x, there exists a position j on x for which 0 ≤ j ≤ i < j + |u| ≤ |x| and u = x[j . . j + |u| − 1]. Design an algorithm for the computation of the shortest cover of a string. State its complexity. 1.26 (Long border) Let u be a non-empty border of the string x ∈ A∗. Let v ∈ A∗ be such that |v| < |u|. Show that v is a border of u if and only if it is a border of x. Show that x has another non-empty border if u satisfies the inequality |x| < 2|u|. Show that x has no other border satisfying the same inequality if per(x) > |x|/4. 1.27 (Border free) We say that a non-empty string u is border free if Border(u) = ε, or, equivalently, if per(u) = |u|. Let x ∈ A∗. Show that C = {u : u pref x and u is border free} is a suffix code (see Exercise 1.10). Show that x uniquely factorizes into xk xk−1 . . . x1 according to the strings of C (xi ∈ C for i = 1, 2, . . . , k). Show that x1 is the shortest string of C that is a suffix of x and that xk is the longest string of C that is a prefix of x. Design a linear time algorithm for computing the factorization. 1.28 (Maximal suffix) We denote by SM (≤, u) the maximal suffix of u ∈ A+ for the lexicographic order where, in this notation, ≤ denotes the order on the alphabet. Let x ∈ A+ . Show that |x| − |SM (≤, x)| < per(x). 50 Tools We assume that SM (≤, x) = x and we denote by w1 , w2 , . . . , wk the borders of x in decreasing order of length (we have k > 0 and wk = ε). Let a1 , a2 , . . . , ak ∈ A and z1 , z2 , . . . , zk ∈ A∗ be such that x = w1 a1 z1 = w2 a2 z2 = · · · = wk ak zk . Show that a1 ≤ a2 ≤ · · · ≤ ak . Design a linear-time algorithm that computes the maximal suffix (for the lexicographic order) of a string x ∈ A+ . [Hint: use the algorithm that computes the borders of Section 1.6 or see Booth [93]; see also [3].] 1.29 (Critical factorisation) Let x ∈ A+ . For each position i on x, we denote by rep(i) = min{|u| : u ∈ A+ , A∗u ∪ A∗x[0 . . i − 1] 6= ∅ and uA∗ ∪ x[i . . |x| − 1]A∗ 6= ∅} the local period of x at position i. Let w = SM (≤, x) (SM is defined in Exercise 1.28) and assume that |w| ≤ |SM (≤−1 , x)|; show that rep x (|x| − |w|) = per(x). [Hint: note that the intersection of the two orderings on strings induced by ≤ and ≤−1 is the prefix ordering, and use Proposition 1.4; see Crochemore and Perrin [108, 3].]