Academia.eduAcademia.edu

Extending de Bruijn sequences to larger alphabets

2019, arXiv (Cornell University)

Extending de Bruijn sequences to larger alphabets Verónica Becher Lucas Cortés [email protected] [email protected] Departamento de Computación, Facultad de Ciencias Exactas y Naturales & ICC Universidad de Buenos Aires & CONICET, Argentina arXiv:1907.00056v2 [cs.DM] 19 Nov 2020 November 23, 2020 Abstract A de Bruijn sequence of order n over a k-symbol alphabet is a circular sequence where each length-n sequence occurs exactly once. We present a way of extending de Bruijn sequences by adding a new symbol to the alphabet: the extension is performed by embedding a given de Bruijn sequence into another one of the same order, but over the alphabet with one more symbol, while ensuring that there are no long runs without the new symbol. Our solution is based on auxiliary graphs derived from the de Bruijn graph and solving a problem of maximum flow. Keywords: de Bruijn sequences, Eulerian cycle, maximum flow, combinatorics on words. Contents 1 Introduction and statement of results 2 Proof of Theorem 1 2.1 Graph of circular words . . . . . . 2.2 Fair distribution of the new symbol 2.3 Partition of the augmenting graph 2.4 Actual proof of Theorem 1 . . . . . 2.5 Proof of Proposition 1 . . . . . . . 1 . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 5 6 7 7 Introduction and statement of results A circular sequence is the equivalence class of a sequence under rotations. A de Bruijn sequence of order n over a k-symbol alphabet is a circular sequence of length k n in which every length-n sequence occurs exactly once [6, 11], see [4] for a fine presentation and history. For example, writing [abc] to denote the circular sequence formed by the rotations of abc, [0011] is de Bruijn of order 2 over the alphabet {0, 1}. A subsequence of a sequence a1 a2 . . . an is a sequence b1 b2 . . . bm defined by bi = ani for i = 1, 2, . . . , m, where n1 ≤ n2 ≤ . . . ≤ nm . The same applies to circular sequences, assuming any starting position. For example, for the alphabet of digits from 0 to 9, [123], [246] and [5612] are subsequences of [123456]. Clearly, for any given de Bruijn sequence over a k-symbol alphabet there is another one over the alphabet enlarged with one new symbol, such that the two sequences have the same order, and the first is a subsequence of the second. This is immediate from the characterization of de Bruijn sequences as Eulerian cycles on de Bruijn graphs: the de Bruijn graph for the original alphabet is a sugbgraph of the de Bruijn graph for the enlarged alphabet, and any cycle in an Eulerian graph can be embedded into a full Eulerian cycle. For instance, such an extension can be constructed with Hierholzer’s algorithm for joining cycles together to create an Eulerian cycle of a graph. However, this gives no guarantee that the new symbol is fairly distributed along the resulting de Bruijn sequence. 1 In this note we consider the problem of extending a de Bruijn sequence over a k-symbol alphabet to another one of the same order over the alphabet enlarged with a new symbol, such that the first is a subsequence of the second and there are no long runs without the new symbol. If in between every two successive occurrences of the new symbol there were fewer than n symbols, it would be impossible to accommodate all words of length n lacking the new symbol. If there were exactly n symbols, to accommodate all words of length n lacking the new symbol we would need (n + 1)k n symbols. But this would be impossible because for all sufficiently large values of n this quantity exceeds (k + 1)n , the length of a de Bruijn sequence of order n over a (k + 1)-symbol alphabet. Theorem 1 proves that there is an extension that in between any two successive occurrences of the new symbol there can be at most n + 2k − 2 other symbols. Theorem 1. For any de Bruijn sequence v over a k-symbol alphabet of order n there is another one w over that alphabet enlarged with a new symbol, of the same order n, such that v is a subsequence of w and for any n + 2k − 1 consecutive symbols in w there is at least one occurrence of the new symbol. For example, for this de Bruijn sequence of order 3 over the alphabet {0, 1}, v = [11000101] the following de Bruijn sequence of order 3 over the alphabet {0, 1, 2} satisfies the conditions of the theorem: w = [122212111002202000120102101] because v is a subsequence of w and given any n + 2k − 1 = 6 consecutive symbols in w there is at least one occurrence of the symbol 2. To prove Theorem 1, in addition to classical elements from graph theory such as de Bruijn graphs, Eulerian cycles and graph transformations, we pose the fairness condition on the new symbol as a problem of maximum flow and solve it with Edmonds-Karp algorithm [7, 5]. The following is a crude upper bound of the complexity of the construction. Proposition 1. For order n and every k-symbol alphabet there is a construction that proves Theorem 1 in O(k 3n−2 ) mathemtical operations. It is possible to conceive this extension problem in variants of de Bruijn sequences defined in terms of Eulerian cycles in approrpriate graphs. For instance, the semi-perfect de Bruijn sequences of Repke and Rytter [10] which satisfy that each of the prefixes (large enough) has the largest possible number of distinct words. Or the perfect sequences [1] which, for order n, contain each word of length n exactly n times but each one starting at different positions modulo n. Or the subtler nested perfect sequences [3] originated in Mordachay Levin’s [9, Theorem 2]. The extension to a larger alphabet without the fairness condition on the new symbol is particularly simple for the lexicographically greatest de Bruijn sequence: the one over the original alphabet is the suffix of the one of the enlarged alphabet [13], assuming the new symbo is the lexicographically greatest. The extension can be done with an efficient greedy algorithm, see [12]. The extension problem to a larger alphabet that we consider in the present note is dual to the extension problem studied by Becher and Heiber in [2], where they considered extending a de Bruijn sequence of order n over a k-symbol alphabet to another one of order n + 1 over the same alphabet such that the first is a prefix of the second. 2 Proof of Theorem 1 In the sequel we use the terms word and sequence interchangeably. A de Bruijn graph G(k, n) is a directed graph whose vertices are the words of length n over a k-symbol alphabet and whose edges are the pairs (v, w) where v = au and w = ub, for some word u of length n − 1 and possibly two different symbols a, b. Thus, the graph G(k, n) has k n vertices and k n+1 edges, it is strongly connected and every vertex has the same in-degree and out-degree. Each de Bruijn sequence of order n over a k-symbol alphabet can be constructed by taking a Hamiltonian cycle in G(k, n). Since the line graph of G(k, n) is G(k, n+1), each de Bruijn sequence of order n+1 over a k-symbol alphabet can be constructed as an Eulerian cycle in G(k, n). 2 Figure 1: For alphabet {0, 1} there are 4 circular words of length 3: [000], [100], [110] and [111], each corresponds to a simple cycle in the de Bruijn graph G(2, 2). Figure 2: On the left G(2, 3). On the right graph C(2, 4). Figure 3: The de Bruijn graph G(2, 2) is given by the solid lines. The Augmenting graph A(3, 2) consists of all the vertices and just the dashed lines. 2.1 Graph of circular words Our main tool is the factorization of the set of edges in G(k, n) in convenient sets of pairwise disjoint cycles. We say that two cycles are disjoint if they have no common edges. Proposition 2. For every, k and n, the set of edges in G(k, n) can be partitioned into a disjoint set of cycles identified by the circular words of length n + 1. Proof. As usual, we identify an edge in G(k, n) by concatenating the starting vertex label with the edge label. Thus, each edge in G(k, n) is identified with a word of length n + 1. The set of all rotations of a word of length n + 1 identifies consecutive edges that form a simple cycle in G(k, n). And each circular word of length n + 1 corresponds exactly to one simple cycle in G(k, n). The partition of the set of words of length n + 1 in the equivalence classes given by their rotations determines a partition of the set of edges in G(k, n) into disjoint simple cycles, see Figure 1. 3 Figure 4: Petals for the vertices in G(2, 2). Figure 5: On the left, the petal for the vertex 01, which is just [012]. On the right, the petal for the vertex 10 which consists of the path [222], [202], [021]. We define the graph of circular words. Figure 2 shows it for word length 3 over {0, 1}. Definition 1 (Graph of circular words). For every k and n, C(k, n) is the graph whose vertices are the circular words of length n over the k-symbol alphabet and two vertices [v] and [w] are connected if there is a word u of length n − 1 and symbols a, b such that [au] = [v], [ub] = [w]. The fact that G(k, n) is a subgraph of G(k + 1, n) motivates the following definition. Definition 2 (Augmenting graph). The augmenting graph A(k + 1, n) is the directed graph (V, E) where V is the set of length-n words over the alphabet enlarged by a new symbol s, and E is the set of pairs (v, w) such that v = au, w = ub for some word u of length n − 1 and symbols a, b, and either v or w have at least one occurrence of the symbol s. Figure 3 illustrates A(3, 2). Observe that in A(k + 1, n) each of the vertices in G(k, n) has exactly one incoming edge and exactly one outgoing edge. This outcoming edge is always labelled with the new symbol s. To prove Theorem 1 we plan to construct an Eulerian cycle in G(k+1, n) by joining the given Eulerian cycle in G(k, n) with disjoint cycles of the augmenting graph A(k + 1, n) that we call petals. Since the edges in A(k + 1, n) are exactly the edges in G(k + 1, n) minus those in G(k, n), the edges in A(k + 1, n) can also be partitioned into a disjoint set of cycles which are identified by the circular words of length n + 1 that have at least one occurrence of the new symbol s. To define petals we consider the restriction of C(k + 1, n + 1) to the simple cycles in A(k + 1, n). e + 1, n + 1) be the subgraph of C(k + 1, n + 1) Definition 3 (Petal for a vertex in G(k, n)). Let C(k whose set of vertices are the circular words of length n + 1 with at least one occurrence of symbol s. e + 1, n + 1) that seen as a cycle in A(k + 1, n), A petal for a vertex v in G(k, n) is a subgraph of C(k traverses exactly one vertex in G(k, n), the vertex v. There is exactly one petal for each vertex v in G(k, n) and this petal starts at the circular word [vs], where s is the new symbol. Now there are two difficulties. One is to determine where to insert the petals so that we obtain a fair distribution of the new symbol s. The other difficulty is that petals must exhaust the augmenting graph A(k + 1, n). Figures 4 and 5 illustrates petals for vertices in G(2, 2). 4 Figure 6: The Eulerian cycle in G(2, 2) given by [11000101] started at vertex 11 has 4 sections, section 0 is (11, 11) , section 1 is (10, 00), section 2 is (00, 01) and section 3 is (10, 01). Figure 7: At the left, a Distribution graph D(2, 2). At the center, a possible perfect matching. At the right, the flow network for D(2, 2) where each edge has capacity 1. 2.2 Fair distribution of the new symbol A pointed cycle is a cycle with a specified starting edge. Definition 4 (Section of a cycle). For a pointed Eulerian cycle in G(k, n) given by the sequence of edges e1 , . . . ekn+1 and a non-negative integer j such that 0 ≤ j < k n , the sequence of vertices vjk , ..., vjk+k−1 , where each vi is the head of ei , is a section j of the cycle. Figure 6 exemplifies the four sections of an Eulerian cycle in G(2, 2). The de Bruijn graph G(k, n) has k n vertices and k n+1 edges. An Eulerian cycle in G(k, n) has k n sections with k vertices each section. Since there are the same number of vertices as sections we would like to choose one vertex from each section to place a petal. The problem is each vertex occurs k times in the Eulerian cycle but not necessarily at k different sections. We pose it as a matching problem. Definition 5 (Distribution graph). Given pointed Eulerian cycle in G(k, n) the Distribution graph D(k, n) is a k-regular bipartite graph where the two vertex classes are the vertices in G(k, n) and the sections of the Eulerian cycle and there is an edge (v, j) if v belongs to the section j. A matching in a graph D is a set of edges such that no two edges share a common vertex. A vertex is matched if it is an endpoint of one of the edges in the matching. A perfect matching is a matching that matches all vertices in the graph. Lemma 1. For every Distribution graph D(k, n) there is a perfect matching. Proof. Let D be a finite bipartite graph consisting of are two disjoint sets of vertices X and Y with edges that connect a vertex in X to a vertex in Y . For a subset W of X, let N (W ) be the set of all vertices in Y adjacent to some element in W . Hall’s marriage theorem [8] states that there is a matching that entirely covers X if and only if for every subset W in X, |W | ≤ |N (W )|. Consider a Distribution graph D(k, n) and call X to the set of vertices G(k, n) and Y to the set of sections. For any W ⊆ X such that |W | = r, the sum of the out-degree of these r vertices is rk. Given that the in-degree for any vertex in Y is k, we have that |N (W )| ≥ r. Then, there is a matching that entirely covers X. Furthermore, since the number of vertices is equal to the number of sections, |X| = |Y | and the matching is perfect. 5 Figure 8: A petals tree with four petals, one for each vertex of G(2, 2). To obtain a perfect matching in a Distribution graph we can use any method to compute the maximum flow in a network. We define the flow network by adding adding two vertices to the Distribution graph, the source and the sink. Add an edge from the source to each vertex in X and add an edge from each vertex in Y to the sink. Assign capacity 1 to each of the edges of the flow network. The maximum flow of the network is |X|. This flow has the edges of a perfect match. Figure 7 shows a Distribution graph D(2, 2), a possible perfect matching, and the flow network used to obtain it. 2.3 Partition of the augmenting graph We must partition the set of edges in A(k + 1, n) into petals. We define a Petals tree as a root that e + 1, n + 1). It has height n + 1, the vertices at distance d to branches out in a subgraph of C(k the root have exactly d occurrences of the new symbol s, for d = 1, . . . , n + 1. Definition 6 (Petals tree). Let [r] be a circular word corresponding to an Eulerian cycle in G(k, n). e We define the Petals tree given by the root [r] and all the vertices in C(k+1, n+1). Every vertex [v] where v has exactly one occurrence of the symbol s is a child of the root [r]. And for every pair of vertices [v], [w] there is an edge between them exactly when there is an edge between them e + 1, n + 1) and w has one more occurrence of the new symbol s than v. in C(k Figure 8 shows a petals tree. The root branches our in the petal for vertex 00, which has the circular word [002]; the petal for vertex 01, which has [012]; the petal for vertex 10, which has [021], [022], [122] ,[222]; and the petal for 11 which has [112]. Given Eulerian cycle in G(k, n) and a starting vertex, divide it in k n sections. Choose one vertex in each section according to a perfect matching. Fix a Petals tree as a subgraph of A(k + 1, n). The construction considers all the sections, one after the other, starting at section 0. At each section the construction inserts the petal for a chosen vertex, guided by the Petals tree. Each traversed edge is added to the construction. The construction starts at the vertex that is the head of the first edge of section 0. Let w be the current vertex. Case w is a vertex in G(k, n): If w is a chosen vertex in the current section and the petal for w has not been inserted yet then traverse the edge labelled with symbol s and continue traversing the petal for w (which starts with [ws]). If the petal for w has already been traversed or w is not a chosen vertex then continue with the traversal of edges in the current section. Case w is not a vertex in G(k, n): If the edge labelled with s has not been traversed yet, [ws] is a child of the current node the tree and [ws] has not been traversed yet, then traverse it. Otherwise continue with the traversal of the petal that w was already part of. For example, consider this [11000101] de Bruijn sequence of order 3 over alphabet {0, 1} and the corresponding Eulerian cycle in G(2, 2). Suppose we start this cycle at vertex 11 and consider the four consecutive sections (11, 10), (00, 00), (01, 10) and (01, 11). Assume a perfect matching yields for section 0 the vertex 10 and for section 1 the second instance of the vertex 00. Figure 9 illustrates part the construction of the extended Eulerian cycle for a given Petals tress, inserting the petal for the vertex 10 and the petal for the vertex 00. 6 Figure 9: Some steps of the construction of the extension of the de Bruiijn sequence [11000101] 2.4 Actual proof of Theorem 1 Proof of Theorem 1. Let e1 , ..., ekn be the list of edges visited by the Eulerian cycle determined by v in G(k, n − 1), and let v1 , ..., vkn be the list of the respective head vertices. Divide these vertices in k n−1 consecutive sections, each has k vertices. We use the Edmonds-Karp algorithm determine a vertex from each section. Consider the petals for G(k, n − 1). If we place one petal in each section, two consecutive petals can be at most 2k − 1 edges away. Consider now the petals as pointed cycles in A(k + 1, n − 1). A petal for vertex v = a1 ...an−1 starts with the outgoing edge labelled s. Inside the petal, for any n − 1 consecutive edges there is one edge labelled with s. The last edge of the petal is (p, q) where p = sa1 ...an−2 and q = a1 ...an−1 . Thus, for any 2k − 1 + n consecutive edges there is at least be one labelled with s. 2.5 Proof of Proposition 1 Proof of Proposition 1. We must consider the Eulerian cycle in G(k, n − 1) and the extended Eulerian cycle in G(k + 1, n − 1). The search of the maximum flow is the most expensive part of the construction. Edmonds-Karp algorithm has running time O(|V |2 |E|), see [7, 5] for the flow graph (V, E). In our case V has a source, a sink, k n−1 vertices of the original de Bruijn graph G(k, n − 1) and k n−1 vertices for the sections. So |V | = 2k n−1 + 2. There is an edge from the source to each vertex in G(k, n), there are k outgoing edges from each vertex in G(k, n) to sections, and there is one outgoing edge to each section the sink. So, |E| = (k + 2) k n−1 . Then the time complexity of the Edmonds-Karp algorithm in our graph is O((2k n−1 + 2)2 (k + 2)k n−1 ) = O(k 3n−2 ). This completes the proof. 7 References [1] Nicolás Álvarez, Verónica Becher, Pablo Ferrari, and Sergio Yuhjtman. Perfect necklaces. Advances in Applied Mathematics, 80:48 – 61, 2016. [2] Verónica Becher and Pablo Ariel Heiber. On extending de Bruijn sequences. Information Processing Letters, 111(18):930–932, 2011. [3] Verónica Becher and Olivier Carton. Normal numbers and nested perfect necklaces. Journal of Complexity, 54:101403, 2019. [4] Jean Berstel and Dominique Perrin. The origins of combinatorics on words. European Journal of Combinatorics, 28(3):996–1022, 2007. [5] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. MIT Press, 2009. [6] Nicolaas G. de Bruijn. A combinatorial problem. Nederl. Akad. Wetensch., Proc., 49:758–764 = Indagationes Math. 8, 461–467 (1946), 1946. [7] Jack Edmonds and Richard M. Karp. Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM, 19(2):248–264, 1972. [8] Philip Hall. On representatives of subsets. Journal of the London Mathematical Society, 10, 1935. [9] Mordechay B. Levin. On the discrepancy estimate of normal numbers. Acta Arithmetica, 88(2):99–111, 1999. [10] Damian Repke and Wojciech Rytter. On semi-perfect de Bruijn words. Theoretical Computer Science, 720:55 – 63, 2018. [11] Camille Flye Sainte-Marie. Question 48. L’interm. des math., 1:107–110, 1894. [12] Moshe Schwartz, Yotam Svoray, and Gera Weiss. On embedding de Bruijn sequences by increasing the alphabet size. arXiv:1906.06157, 2019. [13] Gabriel Thibeault. Lexicographically maximum de bruijn sequences in larger alphabets, July 29, 2019. Tesis de Licenciatura en Ciencias de la Computación, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires. Director: Verónica Becher. 8