Academia.eduAcademia.edu

Greedy Construction of DNA Codes and New Bounds

2015, arXiv (Cornell University)

In this paper, we construct linear codes over Z4 with bounded GCcontent. The codes are obtained using a greedy algorithm over Z4. Further, upper and lower bounds are derived for the maximum size of DNA codes of length n with constant GC-content w and edit distance d.

arXiv:1505.06262v1 [cs.IT] 23 May 2015 Greedy Construction of DNA Codes and New Bounds Nabil Bennenni, Kenza Guenda ∗, T. Aaron Gulliver Abstract In this paper, we construct linear codes over Z4 with bounded GCcontent. The codes are obtained using a greedy algorithm over Z4 . Further, upper and lower bounds are derived for the maximum size of DNA codes of length n with constant GC-content w and edit distance d. keywords: DNA codes, GC-content, edit distance, upper and lower bounds. 1 Introduction Deoxyribonucleic acid (DNA) contains the genetic program for the biological development of life. DNA is formed by strands linked together and twisted in the shape of a double helix. Each strand is a sequence of four possible nucleotides, two purines, adenine A and guanine G, and two pyrimidines, thymine T and cytosine C. The ends of a DNA strand are chemically polar with 5′ and 3′ ends, which implies that the strands are oriented. Hybridization, known as base pairing, occurs when a strand binds to another strand, forming a double strand of DNA. The strands are linked following the Watson-Crick model. Every A is linked with a T , and every C with a G, and vice versa. We denote the complement of X by X̂, i.e., Â = T, T̂ = A, Ĝ = C and Ĉ = G. The pairing is done in the opposite direction and the reverse order. For instance, the WatsonCrick complementary (WCC) strand of 3′ − ACT T AGA − 5′ is the strand 5′ − T CT AAGT − 3′ . The WCC property of DNA strands is used in DNA computing. In this case the data is encoded using DNA strands, and molecular biology techniques are used to simulate arithmetic and logical operations. The main advantages of this approach are huge memory capacity, massive parallelism, and low power molecular hardware and software. Other applications make use of the properties of DNA [9]. In this paper, we construct linear codes over Z4 with bounded GCcontent. The codes are obtained using a greedy algorithm over Z4 . Further, upper and lower bounds on the maximum size of DNA codes of length n with constant GC-content w and edit distance d are given. ∗ N. Bennenni and K. Guenda are with the Faculty of Mathematics USTHB, University of Science and Technology of Algiers, Algeria. email: [email protected], [email protected] 1 The choice of the ring Z4 comes from the fact that the bounded GCcontent and bounded edit distance properties are multiplicative over Z4 . This is not the case over F4 . The bounded GC-constraint ensures that all codewords have thermodynamic characteristics below some threshold. This is an important criteria for DNA sequences as it reduces the probability of erroneous cross-hybridization. In [2], Chee and Ling gave an algorithm to construct DNA codes with large GC-content which are optimal only up to n = 12. Bishop et al. [1] considered the construction of random codes with fixed GC-content using a probabilistic model. King [5] and Condon et al. [6] gave several upper and lower bounds on the maximum size of DNA codes of length n with constant GC-content w and Hamming distance d. It is well known that the Hamming distance does not capture the thermodynamic and the combinatorial properties of DNA strand. In fact, the edit distance is a much more appropriate metric for designing codes for DNA computing. Thus, in the second part of this paper upper and lower bounds are derived for the maximum size of DNA codes of length n with constant GC-content w and edit distance d. The remainder of this paper is organized as follows. In Section 2, some preliminary results are presented. Section 3 employs a greedy algorithm to obtain DNA codes with bounded GC-content, and in Section 4 DNA lexicodes are constructed with bounded edit distance. Upper and lower bounds on the edit distance are also presented. In addition, examples of DNA codes with bounded GC-content and edit distance are given. 2 Preliminaries The ring Z4 with element {0, 1, 2, 3} is considered here with addition and multiplication modulo 4. It is a finite chain ring with maximal ideal < 2 > and nilpotency index 2. The Hamming weight of a codeword x in Zn 4 is defined as wH (x) = n1 (x) + n2 (x) + n3 (x), and the Hamming distance dH (x, y) between two codewords x and y as wH (x − y). We define the reverse of x = (x0 x1 · · · xn−1 ) to be xR = (xn−1 xn−2 · · · x1 x0 ). The elements {0, 1, 2, 3} of Z4 are in one to one correspondence with the nucleotide DNA bases {A, T, C, G} by the map φ such that 0 → G, 2 → C, 3 → T and 1 → A. The complement of the codeword x = (x0 x1 · · · xn−1 ) is the vector xC = (xˆ0 xˆ1 · · · xn−1 ˆ ). The reverse complement (also called the WatsonCrick complement) is xRC = (xn−1 ˆ xn−2 ˆ · · · xˆ1 xˆ0 ). For x ∈ Z4 , x̂ is defined ˆ to be φ(x). A linear code C is said to satisfy the reverse constraint, respectively the reverse-complement constraint if for all x ∈ C we have xR ∈ C, respectively xRC ∈ C. 2.1 Construction of Lexicodes over Z4 The construction of lexicodes over Z4 given in [4] is now reviewed. A n linear code C of length n over Z4 is an additive code over Zn 4 . Thus Z4 is a linear code over Z4 with basis B = {b1 ···bn }. With respect to this basis, we recursively define a lexicographically ordered list Vi = x1 , x2 , · · ·, x4i 2 as follows V0 := 0 Vi := Vi−1 , bi + Vi−1 , 2bi + Vi−1 , 3bi + Vi−1 , 1 ≤ i ≤ n. In this way |Vi | = 4i , and Zn 4 can be associated with Vn . Assume now that we have a property P which can test if a vector c ∈ Zn 4 is selected or not. The selection property P on V can be seen as a boolean valued function P : V → {T rue, F alse}, that depends on one variable. Over Z4 , the property P is called a multiplicative property if P [x] is true implies P [3x] is true. The following greedy algorithm provides lexicodes over Zn 4 [4]. Algorithm 1 1. C0 := 0; i := 1; 2. select the first vector ai ∈ Vi \Vi−1 such that P [2ai + c] for all c ∈ Ci−1 ; 3. if such an ai exists, then Ci := Ci−1 , ai + Ci−1 , 2ai + Ci−1 , 3ai + Ci−1 ; otherwise Ci := Ci−1 ; 4. i := i + 1; return to 2. For 0 < i ≤ n, the code Ci is forced to be linear because all linear combinations of the selected vectors ai1 , · · · , ail , l ≤ i, are taken. The code Ci has a ‘basis’ formed from ai1 , · · · , ail , so we have a nested sequence of linear codes 0 = C0 ⊆ C1 ⊆ · · · ⊆ C n . Cn is the lexicode and is denoted Cn = C(B, P ) where B is the ordering and P is the selection property. We have the following result. Theorem 1. ([4, Theorem 4]) For any basis B of Rn and any multiplicative selection criterion P , the lexicode C(B, P ) is linear and P [x] holds for each codeword x 6= 0. 3 A Greedy Algorithm for Bounded GCcontent DNA Codes In this Section we construct DNA codes with bounded GC-content using Algorithm 1. We begin with the following definition. Definition 2. Let C be a linear code over Z4 n . The GC-content of a codeword x ∈ C, denoted by GC(φ(x)), is the number of occurrences of G and C in φ(x) GC(φ(x)) = |{1 ≤ i ≤ n; φ(x)i ∈ {G, C}}| = wGC (φ(x)). We say that a subset C of Zn 4 satisfies the bounded GC-content constraint if there exists a positive integer w such that GC(φ(x)) ≥ w, ∀ x ∈ C. 3 Remark 3. Definition 2 differs from the conventional definition [4, 3]. The bounded GC-content constraint ensures that all codewords have a hybridization energy below some threshold, which results in stable DNA strands. Proposition 4. The property P1 [x] is true if and only if wGC (φ(x)) ≥ w is a multiplicative property over Z4 . Proof 5. Let x ∈ Z4 n such that wGC (φ(x)) ≥ w. Multiplying the vector x by 3 does not change the number of 0’s and 2’s. This gives that wGC (φ(3x)) = wGC (φ(x)) ≥ w, and the result follows. 3.1 Construction Results In this section, construction results are presented for linear codes over Z4 with bounded GC-content. In this case, the verification step for wGC (φ(2x)) ≥ w in Algorithm 1 can be eliminated. This is because for x ∈ Zn 4 , wGC (φ(x)) ≥ w implies that wGC (φ(2x)) ≥ w, and this improves the speed of the algorithm. Some of these codes attain upper bound (5) given in [5, Proposition 1]. Furthermore, the codes obtained are linear as opposed to those in [8]. Table 1 gives DNA lexicodes over Zn 4 obtained using the selection property P1 [x] (wGC (φ(x)) ≥ w). The DNA code strands corresponding to the first and second codes in Table 1 are given in Tables 2 and 3, respectively. 4 DNA Codes and Edit Distance The edit distance has been used for biological computation, in particular for two types of genetic mutation. The first is the substitution of nucleotides and consists of two possible mutations: • Transition: a purine is replaced by a purine (A ↔ G) or a pyrimidine is replaced by a pyrimidine (T ↔ C). Transversion: a purine is replaced by a pyrimidine or the reverse (eg. A ↔ C). • Modification using insertions and deletions. In this section, we consider the edit distance in the greedy algorithm in order to find large sets of DNA codewords of length n with given wGC and minimum edit distance d. We begin by providing a definition of edit distance which follows the presentation in [7]. Let A and B be finite sets of distinct symbols and let xt ∈ At denote an arbitrary string of length t over A. The string edit distance is characterized by a triple < A, B, c > consisting of the finite sets A and B, and the primitive function c : E → R+ where R+ is the set of nonnegative reals, E = Es ∪ Ed ∪ Ei is the set of primitive edit operations, Es = A ∗ B is the set of substitutions, Ed = A ∗ E is the set of deletions, and Ei = E × B is the set of insertions. Each triple < A, B, c > induces a distance function dc : A∗ × B∗ → R+ that maps a string xt to a nonnegative value [7]. 4 Table 1: DNA Lexicodes over Zn4 Obtained using the Selection Property P1 [x] (wGC (φ(x)) ≥ w) n 8 w 4 dH 4 Basis of Z4 Canonical basis 10 6 4 Canonical basis 10 10 1 Canonical basis 12 12 1 Canonical basis 5 Basis of C(B, P ) 21111000 13210100 32310010 2111100000 1321010000 3231001000 2000000000 0200000000 0020000000 0002000000 0000200000 0000020000 0000002000 0000000200 0000000020 0000000002 200000000000 020000000000 002000000000 000200000000 000020000000 000002000000 000000200000 000000020000 000000002000 000000000200 000000000020 000000000002 Table 2: DNA Code Strands Corresponding to the Linear Code in the First Row of Table 1 GGGGGGGG GGCCGGCC GGGCCCGC CAAAAGGG TGGGAAAC GGAAACTG AAACTGGG TTGGGAAC CTTGGGAA GGAACTTG AAGGGCTT GCTTAAGG ATTGGGCA TTTGGGAC CTTTGGGA CATTTGGG GGGGCCCC CCGGCCGG GGGCCCCG AAAAGGGC CTGGGAAA ACTGGGAA AACTGGGA ACTTGGGA AACTTGGG GAACTTGG TTAAGGGC TAAGGGCT TTGGGCAA TTGGGACT TTTACGGG GGACTTTG CCCCGGGG CGCGCGCG GGGAAAAC GGGAAACT CTAAAGGG GAAACTGG GGGAACTT TGGGAACT GGGCTTAA AGGGCTTA CTTAAGGG GGCTTAAG GGGACTTT TGGGACTT GACTTTGG GGGCTTTT GAAAACCC GATTTCCC AACCCGTT TTAACCCG CCCAAAGT TGGGCTTT CCCGATTT TTCCCGAA AATTCCCG TAAACCCG TTGGGCTT GCCCTTTT GACCCTTT CAATTCCG GAACCCTT GAAACCCT Table 3: DNA Code Strands Corresponding to the Linear Code in the Second Row of Table 1 GGGGGGGGGG ATCAGAGGGG CCGCGCGGGG TACTGTGGGG CAAAAGGGGG TGTGAAGGGG GTATACGGGG ACTGATGGGG GCCCCGGGGG AAGTCAGGGG CGCGCCGGGG TTGACTGGGG CTTTTGGGGG TCAGTAGGGG GATATCGGGG AGACTTGGGG TCTAGGAGGG GAACGAAGGG AGTTGCAGGG CTAGGTAGGG ATGCAGAGGG CCCAAAAGGG TAGGACAGGG GGCAATAGGG TGATCGAGGG GTTGCAAGGG ACAACCAGGG CATCCTAGGG AACGTGAGGG CGGATAAGGG TTCCTCAGGG GCGTTTAGGG 6 GGCCGGCGGG TTGTGACGGG GCCGGCCGGG AAGCGTCGGG GATTAGCGGG AGACAACGGG CTTAACCGGG TCACATCGGG CCGGCGCGGG TACACACGGG CGGCCCCGGG ATCTCTCGGG GTAATGCGGG ACTCTACGGG CAATTCCGGG TGTGTTCGGG ACATGGTGGG CATGGATGGG TGAAGCTGGG GTTCGTTGGG TTCGAGTGGG GCGTAATGGG AACCACTGGG CGGTATTGGG AGTACGTGGG CTACCATGGG TCTTCCTGGG GAAGCTTGGG TAGCGGTGGG GGCTTATGGG ATGGTCTGGG CCCATTTGGG Table 4: DNA Lexicodes over Z4 n Obtained using the Selection Property P2 [x] (dc (φ(x), φ(y)) ≤ m) n 4 φ(x) GGGG m 1 wGC 4 Basis of Z4 Canonical basis 4 GCGC 2 4 Canonical basis Basis of C(B, P ) 2222 2202 2220 2022 2020 0022 0220 2222 Definition 6. The edit distance dc (xt , y v ) between two strings xt ∈ At and y v ∈ Bv is defined recursively as t v dc (x , y ) = min ( c(xt , y v ) + dc (xt−1 , y v−1 ), c(xt , ǫ) + dc (xt−1 , y v ), c(ǫ, y v ) + dc (xt , y v−1 ); where dc (ǫ, ǫ) = 0 and ǫ denotes the empty string of length n. The edit distance constraint for a DNA code C is dc (x, y) ≥ d∀x, y ∈ C, x 6= y, for some prescribed minimum edit distance d. The edit distance constraint can reduce non-specific hybridization between distinct codewords, as well as allow for the correction of insertion, deletion and substitution errors in codewords. Proposition 7. The property P2 [x] is true only if dc (φ(x), φ(y)) ≤ w is a multiplicative property over Z4 . Proof 8. Let x ∈ Z4 n and y ∈ Z4 n . Multiplying x by 3 and y by 3 does not change the number of 0’s and 2’s. Therefore the number of 1’s and 3’s also does not change, so n1 (x) + n0 (x) + n2 (x) + n3 (x) = n1 (3x) + n0 (3x) + n2 (3x) + n3 (3x). This also holds for y and thus dc (x, y) = dc (3x, 3y). Now we use Algorithm 1 to construct linear codes over Z4 with GCcontent bounded by w and edit distance dc (φ(x), φ(y)) such that x ∈ Z∗4 and y ∈ Z∗4 . The results are given in Table 4. 4.1 Upper and Lower Bounds Let A4 (n, d) be the maximum size of a code over Z4 with length n and minimum edit distance d. Let AGC 4 (n, d, w) be the maximum size of a DNA code with length n, minimum edit distance d, and fixed GC weight w. Further, let AR,GC (n, d, w), respectively ARC,GC (n, d, w) be the maximum 4 4 size of a DNA code with length n, minimum edit distance d, and fixed GC 7 weight w, that satisfies the reverse constraint, respectively the reversecomplement constraint. The purpose of this section is to give upper and lower bounds on these quantities. We have the following theorem. Theorem 9. For n > 0 with 0 ≤ d ≤ n and 0 ≤ w ≤ n, the following results hold. (1) A4 GC (n, d, 0) = A2 (n, d), A4 GC (n, d, w) = AGC 4 (n, d, n − w), (2) and if w = n/2 then AGC 4 (n, d, w) = 4. (3) Proof 10. The analogous result for DNA codes with GC-content and Hamming distance was given in [5]. The corresponding proof is employed here for the edit distance. (1): Let C be a linear code over Zn 4 with wGC (φ(C)) = 0. Then C contains only 0’s and 1’s, so C can be considered as a binary code which gives A4 GC (n, d, 0) = A2 (n, d). (2): Since wGC (φ(C)) = n − wAT (φ(C)), interchanging the A’s with C’s and T ’s with G’s gives wGC (φ(C)) = n − w, so that AGC 4 (n, d, w) = AGC 4 (n, d, n − w). (3): Since ARC,GC (n, d, w) ≤ AGC 4 (n, d, w), by [10, Theorem 5] we have 4 that ARC,GC (n, d, w) = 2. Then 4 ≤ AGC 4 (n, d, w), and by the pigeonhole prin4 ciple GC AGC 4 (n, d, w) ≥ 4, so that A4 (n, d, w) = 4. We have the following relationship between the GC-content of a code and the code size over the alphabet {A, T, C, G}. Proposition 11. GC AGC 4 (n, d, w) ≥ A4 (n + 1, d + 1, w). (4) GC AGC 4 (n, d, w) ≥ A4 (n + 1, d, w)/4. (5) Proof 12. The analogous result for DNA codes with unrestricted GCcontent and Hamming distance was given in [6]. The corresponding proof is employed here for the edit distance. (4): A (n, AGC 4 (n + 1, d + 1, w), d, w) code can be obtained from a (n + 1, AGC 4 (n + 1, d + 1, w), d + 1, w) code by removing a symbol from each codeword such that their GC-content is preserved. (5): If all the codewords in a (n + 1, AGC 4 (n + 1, d, w), d, w) code are partitioned into four subsets according to the first symbol, one of the subsets GC+ (n + will have size at least AGC 4 (n + 1, d, w)/4 and thus is a (n + 1, A4 1, d, w)/4, d, w) code. By removing the (common) symbol from all codewords in the largest subset, a (n, AGC+ (n + 1, d, w)/4, d, w) code is ob4 tained. We have the following relationship between the GC-content of a reverse code and the code size over the alphabet {A, T, C, G}. 8 Proposition 13. AGC,R (n − 1, d, w) ≤ AGC,R (n, d, w) ≤ AGC,R (n, d − 1, w). 4 4 4 (6) AGC,R (n − 1, d, w) ≥ AGC,R (n, d, w)/4. 4 4 (7) Proof 14. The analogous result for DNA codes with unrestricted GCcontent and Hamming distance was given in [6]. The corresponding proof is used here for the edit distance. (6): By the construction of codes over Z4 , we obtain 4n codewords of length n and 4n−1 codewords of length n − 1, and the result follows. (n, d, w), d)−code over Z4 can be parti(7): The codewords of a C(n, AGC,R 4 tioned into four subsets denoted C1 , C2 , C3 , C4 such that the size of subset C1 is at least AGC,R (n, d, w)/4 and C1 is a (n, AGC,R (n, d, w)/4, d) code. 4 4 Removing a symbol from the codewords of C1 such that the distance d and weight w are maintained, we obtain a (n − 1, AGC,R (n, d, w), d) code, and 4 the result follows. Proposition 15. For 0 ≤ d ≤ n and 0 ≤ w ≤ n AGC,RC (n, d, w) = AGC,R (n, d, w), 4 4 if n is even, and AGC,R (n, d + 1, w) ≤ AGC,RC (n, d, w) ≤ AGC,R (n, d−, w), 4 4 4 if n is odd. Proof 16. The analogous result for DNA codes with unrestricted GCcontent and edit distance was given in [5]. The corresponding proof is employed here for the edit distance. Given a set of codewords of length n, if we replace all entries in any subset of the positions by their complement, the GC-content of these codewords is preserved, as well as the edit distance between any pair of codewords. The edit distance between a codeword and the reverse or reverse-complement of the other codewords is not in general preserved, but if n is even and the first n/2 coordinates of each codeword xi are replaced by their complements to form a new codeword yi , then dc (xi , xR ) = dc (yi , yjRC ) for all codewords xi and xj . Similarly, if n is odd and the first (n − 1)/2 coordinates of each codeword xi are replaced RC by their complements to form yi , then |dc (xi , xR j ) − dc (yi , yj )| ≤ 1. References [1] M.A. Bishop, A.G. D’Yachkov, A.J. Macula, T.E. Renz and V.V. Rykov. Free energy gap and statistical thermodynamic fidelity of DNA codes. J. Comp. Biol. 14(8), 1088–1104 (2007). [2] Y.M. Chee and S. Ling. Improved lower bounds for constant GCcontent DNA codes. IEEE Trans. Inform. Theory. 54(1), 391–394 (2008). [3] K. Guenda, T.A. Gulliver and P. Solé. On cyclic DNA codes. Istanbul, Proc. IEEE Int. Symp. Inform. Theory, 121–125 (2013). 9 [4] K. Guenda, T.A. Gulliver and S.A. Sheikholeslam. Lexicodes over rings. Des. Codes Cryptogr. 72(3), 749–763 (2014). [5] O.D. King. Bounds for DNA codes with constant GC-content. Electron. J. Combin. 10, R33 (2003). [6] A. Marathe, A.E. Condon and R.M. Corn. On combinatorial DNA word design. J. Comp. Biol. 8(3), 201–219 (2001). [7] E.S. Ristad and P.N. Yianilos. Learning string-edit distance. IEEE Trans. Anal. Mach. Intell. 20(5), 522–532 (1998). [8] D.H. Smith, N. Aboluion, H. Montemanni and S. Perkins. Linear and nonlinear constructions of DNA codes with Hamming distance d and constant GC-content. Discr. Math. 311(13), 1207–1219 (2011). [9] D.D. Shoemaker, D.A. Lashkari, D. Morris, M. Mittman and R.W. Davis. Quantitative phenotypic analysis of yeast deletion mutant using a highly parallel molecular bar-coding strategy. Nat. Genet. 14, 450–456 (1996). [10] J. Sun. Bounds on edit metric codes with combinatorial DNA constraints. Master’s Thesis, Brock University, (2009). 10