arXiv:1505.06262v1 [cs.IT] 23 May 2015
Greedy Construction of DNA Codes and New
Bounds
Nabil Bennenni, Kenza Guenda ∗, T. Aaron Gulliver
Abstract
In this paper, we construct linear codes over Z4 with bounded GCcontent. The codes are obtained using a greedy algorithm over Z4 . Further, upper and lower bounds are derived for the maximum size of DNA
codes of length n with constant GC-content w and edit distance d.
keywords: DNA codes, GC-content, edit distance, upper and lower
bounds.
1
Introduction
Deoxyribonucleic acid (DNA) contains the genetic program for the biological development of life. DNA is formed by strands linked together
and twisted in the shape of a double helix. Each strand is a sequence
of four possible nucleotides, two purines, adenine A and guanine G, and
two pyrimidines, thymine T and cytosine C. The ends of a DNA strand
are chemically polar with 5′ and 3′ ends, which implies that the strands
are oriented. Hybridization, known as base pairing, occurs when a strand
binds to another strand, forming a double strand of DNA. The strands
are linked following the Watson-Crick model. Every A is linked with a
T , and every C with a G, and vice versa. We denote the complement of
X by X̂, i.e., Â = T, T̂ = A, Ĝ = C and Ĉ = G. The pairing is done in
the opposite direction and the reverse order. For instance, the WatsonCrick complementary (WCC) strand of 3′ − ACT T AGA − 5′ is the strand
5′ − T CT AAGT − 3′ .
The WCC property of DNA strands is used in DNA computing. In
this case the data is encoded using DNA strands, and molecular biology techniques are used to simulate arithmetic and logical operations.
The main advantages of this approach are huge memory capacity, massive parallelism, and low power molecular hardware and software. Other
applications make use of the properties of DNA [9].
In this paper, we construct linear codes over Z4 with bounded GCcontent. The codes are obtained using a greedy algorithm over Z4 . Further, upper and lower bounds on the maximum size of DNA codes of
length n with constant GC-content w and edit distance d are given.
∗ N. Bennenni and K. Guenda are with the Faculty of Mathematics USTHB, University of Science and Technology of Algiers, Algeria.
email:
[email protected],
[email protected]
1
The choice of the ring Z4 comes from the fact that the bounded GCcontent and bounded edit distance properties are multiplicative over Z4 .
This is not the case over F4 . The bounded GC-constraint ensures that
all codewords have thermodynamic characteristics below some threshold.
This is an important criteria for DNA sequences as it reduces the probability of erroneous cross-hybridization.
In [2], Chee and Ling gave an algorithm to construct DNA codes with
large GC-content which are optimal only up to n = 12. Bishop et al. [1]
considered the construction of random codes with fixed GC-content using
a probabilistic model. King [5] and Condon et al. [6] gave several upper
and lower bounds on the maximum size of DNA codes of length n with
constant GC-content w and Hamming distance d. It is well known that
the Hamming distance does not capture the thermodynamic and the combinatorial properties of DNA strand. In fact, the edit distance is a much
more appropriate metric for designing codes for DNA computing. Thus,
in the second part of this paper upper and lower bounds are derived for
the maximum size of DNA codes of length n with constant GC-content w
and edit distance d.
The remainder of this paper is organized as follows. In Section 2, some
preliminary results are presented. Section 3 employs a greedy algorithm
to obtain DNA codes with bounded GC-content, and in Section 4 DNA
lexicodes are constructed with bounded edit distance. Upper and lower
bounds on the edit distance are also presented. In addition, examples of
DNA codes with bounded GC-content and edit distance are given.
2
Preliminaries
The ring Z4 with element {0, 1, 2, 3} is considered here with addition and
multiplication modulo 4. It is a finite chain ring with maximal ideal < 2 >
and nilpotency index 2. The Hamming weight of a codeword x in Zn
4 is
defined as wH (x) = n1 (x) + n2 (x) + n3 (x), and the Hamming distance
dH (x, y) between two codewords x and y as wH (x − y). We define the
reverse of x = (x0 x1 · · · xn−1 ) to be xR = (xn−1 xn−2 · · · x1 x0 ).
The elements {0, 1, 2, 3} of Z4 are in one to one correspondence with
the nucleotide DNA bases {A, T, C, G} by the map φ such that 0 → G,
2 → C, 3 → T and 1 → A.
The complement of the codeword x = (x0 x1 · · · xn−1 ) is the vector
xC = (xˆ0 xˆ1 · · · xn−1
ˆ ). The reverse complement (also called the WatsonCrick complement) is xRC = (xn−1
ˆ xn−2
ˆ · · · xˆ1 xˆ0 ). For x ∈ Z4 , x̂ is defined
ˆ
to be φ(x).
A linear code C is said to satisfy the reverse constraint,
respectively the reverse-complement constraint if for all x ∈ C we have
xR ∈ C, respectively xRC ∈ C.
2.1
Construction of Lexicodes over Z4
The construction of lexicodes over Z4 given in [4] is now reviewed. A
n
linear code C of length n over Z4 is an additive code over Zn
4 . Thus Z4 is
a linear code over Z4 with basis B = {b1 ···bn }. With respect to this basis,
we recursively define a lexicographically ordered list Vi = x1 , x2 , · · ·, x4i
2
as follows
V0 := 0
Vi := Vi−1 , bi + Vi−1 , 2bi + Vi−1 , 3bi + Vi−1 , 1 ≤ i ≤ n.
In this way |Vi | = 4i , and Zn
4 can be associated with Vn . Assume now
that we have a property P which can test if a vector c ∈ Zn
4 is selected
or not. The selection property P on V can be seen as a boolean valued
function
P : V → {T rue, F alse},
that depends on one variable. Over Z4 , the property P is called a multiplicative property if P [x] is true implies P [3x] is true. The following
greedy algorithm provides lexicodes over Zn
4 [4].
Algorithm 1
1. C0 := 0; i := 1;
2. select the first vector ai ∈ Vi \Vi−1 such that P [2ai + c] for all c ∈
Ci−1 ;
3. if such an ai exists, then Ci := Ci−1 , ai + Ci−1 , 2ai + Ci−1 , 3ai + Ci−1 ;
otherwise Ci := Ci−1 ;
4. i := i + 1; return to 2.
For 0 < i ≤ n, the code Ci is forced to be linear because all linear
combinations of the selected vectors ai1 , · · · , ail , l ≤ i, are taken. The
code Ci has a ‘basis’ formed from ai1 , · · · , ail , so we have a nested sequence
of linear codes
0 = C0 ⊆ C1 ⊆ · · · ⊆ C n .
Cn is the lexicode and is denoted Cn = C(B, P ) where B is the ordering
and P is the selection property. We have the following result.
Theorem 1. ([4, Theorem 4]) For any basis B of Rn and any multiplicative selection criterion P , the lexicode C(B, P ) is linear and P [x] holds for
each codeword x 6= 0.
3 A Greedy Algorithm for Bounded GCcontent DNA Codes
In this Section we construct DNA codes with bounded GC-content using
Algorithm 1. We begin with the following definition.
Definition 2. Let C be a linear code over Z4 n . The GC-content of a
codeword x ∈ C, denoted by GC(φ(x)), is the number of occurrences of G
and C in φ(x)
GC(φ(x)) = |{1 ≤ i ≤ n; φ(x)i ∈ {G, C}}| = wGC (φ(x)).
We say that a subset C of Zn
4 satisfies the bounded GC-content constraint
if there exists a positive integer w such that GC(φ(x)) ≥ w, ∀ x ∈ C.
3
Remark 3. Definition 2 differs from the conventional definition [4, 3].
The bounded GC-content constraint ensures that all codewords have a
hybridization energy below some threshold, which results in stable DNA
strands.
Proposition 4. The property P1 [x] is true if and only if wGC (φ(x)) ≥ w
is a multiplicative property over Z4 .
Proof 5. Let x ∈ Z4 n such that wGC (φ(x)) ≥ w. Multiplying the vector x by 3 does not change the number of 0’s and 2’s. This gives that
wGC (φ(3x)) = wGC (φ(x)) ≥ w, and the result follows.
3.1
Construction Results
In this section, construction results are presented for linear codes over
Z4 with bounded GC-content. In this case, the verification step for
wGC (φ(2x)) ≥ w in Algorithm 1 can be eliminated. This is because for
x ∈ Zn
4 , wGC (φ(x)) ≥ w implies that wGC (φ(2x)) ≥ w, and this improves
the speed of the algorithm. Some of these codes attain upper bound (5)
given in [5, Proposition 1]. Furthermore, the codes obtained are linear as
opposed to those in [8]. Table 1 gives DNA lexicodes over Zn
4 obtained using the selection property P1 [x] (wGC (φ(x)) ≥ w). The DNA code strands
corresponding to the first and second codes in Table 1 are given in Tables
2 and 3, respectively.
4
DNA Codes and Edit Distance
The edit distance has been used for biological computation, in particular for two types of genetic mutation. The first is the substitution of
nucleotides and consists of two possible mutations:
• Transition: a purine is replaced by a purine (A ↔ G) or a pyrimidine
is replaced by a pyrimidine (T ↔ C).
Transversion: a purine is replaced by a pyrimidine or the reverse
(eg. A ↔ C).
• Modification using insertions and deletions.
In this section, we consider the edit distance in the greedy algorithm in
order to find large sets of DNA codewords of length n with given wGC
and minimum edit distance d. We begin by providing a definition of edit
distance which follows the presentation in [7].
Let A and B be finite sets of distinct symbols and let xt ∈ At denote
an arbitrary string of length t over A. The string edit distance is characterized by a triple < A, B, c > consisting of the finite sets A and B, and the
primitive function c : E → R+ where R+ is the set of nonnegative reals,
E = Es ∪ Ed ∪ Ei is the set of primitive edit operations, Es = A ∗ B is the
set of substitutions, Ed = A ∗ E is the set of deletions, and Ei = E × B is
the set of insertions. Each triple < A, B, c > induces a distance function
dc : A∗ × B∗ → R+ that maps a string xt to a nonnegative value [7].
4
Table 1: DNA Lexicodes over Zn4 Obtained using the Selection Property P1 [x]
(wGC (φ(x)) ≥ w)
n
8
w
4
dH
4
Basis of Z4
Canonical basis
10
6
4
Canonical basis
10
10
1
Canonical basis
12
12
1
Canonical basis
5
Basis of C(B, P )
21111000
13210100
32310010
2111100000
1321010000
3231001000
2000000000
0200000000
0020000000
0002000000
0000200000
0000020000
0000002000
0000000200
0000000020
0000000002
200000000000
020000000000
002000000000
000200000000
000020000000
000002000000
000000200000
000000020000
000000002000
000000000200
000000000020
000000000002
Table 2: DNA Code Strands Corresponding to the Linear Code in the First
Row of Table 1
GGGGGGGG
GGCCGGCC
GGGCCCGC
CAAAAGGG
TGGGAAAC
GGAAACTG
AAACTGGG
TTGGGAAC
CTTGGGAA
GGAACTTG
AAGGGCTT
GCTTAAGG
ATTGGGCA
TTTGGGAC
CTTTGGGA
CATTTGGG
GGGGCCCC
CCGGCCGG
GGGCCCCG
AAAAGGGC
CTGGGAAA
ACTGGGAA
AACTGGGA
ACTTGGGA
AACTTGGG
GAACTTGG
TTAAGGGC
TAAGGGCT
TTGGGCAA
TTGGGACT
TTTACGGG
GGACTTTG
CCCCGGGG
CGCGCGCG
GGGAAAAC
GGGAAACT
CTAAAGGG
GAAACTGG
GGGAACTT
TGGGAACT
GGGCTTAA
AGGGCTTA
CTTAAGGG
GGCTTAAG
GGGACTTT
TGGGACTT
GACTTTGG
GGGCTTTT
GAAAACCC
GATTTCCC
AACCCGTT
TTAACCCG
CCCAAAGT
TGGGCTTT
CCCGATTT
TTCCCGAA
AATTCCCG
TAAACCCG
TTGGGCTT
GCCCTTTT
GACCCTTT
CAATTCCG
GAACCCTT
GAAACCCT
Table 3: DNA Code Strands Corresponding to the Linear Code in the Second
Row of Table 1
GGGGGGGGGG
ATCAGAGGGG
CCGCGCGGGG
TACTGTGGGG
CAAAAGGGGG
TGTGAAGGGG
GTATACGGGG
ACTGATGGGG
GCCCCGGGGG
AAGTCAGGGG
CGCGCCGGGG
TTGACTGGGG
CTTTTGGGGG
TCAGTAGGGG
GATATCGGGG
AGACTTGGGG
TCTAGGAGGG
GAACGAAGGG
AGTTGCAGGG
CTAGGTAGGG
ATGCAGAGGG
CCCAAAAGGG
TAGGACAGGG
GGCAATAGGG
TGATCGAGGG
GTTGCAAGGG
ACAACCAGGG
CATCCTAGGG
AACGTGAGGG
CGGATAAGGG
TTCCTCAGGG
GCGTTTAGGG
6
GGCCGGCGGG
TTGTGACGGG
GCCGGCCGGG
AAGCGTCGGG
GATTAGCGGG
AGACAACGGG
CTTAACCGGG
TCACATCGGG
CCGGCGCGGG
TACACACGGG
CGGCCCCGGG
ATCTCTCGGG
GTAATGCGGG
ACTCTACGGG
CAATTCCGGG
TGTGTTCGGG
ACATGGTGGG
CATGGATGGG
TGAAGCTGGG
GTTCGTTGGG
TTCGAGTGGG
GCGTAATGGG
AACCACTGGG
CGGTATTGGG
AGTACGTGGG
CTACCATGGG
TCTTCCTGGG
GAAGCTTGGG
TAGCGGTGGG
GGCTTATGGG
ATGGTCTGGG
CCCATTTGGG
Table 4: DNA Lexicodes over Z4 n Obtained using the Selection Property P2 [x]
(dc (φ(x), φ(y)) ≤ m)
n
4
φ(x)
GGGG
m
1
wGC
4
Basis of Z4
Canonical basis
4
GCGC
2
4
Canonical basis
Basis of C(B, P )
2222
2202
2220
2022
2020
0022
0220
2222
Definition 6. The edit distance dc (xt , y v ) between two strings xt ∈ At
and y v ∈ Bv is defined recursively as
t
v
dc (x , y ) = min
(
c(xt , y v ) + dc (xt−1 , y v−1 ),
c(xt , ǫ) + dc (xt−1 , y v ),
c(ǫ, y v ) + dc (xt , y v−1 );
where dc (ǫ, ǫ) = 0 and ǫ denotes the empty string of length n.
The edit distance constraint for a DNA code C is dc (x, y) ≥ d∀x, y ∈ C,
x 6= y, for some prescribed minimum edit distance d. The edit distance
constraint can reduce non-specific hybridization between distinct codewords, as well as allow for the correction of insertion, deletion and substitution errors in codewords.
Proposition 7. The property P2 [x] is true only if dc (φ(x), φ(y)) ≤ w is
a multiplicative property over Z4 .
Proof 8. Let x ∈ Z4 n and y ∈ Z4 n . Multiplying x by 3 and y by 3 does
not change the number of 0’s and 2’s. Therefore the number of 1’s and
3’s also does not change, so
n1 (x) + n0 (x) + n2 (x) + n3 (x) = n1 (3x) + n0 (3x) + n2 (3x) + n3 (3x).
This also holds for y and thus dc (x, y) = dc (3x, 3y).
Now we use Algorithm 1 to construct linear codes over Z4 with GCcontent bounded by w and edit distance dc (φ(x), φ(y)) such that x ∈ Z∗4
and y ∈ Z∗4 . The results are given in Table 4.
4.1
Upper and Lower Bounds
Let A4 (n, d) be the maximum size of a code over Z4 with length n and
minimum edit distance d. Let AGC
4 (n, d, w) be the maximum size of a
DNA code with length n, minimum edit distance d, and fixed GC weight w.
Further, let AR,GC
(n, d, w), respectively ARC,GC
(n, d, w) be the maximum
4
4
size of a DNA code with length n, minimum edit distance d, and fixed GC
7
weight w, that satisfies the reverse constraint, respectively the reversecomplement constraint. The purpose of this section is to give upper and
lower bounds on these quantities. We have the following theorem.
Theorem 9. For n > 0 with 0 ≤ d ≤ n and 0 ≤ w ≤ n, the following
results hold.
(1)
A4 GC (n, d, 0) = A2 (n, d),
A4 GC (n, d, w) = AGC
4 (n, d, n − w),
(2)
and if w = n/2 then
AGC
4 (n, d, w) = 4.
(3)
Proof 10. The analogous result for DNA codes with GC-content and
Hamming distance was given in [5]. The corresponding proof is employed
here for the edit distance.
(1): Let C be a linear code over Zn
4 with wGC (φ(C)) = 0. Then C contains only 0’s and 1’s, so C can be considered as a binary code which gives
A4 GC (n, d, 0) = A2 (n, d).
(2): Since wGC (φ(C)) = n − wAT (φ(C)), interchanging the A’s with C’s
and T ’s with G’s gives wGC (φ(C)) = n − w, so that AGC
4 (n, d, w) =
AGC
4 (n, d, n − w).
(3): Since ARC,GC
(n, d, w) ≤ AGC
4 (n, d, w), by [10, Theorem 5] we have
4
that
ARC,GC
(n, d, w) = 2. Then 4 ≤ AGC
4 (n, d, w), and by the pigeonhole prin4
ciple
GC
AGC
4 (n, d, w) ≥ 4, so that A4 (n, d, w) = 4.
We have the following relationship between the GC-content of a code
and the code size over the alphabet {A, T, C, G}.
Proposition 11.
GC
AGC
4 (n, d, w) ≥ A4 (n + 1, d + 1, w).
(4)
GC
AGC
4 (n, d, w) ≥ A4 (n + 1, d, w)/4.
(5)
Proof 12. The analogous result for DNA codes with unrestricted GCcontent and Hamming distance was given in [6]. The corresponding proof
is employed here for the edit distance.
(4): A (n, AGC
4 (n + 1, d + 1, w), d, w) code can be obtained from a (n +
1, AGC
4 (n + 1, d + 1, w), d + 1, w) code by removing a symbol from each
codeword such that their GC-content is preserved.
(5): If all the codewords in a (n + 1, AGC
4 (n + 1, d, w), d, w) code are partitioned into four subsets according to the first symbol, one of the subsets
GC+
(n +
will have size at least AGC
4 (n + 1, d, w)/4 and thus is a (n + 1, A4
1, d, w)/4, d, w) code. By removing the (common) symbol from all codewords in the largest subset, a (n, AGC+
(n + 1, d, w)/4, d, w) code is ob4
tained.
We have the following relationship between the GC-content of a reverse
code and the code size over the alphabet {A, T, C, G}.
8
Proposition 13.
AGC,R
(n − 1, d, w) ≤ AGC,R
(n, d, w) ≤ AGC,R
(n, d − 1, w).
4
4
4
(6)
AGC,R
(n − 1, d, w) ≥ AGC,R
(n, d, w)/4.
4
4
(7)
Proof 14. The analogous result for DNA codes with unrestricted GCcontent and Hamming distance was given in [6]. The corresponding proof
is used here for the edit distance.
(6): By the construction of codes over Z4 , we obtain 4n codewords of
length n and 4n−1 codewords of length n − 1, and the result follows.
(n, d, w), d)−code over Z4 can be parti(7): The codewords of a C(n, AGC,R
4
tioned into four subsets denoted C1 , C2 , C3 , C4 such that the size of subset
C1 is at least AGC,R
(n, d, w)/4 and C1 is a (n, AGC,R
(n, d, w)/4, d) code.
4
4
Removing a symbol from the codewords of C1 such that the distance d and
weight w are maintained, we obtain a (n − 1, AGC,R
(n, d, w), d) code, and
4
the result follows.
Proposition 15. For 0 ≤ d ≤ n and 0 ≤ w ≤ n
AGC,RC
(n, d, w) = AGC,R
(n, d, w),
4
4
if n is even, and
AGC,R
(n, d + 1, w) ≤ AGC,RC
(n, d, w) ≤ AGC,R
(n, d−, w),
4
4
4
if n is odd.
Proof 16. The analogous result for DNA codes with unrestricted GCcontent and edit distance was given in [5]. The corresponding proof is
employed here for the edit distance. Given a set of codewords of length n,
if we replace all entries in any subset of the positions by their complement,
the GC-content of these codewords is preserved, as well as the edit distance
between any pair of codewords. The edit distance between a codeword and
the reverse or reverse-complement of the other codewords is not in general
preserved, but if n is even and the first n/2 coordinates of each codeword
xi are replaced by their complements to form a new codeword yi , then
dc (xi , xR ) = dc (yi , yjRC ) for all codewords xi and xj . Similarly, if n is
odd and the first (n − 1)/2 coordinates of each codeword xi are replaced
RC
by their complements to form yi , then |dc (xi , xR
j ) − dc (yi , yj )| ≤ 1.
References
[1] M.A. Bishop, A.G. D’Yachkov, A.J. Macula, T.E. Renz and V.V.
Rykov. Free energy gap and statistical thermodynamic fidelity of DNA
codes. J. Comp. Biol. 14(8), 1088–1104 (2007).
[2] Y.M. Chee and S. Ling. Improved lower bounds for constant GCcontent DNA codes. IEEE Trans. Inform. Theory. 54(1), 391–394
(2008).
[3] K. Guenda, T.A. Gulliver and P. Solé. On cyclic DNA codes. Istanbul, Proc. IEEE Int. Symp. Inform. Theory, 121–125 (2013).
9
[4] K. Guenda, T.A. Gulliver and S.A. Sheikholeslam. Lexicodes over
rings. Des. Codes Cryptogr. 72(3), 749–763 (2014).
[5] O.D. King. Bounds for DNA codes with constant GC-content. Electron. J. Combin. 10, R33 (2003).
[6] A. Marathe, A.E. Condon and R.M. Corn. On combinatorial DNA
word design. J. Comp. Biol. 8(3), 201–219 (2001).
[7] E.S. Ristad and P.N. Yianilos. Learning string-edit distance. IEEE
Trans. Anal. Mach. Intell. 20(5), 522–532 (1998).
[8] D.H. Smith, N. Aboluion, H. Montemanni and S. Perkins. Linear
and nonlinear constructions of DNA codes with Hamming distance d
and constant GC-content. Discr. Math. 311(13), 1207–1219 (2011).
[9] D.D. Shoemaker, D.A. Lashkari, D. Morris, M. Mittman and R.W.
Davis. Quantitative phenotypic analysis of yeast deletion mutant using a highly parallel molecular bar-coding strategy. Nat. Genet. 14,
450–456 (1996).
[10] J. Sun. Bounds on edit metric codes with combinatorial DNA constraints. Master’s Thesis, Brock University, (2009).
10