The "Wow! Signal" of The Terrestrial Genetic Code: Vladimir I. Shcherbak and Maxim A. Makukov

The “Wow!
signal” of the terrestrial genetic code

Vladimir I. shCherbaka and Maxim A. Makukovb*
a
Department of Mathematics, al-Farabi Kazakh National University, Almaty, Republic of Kazakhstan
e-mail: [email protected]
b
Fesenkov Astrophysical Institute, Almaty, Republic of Kazakhstan
e-mail: [email protected], [email protected]
ARTICLE INFO ABSTR ACT
This is the authors’ version of the It has been repeatedly proposed to expand the scope for SETI, and one of the suggested alterna-
manuscript published in Icarus. tives to radio is the biological media. Genomic DNA is already used on Earth to store non-
Changes resulting from the pub- biological information. Though smaller in capacity, but stronger in noise immunity is the genetic
lishing process may not be re- code. The code is a flexible mapping between codons and amino acids, and this flexibility allows
flected in this document. Changes
may have been made to this work
modifying the code artificially. But once fixed, the code might stay unchanged over cosmological
since it was submitted for publi- timescales; in fact, it is the most durable construct known. Therefore it represents an exceptional-
cation. For the journal version see ly reliable storage for an intelligent signature, if that conforms to biological and thermodynamic
http://dx.doi.org/10.1016/j.icarus.2013.02.017 requirements. As the actual scenario for the origin of terrestrial life is far from being settled, the
_________________________
proposal that it might have been seeded intentionally cannot be ruled out. A statistically strong
Article history: intelligent-like “signal” in the genetic code is then a testable consequence of such scenario. Here
Submitted 26 June 2012
Revised 3 October 2012 we show that the terrestrial code displays a thorough precision-type orderliness matching the
Revised 31 January 2013 criteria to be considered an informational signal. Simple arrangements of the code reveal an en-
Accepted 12 February 2013 semble of arithmetical and ideographical patterns of the same symbolic language. Accurate and
____________________ systematic, these underlying patterns appear as a product of precision logic and nontrivial com-
Keywords: puting rather than of stochastic processes (the null hypothesis that they are due to chance coupled
astrobiology with presumable evolutionary pathways is rejected with P-value < 10–13). The patterns are pro-
genetic code found to the extent that the code mapping itself is uniquely deduced from their algebraic repre-
directed panspermia
SETI sentation. The signal displays readily recognizable hallmarks of artificiality, among which are the
____________________ symbol of zero, the privileged decimal syntax and semantical symmetries. Besides, extraction of
* Corresponding author the signal involves logically straightforward but abstract operations, making the patterns essen-
tially irreducible to any natural origin. Plausible way of embedding the signal into the code and
For additional information see: possible interpretation of its content are discussed. Overall, while the code is nearly optimized
https://bioseti.info
biologically, its limited capacity is used extremely efficiently to store non-biological information.
Introduction putative recipients. Being energy-efficient (Rose & Wright,

2004) and self-replicating, the biological channel is also free
Recent biotech achievements make it possible to employ from problems peculiar to radio signals: there is no need to rely
genomic DNA as data storage more durable than any media on time of arrival, frequency and direction. Thus, due to these
currently used (Bancroft et al., 2001; Yachie et al., 2008; restrictions the origin of the famous “Wow!” signal received in
Ailenberg & Rotstein, 2009). Perhaps the most direct applica- 1977 remains uncertain (Ehman, 2011). The biological channel
tion for that was proposed even before the advent of synthetic has been given serious considerations for its merits in SETI,
biology. Considering alternative informational channels for SETI, though with the focus on genomes (Yokoo & Oshima, 1979;
Marx (1979) noted that genomes of living cells may provide a Freitas, 1983; Nakamura, 1986; Davies, 2010; Davies, 2012).
good instance for that. He also noted that even more durable is Meanwhile, it has been proposed to secure terrestrial life by
the genetic code. Exposed to strong negative selection, the code seeding exoplanets with living cells (Mautner, 2000; Tepfer,
stays unchanged for billions of years, except for rare cases of 2008), and that seems to be a matter of time. The biological
minor variations (Knight et al., 2001) and context-dependent channel suggests itself in this enterprise. To avoid anthropo-
expansions (Yuan et al., 2010). And yet, the mapping between centric bias, it might be admitted that terrestrial life is not the
codons and amino acids is malleable, as they interact via modifi- starting point in the series of cosmic colonization (Crick &
able molecules of tRNAs and aminoacyl-tRNA synthetases Orgel, 1973; Crick, 1981). If so, it is natural to expect a statis-
(Giegé et al., 1998; Ibba & Söll, 2000; see also Appendix A). tically strong intelligent-like “signal” in the terrestrial genetic
This ability to reassign codons, thought to underlie the evolu- code (Marx, 1979). Such possibility is incited further by the
tion of the code to multilevel optimization (Bollenbach et al., fact that how the code came to be apparently non-random and
2007), also allows to modify the code artificially (McClain & nearly optimized still remains disputable and highly specula-
Foss, 1988; Budisa, 2006; Chin, 2012). It is possible, at least in tive (for reviews on traditional models of the code evolution
principle, to arrange a mapping that both conforms to functional see Knight et al., 1999; Gusev & Schulze-Makuch, 2004;
requirements and harbors a small message or a signature, al- Di Giulio, 2005; Koonin & Novozhilov, 2009).
lowed by 384 bits of informational capacity of the code. Once The only way to extract a signal, if any, from the code is to
genome is appropriately rewritten (Gibson et al., 2010), the new arrange its elements – codons, amino acids and syntactic signs –
code with a signature will stay frozen in the cell and its proge- by their parameters using some straightforward logic. These
ny, which might then be delivered through space and time to arrangements are then analyzed for patterns or grammar-like
1
structures of some sort. The choice of arrangements and parame-
ters should exclude arbitrariness. For example, only those pa-
rameters should be considered which do not depend on systems
of physical units. However, even in this case a priori it is un-
known exactly what kind of patterns one might expect. So there
is a risk of false positives, as with a data set like the genetic code
it is easy to find various patterns of one kind or another.
Nonetheless, the task might be somewhat alleviated. First, it
is possible to predict some general aspects of a putative signal
and its “language”, especially if one takes advantage of active
SETI experience. For example, it is generally accepted that nu-
merical language of arithmetic is the same for the entire universe
(Freudenthal, 1960; Minsky, 1985). Besides, symbols and
grammar of this language, such as positional numeral systems
with zero conception, are hallmarks of intelligence. Thus, inter-
stellar messages sent from the Earth usually began with natural Fig. 1. The genetic code. (a) Traditional representation of the stand-
sequence of numbers in binary or decimal notation. To reinforce ard, or universal, code. Codons coding the same amino acid form sy n-
onymic series denoted with opening braces. Number of codons in a
the artificiality, a symbol of zero was placed in the abstract posi- series defines its redundancy (degeneracy). Whole codon families
tion preceding the sequence. Those messages also included sym- consist of one series of redundancy IV. Other families are split. Most
bols of arithmetical operations, Egyptian triangle, DNA and oth- split families are halved into two series of redundancy II each, one
er notions of human consciousness (Sagan et al., 1972; The Staff ending with pyrimidines {T, C} and another with purines {A, G}.
at the NAIC, 1975; Sagan et al., 1978; Dumas & Dutil, 2004). Three codons in the standard code are not mapped to any amino acid
Second, to minimize the risk of false positives one can im- and are used as Stop in translation. The Start is usually signified by
pose requirements as restrictive as possible on a putative signal. ATG which codes Met. Closing brace shows the only difference be-
For example, it is reasonable to expect that a genuinely intelli- tween the euplotid and the standard code. (b) Contracted representa-
tion of the euplotid version. Synonymous full-size codons are replaced
gent message would represent not just a collection of patterns of
by a single contracted series with combined third base. FASTA desi g-
various sorts, but patterns of the same “linguistic style”. In this nations are used: R and Y stand for purines and pyrimidines, respe c-
case, if a potential pattern is noticed, further search might be tively, N stands for all four bases and H stands for {T, C, A}. Series
narrowed down to the same sort of patterns. Another stringent are placed vertically for further convenience. The pictogram on the left
requirement might be that patterns should involve each element helps in figures below. Filled elements denote whole families here.
of the code in each arrangement, whereas the entire signal
should occupy most, if not all, of the code’s informational capac-
ity. By and large, given the nature of the task, specifics of the of bases, and these families generally consist of either one or two
strategy are defined en route. equal series of codons mapped to one amino acid or to Stop
Following these lines, we show that the terrestrial code har- (Fig. 1a). In effect, the standard code is nearly symmetric in re-
bors an ensemble of precision-type patterns matching the redundancy. There are only two families split unequally: those
quirements mentioned above. Simple systematization of the code beginning with TG and AT. The minimum action to restore the
reveals a strong informational signal comprising arithmetical and symmetry is to match TG-family against AT-family by reas-
ideographical components. Remarkably, independent patterns of signing TGA from Stop to cysteine. Incidentally, this symme-
the signal are all expressed in a common symbolic language. We trized version is not just a theoretical guess but is also found in
show that the signal is statistically significant, employs informa- nature as the nuclear code of euplotid ciliates (Meyer et al.,
tional capacity of the code entirely, and is untraceable to natural 1991). While the standard code stores the arithmetical compo-
origin. The models of emergence of primordial life with original nent of the signal, the symmetrical euplotid version keeps the
signal-free genetic code are beyond the scope of this paper; ideographical one (the interrelation between these two code
whatever it was, the earlier state of the code is erased by palimp- versions is discussed later). Regular redundancy leads also to
sest of the signal. the block structure of the genetic code. This makes it possible to
depict the code in a contracted form, where each amino acid
corresponds to a single block, or a contracted series (Fig. 1b).
Background The three exceptions are Arg, Leu and Ser, which have one IV-
Should there be a signal in the code, it would likely have mani- series and one II-series each.
fested itself someway during the half-century history of traditional Apart from regular redundancy, a wealth of other features
analysis of the code organization. So it is of use to summarize were reported afterwards, among which are robustness to errors
briefly what has been learned about that up to date. Also, for the (Alff-Steinberger, 1969), correlation between thermostability
sake of simplicity in data presentation, we will mention in advance and redundancy of codon families (Lagerkvist, 1978), non-
some a posteriori information concerning the signal to be de- random distribution of amino acids among codons if judged by
scribed, with fuller discussion in due course. We suggest to a their polarity and bulkiness (Jungck, 1978), biosynthetic pathways
reader unfamiliar with molecular mechanisms behind the genetic (Taylor & Coates, 1989), reactivity (Siemion & Stefanowicz,
code first to refer to Appendix A, where it is also explained why 1992), and even taste (Zhuravlev, 2002). The code was also shown
the code is amenable to intentional “modulation” (to use the lan- to be effective at handling additional information in DNA
guage of radio-oriented SETI) and, at the same time, is highly (Baisnée et al., 2001; Itzkovitz & Alon, 2007). Apparently, these
protected from casual “modulation” (has strong noise immunity). features are related, if anything, to the direct biological function
of the code. There are also a number of abstract approaches to
The code at a glance. As soon as the genetic code was biochem- the code, such as those based on topology (Karasev & Stefanov,
ically cracked (Nirenberg et al., 1965), its non-random structure 2001), information science (Alvager et al., 1989), and number
became evident (Woese, 1965; Crick, 1968). The most obvious theory (Dragovich, 2012). However, the main focus of these
pattern that emerged in the code was its regular redundancy. The approaches is in constructing theoretical model descriptions of
code comprises 16 codon families beginning with the same pair known features in the code, rather than dealing with new ones.
2
All in all, only two intrinsic regularities, observed early on in below, this anticorrelation is a derivative of the signal. Moreo-
the study of the code, might suggest possible relation to a puta- ver, exactly this observation suggests simple systematization for
tive signal due to their conspicuous and unambiguous character. both “ostensive numerals”: monotonous arraying of nucleon and
They also suggest two dimensionless integer parameters for sig- redundancy numbers in opposite directions.
nal extraction. These are quantity of codons in a series mapped On the whole, Hasegawa and Miyata dealt with amino acids
to one amino acid (redundancy) and quantity of nucleons in ami- whereas Rumer dealt with codons. Combined, these approaches
no acid molecules. These parameters might be called “ostensive yield assignments between codons and amino acid nucleon num-
numerals” by analogy with the quantity of radio beeps in Lingua bers convenient for systematization. Stop-codons code for no
Cosmica (Freudenthal, 1960). amino acid; therefore, to include them into the systematization,
they are assigned a zero nucleon number.
Rumer’s bisection. Rumer (1966) bisected the code by redun-
dancy – the first “ostensive numeral”. There are 8 whole families The activation key. All arithmetical patterns considered further
and 8 split families in the code (Fig. 2a). Rumer found that co- appear with the differentiation between blocks and chains in all
dons in these families are mapped to each other in a one-to-one 20 amino acids and with the subsequent transfer of one nucleon
fashion with a simple relation TG, CA, now known as Ru- from side chain to block in proline (Fig. 2b). Proline is the only
mer’s transformation. There are two more transformations of exception from the general structure of amino acids: it holds its
such type: TC, AG and TA, CG. They also appear in side chain with two bonds and has one hydrogen less in its
Rumer’s bisection and each makes half of what Rumer’s trans- block. The mentioned transfer in proline “standardizes” its
formation makes alone. block nucleon number to 73 + 1 and reduces its chain nucleons
Arbitrary bisection of the code has small chances to produce to 42 – 1. In itself, the distinction between blocks and chains is
a transformation, and still less – their ordered set (see Appendix purely formal: there is no stage in protein synthesis where ami-
B). Rumer’s finding was rediscovered by Danckwerts & Neubert no acid side chains are detached from standard blocks. There-
(1975), who also noted that this set might be described with a fore, there is no any natural reason for nucleon transfer in pro-
structure known in mathematics as the Klein-4 group. That trig- line; it can be simulated only in the mind of a recipient to
gered a series of yet other models involving group theory to de- achieve the array of amino acids with uniform structure. Such
scribe the code (Bertman & Jungck, 1979; Hornos & Hornos, nucleon transfer thus appears artificial. However, exactly this
1993; Bashford et al., 1998), which, admittedly, did not gain seems to be its destination: it protects the patterns from any
decisive insights. Meanwhile, in traditional theories of the code natural explanation. Minimizing the chances for appealing to
evolution this feature was ignored altogether, though it was re- natural origin is a distinct concern in messaging of such kind,
peatedly rediscovered again (e.g., see Wilhelm & Nikolajewa, and this problem seems to be solved perfectly for the signal in
2004). Noteworthy, this regularity – which turns out to be a the genetic code. Applied systematically without exceptions,
small portion of the signal – was first noticed immediately after the artificial transfer in proline enables holistic and arithmeti-
codon assignments were elucidated. Together with the fact of cally precise order in the code. Thus, it acts as an “activation
rediscoveries, this speaks for the anticryptographic nature of the key”. While nature deals with the actual proline which does not
signal inside the code. produce the signal in the code, an intelligent recipient easily
Amino acid nucleons. Hasegawa & Miyata (1980) arranged finds the key and reads messages in arithmetical language (see
amino acids in order of increasing nucleon number – the second also Discussion).
“ostensive numeral” which, unlike other amino acid properties,
Decimalism. The arithmetical patterns to be described hold
does not rely on arbitrarily chosen system of units. Such ar- true in any numeral system. However, as it turned out, ex-
rangement reveals a rough anticorrelation: the greater the redun-
pressed in positional decimal system, they all acquire conspic-
dancy the smaller the nucleon number (Fig. 2b). This promoted uously distinctive notation. Therefore, here we briefly provide
speculations that prevailing small amino acids occupied the se- some relevant information.
ries of higher redundancy during the code evolution. As shown
Fig. 2. Preceding observations. (a) Rumer’s bisection. Whole families are opposed to split ones, thereby bisecting the code. Codons in opposed
families are mapped to each other with the ordered set of Rumer’s transformation and two half -transformations. Transformation of third bases is
trivial as they are the same in any family; therefore contracted representation is adequate to show this regularity. The regularity is valid both for
the standard and the euplotid (shown here) version. (b) Categorization of amino acids by nucleon numbers. Free molecules unmodified by cyt o-
plasmic environment are shown. Each of them is formed of the standard block and a side c hain. Blocks are identical in all amino acids except
proline. Chains are unique for each amino acid. Numbers of nucleons, i.e. protons and neutrons, are shown for both blocks and chains. To avoid
ambiguity, it is judicious to consider only most common and stable isotopes: 1H, 12C, 14N, 16O, 32S. The bar at the bottom shows the redundancy of
amino acids in the code. Cross-cut bonds symbolize the distinction between standard blocks and unique side chains of amino acids. The arrow in
proline denotes hereafter the “activation key” (see text).
3
no unique order in presenting them. We will begin with arith-
metical component and then move on to ideography.
The arithmetical component

Full-size standard code. One logically plain arrangement of the
code was proposed by George Gamow in his attempt to guess
the coding assignments theoretically before the code was
cracked in vitro (see Hayes, 1998). One of his models, though it
Fig. 3. Digital symmetry of decimals divisible by 037. Leading zero did not predict the actual mapping correctly, coincided remarka-
emphasizes its equal participation in the symmetry. All three-digit bly with one of the signal components. Gamow arranged codons
decimals with identical digits 111, …, 999 are divisible by 037. The according to their composition, since 20 combinations of four
sum of three identical digits gives the quotient of the number divided
bases taken three at a time could account for 20 amino acids
by 037. Analogous sum for numbers with unique digits gives the cen-
tral quotient in the column. Digits in these numbers are interconnected (Gamow & Yčas, 1955). Bringing nucleon numbers, activation
with cyclic permutations that are mirror symmetrical in neighb or col- key and few “freezing” conditions into this arrangement reveals
umns. Addition instead of division provides an efficient way to per- total nucleon balancing ornate with decimal syntax.
form checksums (see Appendix C). The scheme extends to decimals Codons with identical and unique bases comprise two small-
with more than three digits, if they are represented as a + 999×n, er sets (Fig. 5a). Halved, both sets show the balance of side
where n is the quotient of the number divided by 999 and a is the chains with 703 = 037×019 nucleons in each half as well as the
remainder, to which the same symmetry then applies (for three-digit balance of whole molecules with 1665 = 666 + 999×1 nucleons.
decimals n = 0). Numbers divisible by 037 and larger than 999 will be
Importantly, the halving is not arbitrary. Codons are opposed by
shown in this way.
Rumer’s transformation along with the half-transformation
TC, AG in the first set and TA, CG in the second set.
Nature is indifferent to numerical languages contrived by in-
telligence to represent quantities, including zero. A privileged
numeral system is therefore a reliable sign of artificiality. In-
tentionally embedded in an object, a privileged system might
then demonstrate itself through distinctive notation to any
recipient dealing with enumerable elements of that object. For
example, digital symmetries of numbers divisible by prime
037 exist only in the positional decimal system with zero con-
ception (Fig. 3). Thus, distinctive decimals 111, 222 and 333
look ordinarily 157, 336 and 515 in the octal system. This
notational feature was marked by Pacioli (1508) soon after the
decimal system came to Europe. Analogous three-digit feature
exists in some other systems, including the quaternary one
(see Appendix C).
Results
The overall structure of the signal is shown in Fig. 4, which
might be used as guidance in further description. The signal is
composed of arithmetical and ideographical patterns, where
arithmetical units are represented by amino acid nucleons,
whereas codon bases serve as ideographical entities. The pat-
terns of the signal are displayed in distinct logical arrange-
ments of the code, thereby increasing both the informational
content of the signal and its statistical significance. Remarka-
bly, all of the patterns bare the same general style reflected in
Fig. 4 with identical symbols in each signal component (repre-
sented by boxes). Namely, distinct logical arrangements of the
code and activation key produce exact equalities of nucleon
sums, which furthermore display decimalism and are accompa-
nied by Rumer’s and/or half-transformations. One of these
arrangements furthermore leads to ideography and semantical
symmetries. All elements of the code – 64 codons, 20 amino
acids, Start and Stop syntactic signs – are involved in each
arrangement.
Unlike radio signals which unfold in time and thus have se- Fig. 4. The structure of the signal. All details are discussed sequential-
quential structure, the signal in the genetic code has no entry ly in the text. The image of scales represents precise nucleon equalities.
point, similar to the pictorial message of Pioneer plaques DEC stands for distinctive decimal notation of nucleon sums. The dot-
(Sagan et al., 1972). However, instead of providing pictograms ted box denotes the cytoplasmic balance (see Appendix D), the only
pattern maintained by actual proline and cellular milieu. All other pat-
the signal in the genetic code provides patterns that do not de-
terns are enabled by the “activation key” and are valid for free amino
pend on visual symbols chosen to represent them (be it sym- acids. K stands for {T, G}, M stands for {A, C}. Though all three types
bols for nucleotide bases or for the notation of “ostensive nu- of transformations act in the patterns, only Rumer’s transformation is
merals”). These patterns make up the organic whole, so there is indicated for simplicity.
4
tical bases type. Though not balanced, these halves again show
distinctive decimal syntax with 888 and 1110 = 111 + 999×1
nucleons. Decimalism of one of these sums is algebraically de-
pendent, as from the previous case (Fig. 5b) the sum of the
whole set is known to be divisible by 037; if a part of this set is
decimally distinctive, the other one will be such automatically.
Notably, an independent pattern nonetheless stands out here.
Namely, a part of the previous threefold balance has an equiva-
lent in one half here, where the same amino acids are represented
by synonymous codons (Fig. 5b and c). Whole molecules of this
equivalent – 333 side chain and 444 standard block nucleons –
are balanced with 777 chain nucleons in the rest of the subset.
Note that all those distinctive notations of nucleon sums
appear only in positional decimal system. Positional notation is
so customary in our culture that most of its users hardly re-
member a fairly complex rule behind it that encodes numbers
as an–1×qn–1 + … + a1×q1 + a0×q0, where q = 10 in case of the
decimal system, n is the quantity of digits in notation, and ai –
digits 0-9 that are left in the final notation.
Decomposed standard code. Another arrangement of the code is
brought about by decomposition of its 64 full-size codons. This
yields 192 separate bases and reveals a pattern of the same type
as in full-size format. Identical bases make up four sets of 48
bases in each. Each base retains the amino acid or Stop of its
original codon (Fig. 6a). Thus, the four sets get their individual
chain and block nucleon sums.
In total, there are 222 + 999×10 side chain nucleons in the
decomposed code – obviously, thrice as much as the total sum in
the previous full-size case (with the activation key still applied).
Only one combination of the four sets displays distinctive deci-
malism of side chain nucleon sums. These are 666 + 999×2 nu-
cleons in the T-set and 555 + 999×7 nucleons in the joint CGA-
set (Fig. 6b). Meanwhile, there are exactly 222 + 999×10 block
nucleons in the CGA-set (note that the sets have unequal block
sums due to different accumulation of Stops). Thus, while chain
nucleons are outnumbered by block nucleons overall the code,
they are neatly balanced with their CGA-part.
Fig. 5. Gamow’s sorting of codons according to their nucleotide base
Contracted code and the systematization rule. In a sense, con-
composition. Base combinations (shown on triangular frames) pro-
duce three sets: 4 codons with three identical bases, 24 codons with traction of codon series (see Fig. 1b) is an operation logically
unique bases and 36 codons with two identical bases. (a) The first opposite to decomposition. Besides displaying new arithmetical
and the second sets halved by vertical axis with Rumer’s and half- patterns, contracted code also reveals ideographical component
transformations along with Spin  Antispin transformation denoted
with circular arrows. Applied to triangular frames, these arrows de-
fine the sequence of bases in codons. Note that while any block sum
(with the activation key applied) is divisible by 037 as each block has
74 = 2×037 nucleons, chain sums are not restricted in this way.
(b) The third set halved according to whether identical bases are
purines or pyrimidines. (c) The third set halved with horizontal axis
according to whether unique bases are purines or pyrimidines.
The Spin  Antispin transformation does not affect the first set
but finally freezes elements of the second one. There is only one
degree of freedom left since there are no reversible transfor-
mations that might connect both sets, so one of them is free to
swap around the axis. The balance appears in one of the two
alternative states.
The third set includes codons with two identical bases. When
halved according to whether they are purines or pyrimidines,
regardless of the unique base type, this set shows the balance
999 = 999 of side chains (Fig. 5b). Besides, such halving keeps
Rumer’s and one of the half-transformations again in place. In
its turn, the right half of the set is threefold balanced. Codons
Fig. 6. The decomposed standard code. (a) Decomposition shown for one
with adenine side by side, guanine side by side and palindromic family of codons. Three T-bases contribute three Cys molecules into T-
codons make up three equal parts with 333 nucleons each. set; one A-base contributes one Stop to A-set and so on for the entire
In Fig. 5c the same set is halved according to whether unique code. (b) Identical bases are sorted into four sets regardless of their posi-
bases are purines or pyrimidines, this time regardless of the iden- tion in codons. The sets are shown twice for convenience.
5
Fig. 7. The contracted euplotid code with the systematiza tion rule applied (compare with Fig. 2). (a) The resulting arrangement of contracted
codon series forming the ideogram. Side-by-side alignment of vertical series produces three horizontal strings of peer -positioned bases. Gln and
Lys have the same nucleon number; ambiguity in their positioning is eliminated by the symmetries considered further. ( b) The arithmetical
background of the ideogram (valid for the standard version as well, as it contributes another zero to the III, II, I set). For  and  side chain
levels see Discussion.
of the signal. The systematization rule leading to the ideography

combines findings of Rumer (1966) and of Hasegawa & Miyata
(1980) and is symmetric in its nature (shCherbak, 1993). Con-
tracted series are sorted into four sets according to their redun-
dancy; within those sets they are aligned side-by-side in order of
monotonously changing (e.g., increasing) nucleon number. The
sets themselves are then arranged in antisymmetrical fashion
(e.g., in order of decreasing redundancy number). Stop-series is
placed at the beginning of its set representing zero in its special
position. Finally, Rumer’s bisection opposes the IV-set to III, II,
I sets. The resulting arrangement is shown in Fig. 7 for the eu-
plotid code, with ideography of codon bases (see next section) in
Fig. 7a and arithmetical patterns of amino acids (shared by both
code versions) in Fig. 7b.
A new balance is found in the joint III, II, I set. Side chain
nucleons of all its amino acids are equalized with their standard
blocks: 111 + 999×1 = 111 + 999×1 (Fig. 7b). This pattern mani-
fests as the anticorrelation mentioned by Hasegawa & Miyata
(1980). Chain nucleon sum of all series in the code is less than
the sum of all blocks. Only a subset of series coding mainly big-
ger amino acids may equalize its own blocks. Exactly this hap-
pens in the joint III, II, I set. As a consequence, smaller amino
acids are left in the set of redundancy IV.
Meanwhile, there are 333 chain and 592 block nucleons and
333 + 592 = 925 nucleons of whole molecules in the IV-set. Fig. 8. Additional arithmetical patterns of the contracted code
(shared by both code versions). (a) The code is divided according
With 037 cancelled out, this leads to 32 + 42 = 52 – numerical
to whether first bases are purines or pyrimidines. This gives two
representation of the Egyptian triangle, possibly as a symbol of sets with equal numbers of series. The halve with pyrimidines in
two-dimensional space. Incidentally, codon series in the ideo- first positions reveals a new balance of chains and blocks anal o-
gram (Fig. 7a) are arranged in the plane rather than linearly in a gous to that in Fig. 7b. Another halve is algebraically dependent
genomic fashion. except the decimal sum of its β, δ, ζ levels, see Discussion.
Rumer’s bisection is based on redundancy and thus makes use (b) The code is divided according to whether first bases are K or
of third positions in codon series. Divisions of the contracted code M (left) or whether central bases are K or M (center). Both divi-
based on first and center positions also reveal similar patterns sions produce halves with identical chain nucleon sums. As alg e-
(Fig. 8). Another arithmetical phenomenon presumably related to braic consequence of these divisions, series with K in first and
central positions and series with M in first and central positions
the signal – the cytoplasmic balance – is described in Appendix D.
are chain-balanced (right). Each of the three divisions is accomp a-
Thus, the standard code reveals same-style and yet algebrai- nied by half-transformations and, remarkably, also produces equal
cally independent patterns simultaneously in decomposed, full- numbers of series in each half. This pattern is the only one that
size, and contracted representations (see Fig. 4). It is a highly shows no divisibility by 037. However, all three numbers – 654,
nontrivial algebraic task to find the solution that maps amino 789 and 369 – are again specific in decimal notation where digits
acids and syntactic signs to codons in a similar fashion. Normal- in each of them appear as arithmetic progressions.
ly this would require considerable computational power.
6
frame displays the semantical mirror symmetry of antonyms
with homogeneous AAA-codon in the center.
The codons of this reading frame are purely abstract sym-
bols, given that they are read across contracted series. Howev-
er, they are regularly crossed with the same codons in the ide-
ogram, thereby reinforcing the semantical symmetry and mak-
ing the current frame unique (Fig 10c). Besides, direction of
reading now becomes distinguished since such “crossword”
disappears if read in opposite way, though the palindrome
itself remains the same.
Remarkably, the triplet string in Fig. 10c is written with the
code symbols within the code itself. This implies that the signal-
harboring mapping had to be projected preliminarily (see Dis-
cussion). Besides, translation of this string with the code itself
reveals the balance 222 = 222 of chains and blocks (Fig. 10d).
Additional palindrome in the frame shifted by one position
(Fig. 10e) reproduces the chain sum of 222, confirming that the
ideogram is properly “tuned in” to the euplotid version: TGA
stands for Cys here, not for Stop of the standard code.
Discussion
Artificiality. To be considered unambiguously as an intelligent
signal, any patterns in the code must satisfy the following two
Fig. 9. Patterns of the short (a) and the long (b) upper strings. The criteria: (1) they must be highly significant statistically and (2) not
strings are arranged with the same set of symmetries: mirror symmetry
only must they possess intelligent-like features (Elliott, 2010), but
(denoted with the central vertical axis), translation symmetry (denoted
with italicized letters and skewed frames) and purine  pyrimidine
they should be inconsistent in principle with any natural process,
inversion (denoted with color gradient, where black and white stand for be it Darwinian (Freeland, 2002) or Lamarckian (Vetsigian
pyrimidines and purines, respectively). The image of DNA at the top et al., 2006) evolution, driven by amino acid biosynthesis
illustrates possible interpretation of the short string (see Discussion). (Wong, 2005), genomic changes (Sella & Ardell, 2006), affin-
ities between (anti)codons and amino acids (Yarus et al.,
2009), selection for the increased diversity of proteins (Higgs,
The ideographical component
Upper strings. We refer to the product of systematization in
Fig. 7a as the ideogram. The ideogram of the genetic code is
based on symmetries of its strings (shCherbak, 1988). The
strings are read across contracted series.
The upper short string demonstrates mirror, translation and
inversion symmetries (Fig. 9a). Its bases are invariant under
combined operation of the mirror symmetry and inversion of the
type base  complementary base. A minimum pattern of the
translation symmetry is represented by RRYY quadruplet.
The same three symmetries arrange the long upper string
(Fig. 9b). The pair of flanking TATAT sequences is mirror
symmetrical. The pair of central AGC codons forms a minimum
pattern of the translation symmetry. First and third bases in the
set of redundancy II are interconnected in an axisymmetric man-
ner with purine  pyrimidine inversion and its opposite opera-
tion – the unit transformation producing no exchange.
Center strings. Placed coaxially, the short and the long center
strings appear interconnected with purine  pyrimidine inver-
sion (Fig. 10a). Both strings exhibit purine-pyrimidine mirror
symmetry. The long string keeps the mirror symmetry even for
ordinary bases.
Codons of the short string CCC and TCT break the mirror
symmetry of ordinary bases, but they share a palindromic fea-
ture, i.e. direction of reading invariance. This feature restores the
mirror symmetry, this time of the semantical type (Fig. 10b). As
in the previous case, two center strings are expected to share the
same set of symmetries. Therefore, the semantical symmetry of
palindromic codons flanked by G-bases may indicate a similar
feature in the long string. Indeed, semantical symmetry is found
there in the triplet reading frame starting after flanking G-base Fig. 10. Patterns of the short (a, b) and the long (a, c, d, e) center
(Fig. 10c). This reading frame is remarkable with the regular strings. Both strings are arranged with purine-pyrimidine mirror sym-
arrangement of all syntactic signs of the euplotid code – both metry, purine  pyrimidine inversion and semantical symmetry. The
Stop-codons and the Start-codon repeated twice. The reading first two are denoted in the same way as in Fig. 9,  denotes palindrome.
7
2009), energetics of codon-anticodon interactions (Klump, tion, logical transformations accompanying the equalities, the
2006; Travers, 2006), or various pre-translational mechanisms symbol of zero and semantical symmetries, but the very method
(Wolf & Koonin, 2007; Rodin et al., 2011). of its extraction involves abstract operations – consideration of
The statistical test for the first criterion is outlined in Appen- idealized (free and unmodified) molecules, distinction between
dix B, showing that the described patterns are highly significant. their blocks and chains, the activation key, contraction and de-
The second criterion might seem unverifiable, as the patterns composition of codons. We find that taken together all these
may result from a natural process currently unknown. But this aspects point at artificial nature of the patterns.
criterion is equivalent to asking if it is possible at all to embed Though the decimal system in the signal might seem a ser-
informational patterns into the code so that they could be une- endipitous coincidence, there are few possible explanations,
quivocally interpreted as an intelligent signature. The answer from 10-digit anatomy as an evolutionary near-optimum for
seems to be yes, and one way to do so is to make patterns virtual, bilateral beings (Dennett, 1996) to the fact that there are con-
not actual. Exactly that is observed in the genetic code. Strict veniently 74 = 2×037 nucleons in the standard blocks of α-
balances and their decimal syntax appear only with the applica- amino acids. Besides, the decimal system shares the triplet digi-
tion of the “activation key”. Physically, there are no strict bal- tal symmetry with the quaternary one (see Appendix C), estab-
ances in the code (e.g., in Fig. 5b one would have 1002 ≠ 999 lishing a link to the “native” language of DNA. After all, some
instead of 999 = 999). Artificial transfer of a nucleon in proline of the messages sent from the Earth included the decimal sys-
turns the arithmetical patterns on and thereby makes them virtu- tem as well (Sagan et al., 1978; Dumas & Dutil, 2004), though
al. This is also the reason why we interpret distinctive notation they were not supposed to be received necessarily by 10-digit
as an indication of decimalism, rather than as a physical re- extraterrestrials. Whatever the actual reason behind the decimal
quirement (yet unknown) for nucleon sums to be multiples of system in the code, it appears that it was invented outside the
037: in general, physically there is no such multiplicity in the Solar System already several billions years ago.
code. In its turn, notationally preferred numeral system is by
Two versions of the code. The nearly symmetric code version
itself a strong sign of artificiality. It is also worth noting that all
with arithmetical patterns acts as the universal standard code.
three-digit decimals – 111, 222, 333, 444, 555, 666, 777, 888,
With this code at hand it is intuitively easy to infer the symmet-
999 (as well as zero, see below) – are represented at least once in
ric version with its ideography. Vice versa, if the symmetric
the signal, which also looks like an intentional feature.
version were the universal one, it would be hardly possible to
However, it might be hypothesized that amino acid mass is
infer the nearly symmetric code with all its arithmetical patterns.
driven by selection (or any other natural process) to be distribut-
Therefore, with the standard version alone it is possible to “re-
ed in the code in a particular way leading to approximate mass
ceive” both arithmetical and ideographical components of the
equalities and thus making strict nucleon balances just a likely
signal, even if the symmetric version was not found in nature.
epiphenomenon. But it is hardly imaginable how a natural pro-
There are two possible reasons why it is actually found in eu-
cess can drive mass distribution in abstract representations of the
plotid ciliates: either originally when Earth was seeded there
code where codons are decomposed into bases or contracted by
were both versions of the code with one of them remaining cur-
redundancy. Besides, nucleon equalities hold true for free amino
rently in euplotid ciliates, or originally there was only the stand-
acids, and yet in these free molecules side chains and standard
ard version, and later casual modification in euplotid lineage
blocks had to be treated by that process separately. Furthermore,
coincided with the symmetric version.
no natural process can drive mass distribution to produce the
What concerns other known rare versions of the code, they
balance in Fig. 10d: amino acids and syntactic signs that make
seem neither to have profound pattern ensembles, nor to be easily
up this balance are entirely abstract since they are produced by
inferable from the standard code. As commonly accepted, they
translation of a string read across codons.
represent later casual deviations of the standard code caused by
Another way to make patterns irreducible to natural events is
ambiguous intermediates or codon captures (Moura et al., 2010).
to involve semantics, since no natural process is capable of in-
terpreting abstract symbols. It should be noted that notions of Embedding the signal. To obtain a code with a signature one
symbols and meanings are used sometimes in a natural sense might search through all variant mappings and select the “most
(Eigen & Winkler, 1983), especially in the context of biosemiot- interesting” one. However, this method is unpractical (at least
ics (Barbieri, 2008) and molecular codes (Tlusty, 2010). The with the present-day terrestrial computing facilities), given the
genetic code itself is regarded there as a “natural convention” astronomically huge number of variant codes. In a more realistic
that relates symbols (codons) to their meanings (amino acids). alternative, the pattern ensemble of the signal is projected pre-
However, these approaches make distinction between organic liminarily as a system of algebraic expressions which is then
semantics of molecular codes and interpretive or linguistic se- solved relatively easily to deduce the mapping of the code. Thus,
mantics peculiar to intelligence (Barbieri, 2008). Exactly the all described patterns might be represented post factum as a sys-
latter type of semantics is revealed in the signal of the genetic tem of Diophantine expressions (i.e. equations and inequalities
code. It is displayed there not only in the symmetry of antony- allowing only integer solutions), and analysis of this system
mous syntactic signs (Fig. 10c), but also in the symbol of zero. shows that it uniquely determines the mapping between codon
For genetic molecular machinery there is no zero, there are nu- series and nucleon numbers, including zeros for Stop-codons
cleotide triplets recognized sterically by release factors at the (see Appendix E). Though some amino acids have equal nucleon
ribosome. Zero – the supreme abstraction of arithmetic – is the numbers, as the case for Leu and Ile, or Lys and Gln, even they
interpretive meaning assigned to Stop-codons, and its correctness are not interchangeable, as suggested by distinctive notation of
is confirmed by the fact that, being placed in its proper front nucleon sums in ,  and other positional levels of side chains in
position, zero maintains all ideogram symmetries. Thus, a trivial the contracted code (Figs. 7b and 8a). The activation key applies
summand in balances, zero, however, appears as an ordinal here as well (note that - and -carbons in proline are positional-
number in the ideogram. In other words, besides being an inte- ly equivalent). The standard chemical nomenclature of carbon
gral part of the decimal system, zero acts also as an individual atoms is extended here to denote positions of other nodal atoms.
symbol in the code. Decimalism in different combinations of levels circumvents
In total, not only the signal itself reveals intelligent-like fea- algebraic dependence and employs chemical structure of amino
tures – strict nucleon equalities, their distinctive decimal nota- acids more efficiently.
8
These patterns within side chains go even deeper into chemi- becomes the symbol of duplex DNA located between genes.
cal structure. Some of the canonical amino acids – His, Arg and Should this particular numbering have relation to the genomic
Trp – might exist in alternative neutral tautomeric forms differ- message, if any, is a matter of further research.
ing in the position of one hydrogen atom in their side chains It is worth mentioning that all genomes, despite their huge
(Taniguchi & Hino, 1981; Rak et al., 2001; Li & Hong, 2011). size and diversity, do possess a feature as universal as the genet-
Though some of these tautomers occur very rarely at cytoplas- ic code itself. It is known as the second Chargaff’s rule. In al-
mic pH (as the case for indolenine tautomer of Trp shown in Fig. most all genomes – from viral to human – the quantities of com-
7b), all neutral tautomers are legitimate if idealized free mole- plementary nucleotides, dinucleotides and higher oligonucleo-
cules are considered, and taking only one of them would intro- tides up to the length of ~9 are balanced to a good precision
duce arbitrariness. Notably, however, that while one Trp tauto- within a single DNA strand (Okamura et al., 2007). Unlike the
mer maintains the patterns in Fig. 7b, another one does the job in first Chargaff’s rule which quickly found its physicochemical
Fig. 8a, whereas any neutral tautomer of His and Arg might be basis, the second rule with its total orderliness still has no obvi-
taken in both cases without affecting the patterns at all (which is ous explanation.
easily checked; to this end, both Arg tautomers are shown in Fig. ___________________________________________
8a and both His tautomers are shown in Figs. 7b and 8a).
Importantly, preliminary projecting of a signal admits impo-
sition of functional requirements as extra formal conditions. The Appendix A. Molecular implementation of the genetic code
terrestrial code is known to be conservative with respect to polar Here we outline molecular workings behind the genetic code
requirement (Freeland & Hurst, 1998), but not to molecular size which explain why it stays unchanged for billions of years and,
(Haig & Hurst, 1991). The signal in the code does not involve at the same time, might be readily modified artificially, e.g., for
polar requirement as such, so it might be used in a parallel for- embedding a signal. For simplicity, we skip the details such as U
mal condition to reduce effect of misreadings. However, the instead of T in RNA, ATP energetics, wobble pairing, etc., that
signal does involve nucleon numbers which correlate with mo- do not affect understanding of the main point (for details see,
lecular volume. That interferes with an attempt to make the code e.g., Alberts et al., 2008).
conservative with respect to size of amino acids as well. The first type of molecules behind the genetic code is trans-
Possible interpretation. Besides having the function of an intel- fer RNAs (tRNAs). They deliver amino acids into ribosomes,
ligent signature as such, the signal in the genetic code might also where protein synthesis takes place. tRNAs are transcribed as
admit sensible interpretations of its content. Without claim to be final products from tRNA genes in genomes by RNA polymer-
correct, here we propose our own version. It is now tempting to ase (Fig. A1a; for definiteness, the mechanism is shown for ami-
think that the main body of the message might reside in genomes no acid Ser and its TCC codon). With the length varying around
(Marx, 1979; see also Hoch & Losick, 1997). Though the idea of 80 nucleotides, tRNA transcripts fold in a specific spatial con-
genomic SETI (Davies, 2010) might seem naïve in view of ran- figuration due to base-pairing between different sections of the
dom mutations, things are not so obvious. For example, a locus same RNA strand, similar to as it occurs between two strands of
with a message might be exposed to purifying selection through DNA helix (Fig. A1b). At its opposite sides the folded tRNA
coupling to essential genes, and there is even possible evidence molecule has an unpaired anticodon and the acceptor end to
for that (ibid.). Whatever the case, the ideogram does seem to which amino acid is to be bound. tRNAs with differing antico-
provide a reference to genomes. Thus, complementary mirror- dons specifying the same amino acid (remember the code is re-
symmetrical bases of the short upper string (Fig. 9a) resemble dundant) are identical in their overall configuration. tRNAs speci-
Watson-Crick pairs; the four central bases TC|GA and the cen- fying distinct amino acids differ from each other in anticodons as
tral axis therefore possibly represent the symbol of the genomic well as other spots, so they have slightly different overall config-
DNA itself. Flanking TATAT bases (Fig. 9b) might symbolize urations. However, acceptor ends are identical in all tRNAs, so
consensus sequence found in promoters of most genes. Coding for tRNA itself it makes no difference which amino acid is bound
sequences of genes are located between Start- and Stop-codons. to it, no matter which anticodon it has at the opposite side. The
Vice versa, nontranslated regions are found between Stop- and process of binding amino acids to tRNAs is performed by protein
Start-codons of neighbor genes. Therefore the triplet string in enzymes called aminoacyl-tRNA synthetases (aaRSs, Fig A1b,
Fig. 10c might symbolize intergenic regions, and may be inter- bottom). Normally, there are 20 types of aaRSs, one for each
preted as the address of the genomic message. amino acid, and they themselves are translated from appropriate
The privileged numeral system in the code might also be in- genes in genome. Each of these enzymes recognizes with great
terpreted as an indication of a similar feature in genomes. It is specificity both its cognate amino acid and all tRNAs specifying
often said that genomes store hereditary information in quater- that amino acid; tRNAs are recognized primarily by their overall
nary digital format. There are 24 possible numberings of DNA configuration, not exclusively by their anticodons (Fig. A1c).
nucleotides with digits 0, 1, 2, 3. The ideogram seems to suggest After binding and additional checking, aaRS releases tRNA
the proper one: T  0, C  1, G  2, A  3. In this case the charged with amino acid to be delivered to ribosome (Fig. A1d).
TCGA quadruplet (Fig. 9a), read in the distinguished direction, In its turn, the ribosome does not care if tRNA carries an amino
represents the natural sequence preceded by zero. Palindromic acid specified by its anticodon; it only checks if the anticodon of
codons CCC and TCT (Fig. 10b) become a symbol of the qua- tRNA matches complementarily the current codon in messenger
ternary digital symmetry 1114 and the radix of the corresponding RNA (mRNA; Fig. A1e). If so, the amino acid is transferred
system 0104 = 4, respectively. Translationally related AGC, or from tRNA to the growing peptide chain and tRNA is released to
3214, codons (Fig. 9b) possibly indicate positions in quaternary be recycled. If codon and anticodon do not match, tRNA with its
place-value notation, with higher orders coming first. The sum amino acid is dislodged from the ribosome to be used later until
of digital triplets in the string TAG + TAA + AAA + ATG + it matches codon on mRNA (even with this overshoot the bacte-
ATG (Fig. 10c) equals to the number of nucleotides in the code rial ribosome manages to add ~20 amino acids per second to a
30004 = 192. Besides, T as zero is opposed to the other three peptide chain). The described mechanism results in relationships
“digits” in the decomposed code (Fig. 6). Finally, each comple- between mRNA codons and amino acids (Fig. A1f) which, col-
mentary base pair in DNA sums to 3, so the double helix looks lected together in any convenient form (one possibility is shown
numerically as 333…4, and the central AAA codon in Fig. 10c in Fig. 1a), constitute the genetic code.
9
Fig. A1. Molecular mechanisms of the genetic code (shown for the case of amino acid serine) and a simple example of its artificial modification. The
contour arrows indicate directionality of DNA and RNA strands as defined by orientation of their subunits (designated in biochemistry as 5′→3′ ori-
entation; replication, transcription and translation occur only in that direction). (a) tRNASer gene (the gene of tRNA that specifies Ser in the standard
code) is transcribed by RNA polymerase from genomic DNA. (b) The folded tRNASer molecule (top), serine molecule (middle) and seryl-tRNA syn-
thetase (SARS, an aaRS cognate for amino acid serine; bottom). (c) SARS recognizes both serine and tRNASer and binds them together. (d) Ser-
tRNASer released from SARS and ready to be delivered to ribosome. (e) The process of peptide synthesis at the ribosome (as an example, the mRNA
with the gene fragment of the SARS itself is shown). (f) The resulting fragment of the genetic code (also shown is Ala group, which will be used in
an example below). (g)-(k). A simple way of genetic code modification. The shaded sequence in (j) corresponds to the region shown in (e).
The key point in terms of changeability of the genetic code is that TCC codons are replaced with GCC and vice versa (Fig.
that there is no direct chemical interaction between mRNA co- A1j); such operation is possible when genomes are even rewrit-
dons and amino acids at any stage. They interact via molecules ten from scratch (Gibson et al., 2010). Now, amino acid se-
of tRNA and aaRS both of which might be modified so that a quences of proteins stay unaltered and a cell proliferates with
codon is reassigned to another amino acid. As an example, Fig- the new genetic code (Fig. A1k).
ures A1g-k show a simple way of changing the code where two It must be clear now why the genetic code is highly protected
amino acids – Ser and Ala – interchange two of their codons. It from casual modifications. If a mutation occurs in tRNA or aaRS
is known that in most organisms tRNA anticodons are not in- leading to codon reassignment, all genes in genome remain writ-
volved in recognition by aaRSs cognate for these amino acids ten with the previous code, and a cell quickly goes off the scene
(Giegé et al., 1998; the fact reflected in Fig. A1c with SARS not without progeny. The chances that such mutation in tRNA/aaRS
touching the anticodon). Therefore, the three nucleotides in is accompanied by corresponding mutations in coding genes all
tRNASer gene corresponding to anticodon might be replaced over the genome resulting in unaltered proteins are vanishingly
(Fig. A1g), in particular, to get GGC anticodon corresponding small, given that there are dozens of such codons in thousands of
to GCC codon in mRNA, which normally codes Ala. (To get genes in a genome. Thus, the machinery of the genetic code ex-
anticodon for a codon, or vice versa, one has to apply comple- periences exceptionally strong purifying selection that keeps it
mentarity rule and reverse the resulting triplet, since comple- unchanged over billions of years.
mentary DNA/RNA strands have opposite directionalities). It should be reminded that in reality the process of intentional
After that, SARS will still bind Ser to tRNASer, even though it modification of the code is more complicated. For example, details
now has new GGC anticodon (Fig. A1h). If analogous proce- of tRNA recognition by aaRSs vary depending on tRNA species
dure is performed with tRNAAla genes to produce tRNAAla with and organism, and in some cases anticodon is involved, partially
GGA anticodon, the genetic code would be modified: Ser and or entirely, in that process. However, this is avoidable, in princi-
Ala would have interchanged some of their codons (actually, ple, with appropriate methods of molecular engineering. Another
two codons, due to wobble pairing). However, the cell will not issue is that modifications in the code that leave proteins unaltered
survive such surgery, since all coding genes in genome remain still might affect the level of gene expression (Kudla et al., 2009).
“written” with the previous code and after translation with the Therefore, additional measures might have to be taken to restore
new code they all produce non- or at best semi-functional pro- the expression pattern with the new genetic code. These are sur-
teins, with Ala occasionally replaced by Ser and vice versa. To mountable technical issues; the point is that there are no principal
fix the new code in a cell lineage, one also has to change coding restrictions for changing the code artificially in any desired way.
mRNAs appropriately to leave amino acid sequences of coded In effect, elaborate methods of modifying the overall tRNA con-
proteins unaltered (Fig. A1i). That would be automatically ful- figuration and/or aaRS recognition sites might allow not only in-
filled if all coding genes are rewritten all over the genome so terchanging two amino acids, but introducing new ones.
10
Appendix B. Statistical test The random variable in question is the number of independ-
ent patterns of the same sort in a code. Obviously, the more such
It is appropriate to ask if the presented patterns are merely an
patterns are observed in a code, the less likely such observation
artifact of data fishing. To assess that, one might compare infor- is. Probably, a good approximation here would be a binomial
mation volumes of the data set itself (V0) and of the pattern en-
distribution since, for example, a nucleon balance might be re-
semble within that set (Vp). The artifact of data fishing might
garded as a Bernoulli trial: in a given arrangement the balance is
then be defined as the case when Vp << V0. As shown in Appen-
either “on” or “off”, where probability for “on” is much smaller
dix E, the presented ensemble of patterns might be described
than for “off”. However, probabilities for balances in distinct
with a system of Diophantine equations, where nucleon numbers
arrangements might differ, especially under conditions imposed.
of amino acids serve as unknowns. Given the set of canonical
Situation is even more complex with ideogram symmetries:
amino acids (the range of possible values for the unknowns), this symmetry is not just “on” or “off”, it is also characterized by the
system is completely defined: it has a single solution and that
length of a string and the number of nucleotide types involved.
turns out to be the actual mapping of the code (this also implies Therefore, we do not apply any approximations but use brute-
that there are no more algebraically independent patterns of the
force approach to find distributions for appropriately defined
same sort in the code). Hence, Vp = V0, so the pattern ensemble
scores for the patterns. Proline was considered with one nucleon
employs informational capacity of the code entirely, showing
transferred from its side chain to its block (note that since the
that it represents a feature inherent to the code itself, rather than activation key is applied universally, the actual code and the
an artifact of data fishing.
code with the key applied are equivalent statistically).
One might ask then how likely such pattern ensemble is to
appear in the genetic code by chance. Since this question implies Nucleon balances. Arithmetical patterns in the standard code are
that the current mapping of the code has been shaped by natural all of the same style: equality of nucleon sums + their distinctive
processes, it is more appropriate to ask how likely such pattern decimal notation + at least one of the three transformations (ex-
ensemble is to appear by chance under certain conditions reflect- cept the decomposed case). The search for a random code with a
ing presumable evolutionary pathways. We tested both versions few patterns of this sort turned out to be time-consuming, so the
of the null hypothesis (“the patterns are due to chance alone” and requirements were greatly simplified. Only nucleon equalities
“the patterns are due to chance coupled with presumable evolu- were considered, without requirement of distinctive notation in
tionary pathways”). The results are of the same order of magni- any numeral system. Presence of transformations was required
tude; we describe only the version with presumable evolutionary only in Gamow’s arrangement for codons with identical and
pathways. Three such pathways reflecting predominant specula- unique bases, since transformations act there in the first place,
tions on the code evolution were imposed on computer- not as companions of another sorting logic. Also for simplicity,
generated codes in this test: only global patterns were considered; “local” features like the
(1) Redundancy must be on average similar to that of the real threefold balance in Fig. 5b were not checked.
code. This is thought to be due to the specifics of interaction Alternative codes might have balances in arrangements and
between the ribosome, mRNA and tRNA (Novozhilov et al., combinations different from those in the real code. Contrary to
2007). Besides, we took into account possible dependence of the as it might seem, there are not so many ways of arranging the
probability for a codon family to stay whole or to be split on the code based on a straightforward logic with minimum arbitrari-
type of its first two bases. This follows from the difference in ness. For example, along with Gamow’s sorting, several other
thermostability between codon-anticidon pairs enriched with arrangements were proposed during early attempts to deduce
strong (G and C) bases and those enriched with weak (A and T) the code theoretically (see Hayes, 1998). One of them is known
bases (Lagerkvist, 1978). For that, the probability for a family of as the “code without commas” (Crick et al., 1957). However,
four codons with leading strong doublets to specify a single ami- unlike Gamow’s sorting, this and other proposed arrangements
no acid was adopted to be 0.9, for those with weak doublets – do not allow “freezing” the code elements completely, leaving
0.1, and for mixed doublets it was 0.5. Each of the 20 amino a large degree of arbitrariness. Ultimately, the following ar-
acids and Stop is recruited at least once; therefore codes with rangements were considered in the test:
less than 21 generated blocks are discarded. After that blocks - divisions based on redundancy;
were populated randomly with amino acids and Stop. - divisions based on positions in codons (alternating all com-
(2) Reduced effect of mutations/mistranslations due to natu- binations such as S or W in the first position, R or Y in the
ral selection. The cost function for polar requirement was adopt- second position, etc.);
ed from Freeland & Hurst (1998), taking into account transver- - sortings based on nucleotide composition of codons (alternating
sion-transition and mistranslation biases (see also Novo- all combinations of “freezing” conditions and division logic);
zhilov et al., 2007). Only those codes were passed further which - arrangements based on decomposition of codons into bases
had cost function value smaller than φ0 + σ, where φ0 is the value (alternating all combinations of the four nucleotide sets).
for the universal code, and σ is the standard deviation for all Besides, the first two types might be arranged with full-size or
random codes filtered through the previous condition. contracted codons. The only possible balance of the peptide rep-
(3) Small departure from the cytoplasmic balance (see Ap- resentation (Appendix D) was also checked. In total, 160 poten-
pendix D). As argued by Downes & Richardson (2002), this tial balances (of both chain-chain and block-chain types) were
balance might reflect evolutionary pathways optimizing the dis- checked in all these arrangements. Precautions were made to
tribution of mass in proteins. With C standing for all side chain ignore arithmetical dependencies, as for certain code versions
nucleons in the code and B for all nucleons in block residues, the some balances are trivially fulfilled if few others occur. A simple
value δ = (C – B)/(C + B) is distributed approximately normally scoring scheme was adopted: the score of a code is the number
with μ = 0.043 and σ = 0.024 (under the first condition described of algebraically independent nucleon equalities it happens to
above). Only those codes were considered which had δ in the possess in all arrangements. In this scheme the simplified ver-
range 0 ± σ, centered on the value of the standard code. As that sion of arithmetical patterns in the standard code has the score 7.
range corresponds to codes with smaller (“early”) amino acids Computer estimation shows that probability for a code to have
predominating, this condition also reflects presumable history of the score not less than 7 by chance under imposed conditions is
the code expansion (Trifonov, 2000; Wong, 2005). p1 = 1.5×10–8 (Fig. B1a).
11
single code under imposed conditions is p1p2p3 = 2.4×10–14.
Since the redundancy-symmetric code is not even needed to be
found in nature to reveal the ideogram, the final P-value will not
differ much from that value.
This result gives probabilities for the specific type of patterns –
nucleon equalities, ideogram symmetries and transformations.
However, testing the hypothesis of an intelligent signal should
take into account patterns of other sorts as well, as long as they
meet the requirements outlined in Introduction. After analysis of
the literature on the genetic code our opinion is still that nucleon
and redundancy numbers are the best candidates for “ostensive
Fig. B1. Distribution of variant codes by their scores for (a) nucleon numerals”. We accept though that there could be other possibili-
equalities and (b) ideogram symmetries. The size of the sample in both ties and that the obtained P-value should be regarded as a rough
cases is one billion codes. approximation (keep in mind simplifications in the test as well).
But admittedly, there are just not enough candidates for “ostensive
Ideogram symmetries. An ideogram might be built for each var- numerals” and corresponding (algebraically defined) pattern en-
iant code in the same way as shown in Fig. 7 (however, no re- sembles to compensate for the small P-value obtained and to raise
quirement is imposed for whole and split families to be linked it close to the significance level.
with any transformation). There are a few more conceivable
ways to build an ideogram using contracted codon series (ideo- Appendix C. Digital symmetries of positional numeral systems
grams based on full-size codons suffer with ambiguities). For The digital symmetry described in the main text for the
example, nucleon and redundancy numbers might be arranged decimal system is related to a divisibility criterion that might
in the same direction, rather than antisymmetrically. Another be used to effectively perform checksums. Consider the num-
way is to divide the code by positions in codons (e.g., R or Y in ber 27014319417 as an example. Triplet reading frame splits
the first position; though these ideograms are simpler as two of this number into digital triplets 270, 143, 194, 170 (any of the
their four upper strings are always binary, whereas in ideo- three reading frames might be chosen; zeros are added at flanks
grams based on redundancy all strings are, in general, quater- to form complete triplets). The sum of these triplets equals to
nary). In total, 9 ideogram versions were built for each code 777. Its distinctive notation indicates that the original number
and checked for symmetries. Namely, each of the four strings is divisible by 037. In four-digit numbers that appear during
was checked for M, M + I, T, T + I, where M and T stand for summations thousand’s digits are transferred to unit’s digits. If
mirror and translation symmetries and I denotes pair inversions notation of the resulting sum is not distinctive, add or subtract
of all three types. For each symmetry a string of length L gets 037 once. Subsequent distinctive notation will confirm the
the score L/2, if it contains only two types of bases (or if the divisibility of the original number by 037 while its absence will
symmetry holds only in binary representation RY, SW or KM), disprove it. Thus, the other two frames for the exemplary num-
and L, if it contains three or all four types of bases. Only ber yield:
whole-string symmetries were considered (in this case multiple
symmetries organizing different parts of a string such as in 002 + 701 + 431 + 941 + 700 = 2775 → 002 + 775 = 777;
Fig. 9b are not detected; the whole string in Fig. 9b, however, 027 + 014 + 319 + 417 = 777.
is mirror symmetrical in KM representation). For each ambig- This criterion applies to numbers of any length and requires a
uous position (two neighboring series with equal nucleon num- register with only three positions. Moving along a linear nota-
bers) the penalty L/3 was introduced. Semantical symmetries tion, such register adds digital triplets together and transfers
and balances of translated amino acids were not checked. Fi- thousand’s digits to unit’s digits.
nally, if at least one of the four strings has none of the symme- The same triplet digital symmetry and related divisibility cri-
tries, the score is divided by 2. The euplotid code has the score terion exist in all numeral systems with radix q that meets the
35 in this scheme: 8 for M + I(TA, CG) and 4 for TRY in the requirement (q – 1)/3 = Integer. The symmetry-related prime
upper short string, 4 for MRY in the center short string, 8 for number in those systems is found as 111q/3. Thus, the feature
MKM in the upper long string, 16 for M in the center long string, exists in the quaternary system (q = 4) with prime number 7
penalty -16/3 ≈ -5 for Lys and Gln (though in this case their in- (0134), septenary system (q = 7) with prime number 19 (0257),
terchange affects neither MKM in the upper string, nor M in the decimal system (q = 10) with prime number 037, the system with
center one). Computer estimation shows that probability for a q = 13 and prime number 61 (04913), and so on. The digital
code to have the score not less than 35 by chance under imposed symmetry of the quaternary system is shown in Fig. C1.
conditions is p2 = 9.4×10–5 (Fig. B1b).
We also checked transformations in Rumer’s bisections of
generated codes, since these transformations served as the
guiding principle for signal extraction in the real code. Under
the conditions imposed, probability for a random code to have
equal numbers of whole and split families which are further-
more linked with any of the three possible transformations was
found to be 4.6×10–2. Given that one transformation takes
place, the other two might be distributed among codons in the
ratios 8:0 (p = 0.125), 4:4 (p = 0.375), or 2:6 (p = 0.5). For the
real code this ratio is 4:4 (see Fig. 2a), so finally p3 = 1.7×10–2.
As suggested by a separate computational study, mutual in-
fluence of the three types of patterns is negligible, so the total Fig. C1. Similar to the decimal system, the quaternary system also dis-
probability for a (very simplified) signal to occur by chance in a plays symmetry of digital triplets, where 7 (0134) acts instead of 037.
12
Appendix D. The cytoplasmic balance
Fig. D1 represents the entire genetic code as a peptide. Each
amino acid is inserted into this peptide as many times as it ap-
pears in the standard code. Amino acid block residues make up
the peptide backbone. The resulting polymer is 61 amino acids
long. If its N- and C-termini are eliminated by closing the pep-
tide into a ring, its backbone and side chains appear precisely
balanced. Notably, this feature is common to natural proteins:
their mass is distributed approximately equally between peptide
backbone and side chains (Downes & Richardson, 2002). This
also automatically implies that frequencies of amino acids in
natural proteins correlate with their abundance in the genetic
code (see data in Gilis et al., 2001).
Not only the activation key is discarded in this balance, but
amino acid molecules are considered as they appear in cyto-
plasmic environment (where side chains of some of them are
ionized). For these reasons the balance shown in Fig. D1 is re-
ferred to as natural or cytoplasmic. Nevertheless, unusual pep-
tide form (though circular peptides do occur rarely in nature, see
Conlan et al., 2010) and distinction between amino acid blocks
and chains suggest that the cytoplasmic balance and the “virtual”
balances shown in the main text are likely to be related phenom-
ena. Possibly, this balance is intended to validate the artificial
nature of the activation key, showing that only actual proline can
maintain patterns in natural environment. This balance was Fig. D1. Amino acids of the standard genetic code in the form of a circu-
found by Downes & Richardson (2002) from biological aspect. lar peptide (sequence order does not matter). The peptide is formed by
Simultaneously, Kashkarov et al. (2002) found it with a formal aggregating standard blocks of amino acids into polymer backbone.
Formation of each peptide bond releases a water molecule reducing each
arithmetical approach. amino acid block to 56 nucleons (55 in proline). Asp and Glu lose one
proton each from their side chains at cytoplasmic pH, while Arg and Lys
Appendix E. Algebraic representation of the signal gain one proton each (denoted with –1 and +1, respectively). Other ami-
no acids are predominantly neutral in cytoplasmic environment (Alberts
Here we describe a possible way the signal-harboring map- et al., 2008). As a result, nucleon sum of the peptide backbone is exactly
ping might have been obtained. As initial data, one has a set of equal to that of all its side chains.
64 codons and another set of 20 canonical amino acids plus Stop.
Suppose, the mapping between those two sets is unknown and it
has to be deduced from the given pattern ensemble of the signal. 2aay + aar + tar + car + gar = 333 (Fig. 5b);
There are ~1083 possible mappings between the two sets, provid- 3ggn + tgg + cgn + agr = 333 (Fig. 5b);
ed that each element from the second set is represented at least ath + acn + agr + gtn + gcn + gar = 333 (Fig. 5b);
once. Knowing the ideogram (without knowing nucleon numbers tty + 2ctn + 2tcn + ccn + 2aay + tar + ath + car + acn + 2ggn
mapped to individual codons) is equivalent to knowing the block + tgg + gtn + cgn + gcn = 888 (Fig. 5c);
structure of the code. From this follows the first portion of equa- 5tty + 4ttr + 5ctn + 4ath + atg + 5gtn + 5tcn + ccn + acn + gcn
tions ggt = ggc = gga = ggg = ggn, ttt = ttc = tty, etc., where + 3tay + 2tar + cay + aay + gay + 3tgy + tga + tgg + cgn
codons are used to denote variables – unknown nucleon numbers + agy + ggn = 666 + 999×2 (Fig. 6b);
of amino acid side chains. Thus, the number of elements in the 2tar + aar + 2atg = 222 (Fig. 10d);
first set is essentially reduced from 64 to 24. But there are still agy + 2aar + tgh = 222 (Fig. 10e).
~1030 possible mappings left. Now one might write down the The cytoplasmic balance is not accounted here as it has no alge-
nucleon sums from Figs. 5-8 and 10 (leaving out algebraically braic connection to this system due to the activation key. There are
dependent parts and standard block sums, as we are provided also additional inequalities provided by the ideogram (Fig. 7a):
with the set of canonical amino acids; in case of projecting the
patterns Stop might be preliminarily assigned to certain codons ggn ≤ gcn ≤ tcn ≤ ccn ≤ gtn ≤ acn ≤ ctn ≤ cgn;
to make things easier with the block sums): tgh ≤ ath;
tar ≤ agy ≤ ttr ≤ aay ≤ gay ≤ car ≤ aar ≤ gar ≤ cay ≤ tty ≤
ggn + gcn + tcn + ccn + gtn + acn + ctn + cgn = 333 (Fig. 7b); agr ≤ tay;
tgy + tga + ath + tar + agy + ttr + aay + gay + car + aar + gar atg ≤ tgg.
+ cay + tty + agr + tay + atg + tgg = 111 + 999 (Fig. 7b);
tty + ttr + tcn + tay + tar + tgy + tga + tgg + ctn + ccn + cay Finally, tgh = tgy to account for two code versions. In total, there
+ car + cgn = 814 (Fig. 8a); are 26 unknowns, 16 equations and 20 inequalities. Generally,
tty + ttr + tcn + tay + tar + tgy + tga + tgg + gtn + gcn + gay such systems of Diophantine equations and inequalities have
+ gar + ggn = 654 (Fig. 8b); multiple solutions. Since we are interested here in obtaining the
tty + ttr + ctn + ath + atg + gtn + tgy + tga + tgg + cgn + agy mapping of the code given the patterns and the fixed set of canon-
+ agr + ggn = 789 (Fig. 8b); ical amino acids plus Stop, the solution is to be searched over the
tty + aar + ath + tcn + cay + 2gcn + ctn + tgy + tga + gay + atg fragmentary domain of integers and zero {0, 1, 15, 31, 41, 43, 45,
+ car + agy = 703 (Fig. 5a); 47, 57, 58, 59, 72, 73, 75, 81, 91, 100, 107, 130}. In this case
ggn + ccn + ctn + 2acn + tay + tcn + 2gtn + 2cgn + agy + tar analysis of the system with any computer algebra system capable
+ gay = 703 (Fig. 5a); of dealing with Diophantine expressions shows that this system
tty + 2ttr + 3ccn + 2ctn + ath + gtn + 2tcn + acn + gcn + tay has a single solution coinciding with the actual mapping of nu-
+ tgy + cay + cgn = 999 (Fig. 5b); cleon numbers onto codons: tty = 91, ggn = 1, tga = 0, ath = 57,
13
etc. That still leaves us with several mappings for amino acids Davies P.C.W. (2012) Footprints of alien technology. Acta Astronaut. 73,
though, since two of the roots – 57 and 72 – represent two amino 250-257.
Dennett D.C. (1996) Darwin’s Dangerous Idea: Evolution and the Meanings
acids each. This ambiguity is eliminated when the patterns within of Life. Penguin, London, p. 131.
side chains (Figs. 7b and 8a) are also taken into account. After Di Giulio M. (2005) The origin of the genetic code: theories and their rela-
that the actual mapping of the code is deduced unambiguously tionships, a review. BioSystems 80, 175-184.
from the algebraic system of the patterns. In fact, analysis shows Downes A.M. & Richardson B.J. (2002) Relationships between genomic
that unambiguous solution is achieved even if the restriction of base content and distribution of mass in coded proteins. J. Mol. Evol.
fragmentary domain is applied only to some of the unknowns. In 55, 476–490.
Dragovich B. (2012) p-Adic structure of the genetic code. arXiv:1202.2353.
another approach (shCherbak, 2003) unambiguous solution is Dumas S. & Dutil Y. (2004) The Evpatoria messages. ‘Papers’ section at
achieved only with few assumptions about the amino acid set. http://www.activeseti.org.
Ehman J.R. (2011) “Wow!” – a tantalizing candidate. In: Searching for
Acknowledgments Extraterrestrial Intelligence: SETI Past, Present, and Future (edited
by Shuch HP). Springer, Berlin, Heidelberg, pp. 47-63.
The study was partially financed by the Ministry of Education and Science Eigen M. & Winkler R. (1983) Laws of the Game: How the Principles of
of the Republic of Kazakhstan. The research was appreciably promoted by Nature Govern Chance. Princeton Univ. Press, Princeton.
Professor Bakytzhan T. Zhumagulov from the National Engineering Acad- Elliott J.R. (2010) Detecting the signature of intelligent life. Acta Astronaut.
emy of the Republic of Kazakhstan. Part of the research was made during 67, 1419-1426.
V.I.S.’ stay at Max-Planck-Institut für biophysikalische Chemie (Göttingen, Freeland S.J. (2002) The Darwinian genetic code: an adaptation for adapt-
Germany) on kind invitation of Professor Manfred Eigen. V.I.S. expresses ing? Genet. Programm. Evolvable Mach. 3, 113-127.
special thanks to Ruthild Winkler-Oswatitsch for her valuable help and Freeland S.J. & Hurst L.D. (1998) The genetic code is one in a million.
care. M.A.M. acknowledges the support by the administration of Fesenkov J. Mol. Evol. 47, 238-248.
Astrophysical Institute. The authors are grateful to Professor Paul C.W. Freitas R.A. (1983) The search for extraterrestrial artifacts (SETA). J. Brit.
Davies, Felix P. Filatov, Vladimir V. Kashkarov, Artem S. Novozhilov, Interplanet. Soc. 36, 501-506.
Denis V. Tulinov, Artem N. Yermilov and Denis V. Yurin for objective Freudenthal H. (1960) LINCOS: Design of a Language for Cosmic Inter-
criticism and fruitful discussions of the manuscript. We deeply appreciate course. North-Holland Publishing Company, Amsterdam.
Icarus Editorial Staff for organizing the peer review and the three reviewers Gamow G. & Yčas M. (1955) Statistical correlation of protein and ribonu-
for their comments which led to the improvement of the manuscript. cleic acid composition. Proc. Natl. Acad. Sci. USA 41, 1011-1019.
Gibson D.G., Glass J.I., Lartigue C., Noskov V.N., Chuang R.Y., Algire
Authors’ contributions M.A., Benders G.A., Montague M.G., Ma L., Moodie M.M., Merry-
man C., Vashee S., Krishnakumar R., Assad-Garcia N., Andrews-
V.I.S. conceived of and performed the research, developed graphic arts. Pfannkoch C., Denisova E.A., Young L., Qi Z.Q., Segall-Shapiro
V.I.S. and M.A.M. analyzed data, introduced interpretation of the activation T.H., Calvey C.H., Parmar P.P., Hutchison C.A. III, Smith H.O.,
key, outlined structure of the paper. M.A.M. performed statistical test and Venter J.C. (2010) Creation of a bacterial cell controlled by a chemi-
algebraic analysis, wrote the manuscript. cally synthesized genome. Science 329, 52-56.
Giegé R., Sissler M., Florentz C. (1998) Universal rules and idiosyncratic
features in tRNA identity. Nucleic Acids Res. 26, 5017-5035.
References Gilis D., Massar S., Cerf N.J., Rooman M. (2001) Optimality of the genetic
Ailenberg M. & Rotstein O.D. (2009) An improved Huffman coding method code with respect to protein stability and amino-acid frequencies. Ge-
for archiving text, images, and music characters in DNA. BioTech- nome Biol. 2, 49.1–49.12.
niques 47, 747-754. Gusev V.A. & Schulze-Makuch D. (2004) Genetic code: Lucky chance or
Alberts B., Johnson A., Lewis J., Raff M., Roberts K., Walter P. (2008) fundamental law of nature? Phys. Life Rev. 1, 202-229.
Molecular biology of the cell, 5th edition. Garland Science, New York. Haig D. & Hurst L.D. (1991) A quantitative measure of error minimization
Alff-Steinberger C. (1969) The genetic code and error transmission. Proc. in the genetic code. J. Mol. Evol. 33, 412-417.
Natl. Acad. Sci. USA 64, 584-591. Hasegawa M. & Miyata T. (1980) On the antisymmetry of the amino acid
Alvager T., Graham G., Hilleke R., Hutchison D., Westgard J. (1989) On the code table. Orig. Life 10, 265-270.
information content of the genetic code. BioSystems 22, 189-196. Hayes B. (1998) The invention of the genetic code. Am. Sci. 86, 8-14.
Baisnée P.-F., Baldi P., Brunak S., Pedersen A.G. (2001) Flexibility of the Higgs P.G. (2009) A four-column theory for the origin of the genetic code:
genetic code with respect to DNA structure. Bioinformatics 17, 237- tracing the evolutionary pathways that gave rise to an optimized code.
248. Biol. Dir. 4, 16.
Bancroft C., Bowler T., Bloom B., Clelland C.T. (2001) Long-term storage Hoch A.J. & Losick R. (1997) Panspermia, spores and the Bacillus subtilis
of information in DNA. Science 293, 1763-1765. genome. Nature 390, 237-238.
Barbieri M. (2008) Biosemiotics: a new understanding of life. Natur- Hornos J.E.M. & Hornos Y.M.M. (1993) Algebraic model for the evolution
wissenschaften 95, 577-599. of the genetic code. Phys. Rev. Lett. 71, 4401-4404.
Bashford J.D., Tsohantjis I., Jarvis P.D. (1998) A supersymmetric model for Ibba M. & Söll D. (2000) Aminoacyl-tRNA synthesis. Annu. Rev. Biochem.
the evolution of the genetic code. Proc. Natl. Acad. Sci. USA 95, 987- 69, 617-650.
992. Itzkovitz S. & Alon U. (2007) The genetic code is nearly optimal for allow-
Bertman M.O. & Jungck J.R. (1979) Group graph of the genetic code. ing additional information within protein-coding sequences. Genome
J. Hered. 70, 379-384. Res. 17, 405-412.
Bollenbach T., Vetsigian K., Kishony R. (2007) Evolution and multilevel Jungck J.R. (1978) The genetic code as a periodic table. J. Mol. Evol. 11,
optimization of the genetic code. Genome Res. 17, 401-404. 211-224.
Budisa N. (2006) Engineering the Genetic Code: Expanding the Amino Acid Karasev V.A. & Stefanov V.E. (2001) Topological nature of the genetic
Repertoire for the Design of Novel Proteins. Wiley-VCH, Weinheim. code. J. Theor. Biol. 209, 303-317.
Chin J.W. (2012) Reprogramming the genetic code. Science 336, 428-429. Kashkarov V.V., Krassovitskiy A.M., Mamleev V.S., shCherbak V.I. (2002)
Conlan B.F., Gillon A.D., Craik D.J., Anderson M.A. (2010) Circular pro- Random sequences of proteins are exactly balanced like the canonical
teins and mechanisms of cyclization. Biopolymers 94, 573-583. base pairs of DNA. 10th ISSOL Meeting and 13th International Con-
Crick F.H.C. (1968) The origin of the genetic code. J. Mol. Biol. 38, 367- ference on the Origin of Life, 121-122 (abstract).
379. Klump H.H. (2006) Exploring the energy landscape of the genetic code.
Crick F.H.C. (1981) Life Itself: Its Origin and Nature. Simon and Schuster, Arch. Biochem. Biophys. 453, 87-92.
New York. Knight R.D., Freeland S.J., Landweber L.F. (1999) Selection, history and
Crick F.H.C., Griffith J.S., Orgel L.E. (1957) Codes without commas. Proc. chemistry: the three faces of the genetic code. Trends Biochem. Sci.
Natl. Acad. Sci. USA 43, 416-421. 24, 241-247.
Crick F.H.C. & Orgel L.E. (1973) Directed panspermia. Icarus 19, 341-346. Knight R.D., Freeland S.J., Landweber L.F. (2001) Rewiring the keyboard:
Danckwerts H.-J. & Neubert D. (1975) Symmetries of genetic code-doublets. evolvability of the genetic code. Nat. Rev. Genet. 2, 49–58.
J. Mol. Evol. 5, 327-332. Koonin E.V. & Novozhilov A.S. (2009) Origin and evolution of the genetic
Davies P.C.W. (2010) The Eerie Silence: Are We Alone in the Universe? code: the universal enigma. IUBMB Life 61, 99–111.
Penguin, London.
14
Kudla G., Murray A.W., Tollervey D., Plotkin J.B. (2009) Coding- Sagan C., Sagan L.S., Drake F. (1972) A Message from Earth. Science 175,
sequence determinants of gene expression in Escherichia coli. Sci- 881-884.
ence 324, 255-258. Sella G. & Ardell D.H. (2006) The coevolution of genes and genetic codes:
Lagerkvist U. (1978) “Two out of three”: an alternative method for codon Crick’s frozen accident revisited. J. Mol. Evol. 63, 297-313.
reading. Proc. Natl. Acad. Sci. USA 75, 1759-1762. shCherbak V.I. (1988) The co-operative symmetry of the genetic code.
Li S. & Hong M. (2011) Protonation, tautomerization, and rotameric struc- J. Theor. Biol. 132, 121-124.
ture of histidine: a comprehensive study by magic-angle-spinning sol- shCherbak V.I. (1993) The symmetrical architecture of the genetic code
id-state NMR. J. Am. Chem. Soc. 133, 1534-1544. systematization principle. J. Theor. Biol. 162, 395-398.
Marx G. (1979) Message through time. Acta Astronaut. 6, 221-225. shCherbak V.I. (2003) Arithmetic inside the universal genetic code. BioSys-
Mautner M.N. (2000) Seeding the Universe with Life: Securing Our Cosmo- tems 70, 187-209.
logical Future. Legacy Books, Christchurch. Siemion I.Z. & Stefanowicz P. (1992) Periodical change of amino acid reac-
McClain W.H. & Foss K. (1988) Changing the acceptor identity of a transfer tivity within the genetic code. BioSystems 27, 77-84.
RNA by altering nucleotides in a “variable pocket”. Science 241, Taniguchi M. & Hino T. (1981) Cyclic tautomers of tryptophans and trypta-
1804-1807. mines – 4. Tetrahedron 37, 1487-1494.
Meyer F., Schmidt H.I., Plümper E., Hasilik A., Mersmann G., Meyer H.E., Taylor F.J.R. & Coates D. (1989) The code within the codons. BioSystems
Engstörm A., Heckmann K. (1991) UGA is translated as cysteine in 22, 177-187.
pheromone 3 of Euplotes octocarinatus. Proc. Natl. Acad. Sci. USA Tepfer D. (2008) The origin of life, panspermia and a proposal to seed the
88, 3758-3761. Universe. Plant Science 175, 756-760.
Minsky M. (1985) Why intelligent aliens will be intelligible. In: Extraterres- The Staff at the National Astronomy and Ionosphere Center (1975) The
trials: Science and Alien Intelligence (edited by Regis E). Cambridge Arecibo message of November, 1974. Icarus 26, 462-466.
Univ. Press, Cambridge, pp. 117-128. Tlusty T. (2010) A colorful origin for the genetic code: Information theory,
Moura G.R., Paredes J.A., Santos M.A.S. (2010) Development of the genetic statistical mechanics and the emergence of molecular codes. Phys. Life
code: Insights from a fungal codon reassignment. FEBS Lett. 584, Rev. 7, 362-376.
334–341. Travers A. (2006) The evolution of the genetic code revisited. Orig. Life
Nakamura H. (1986) SV40 DNA – A message from ε Eri? Acta Astronaut. Evol. Biosph. 36, 549-555.
13, 573-578. Trifonov E.N. (2000) Consensus temporal order of amino acids and evolu-
Nirenberg M., Leder P., Bernfield M., Brimacombe R., Trupin J., Rott- tion of the triplet code. Gene 261, 139-151.
man F., O’Neal C. (1965) RNA codewords and protein synthesis, Vetsigian K., Woese C., Goldenfeld N. (2006) Collective evolution and the
VII. On the general nature of the RNA code. Proc. Natl. Acad. Sci. genetic code. Proc. Natl. Acad. Sci. USA 103, 10696-10701.
USA 53, 1161-1168. Wilhelm T. & Nikolajewa S. (2004) A new classification scheme of the
Novozhilov A.S., Wolf Y.I., Koonin E.V. (2007) Evolution of the genetic genetic code. J. Mol. Evol. 59, 598-605.
code: partial optimization of a random code for robustness to transla- Woese C.R. (1965) Order in the genetic code. Proc. Natl. Acad. Sci. USA 54,
tion error in a rugged fitness landscape. Biol. Dir. 2, 24. 71-75.
Okamura K., Wei J., Scherer S.W. (2007) Evolutionary implications of Wolf Y.I. & Koonin E.V. (2007) On the origin of the translation system
inversions that have caused intra-strand parity in DNA. BMC Ge- and the genetic code in the RNA world by means of natural selec-
nomics 8, 160. tion, exaptation, and subfunctionalization. Biol. Dir. 2, 14.
Pacioli L. (1508) De Viribus Quantitatis, manuscript, Library of the Univer- Wong J.T.-F. (2005) Coevolution theory of the genetic code at age thirty.
sity of Bologna, code number 250. BioEssays 27, 416-425.
Rak J., Skurski P., Simons J., Gutowski M. (2001) Low-energy tautomers Yachie N., Ohashi Y., Tomita M. (2008) Stabilizing synthetic data in the
and conformers of neutral and protonated arginine. J. Am. Chem. Soc. DNA of living organisms. Syst. Synth. Biol. 2, 19-25.
123, 11695-11707. Yarus M., Widmann J.J., Knight R. (2009) RNA-amino acid binding: a ste-
Rodin A.S., Szathmáry E., Rodin S.N. (2011) On origin of genetic code and reochemical era for the genetic code. J. Mol. Evol. 69, 406-429.
tRNA before translation. Biol. Dir. 6, 14. Yokoo H. & Oshima T. (1979) Is bacteriophage φX174 DNA a message
Rose C. & Wright G. (2004) Inscribed matter as an energy-efficient means from an extraterrestrial intelligence? Icarus 38, 148-153.
of communication with an extraterrestrial civilization. Nature 431, Yuan J., O’Donoghue P., Ambrogelly A., Gundllapalli S., Sherrer R.L.,
47-49. Palioura S., Simonović M., Söll D. (2010) Distinct genetic code ex-
Rumer Yu.B. (1966) Codon systematization in the genetic code. Dokl. Acad. pansion strategies for selenocysteine and pyrrolysine are reflected in
Nauk SSSR 167, 1393-1394 (in Russian). different aminoacyl-tRNA formation systems. FEBS Lett. 584, 342–
Sagan C., Drake F.D., Druyan A., Ferris T., Lomberg J., Sagan L.S. (1978) 349.
Murmurs of Earth: The Voyager Interstellar Record. Random Zhuravlev Yu.N. (2002) Two rules of distribution of amino acids in the code
House, New York. table indicate chimeric nature of the genetic code. Dokl. Biochem. Bi-
ophys. 383, 85-87.
15

The "Wow! Signal" of The Terrestrial Genetic Code: Vladimir I. Shcherbak and Maxim A. Makukov

Uploaded by

Copyright:

Available Formats

The "Wow! Signal" of The Terrestrial Genetic Code: Vladimir I. Shcherbak and Maxim A. Makukov

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The "Wow! Signal" of The Terrestrial Genetic Code: Vladimir I. Shcherbak and Maxim A. Makukov

Uploaded by

Copyright:

Available Formats

The “Wow!

signal” of the terrestrial genetic code

ARTICLE INFO ABSTR ACT

Introduction putative recipients. Being energy-efficient (Rose & Wright,

The arithmetical component

of the signal. The systematization rule leading to the ideography

You might also like