The Rafiki Map

Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

Introduction to Rafiki's quest to break the genetic code.

Quotes to set the rebel tone


Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed during a long course of years, from a point of view directly opposite mine. (B)ut I look with confidence to the future to young and rising naturalists, who will be able to view both sides of the question with impartiality. Charles Darwin Origin of Species

how an individual invents a new way of giving order to data now all assembled must here remain inscrutable and may be permanently so Almost always the men who achieve these fundamental inventions of new paradigm have either been very young or very new to the field whose paradigm they change (they) are particularly likely to see that those rules no longer define a playable game and to conceive another set that can replace them. Thomas Kuhn The Structure of Scientific Revolutions

The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny... Isaac Asimov

The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man. George Bernard Shaw

If we were still listening to scientists wed still be on the ground. The Wright Brothers

One might imagine from these quotes that Rafiki intends to champion new ideas, and indeed this is the case. To be frank, our mission started simply enough: we wanted to sell toys. Mea Culpa. The idea that a geometric puzzle could be useful in the study of genetic translation was pure marketing - just this side of a hoax. Then the Rafiki map was discovered and the attitude quickly changed. Rafiki became intensely determined, obsessed if you will, with the task of learning the fundamentals, and then the gestalt of the genetic code. Science is today without a viable paradigm to explain the origin, meaning and general function of genetic information. Our view of genetic information and translation has become little more than faith-based philosophical doctrine. The perplexities and counter-instances to the cherished dogma are multiplying like Fibonacci's famous rabbits. Therefore, this is an area of science desperately in need of a paradigm shift. Come hell or high water, Rafiki intends to see that shift take place. Rafiki has come full circle. Instead of using the genetic code to sell toys, We intend to use toys to sell a new paradigm of the genetic code. Hundreds of pages of text and twice as many illustrations have been generated in a determined effort to explore and communicate these eccentric, heretical ideas. I have no illusions about my abilities as a writer, scientist or persuader of scientific paradigms. In fact, I know that I am marginal in all of these areas. My language is a tragically imprecise hodge-podge of non-scientific colloquialisms. Much of my thinking is under-developed and confused, and some of the writing , especially the earliest writing, veers dangerously toward gibberish, but there's got to be a pony in here somewhere. Besides, all of it is just good, clean fun, and making it available might provide someone somewhere the insightful spark they need to pick up the ball and run. The bottom line: the dogma doesn't make any sense to me, and I've yet to find any evidence to confirm the cherished and ritualistically protected dogma. What started as a lark is picking up momentum and credibility, and it is on its way toward full-scale scientific jihad. The Rafiki notions, although blatantly heretical, make good sense from most angles, but these ideas consistently draw immediate disdain. Granted, a big part of this reaction is due to the eccentricities of the messenger as well as the outrageousness of the message, but the evidence just isn't there to reject these wacky ideas out-of-hand. In fact, existing evidence appears to support the heresy. So the beatings will continue until morale improves, and the crusade will march forward until the empiric death knell is sounded. Science is being put on notice: the dogma is being put back into play. Definitive empiric evidence is now finally being requested. This is a quest, a search rather than a destination. It's a search worth making, and it's time that somebody formally recognize the need to make it. The genetic code is an unsolved puzzle and a puzzle worth solving. It's fascinating and fun to boot! I am an inventor. I am finding out that Inventors and scientists are two different kinds of animal. They typically are at odds - understandably but they each depend on the other. Inventors piggyback on the good work of scientists, and scientists advance through the fruits of invention. There is a dichotomy to the standard mindsets however. To paraphrase a popular phrase, inventors are from mars and scientists are from venus. Scientists spend entire careers adroitly learning the nuances of dogma, and then they struggle in the good fight to cleverly apply the existing dogma for outstanding problems - primarily in an attempt to bolster the dogma. In short, they are believers. (This view of workaday scientists is more deftly espoused by Thomas Kuhn.) Inventors, on the hand, never met a dogma they didn't hate. What happens when a cherished scientific dogma fails? The dogma behind the genetic code has failed. In the words of Thomas Kuhn, the theory is in crisis, or should be, and we must look forward to the next scientific revolution. The empiric evidence is greatly at odds with the paradigm, and this humble ;-) inventor intends to contribute his two cents to the new paradigm. Here are the broad strokes. The devil, as always, is in the details.

Genetic information has more than one degree of freedom in translation from nucleotides to proteins. The function that translates genetic information and converts primary sequences into proteins is not linear or one-dimensional. Codons are inherently ambiguous with anticodons, and this is informative during translation. In other words, there is more to making a protein than just sequencing amino acids. In short, synonymous codons aren't entirely synonymous. Any complete view of genetic translation must include the information added to the process by tRNA. Just as molecules and molecular information assume an ideal form, so too do the rules of molecular translation. Information has structure, and the translation of information has structure as well. The ideal form of genetic translation is a dodecahedron. Life must constantly find new and diverse sets of protein morphologies on the landscape of all possible protein morphologies. The genetic code serves as a search engine in that task. The search cannot become stuck in isolated regions of the protein landscape, so it is optimized to perform a robust and efficient search for new protein forms. Genomes and the genetic code are both full of symmetries. These symmetries are co-adapted, like hand-in-glove, to execute an optimized search of protein morphologies. Symmetry lays the foundation for all of life. It is the starting point for all molecular processes, and organic molecules are no different. Symmetry lays the foundation, but it is symmetry breaking that actually defines organic information. These web pages are evolving along with the ideas. But originally they were mostly cut and pasted together from the thinking, investigating and writing that has been done since openning Pandora's box on the genetic code. I intend to continue editting and adding to these pages through time, but I especially look forward to posting the continuing contributions of others. The original Rafiki goal was to publish a general interest book to help circulate the eccentric ideas about genetic information and the processes of life. On that score I have so far failed. Writing in general is really time-consuming, hard work, and this project continues to expand and evolve, so web publishing is the best option at this time. Two books were actually written and illustrated, but they are woefully uneditted, and no serious steps toward printing were taken. Rather than rotting in the digital ether of a single computer, they have been made available for completeness, entertainment and as a historical documentation of this fascinating journey - a peek inside the creative process, so to speak. Why not? The web makes anything possible. The first book carries the cryptic working title of 'Rafiki - At the Edge.' The second one was given the working title 'Organic Computers and Genetic Information - the Rafiki Code.' The first one was done in color, and is quite colorful, but the second was done more soberly and cost consciously in anticipation of self-publication. It is tastefully done in black and white, and formatted for printing in 6' x 9'. Perhaps I will return to the book project at a later time, but for now a 'real book' is still a part of the twisted Rafiki dream. At some point after writing the first book, I started writing shorter pieces in an effort to get some ideas published in an academic journal. Some of these pieces will surely be posted, but in general the idea of scientific publication is a rat hole for my time. Again, I failed. As I admitted earlier, I am no scientist, and there seemed to be no place for Rafiki in big-league science today. Rafiki is the science of tomorrow, I suppose. I think it was Hunter Thompson that said, 'when the going gets weird, the weird turn pro.' Some spit and vinegar (another typical internet rant of the heretic): I am frustrated by the institutional arrogance that I routinely encounter, the lack of curiosity or sense of adventure, and mostly by my inability to pick a good fight. Clearly, these folks are too busy chasing grants and making a living to spend any time on their own ideas, let alone the ideas of cyberspace heretics. Their disdain is comprehendable; however, the close-mindedness is stunning. 'Not my area' is the scientific euphemism for 'piss-off, piker'. I was actually told by one of the big boys at the get-go, "you're wrong, and so what if you're right?" Can't really argue with that, but it does remain quite vivid in my memory nonetheless. Here is my rebel yell: Proteins are not functionally or logically equivalent to sequences of amino acids, and the path to understanding this must pass through a dodecahedron! The rally cry: Symmetry! Much of the earlier hard work as a writer was cut apart, recombined and augmented in these web pages. I have made the originals available free of charge in the books section of this site, but they represent dated material. They are useful, colorful and entertaining, but they are not being updated as Rafiki learns and grows. If you have wandered into the Rafiki circus as an unwitting novice, attracted by the glitz and pretty lights, please do not be frightened away just because you might not have a rudimentary familiarity with cell biology, biopolymers, DNA and proteins. There are many good web sites for these basics. Here are some links. Great intro to proteins Folding@Home North Harris College Genetic Science Learning Center Freeland Lab DNA Sciences Hypermedia Glossary Of Genetic Terms The Dictionary of Cell Biology The Human Genome Project Unraveling the Mystery of Protein Folding The basics are not that hard, and it doesn't take too long to get up to speed - not as long as you might think. A contemporary textbook in 9th grade biology is all you should need to get started and join our search for a better understanding of genetic information and translation, the gestalt of life's processes. In fact, the less indoctrination the better. Smart folks with computer, math, art, cryptology or just puzzle-solver backgrounds will thrive in this area. Whatever you do, don't drink the cool aid. Look at the situation with fresh eyes, and make your own critical assessment. Don't accept the 'everybody knows' argument from authority without being shown the goods. The thrust in these pages is slightly more abstract than any biology text treatment of the subject. That is the nature of the beast, and the preferred realm of this author. I do not intend at this time to create primer pages for the basics, but you never know. Please share with us some of the web pages that you find particularly good or useful in these areas. We are interested in views and studies that either confirm or contradict the ideas put forth here, and we are just getting started.

This should be controversial


The idea that primary sequence alone determines tertiary structure in protein folding should be a controversial idea. "a beautiful example of how an entirely acceptable conclusion can be reached that is entirely wrong because of the paucity of knowledge at that particular time. I spent the following 15 years or so completely disproving the conclusions reached in this communication." Christian B. Anfinsen Comment made in 1989 regarding earlier work on the structure of RNase

There is no doubt that Christian Anfinsen was a great contributor to the body of scientific knowledge. His main contribution was in the field of protein folding, and within that field one particular conclusion sticks out: Primary sequence determines tertiary structure. For this conclusion in 1954 Anfinsen shared a Nobel prize in 1972. The origin of the idea, 'the thermodynamic hypothesis of protein folding,' can be glimpsed at the end of the famous 1954 paper:

This hypothesis should still be controversial. Where is the evidence to support the radical conclusions that were subsequently made? Anfinsen did not provide the evidence to support all of his conclusions. It is unquestioned - even by him - that he did not have the ability to determine the shape of even a single protein. He certainly had no way to confirm a theory regarding the shapes of all proteins. This is in fact what he was looking for, and empiric confirmation is still lacking. It will be difficult to obtain, because his conclusions are fundamentally flawed. The confusion (and I got caught by this initially as well) is due to the fact that Anfinsen actually proposed two radical ideas simultaneously. They are almost always taken to be the same idea, but they do not necessarily go together. I agree that the first idea was all but proven - but the other idea is vastly more radical, and wasn't proven by his experiments. Here's the breakdown of the two ideas: 1. Due to thermodynamic molecular forces, polypeptides automatically assume unique, stable conformational ensembles. This might also be termed the auto-assembly hypothesis of protein synthesis. 2. For every sequence of amino acids there is a unique, defining conformational ensemble to which it must auto-assemble. Anfinsen did a fabulous job of validating point #1, but didn't even scratch the surface on point #2. Conversely, it should be simple to disprove the second idea, and in fact it might already be disproved. This is primarily because proving the idea requires proof of a negative: specifically that polypeptides in physiologic conditions cannot consistently fold more than one way. Accepting that they can and do fold consistently in more than one way requires a mere handful of examples where this is known to happen. (For a nice overview of protein folding go here: Unraveling the Mystery of Protein Folding ) The easiest proof of this 'multi-target' view of folding is a prion. Prions are proteins involved in bizarre infectious diseases, such as mad cow disease, where normal proteins are forced to assume shapes different from their 'native state'. Regardless of the mechanism, a prion is an example where the same sequence defines at least two different proteins, and in all probability many different proteins. This argument extends to other diseases as well, diseases generally described as amyloidosis. It is believed to be the process behind Alzheimer's and several forms of cancers to boot. When this is pointed out to cool aid drinkers the protests, excuses and apologies fly. For some reason, prions don't count, but the stark reality is that it is thermodynamically possible to make two distinct proteins from the same sequence of amino acids in physiologic conditions. They say, "there are exceptions to every rule." OK, show me a case where the rule actually holds. It should be simple to take a protein, pepper its nucleotides with copious (not just a few) 'silent mutations' and then thoroughly demonstrate that the protein remains completely unchanged. If this study exists, I have yet to find it, and in fact we can find the converse. The single-target model itself flies in the face of common sense, and actually seems rather absurd, so shouldn't we require at least some indisputable empiric demonstrations of this cherished model, rather than handfuls of 'everyone knows' antecdotes? Today, with a vastly larger amount of more sophisticated evidence, the accepted hypothesis fails. The whole issue must be placed back in the context of information theory. What Anfinsen essentially proposed was that the only information that must be extracted from nucleotide sequences in translation is residue sequence. The rest is autopilot. This theory is compelling not because it is consistent with the data, but because, in the words of Anfinsen, it is a 'considerable simplification.' In fact, it is an over-simplification. It effectively collapses the information content of a protein to residue configuration alone. Preposterous. Dogma recognizes two protein states: 1. random coil, 2. native state. This makes a tautology of 'the protein folding problem'. If the investigation of folding begins with the stipulation of a single uniform, high-energy state - random coil - and it is assumed that the result of folding will inevitably find a single, stable low-energy shape, what is there left to decide?

Sequence and structure are not equivalent!


The sequence of amino acids in a polypeptide string is a major component of the information in a protein structure, but it is clearly not the only component. The issue now should be to identify the other components and elucidate the information mechanisms that deliver them to the final structure. This must begin by questioning the unproven assumptions behind the two states of protein folding. The three basic issues are: 1. How many target structural ensembles are thermodynamically available for an amino acid sequence when it folds? 2. How many distinct conformations emerge post translation - during or just prior to definitive folding? 3. What is the correlation, or what are the folding pathways between the two sets, the sets of possible initial and final conformations of amino acid sequences? If there is only a single possible state to either the initial or final conformations of every sequence, then we can happily go about our business as usual. However, if there are in fact multiple initial conformations coming out of translation, multiple final conformations, and a correlation between the two, then investigations of protein folding must take a completely different tact. The implications of accepting the dogmatic sequence=single structure viewpoint are paramount to our view of the genetic code. If it is correct that primary sequence and only primary sequence determines a single tertiary structure, then the model can be rehabilitated, but if it is false then today's one-dimensional model of the genetic code is beyond repair. If the paradigm were actually correct we should expect to see certain things, irrefutable evidence to support it, but there is nothing to presently justify our blind faith in it. Anfinsen did not justify a belief in the linear paradigm of the genetic code. There are boundless accounts of investigator's experience that suggest the linear paradigm is secure, but conspicuously there is no disciplined proof. There are no well-designed studies to confirm the single-target hypothesis. Whereas there absolutely should be a famous study easily pointed to, reassuring us that the axiom sits on a rock-solid foundation. Anfinsen did not provide it could not provide it - so where is the subsequent definitive study to fill this important void? Common sense and the empiric evidence points in the other direction. There is in fact more than sequence information determining the native conformation of proteins, and ultimately the physiologic behavior of entire protein populations. The logical view is that the genetic code is more subtle, more powerful, and more complex than the beloved, over-simplified paradigm has led us to believe. There is a tremendous amount of work to be done before we can claim that this code has in fact been cracked!

Good questions with links to a few empiric answers:


Besides folded conformations, what are some of the accepted ways that protein populations are known to change with 'silent mutations'? http://nar.oupjournals.org/cgi/content/abstract/26/20/4778 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=11049749&dopt=Abstract Is there any proof that synonymous substitutions are associated with structural differences? http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=9680473&dopt=Abstract http://nar.oupjournals.org/cgi/content/full/27/1/268 Quote from the paper: "These results support the view that structure-related synonymous codon bias is a general phenomenon found in all major taxonomic groups of organisms." http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=8897597&dopt=Abstract Does a 'silent mutation' actually change protein folding? Silent mutations affect in vivo protein folding in Escherichia coli. Does a synonymous substitution actually have an evolutionary impact? http://www.american.edu/cas/bio/faculty_media/carlini/Carlini&Stephan2003.pdf http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=140546 More important than the overwhelming evidence against the idea of sequence-only folding is the lack of evidence for it. This theory, that is so cherished, should be proven before we continue to cherish it. Why do we think that translation and subsequent folding should be perceived as a sequence-only process? The genetic code has the structure in place to go well beyond sequence during translation. Why do we think it couldn't or doesn't? There is more involved in making a protein than translating a sequence of amino acids, and there is more genetic information than residue identity stored in the nucleotide sequence. This information must exist in some form and somehow get translated to the native conformation. How? Where is the intellectual curiosity? Where is the skepticism of dogma that frankly seems absurd? Where is the controversy and debate? Why do people get so mad when these questions are raised? If the proof to whether or not sequence really is the only determinant of structure is so certain, then it should be readily available. 'Everybody knows' is not an adequate defense of this position, because it is too easily rebutted by 'liar liar pants on fire'. More likely, science has fallen asleep at the switch, and we all do get a bit surly upon abrupt awakening. My ears are open - send in the proof.

What can Rafiki contribute?

The most significant contribution from Rafiki is in recognizing the deficiencies in the current model, and then going on to doggedly make a stink about it. Rafiki is nothing if not irritatingly persistent, but the emperor is naked on this one, and Rafiki is child-like enough to point it out. We've all been told that the genetic code has been cracked, yet many important questions are unanswered - in fact some really important questions have remarkably never been asked. Consequently, the genetic code remains unbroken, and our gestalt of the fundamental organization of life's clever molecular sets is found wanting. In essence we have only partially solved the puzzle of how proteins are made from information in DNA, but it is clearly a puzzle worth solving.

Rafiki has contributed four basic ideas:

1. The genetic code is not one-dimensional. There are other dimensions of information, besides primary sequence, that are a part of the genetic code. Another way to put this is that the thermodynamic hypothesis of protein folding is false. Thermodynamics are important to folding, but it is not the only factor involved. 2. There are many ways to map the correlations between nucleic acids and amino acids, but the best way to map them is on the surface of a sphere. 3. The genetic code is optimized not just for making proteins, but for finding new proteins as well. The process and mechanisms of new morphology creation was given short shrift in past considerations of the genetic code. The symmetry of codon assignments combines with genomic symmetry to accelerate the search for new protein morphologies. 4. Our language, metaphors and conceptual tools for studying genetic information are outdated and woefully inadequate. In addition to pointing this out, Rafiki has proposed a few modifications. I'm frequently asked what I'm going to do next. Well, you're looking at it. What do you want me to do, build a biochemistry lab? All I can do is keep shouting, because I'm not going to become a biochemist. I've actually got a life. In the mean time Rafiki has provided some excellent tools to advance the cause. Code World is a nifty contribution toward understanding the genetic code. Code World and the genetic code are not literally one and the same, but their informative structures are. In this regard it is a fabulous tool to help us think about the problem of genetic translation at its very core. In other words, it shows us how two shapes can communicate information. In this case, it shows how information stored in a dodecahedron can be communicated to a tetrahedron. Please think about these concepts in the simplest possible terms, because that is what it is ultimately going to take to eventually get the job done: What is information? what is language and communication? How could the molecules of life organize and execute a language and achieve such sublime communication of information? This is the appropriate starting point in understanding a molecular code for genetic translation. This is an organizing principle for molecular languages, and it is missing from the dogma today. Information must have a structure. Molecules must have access to this structure in some form when they execute methods related to that information. On what structures could these methods be implemented? Molecular information is related to spatial relationships, because this is fundamentally how molecular behavior is determined. The Rafiki Map takes the process one step further and shows us how the actual code is perfectly arranged within a dodecahedron. It effectively demonstrates the context of each molecular set in the code. It provides the most compressed, symmetric and objective view of this vital data. It perfectly demonstrates nucleotide triplets within the overall framework of the code, and it emphasizes the importance of unordered triplets in the form and function of the code. The Rafiki map highlights how the symmetry and coordination of these triplets can work in concert with a genome loaded with the symmetry of sequence transformations to efficiently generate novel protein morphologies. Because of the dodecahedral arrangement, the information involved in the genetic code can finally take shape, and the spatial relationships help define the information. The Rafiki map is simply a superior way to view the data. In all likelihood, Rafiki will not break the genetic code alone (somebody really smart probably will). But by raising long overdue questions, and by providing useful tools - Code World and the Rafiki map - we are contributing to scientific advancement. Perhaps this website will become an icon and a forum for that advancement, or perhaps that forum will develop elsewhere. If you are like-minded, curious or even just amused, please link with us. The more the merrier. The primary theme here is symmetry, and within that theme the dodecahedron takes center stage. The three basic areas where applications of the dodecahedron to life and the genetic code are: Translation. What is the fundamental nature of genetic information; how much of it gets translated, and by what molecular mechanism? If codons are truly synonymous and mutations are truly silent, why do they empirically make an impact on all areas of translation? The Rafiki doctrine of 'symmetry first' when applied to actual genetic translation, is the most persistent and heretical position taken here, but it is also the most important one to resolve. Teleology (The use of ultimate purpose or design as a means of explaining natural phenomena.) What is the origin, history and degree of adaptation in the genetic code, and how has symmetry played its role? Geometric and numeric. Can there be an optimum form to the structure of a code such as the genetic code, and would the numbers involved actually have an impact on that code's optimized behavior? In this case, the basic numbers are 3, 4, 20 and 64, which suggests the introduction of a dodecahedron. Is this just numerology, or is the code actually a combinatorial optimization of molecular sets as suggested by the numbers? All three of these areas will require efforts for years to come of biochemists, computer scientists, physicists, chemists, cryptographers, philosophers, mathematicians; in general - puzzle solvers. Let's have at it!

What is Information?
While working at Bell Labs in 1949, Claude Shannon formally marked the launch of information theory with his book titled The Mathematical Theory of Communication. It came at a time when biochemistry was blazing trails into genetics, so it was natural that information theory played a huge role in forming the discourse around the genetic code. Information is a conceptual entity of the universe that is gaining status with scientists in many areas. Physicists and computer scientists are understandably enamored with the idea that information is a physical reality. Information as a universal entity is a mostly mathematical construct, but information in our universe really does seem to have a degree of independence from its temporal physical embodiment. In this way it appears to possess features like mass or energy. Just as potential energy is easily translated into kinetic energy, information can be translated from one form to another. We practice this principle when we read and write, or speak to each other. In this case information is encoded for travel in vehicles that we call languages, and languages are at the heart of information translation of all sorts. Languages are codes of communication. In a general sense, information travels in any system that has a finite number of discrete choices, and it is quantified by a metric known as a bit. This is a bit confusing (pun intended) because the word bit is also used to describe a binary digit. Binary digits are merely symbols; they are symbolic vessels that carry different amounts of information. The actual amount of information carried by one binary digit will depend on a number of things. Consider a thermometer. It is an instrument or vessel used to measure temperature, and binary digits are like the markings on the thermometer. The value announced by the markings on a thermometer stand for a quantity of heat, but there is another measure to be applied to the information provided by the instrument. The value of knowing the measurement and the precision of the measurement is different from the markings of a scale. Heat is heat, but we can have various amounts of information about it. For instance, just knowing whether something is hot or cold provides one bit of information. Knowing the heat in a system with four possible temperatures provides two bits of information.

It is true that a random binary digit contains one bit of information, but it is not true that a binary digit and one bit of information are synonymous. Anyone familiar with digital compression will recognize this difference. For instance, a digital image stored in one million bits of computer memory might easily yield to compression, and the information required to visually describe the exact same image in two different systems might be compressed into, say, one thousand bits of data. This means that each bit in the original image contains only about 1/1000 of a bit of visual information. Of course, realizing this compression windfall requires knowledge of an encoding schema. It also requires that we find patterns in the data. Most significantly, it requires awareness of the system making and displaying the various potential forms of information contained by the original bits. There is a big catch. There might be additional dimensions of information beyond just visual data contained in a file storing a digital image. This then requires creative exercises in defining the quantities, forms and dimensions in various information systems. For instance, a secret text message, attack at dawn! might be cleverly encrypted in the hypothetical digital image just described. In this case, there are two dimensions of information contained in the image, and one of them is in danger of being missed. The actual information content of each bit might decrease significantly by overlooking informative dimensions and using a careless compression scheme. Our ignorance of the encrypted text message within the image data very well might cause us to unknowingly throw out valuable information when we compress the original file. There are several useful techniques we can use to identify the content and form that a quantity of information might take. It is bittersweet that the information identified by many of these techniques has been given the name entropy (H), because, of course, entropy has another meaning in physics as well. The two concepts are very similar, statistical concepts, and perhaps at a profound level they represent the same concept. My intuition is that they do, but there is a real danger of confusing the entropy of information with the entropy of thermodynamics. Nonetheless, entropy is the name we shall use here, and it will be a useful concept in our examination of genetic information systems. To limit the potential for confusion, I will use the term entropy here only in the context of information entropy.

In broad strokes, information entropy is a measure of the value of knowledge. Specifically, it is the value of knowing the precise choice made by a system given a discrete number of choices. For instance, the entropy of knowing the outcome of an honest coin toss is one bit, because an honest coin toss is the epitome of one random bit of information.The coin might land heads or tails with equal probability. Knowledge of the actual outcome is worth one bit of information.

6 bits of information.

However, the uncertainty of an honest coin toss is at a maximum. Conversely, a two-headed coin lands heads every time, so the uncertainty and therefore the entropy of any number of these absolutely rigged coin tosses are reduced to zero. We know the results without even tossing the coin, so the value of knowing them is nil.

Zero bits of information from a 2-headed coin. Similarly, if the coin is rigged somehow with a probability of 75% heads, 25% tails, then the entropy of knowing outcomes from this coin is calculated to be 0.811 bits per toss. This curious value is derived from the following formula provided us by Shannon, where P(x) stands for the probability that x will occur.

Therefore, entropy is embedded in the concept of uncertainty. As previously described, it is no accident that this formula resembles the formula for thermodynamic entropy. Information entropy is the sum of uncertainties of a finite state system existing in any potential state of the system. As uncertainty changes, or the number of potential states changes, so too changes the entropy of the system. The challenge of measuring this in any system lies in our ability to identify the number of potential states and their probabilities. This is generally how we shall approach our efforts to quantify genetic information, and in order to do this we will rely on some combinatoric properties of discrete mathematics. The following definitions are quoted from the website of Wolfram Research. Discrete Mathematics - The branch of mathematics dealing with objects which can assume only certain discrete values. Discrete objects can be characterized by integers, whereas continuous objects require real numbers. The study of how discrete objects combine with one another and the probabilities of various outcomes is known as combinatorics. Combinatorics - The branch of mathematics studying the enumeration, combination, and permutation of sets of elements and the mathematical relations which characterize these properties. Just like the four-temperature thermometer, or two coin tosses, there are two bits of information in a perfectly random sequence of nucleotides, which can be thought of as a genetic information channel or signal.

If we follow the information through translation a little further we find a curious thing: the information content appears to go up and then down.

The metric is in codon units, or 'triplet equivalents' (TE). The surprise for most people is that information content goes up through the tRNA phase of translation. This is because of wobble, ironically, since it introduces new choices at the third nucleotide position. There are 160 anti-codons, whereas there are only 64 codons. Of course, there are typically only 20 amino acids, so the information content appears to fall, but does it really? The actual information content at each of these stages depends on the actual number of each molecular type present during translation, and the probability of each being used. We have yet to get a good handle on tRNA in this regard. In fact, there might be hundreds of tRNA molecules with slight variations in a given cell. How might these tRNA variations affect downstream information? The key to genetic information at the amino acid stage is not only what is being used but also how it's being used. Sure, leucine is always leucine, but are there different ways to use it in translation. What about a tRNA that puts leucine into a peptide chain rapidly vs. slowly? This distinction, as with the thermometer example above, delivers one bit of information to translation. More importantly, the genetic code is now working in a second dimension. This is what is known as an additional degree of freedom. In sciencespeak we are talking about 'translation kinetics.' Variation in translation kinetics is an absolutely proven reality that 'codon usage' impacts translation outcomes, and the consequences of this are not trivial. This is the mechanism that theoretically drives conformation changes in protein folding. The important thing to note is that translation has been experimentally proven to operate with more than one degree of freedom. In other words, knowing the amino acid sequence is now not enough to allow us to determine the outcome of protein folding! We must have additional dimensions of information; therefore, it cannot be a onedimensional code.

Beyond kinetics, what other degrees of freedom might there be? The next most obvious candidate is fidelity. Not every tRNA will be as reliable as the others. As we have just seen, when probabilities change, information changes. So, if two tRNAs deliver leucine at translation, but one does it more reliably than another, then the two deliver different amounts of information. The most interesting prospect for how one leucine residue might differ from another during translation is related to spatial orientation. Since tRNA vary in size by up to 40%, it is not unreasonable to expect that they behave differently in a spatial sense at the point where a peptide bond is actually made. If the spatial differences in tRNA impact the nature of the bond that is made, then spatial information is being delivered during translation. The simplest but most significant case would be in making a distinction between the formation of a cis-peptide bond versus a transpeptide bond. Beyond that it is not obvious how many choices might be made with respect to bond angle rotations. This absolutely is a plausible hypothesis relating to the mechanisms of genetic translation, and there is no empiric evidence that it doesn't work this way. In fact, common sense predicts that it does, and the observed results of translation supports common sense. What is the most plausible structure for efficiently handling this spatial information?

I am told that the technology is not yet to a level where peptide bonds can be measured at the point of translation, but that time is coming. When it is finally proven that all bonds are not created equal, there will be a renewed interest in the genetic code. The downstream effects are real, and their study is the next great frontier. You heard it here first.

What is the genetic code?

This is the $64,000 question, what is the genetic code? The thrust of this rambling page is that we don't really know the answer to this exceptionally important question. The genetic code clearly must involve the process of making protein based on information in DNA, but precisely how is this done? Here are some equally important and related questions: What is life? What is DNA? What is a protein? Life, as described by Erwin Schrdinger, is an aperiodic crystal. DNA is a crystal, and so is protein. Note the pattern. The cleanest view then to the nature of the genetic code is that it is a complex algorithm through which one crystal is formed according to specific properties of another, sovereign crystal. This then requires a language where two crystal types must somehow communicate information. Tough job, don't you think? It's hard to believe there's a way to consistently do it! Science works by simplifying complexity, but in the case of the genetic code we have oversimplified. The chest-beating certainty of the scientific community has obfuscated the fact that we still do not understand nature's nifty crystal-forming algorithm affectionately referred to as the genetic code. We have mastered the well-worn correlation data between sequential components of each crystal type, but this is not the same as the language that communicates genetic information into a fully formed crystal. Science has dogmatically insisted that they are the same thing, but where's the proof? I doubt that we'll ever see it. News flash: The earth rotates around the sun. Note to astronomers: lose the epicycles. All life on this planet is based on a genetic code. It is a system that somehow defines the construction of living things by directing the processes of molecular synthesis and replication. In 1953 James Watson and Francis Crick described a double helix as the structure of a huge molecule called DNA, which was known to reside in the cell nucleus and store the secrets of the genetic code. Excitement grew, and by 1960 leaders in science were predicting that nature would be laid bare within a year, creating justifiable fears. If man actually controlled the genetic code, what would happen to life on earth? Salvadore Dali seemed to anticipate mans dominion over nature and its relationship to a higher truth, as shown in his painting The Temptation of Saint Anthony.

The predictions and accompanying fears proved unfounded, however, since the code wasnt completely broken for another ten years. Entirely synthetic life has yet to be created, and today, despite tremendous strides in genetic engineering, there is a general disaffection with the code. It appears that the code alone was not enough to allow man dominion over nature. The full glory of protein synthesis remains a mystery, so we have now moved past the code and on to proteins themselves. According to conventional thinking, the genetic code is so simple and buttoned down that its logical foundation appears remarkably trivial. Instead, todays glamour boy is the protein the idol to proteomics. It is the study of proteins and their many eccentric habits of folding that dominates the search. Proteins are so devilishly complex that Breaking the protein makes breaking the code look like childs play. Fortunately, we have a tremendous amount of technology to help with the task as compared to 1960, and some of the greatest scientific minds are focused on a solution. Surprise! The genetic code is childs play. Enter the child. A funny thing happened on our way to dominion: somebody everybody forgot to break the other half of the code. A central premise of these pages is that the genetic code is far different from our conventional view of it. The genetic code somehow makes proteins, not just sequences of amino acids. (No, they are not the same thing.) I attempt through these writings to illustrate this obvious fact, and the implications of having missed it. I also intend to swing a machete in the general direction of any sacred cow that ambles into view. Thats how children are childish.

Starting with a modified version of the standard Watson-Crick table we see the set of codons and amino acids that are a part of the genetic code. From the conventional perspective, the genetic code starts and ends with pairing codons and amino acids.

This table, in its various forms, has today come to represent the entire logic of the genetic code. I arranged this table on a somewhat eccentric scheme. The important thing to note, however, is that we can arrange this table any-ol-way we like. There is no correct way to arrange and display this data according to our conventional view of the genetic code, and many different ways are in use. Since we cant say for sure where nature got the data to begin with, and we believe there is no absolute meaning in its arrangement, we are free to view the organization of this data as arbitrary. This is strongly related to the premise that the genetic code is onedimensional, which means that it contains only one degree of freedom with respect to the information it handles. The one dimension is the pairing of codons and amino acids. These two concepts are self-supporting to the point of forming a tautology. If assignments are arbitrary then the code is one-dimensional, and if the code is one-dimensional then assignments are arbitrary. Regardless, the paradigm of a one-dimensional code leaves no room for any absolute foundational logic. Adding, subtracting or shuffling assignments will change the output of the code, but leaves the foundational logic of the code unchanged. Acceptance of this view is not merited by empiric data, however, and it is extraordinarily detrimental to our study and use of the genetic code. The accepted doctrine has prevented the asking of important and fascinating questions, many of which shall be addressed in these pages. I find the one-dimensional view of things now absurd and untenable, especially in light of discoveries of the past five years. Warning bells should be going off all over the place, but they have not. A view of a one-diminsional code is virtually impossible to rehabilitate in even the broadest of terms. There simply is more than one degree of freedom in translation of genetic information, and all of the information must be embodied in our model of the genetic code. Some might quibble with the precise language of my description, but the conventional approach is yet unchallenged, and I therefore intend to aggressively challenge it. From a Rafiki perspective the nature of the data in the above table is the furthest thing from arbitrary, and there is indeed a best way to arrange and view it. Genetic translation is an objective, molecular process. Genetic information must therefore be founded on objective molecular structures. There are at least two dimensions of information in the genetic code, and surely many more. Open to debate are the natures, forms and actual mechanisms of translation for these additional dimensions of information. Like the periodic table of chemical elements, there is a sublime logic to the assignment of amino acids to codons. Without this insight we are blind, and the genetic code goes from a periodic table of elements to a table of periodic elements as viewed by Michael Teague.

Table of Periodic Elements Michael Teague

The conventional view of assignment data becomes particularly dysfunctional when we return to the premise of having a genetic code in the first place. We intuitively know that cryptic information is contained in one set of molecules and communicated to another set of molecules. We know this because we can witness the process and its results (protein synthesis). The key questions are, what information is in there, and how does it get communicated? If one accepts conventional wisdom, the answers are, not much, and with a single, simple set of linear correlations. These answers are incorrect, and the insistence that we cherish them as we have for so long has led to a truly comical view of the genetic code. More comical is the defense of it, as history will record. All indoctrinees are in the trance of a more than forty-year post-hypnotic suggestion, causing obvious anomalies of the paradigm to go unnoticed. This is most unfortunate, so it is our job to correct it. We will start with some basic questions. What is the origin of the language, or how did Life get started? With so many alpha-amino acids to chose from, and room for 64 in the code, why does the standard set only contain 20? What is the logic behind the arrangement of nucleotides, codons and amino acids? Since the mirrors of alpha-amino acids (L and D) are equally stable and exist in equal proportions within the abiotic areas of the universe, why are all of the standard amino acids in the L form? In such a beautifully rapid, accurate and efficient information system, why is there such an ugly redundancy? With few exceptions, the above system appears to be used in all species and presumably back through time. Given the ravages of evolution - changing properties of organisms rapidly and constantly - one might expect some branching into competing dialects of the genetic language. At least the redundancy of the language should be subject to widespread change, since it has no absolute meaning. How could this exact system exhibit such dogged durability across time and throughout species? We now know the shape of DNA a double helix and we know the functional significance of this shape, but this is only genetic storage. After all, it is a complex 3D information system, and shape imparts structure, function and meaning. What is the fundamental shape and meaning of the genetic code when it performs its magical role during protein synthesis? Answers To pick up a good biochemistry text today one might imagine that either there are generally accepted, plausible answers to these questions, or the questions are too unimportant to merit any attention or real answers. To wit: Why only 20? The fact that all living organisms use the same standard amino acids in protein synthesis is evidence that all species on Earth are descended from a common ancestor. Why all L-amino acids? Like modern organisms, the last common ancestor (LCA) must have used L-amino acids and not D-amino acids. And this is from an otherwise excellent, up-to-date, advance-level biochemistry college textbook! Thats it? Thats the best we can do? At least say, we dont know and we dont care. We mustnt pretend to know, or imply its unimportant that we dont know. These arent answers; they are fables. They are known as just so stories. They equate to, they are because they are, and they must need to be because they are. However, not knowing these answers is a very unsettling concept for a lot of very intelligent people, creating more than a small component of denial. There is another anomaly, a gaping hole so to speak in the same texts. Leaf through them and what do you see? Information, lots of it and presented beautifully. They illustrate an abundance of knowledge representing some of the greatest achievement of human thought and investigation. The trend is toward shape, fit, three-dimensions. There is a tip-of-the-cap to the idea that the meaning of the covered subjects lies in their shapes and their space occupying attributes. Some even provide 3D glasses, and most offer links to animated web sites to facilitate the spatial effects. The double helix is celebrated and dissected. Proteins are unfolded, folded and fit together. Electron prowling domains are drawn and speculated upon. Yet at the point where the rubber hits the road, where the genetic code performs its magic, the descriptions revert to 1950s flatness and they are presented in living black and white. Toto, I have a feeling that we're not in Kansas anymore. Dorothy The Wizard of Oz How can it be that the double helix has this wonderful relationship between form and function, yet the nucleic acid complexes have no form-function relationship during protein synthesis? Certainly the form-function of our genetic information storage is important, but what about the genetic processor? It is likely that it has a distinct shape as well, and the shape of the processor is somehow logically related to genetic information storage and retrieval. The foundation of the central dogma is that the information is co-linear. In other words, there is a line of information in DNA that is communicated, somehow, to a line of results in proteins. This is taken to mean that the code is one-dimensional. There is believed to be only one dimension of information passed from line to line - the one dimension being the identity of amino acids or links in the protein chain. Due to faith in co-linearity, and due to the nascent digital information industry in the 1950s, the code itself came to be seen as linear. As Ive already stated - this is a big mistake. There is nothing really linear about the code, unless you believe that a swarm of bees is in some way linear. Nature ignores lines; its all about shapes. The existence and translation of information in matter is not mystical. It is a nitty-gritty process of quantizing and selecting possibilities from a defined set of possibilities. The mysticism lies in the process by which the universe methodically bootstraps information in an evermore-complex cascade of emergence. The correct term, I believe, is sequential. I will grant that the genetic code is to an extent co-sequential, but I will not concede that it is co-linear, because these illusions of linear paradigms are clouding our eyes and our brains. The linear indoctrination process is intense; I know, because Ive been through it. However, we can loosen the reigns on our senses and find some sense in the madness. With the help of some recent discoveries, some clear, rational thinking, and some bodacious art, we can see the order in the chaos.

M.C. Escher

Order and Chaos

In the middle of a table of periodic elements sits logic. As with any pattern there is an organizing force, something that drives the formation of complexity and order. The laws of nature are in play, and the patterns are there for us to see, if only we have the light and courage to look. Furthermore Although it is commonly accepted that The Genetic Code is presently known, and therefore the fundamental rules for some type of universal translation are also known, this is false. In fact, nothing could be further from the truth. Indeed, we know of a somewhat predictive correlation between levels of translation - between a few limited sets of small organic molecules in selected organisms. We do not even know everything there is to know about this level, despite hyperbole to the contrary. Moreover, we have so far failed to identify the relationships between, and the multi-dimensional meanings in vastly larger, more informative molecular sets. We are far from able to recognize all of the genetic information stored in the double helix - undeniably. We are also unable to interpret or explain the detailed patterns of genetic information spread across past, current and future life in the universe. Much of this confusion is caused by a failure to define terms. Although key terms were initially defined in biology, our knowledge has now outstripped our definitions. There are many usages of the terms genetic and code and there are many casually accepted ways to combine the two. Although there is no official definition of anything that can reasonably be called the genetic code, this label over time has been applied to a variety of general and specific concepts, causing a huge element of confusion. Terms absolutely must be defined, especially central terms, and they no longer are. Most of the key terms today are bastardized, squishy, amoeboid and evolving. The most widely accepted usage of the genetic code relates to translation of nucleotides into amino acids; however, even an accurate definition regarding this function is lacking from the dictionaries, textbooks and literature. Understandably, the minds of investigators are out of sync on this issue, and many other issues will suffer as a consequence. I mostly get my terms from Webster, but science has a way of mutilating common parlance. As theories and ideas evolve, vestigial terms, like the shells of hermit crabs, become inhabited by the animus of new meaning. The term genetic is used here to mean: Of, relating to, or influenced by the origin or development of something. The term code is used here to mean: A systematically arranged and comprehensive collection of rules. The Genetic Code could therefore reasonably apply to anything that fairly describes: A systematically arranged and comprehensive collection of rules relating to the origin and development of living organisms. I have let it somewhat out of its classical cage. For the term to be of any use, other than nostalgia, the genetic code must somehow operate on genetic information to animate living things. We have been told explicitly for decades that the genetic code has been broken, and our expectations have quite understandably soared. Yet the term genetic code remains poorly defined by science, much less broken. Theoretically, we should at least be able to make proteins de novo with the genetic code that we supposedly have broken, but yet we cannot. The unreasonably high expectations caused by the hyperbole of what we actually know have apparently not been met in our quest to learn natures expanded secrets behind her codes of organic translation. Our thinking about the genetic code has, over the decades, become more detailed but progressively more confused. Models of the code are woefully inadequate, and the languages used to describe them are extraordinarily self-contradictory. I am now convinced that science actually has no working paradigm for a consistent exploration of genetic information. I am equally convinced that most people have failed to realize this. Heck, why should they? It was announced long ago that the genetic code had been captured, and this mis-truth has been repeated so many times that people truly believe that the genetic code is leading a peaceful existence in captivity, inside tiny spreadsheets, stored in millions of textbooks around the world. It is embarrassing, but the most honest answer to the question of what the genetic code really is, is that we really dont know. Anything short of this would be drinking the cool-aide. We might call something The Genetic Code but it does not make it real. Our label and perception of it are necessarily artifacts of the human mind. They are working tools made of the material of the mind, for use by the material of the mind. Most people tend to forget this, and they confuse the issue further when they associate DNA with the genetic code. From here it is easy to confuse progress in studying the former as an understanding of the latter. DNA is a component of the code, but it is not the code. Sequencing genomes and understanding codes are decidedly different activities. One is data collection, and the other is data interpretation. A broader, more robust, multi-disciplinary approach is called for. Biologists must team with physicists, mathematicians, computer jocks and philosophers to advance the cause. The emperor today is quite naked, so I propose, throughout these pages, bits and pieces of cloth, a more general paradigm, a model of the genetic code that I call the Rafiki code. I have no illusions of breaking the code myself, and in fact I do not pretend to actually propose a new comprehensive genetic code here. This, I think, is impossible. I am proposing something far more useful, but far more dangerous. I am proposing that we shift our paradigms of genetic information and the codes used to translate it. A paradigm shift in science, like religion, is usually an ugly, protracted, name-calling endeavor. Therefore, short of that, I hope to accomplish several things: Examine the terms currently in use and ask if they are well defined, appropriate and useful. Explore a general framework, or paradigm for genetic information that can tie together seemingly disparate observations from separate fields.

Introduce new concepts, models and investigative tools that can serve as platforms for speculation and further model development. Mostly, I am issuing an open invitation to all interested parties: Join the fun. There is a gigantic, important and fascinating puzzle of nature yet to be solved. It will likely take a collective imagination to solve it, and you never know who might be the one to add a new piece. Discovery of the double helix coincided with the introduction of information theory and the dawn of the information age. With them came an explosion of digital computing technology. Broad parallels between organic and digital computers are striking, but the two systems of computation quickly diverge on their details. Computer metaphors of genetic information are useful, but care must be taken in using them. It is helpful to remember that human native languages inform our thinking, but more often they are apt to misinform our ideas of natures complex phenomena. Life is a complex adaptive system rivaled by no artificial system of digital computation. Although it is valid to draw parallels between the two, it is also essential to meticulously highlight the differences. Most genetic metaphors suffer on this score from a sloppy application of native languages. For instance, all digital systems might be called one-dimensional in the sense that they process information in symbolic binary strings. Digital computations require that information be reduced to these symbolic strings of binary digits. Storage and processing can be carried out in this single format because the digital computer is a linear finite state device. It is dependent on the precise state of the device at each step in the process, and every possible state of the device can be reduced to and completely represented by a string of zeros and ones. I see digital computers in this way as sequential, one-dimensional, dynamic mappings of logical relationships. Furthermore, computations are done in digital systems at discrete points by things known as data processors. Strings and data processors of sorts exist in organic computers, but they have a vast number of informative dimensions, not just one. The genetic code has famously come to also be labeled as one-dimensional, but this is a sloppy use of native language that misinforms our thinking. Genetic information is never reduced to a single format or processed in a single dimension. Nothing about genetic information comes close to one-dimensional. The all-too-familiar linear model of the genetic code was proposed many decades ago, but its validity remains to be demonstrated, and many of its original premises have already been disproved. The term linear with respect to the genetic code could be interpreted in several different ways. The nucleotides in DNA tend to lineup, but in this sense it is more appropriate to use the word sequential. DNA and all other molecular forms of information should not be considered as well-behaved sequences of points that form lines, but rather as mischievous sequences of shapes that form other shapes. Alternately, a mathematical relationship is considered linear when every input produces one and only one output. Here again, this description cannot be applied to the genetic code. It is not clear to me what one means when they describe the genetic code as linear and one-dimensional; none-the-less, this is how it is often described. Our unfortunate insistence on calling the genetic code one-dimensional owes a good deal to the fact that codon and amino acid correlations were discovered early in investigations of genetic information. Each nucleotide triplet, or codon, is mostly linked in protein translation to one and only one amino acid in a peptide chain. It was erroneously claimed that this was the only sort of information that a codon could carry in translation. Subsequent investigations have proven this to be completely false, and the idea of a one-dimensional codon is now untenable, yet the linear philosophical doctrine marches on with nary a question to its validity. The most detrimental effect of using the terms linear and one-dimensional in reference to the genetic code is that an additional term - simple - frequently gets tacked on. I invite anyone to demonstrate or explain the simplicity of genetic information and genetic translation, at any level. Attempts to do so would suggest a failure to recognize that tiny portions of computer programs do not make sense of the entire code. The portion of the genetic code with which we are vaguely familiar is but a tiny subset of the logic behind translation of genetic information. They cannot be removed from the entire code and retain their meaning. Any expectation of simplicity in this, natures zenith of complexity, will have a low probability of being met in reality. Unfortunately, there are few if any competing theories with which to compare the linear model. It is a sparse, comfortable, yet highly imperfect model, but for all practical purposes it is the only model presently considered. Here I attempt to define features that must be addressed by any computational model of genetic information, and with them we can question, examine, support or invalidate general theories. The most obvious difference between a digital computer and an organic computer is that the former can be called one-dimensional, and the latter clearly cannot. The digital system deals exclusively with binary strings, but the genetic system deals with, among other things, molecular strings. Molecular strings are sequential polymers macromolecules - that represent some of the many dimensions of genetic information. Each can be seen as a type of animated-floppy-disk of genetic data, and the information contained in the string exists in many dimensions as well. The beauty of molecular strings is that they provide logical, molecular, spatial sequences that can be leveraged in the time dimension. Organic computation always has a vital time element. With molecular strings, information can undergo a stepwise process of translation from one string to another. Each string is a discrete form of information requiring its own finite set of rules for translation. Humans can easily convert a molecular string into a binary string, but a molecule cannot, nor is it clear why it ever would. Moreover, genetic systems expand information beyond strings, into dimensions that cannot be as easily quantified. Stepwise genetic translations visit many states in many molecular forms, and none of them can be removed from time or space, or from the total system. At the very least, no single form of genetic information or level of translation can be separated from the dimensions of space and time in which it exists, whereas a binary string easily can. Some less intuitive molecular dimensions of genetic information are contained in quantities, populations, interactions, co-dynamics, proportions and concentrations of various organic and inorganic molecules. With genetic information, time is of the essence, and space is of the essence as well. We can distinguish these sorts of epi-string information from the more obvious string information, but we must take care not to consider it epigenetic information. Organic data that is not strictly carried in a molecular string can none-the-less still be considered genetic. We will return to this issue shortly. A particular molecule, hemoglobin for instance, might exist in one of many potential states. Even in its standard form it is an animated molecule that depends on shape and mobility to perform its functions. But as any doctor can tell you, there is vital information in the number of possible states, the probability of each state, and the actual state of the entire molecular population at any given time. Additionally, and more complex, there is entropy in the temporal fluctuations of these molecular states. The ontogeny of hemoglobin populations within an organism is contained in genetic information, and the rules of changing populations within changing environments through time are somehow there as well.

Unfortunately, because of the staggering breadth and complexity of these epi-string dimensions of genetic information, we are precluded from examining the details of these forms and dimensions here. However, we are able to identify some numerical fundamentals of genetic systems, and the dimensions of information contained within many forms of molecular strings are presently not that far from our grasp. We begin with a thumbnail overview of a cell, and an outline of the master loop in a genetic translation program.

Master Loop These are necessarily over-simplified diagrams to help us build a structure for tracking information as a genetic program processes it. Science must simplify to enlighten. A more accurate map of the genetic code would resemble a map of the worldwide web, with a riot of interconnecting reversible arrows and overlapping relationships. In mathematics this is known as graph theory. Each level in this master loop has an algorithm, or set of rules that dictate which translations are done to which portions of the data. Each new dimension of data is passed to the molecular forms at the next level. If we consider that the program is given a certain amount of information, INITIAL(Info), then we can see that the information is processed in many forms before a value is returned by the function SURVIVE(Info). If the value returned by this function is TRUE then the loop continues. If the value is FALSE then the loop terminates. We know that every living thing on the planet has a record of perfect, unbroken strings of ancestral TRUE values returned by the function SURVIVE(Info). Therefore, the data can be grouped on this criterion into a common set. But beyond that it is difficult to imagine what the exact nature, origin, or history of the data might be. This is where we run into a sticky problem with our definitions of two words: genetic and epigenetic. Epigenesis is a process of successive differentiation, a form of growth or development like adding layers to an onion. Aristotle first described it, and it remains an important concept today. The master loop of translation is well defined by the term epigenesis, and this process is truly at the heart of genetic translation. However, this observation is insufficient to merit global replacement of the term genetic with epigenetic. We would otherwise be forced to discard the term genetic altogether in this discussion. Note that the first division of a zygote begins a progressive differentiation in an organism. An interpretation this literal would mean that all further translation, the second cellular division and all subsequent translation, including mRNA transcription must be considered epigenetic. At the very least we would need an arbitrary distinction, or judgment call, drawing the line on the meaning of successive differentiation. This is a false choice, and making it has negative consequences. These semantics do not provide ample reason to abandon the definition of genetic, in my opinion. The prefix epi means next to, apart from, outside of, or surrounding. I will not confuse it with the concept of multiple iterations, or recursions, because these progressive translations are required in decompressing any genetic message. From this standpoint, I will use the term epigenetic to stand for anything that is apart from or outside of something genetic. The irony is not lost on me that although the hallmark of organic computation is epigenesis, the term epigenetic will stand for that which is outside genetic. Otherwise, we will need one rather impotent genetic code, and thousands of tiny epigenetic codes. The word epigenetic here is just innately ambiguous. The following diagram will help us further simplify and visualize the metaphor.

In a broad sense, DNA is data, the genetic code is a processor, and the output is a value returned by the function, SURVIVE(Info). Each piece of data can contribute to and be evaluated by its impact on the return value of SURVIVE, but this is the highest level of processing within an organism, and there are countless billions of computations and translations for virtually all of the parts of genetic information. Because there are vastly more translation paths than possible return values, the genetic code cannot even be close to linear at the highest interpretive level. A changing and unknowable environment obligates us to acknowledge this reality of translation. In other words, despite our desire to believe otherwise, there is no direct cause and effect or one-to-one mapping at higher levels when shuffling any isolated part of genetic information at lower levels. Specific changes in information can change survival and many levels of information in between, but they cannot do it when removed from the context of the whole. Genetic information and its translation are context dependent. In the most general possible sense, we could expand the metaphor back through all time and across all living things. However, this is not the common interpretation of a genetic code metaphor. Science started its investigation with a more extreme reduction and therefore an oversimplified view of the process. Now a restricted slice is taken from the program to stand for genetic translation. Typically, several levels are compressed into one metaphorical level, and they comprise a single dimension of information for a single act of translation that we commonly call The Genetic Code.

This paradigm has the undeniable advantage of simplicity. All of the rules of translation from DNA to protein can be considered in one small spreadsheet a simple, linear, one-dimensional genetic look-up table. Genetic Look-up Table

It might turn out that all of the intermediate translation functions cancel out in a single dimension to leave us with this nifty approximation (in selected organisms) but it is unlikely. In fact, I argue that it is already disproved from many angles. Besides, the low algorithmic complexity of this one-dimensional shortcut ignores entire levels and the dimensions of information they carry. It will therefore necessarily compress all of the data from DNA into amino acid sequences, and it was presumably already compressed to a fair degree before finding its way into DNA. Compression of this sort is always intellectually dangerous, because baby and bathwater can sometimes resemble each other. Some microorganisms are known to use the same nucleotide sequence in making different proteins talk about compression! There is no obvious need for a genetic translation program in the real world to compress information much further through these levels - unless nature somehow provides no alternatives. Philosophically, it seems that a genetic program will benefit from preservation, or even expansion of information in one form or another during these translation steps. Furthermore, just by comparing the shape of DNA to the shapes of proteins there appears to be a need for rules of translation pertaining to spatial information, and sadly no clues are given by this spreadsheet. The exact rules must be located somewhere in the compressed levels of the shortcut algorithm, but prolonged searches have yet to find them. They never will. In fact, the first guess made by scientists was that we dont need any spatial rules at all, that all the information required during folding is somehow contained in this table. The task of finding them here is a fools errand; it was wishful thinking in the first place. The spreadsheet could turn out to be a valid simplification in restricted situations for some dimensions of information, but it seems recklessly narrow as the basis for a universal proclamation as The Genetic Code. Regardless, it is clearly inadequate as a basis for more robust understanding of genetic translation. Conversely, it might turn out that this facile metaphorical compression is actually harming our understanding of genetic translation. Remarkably, high-level investigators have shown little interest in even asking this question. The dogma is tenaciously thick and definitely hardened around this one.

Granted, the familiar, one-dimensional metaphor is widely cherished, but the terms commonly used to describe and apply it are imprecise at best, and negligently incorrect at worse. At the very least we should change its label from The Genetic Code to a compressive shortcut map of codon and amino acid correlations in selected organisms. It isnt such a glorious title, but completely misleading is the pronoun The in its current title. It indicates that the one and only genetic code is completely contained in the spreadsheet, but we have discovered many non-canonical spreadsheets, and we should perhaps either number them (TGC1, TGC2 TGCx) or at least show courtesy to schoolchildren and change The to A Genetic Code. More importantly, we already know that valuable genetic information is missing from these spreadsheets, and it is missing in the exact dimension that they are meant to stand for, as well as many others. Consider the following.

This diagram is a shortcut symbolic representation of translation steps for two nucleotide sequences, A and B, into two distinctly different folded proteins. Both translations pass through a level of polypeptide known as the primary sequence of amino acids. The spreadsheet provides a fairly reliable sequence cipher, and so the one-dimensional paradigm takes the bold step of saying that primary sequence is synonymous with primary structure. How do we know this? We take the next bold step and say that since primary structure determines tertiary structure, which is the ultimate shape of a folded protein, all we need is the primary sequence. Proteins are all about shape, so these are fantastic steps. If nucleotide sequence A and B are identical, then one should expect that both folded proteins will be identical. However, this is rarely true in nature. More interesting, if a nucleotide sequence change is made in a single organism from A to B in which one codon, say for leucine, is substituted with another codon also for leucine, then the primary sequence will not change. These are called synonymous codons, and this sequence mutation is called a silent mutation. It is considered silent because the change presumably cannot be heard within the language. Several studies have shown that this premise is absolutely false. Identical primary sequences after silent mutations can consistently lead to different folded proteins. These folded proteins have different enzymatic properties, and enzymes are high-level components of organic computers - they process information at levels well beyond string formation. The explanation for this must be found in the spreadsheet if it is to serve as the code for turning DNA into protein. Apart from folded proteins, silent mutations have demonstrated profound influence on translation of genetic information in many ways, including rate of translation, translation fidelity, signal peptide function, and the rules for splicing and amplification. Therefore, regardless of the mechanisms, the information contained in a single codon goes well beyond the primary sequence of amino acids. Silent mutations have been shown to significantly affect the competitive dynamics in natural selection, and they have even been implicated in human diseases, such as Hirschsprung disease and medullary thyroid carcinoma. In other words, a silent mutation can certainly change the value returned at the highest level of translation, SURVIVE(Info). If this is silence, Id hate to hear a really loud noise. None of these findings are anticipated by, or explainable within a truly one-dimensional model. The theory of a genetic code with a single dimension of information translated across many levels of organic molecules should be in severe crisis by these discoveries, but yet we chose to cling tightly to it. Why? More to the point, it was an overly optimistic, grandiose rush to judgment when the title The Genetic Code was first applied to the codon map. We might choose to call it such, but it is a gross misnomer none-the-less. By now we should clearly recognize that no human is familiar with any such code of translation on virtually any level. Perhaps man has glimpsed a few faulty lines of code, but despite great scientific strides, the logic behind the entire genetic program remains unarguably hidden from our view. Genetic information is far more complex than our present ability to study it. We certainly should not claim to have cracked the genetic code, and we will surely benefit in the future by speaking in terms of multi-dimensional translation, across entire programs, and throughout time. These are not trivial semantics, because our historically loose choice of terms, and our inability to even partially define them has had a dramatic, counterproductive impact on our thinking. We have a map of codon information in one dimension, and we should consider it as such. Nothing more. Even at that, it is flawed. I have asked dozens of working scientists, scoured numerous references, and I have yet to be given an appropriate, useful definition of the genetic code. Definitions, by definition, are whatever we say they are. Its like the price of gold, its not the inherent value of the metal, its whatever were willing to accept as a value. The value of a definition should track with its value in describing something in a useful way. On that score, I will share, in my opinion, the best definition that I have found. This definition comes from Hortons Principles of Biochemistry, Third Edition. genetic code. The correspondence between a particular three nucleotide codon and the amino acid it specifies. The standard genetic code of 64 codons is used by almost all organisms. The genetic code is used to translate the sequence of nucleotides in mRNA into protein. This is a most artful state-of-the-art definition. It has one of the required appropriate qualifiers, and it avoids the classic pitfalls of including words like universal, linear, one-dimensional and simple, that are thrown around with abandon elsewhere. But what does it really tell us? It tells us that the genetic code is a map with only one-dimension of information, that dimension being the correlation between codons and amino acids. It says that there are always sixty-four codons, which is probably wrong, and it should stop there, because from there it becomes even more misleading.

The final sentence should be replaced with another qualifier: The genetic code is part of an unknown algorithm that living things somehow use to make proteins. The standing definition here strongly implies that the genetic code and nucleotide sequence information - and nothing else are all that is required in making proteins. This is the stated rule, and of course there are exceptions to every rule, but in this case the exception is the rule. We have yet to find even a single case where the rule applies. The reason that the definition is invalid is because Da Vinci was right. Organic molecules are composites of shapes, not lines of identities. Knowing the primary sequence of amino acids has proven to be entirely different from knowing the shape of a protein. Information in addition to primary sequence must be brought to bear during translation. We presently do not have these extra dimensions of information, and we certainly do not have the rules of their application during translation. Predicting the primary sequence of a protein is relatively easy, but predicting the shape of a folded protein requires tremendous effort. We must compile gigantic piles of data, which must then be thoroughly and creatively massaged, and even still we have less than sterling results. More curious, the best data for massage is nucleotide sequence data, not amino acid sequence data. This implies that the mysteriously missing information might also be found somewhere in nucleotide sequences, more so than amino acids. Did we really expect it to be that simple? Consider the problem on its face. DNA, for all intents and purposes, is stored in a single changeless shape, whereas proteins are defined by diverse and dynamic shapes. Whatever the process, the translation of information from DNA to protein involves a shift from monotony to diversity. Molecular identities expand only four to twenty, but bond identities expand sixteen to millions. The magic of translation and the expansion of organic information therefore lie in the bonds between the molecules of molecular strings. This should be the focus of any investigation, definition, or map of the genetic code. There are numerous precedents where man maps nature, but then later discovers new information that must be added to the map. This does not mean that the original map was wrong, or useless; it just means that it is outdated and incomplete. Moreover, maps can never be complete, because in some way they must be an encryption, a compressed view of the universe. The universe itself is perhaps incompressible. We can, however, map slices of the universe and be informed by them - but context remains all-important. I am sure that the 8th century Spanish priest Beatus made a significant and useful contribution to human achievement when he drafted his map of the known world.

World Map, circa 776 His main motivation was religious, and so the symbols are related to each other according to religious as well as geographic meaning. This map might well have been labeled Map of the World at the time it was created, because at that time it represented the whole known world. However, Beatus surely labored under serious misperceptions, and he quite obviously lacked some key information. He believed that he was mapping a stationary flat object at the center of the universe. He also erroneously concluded that humans with whom he could speak had surveyed all available landmass in the world. Actually, from that perspective its really not a bad map. Incompleteness does not diminish the utility or profundity of the insight behind the map. Counter intuitively, and unfortunately for Beatus position in mapmaking history, the earth has now been found to be bigger and rounder than expected, and it was incidentally found to be orbiting the sun. In retrospect, Beatus might have improved his map somewhat by wrapping it around a ball, but this would not have been viewed as an improvement at that time. The addition of an extraneous dimension to 8th century reality could not have possibly been seen as an improvement. Beatus could have argued in favor of such a thing, but they surely would have killed him for it. Similarly, the initial efforts to map a universal genetic code are rapidly appearing more parochial. We now know for certain that the scale, scope and number of dimensions for the genetic code in our current map is considerably off, and a substantial quantity of critical information is missing from it. Compared to the Beatus map, the genetic spreadsheet is a far less accurate compression of the universe. Suggesting that extra dimensions should be added to the map wont get you killed today, but they wont make you very popular either. Our present mapping of amino acid correlations does not even visually provide an accurate structure for a code of genetic information, let alone a comprehensive mapping of its logic. It is curiously difficult to find anyone today willing to make a stiff defense of it. The codon map is a familiar icon, but it quickly fails on many important details. It is now difficult to define, and no useful definition of it should include anything but a historical reference to the term the genetic code. It would be as if because of Beatus and his map, we decided to call one hemisphere of earth the world and the other hemisphere the epi-world. Of course the first genetic spreadsheet was made at a time when man was mostly ignorant of the things he was attempting to map. Concepts such as complex object oriented programming, global computer networking, and the human genome databank were all but a twinkle in the cartographers eyes. None-the-less, symbols in our current map have assumed elements of a religious doctrine.

The Genetic Code, circa 2003 Maps are eminently useful, so it is common to forget what they really are. They are representations of things, but they are not the things themselves. It is easy to get carried away with maps, and begin to let languages cloud our definitions and thinking about the actual things. The word synonymous is likely to appear in conjunction with the codon map. Synonymous is a powder keg of a word, especially with maps. Technically it means the same as or similar to but this is itself misleading. First of all, people hear the same as and forget about or similar to. Secondly, nothing in the universe can truly be the same as something else. This has proven especially true on an atomic level, and it is this very fact that gives the universe its informative nature. By way of illustration, a tenet of theoretical physics called the Pauli exclusion principle was the final piece in the quantum puzzle leading to a more accurate mapping of the atom. Wolgang Pauli contributed the spin quantum number to numerically describe electrons within an atom. The Pauli exclusion principle says that no two electrons within an atom can share the same four quantum numbers. By adding one binary digit to the existing three quantum numbers, Pauli made the ultimate statement of dualism, and matter was forever theoretically transformed into a purely informative perception. With the stilted parlance of a logician, and the poetic license of well, a poet, I will paraphrase and expand the Pauli exclusion principle: Existence is dualistic. In everyday language this means that no two things can be the same. Two things that are the same thing are not two things; they are one thing. Interestingly, this does not say that two things that dont exist cannot be the same thing. There are many ways to exist, but perhaps only one way to not exist. Pauli must have noticed something to this effect in an atom, where we can usually count discrete things that we know as electrons, whatever they might be in reality out there. Pauli essentially said that irrespective of how similar these things appear, no two of these things are the same thing. Pauli stipulated that every electron must somehow be different from every other electron. We will find it quite useful to invoke this principle as we begin to re-examine our perception of the genetic code. The precise definitions for any one-dimensional genetic metaphor have now scattered in the wind, but we cling to their memory, and we continue to muddle our thoughts with their nostalgic usage. Attempts are no longer made to even partially define terms in a consistent fashion. It is now freakishly possible to speak of a non-synonymous-synonymous codon. In what scene shall the Mad Hatter appear? We might at least try to pin down self-consistent definitions. In the meantime, the following diagram will illustrate the linear paradigm of genetic translation as classically described.

It is instructive to re-examine the outline of the master loop, and instead of compressing the four levels of translation into one level that stands for the genetic code, we can visualize a small computer metaphor between each level in the loop.

Rather than one set of data that is translated in one dimension from nucleotide sequence to folded protein, we can imagine that genetic information is translated through several forms before folding into proteins. Translation at one level produces data for the next level. The information in the DNA string is preserved as it passes through several different strings, changing molecular form as it goes. Each translation level has its own set of informative rules that determines, or processes the information for the next level. These algorithms are not responsible for making the subunits in the string in each step, only for aligning them in a sequence. In other words, fully formed, individual molecules, such as tRNA, are available to this process during this phase of translation. 3 DNA = 3 mRNA = 1 tRNA = 1 amino acid. The first level of the process inputs Info from DNA into a function that translates it into mRNA. There is a finite set of rules, or an algorithm to describe this function. The Info from mRNA is fed into another function with another finite set of rules that translate the Info into tRNA. Perhaps investigators have yet to give us enough experimental data at this level to track the string Info accurately through a string made of tRNA molecules, but we can establish parameters at each level and take broad measurements of information entropy. These entropy-tracking protocols can then be used to query the process and broadly investigate the form and flow of information within and between various organic programs provided by nature. Having said all that (actually, having cut and paste all that from various writings) I will now propose some choices. We can define the genetic code in a broad sense or very simply in a restricted sense. Traditionally, we have chosen an extremely restricted definition, but even here there are choices. Either the genetic code embodies the rules for sequencing and only sequencing amino acids, or it embodies the rules for making entire proteins. Although we were deluded for quite some time into believing that these choices are identical, they clearly are not. Our definition of the genetic code should at least be taken to include the information in proteins, not just their sequences. Agreeing on a definition is a good first step toward actually understanding the code and how it works.

How much genetic information gets into proteins?


Previous sections have described the nature of information in general, and discussed a broad perspective on the genetic code. Now we are curious as to the amount of genetic information that finds its way into protein. Knowing this first perhaps will help pin down how it gets there. Traditionally, this issue gets framed with respect to the number of possible combinations in nucleotide sequences. This has come to stand for the range of possible genetic information that must somehow propagate forward in translation. Here we will attack the question in reverse, starting with the number of combinations in protein space and work backward to the nucleotide sequences. We will consider a protein to be defined as an actual conformation of a string of any number of amino acids. The space of all possible proteins includes all lengths, sequences and space-filling patterns of such biopolymers. If we assume for purposes of discussion that every protein in the space of all possible proteins is equally probable, then the information entropy of each protein is easy to calculate. It will be log base 2 of all possible proteins. Furthermore, each nucleotide sequence that maps to, or makes each of these proteins must somehow contain the proper amount of genetic information. In this over-simplified circumstance we can view the two sets of biopolymers as two distinct alphabets, whose symbol or character usage are evenly distributed. We will call the alphabet of all possible proteins P and the alphabet of all possible nucleotide sequences N. In the simplest possible case there would be one character in N for every character in P. In reality, this probably could never be the case, and empiric data suggests that it is in fact not the case. So now we must begin considering the absolute and relative sizes of the two alphabets, and the logic that maps one to the other. Consider the processes in nature that restrict the sizes of these two alphabets. Any restrictive process will decrease the amount of genetic information that finds its way into proteins. For instance, let's say that we put restrictions on the number and type of residues that can be included in a protein. Consider an extreme example where the process is limited to strings of only leucine, and the strings can only be between two and nine residues long. How much genetic information can be put into these proteins? This hypothetical case of severe restriction is merely a subset of the traditional picture, much in the same way that reality must be a subset of theoretical P and N. We will merely stipulate that only twenty amino acids and four nucleotides are available (actually five) in standard genetic processes. We will take the restrictions of the standard set to be synonymous with the term all possible. This stipulation means that P and N are in fact vanishing small subsets of theoretical P and N, but empirically we know that nature has found a way to impose the same set of restrictions in almost every organism. In our hypothetical case of leucine-only proteins, without considering dimensions of information beyond sequence, there are now only eight ways to make a protein. But what about the two obvious conformations of di-leucine, cis and trans di-leucine? And if we string four leucines together, based solely on cis and trans bonds there are sixteen possible permutations. If we string eight leucines there are 128 distinct cis-trans conformations. Considering only cis and trans bonds, there are 254 conformations in this hypothetically super-restricted protein space, but what is the actual space of all possible poly-leucines given all parameters? I should think that it is actually much higher than 254, certainly higher than 8. What are the circumstances in which one or more of these poly-leucines might consistently appear within a cell, or can they appear at all? More importantly, If more than eight appear, where could the information to create these restricted proteins get into the system? So generally the question becomes, what are the parameters available in a cell that determine P? Once we impose any restriction on P, beyond the standard stipulation above, we must create a subset of P, call it P. How are these restrictions imposed, at what cost, and to what benefit? Restricting P to discrete amino acid sequences from P is an extremely severe restriction, so one must wonder how such a thing could ever be accomplished. For every sequence there is an incredibly large number of ways that the sequence might fill space. Dogma claims that one of these ways is so much more probable that it will occur in all instances, but this is statistically implausible, made less believable since there is no evidence that it actually happens. Looking at it from this reverse angle gives one a completely different perspective. Now we are contemplating a search for an active process of placing restrictions. In reality it is a process of eliminating information, and the erasure of information requires energy. From an engineering standpoint, it is difficult to imagine how life could achieve this monumental task, let alone why it would be beneficial to do so. Is it actually so desirable to think that this information must get eliminated? However, the dogma assumes there need be no mechanism or cost involved in eliminating it. Furthermore, dogma fails to recognize the severity of the restriction being proposed. There must be an equally severe mechanism to impose it. No such restriction has been proven, and no such mechanism has been found. Consider a single string of 100 amino acids. Lets suppose that each adjacent pair of amino acids is known to exist in one of 10 possible conformations. There are then 10100 different conformations for this protein that share the same sequence. Call this number H, for huge, and call the one conformation that is most likely to occur H. Think of the pre-folded sequence as a ball bearing at the top of a hill. Each potential conformation is like a hole or pocket that might catch the ball bearing as it rolls down the hill. Consider all the possible variables that might affect the ball bearing as it rolls down the hill spin, humidity, temperature, wind, the gravitational pull of the moon. What are the odds of finding H from within H with each roll down the hill? How do we deal with the circumstances when any of these variables, or perhaps all of them change independently? The classic paradigm superficially addresses this problem, and from it we have apparently built the equivalent of a tube for the ball to roll in a straight shot to H. The tube somehow protects the rolling ball from all variables that might deflect its course down the hill. This tube cannot be free of cost, and it cannot be invisible to observation. Where is this magical tube, and where is the bill for its construction?

It is perhaps easy to shrug off this portion of the thought experiment, as has been done, but keep in mind that we are only halfway finished. P sits between P and N, and we have merely restricted P to P. Now we must consider restricting N to P. There are approximately three synonymous codons for every amino acid in our string of 100 amino acids, so there are 3 100 members of N that point to H. This is a number of roughly the same magnitude as H. In reality, each member of N that points to H represents a different set of variables to start the journey of the ball bearing down the hill. Now we really do need that protective tube if we are to find H, against all odds, every time. Not all synonymous codons are created equal. We like to see them as such, but it is absurd to do so. The reason is simple: codons specify anti-codons, not amino acids. While two codons might be assigned the same amino acid, they also might be assigned different tRNA to deliver that amino acid. More interesting, the same codon can be assigned different tRNAs. Because of wobble, there are two and one-half times as many anti-codons as codons. In a cell there might be more than one tRNA for every anti-codon. Why the expansion? There is now a new alphabet to consider, call it T for translation. T represents the alphabet of all possible strings of tRNA molecules. It sits between and translates N into P. As we just saw, T is much larger than N. T is a more accurate representation than N of all the possible variables at the start of the journey down the hill of protein conformations. What this means is that T makes it much harder, not easier for N to find H, the needle in a universal haystack. One should suspect that the role of T is to ensure that a single H does not exist for every sequence of amino acids, because it is a lousy mechanism for restricting N to P. Life would never select such a lousy mechanism at this critical function. Therefore, we should begin to suspect that the primary role of a robust population of tRNA molecules is to expand P beyond the severe restriction of translating sequence-only information. At some point in translation the importance of sequence gives way to the importance of conformation. It is the role of tRNA to see that this happens logically and consistently. T stands for ticket out of sequence space. It doesn't really matter when or how actual protein conformations break from their restricted sequence space, just whether or not they do, and to what extent. If more than one conformation somehow consistently emerges from any given sequence, then the information content of all proteins must go up. The mechanism for storing and delivering this information must then be addressed. These are the kinds of basic questions that must be answered first, and then the process must be worked backward to determine the amount of genetic information that impacts mechanisms of translation. If we return to the idea of a space of all possible proteins we can see that the space of all possible amino acid sequences represents a tiny, tiny, tiny subset of that space. The same questions apply. What possible information handling methods could restrict genetic translation to this extremely restricted subspace? How could the 'one-dimensional model' possibly account for the known instances where translation strays from this restricted subspace? Why would we expect life to choose this as the optimum method for genetic translation? Life in general emerges from a search for solutions. In the case of translating genetic information, the search begins with the space of all possible proteins. We should expect life to search this space in a practical and clever fashion. Restricting the search to sequence-only is sub-optimal and impractical. Why do we expect it, hell, why do we virtually insist on believing that it does? There is no longer any practical utility to calling the genetic code one-dimensional. Originally it gave us license to throw away nucleotide sequence data and concentrate on amino acid sequences in studies of proteins. Curiously, the boys in the field are now scrambling to bring the nucleotide sequences back into the studies. Why? If the information they contain maps so conveniently to the subspace of all possible amino acid sequences, what use are they? The obvious answer is that nucleotide sequences contain more genetic information than can be contained by the subspace of all possible amino acid sequences. These additional dimensions of genetic information are being translated - somehow - into proteins. To finally answer the question, how much genetic information finds its way into proteins? It depends. It depends on T, because T has the ability to expand the information that travels from N to P. The population of tRNA in a cell can expand information in nucleotide sequences. More robust tRNA populations can create more diverse sets of proteins. We need to recognize this reality so that we can begin to examine the mechanisms, and ultimately the code behind this translation. What variables can tRNA control, and how does the genetic code specify these variables?

The role of stereochemistry and tRNA


Stereochemistry refers to the spatial arrangements of atoms in molecules and complexes. The linear model can account for no stereochemistry of a polypeptide beyond the sequential arrangement of amino acids. Primary sequence the sequence of amino acids in a polypeptide. The linear model holds that nucleotide information is translated into primary sequences of amino acids. However, there is almost surely stereochemical information in some form as well. So I reject the widely accepted notion the definition actually that primary sequence is synonymous with primary structure. I advocate separate definitions for these obviously different terms. Primary structure the structure of the polypeptide backbone before folding. This is the insanity of Rafiki. Of all the ideas presented in these web pages, the notion that genetic information can be fundamentally stereochemical is the most unconventional. Nucleotides are translated into primary structure instead of just primary sequence - the heretical conclusions of a child unsupervised. A brief sketch of the cellular genetic mechanisms will serve as a base for more specifics of this concept.

Genetic information is stored in the cell (in the nucleus of eukaryotes) by DNA. This information is transcribed into messenger RNA (mRNA) and transported out of the nucleus into the cytoplasm. The information is translated into polypeptides by a mechanism that involves mRNA, transfer RNA (tRNA), ribosomes (rRNA) and amino acids. Determining genetic information at the level of the polypeptide is a simple matter of counting. Consider the case of a hypothetical sequence of 100 amino acids. Given all the possible synonymous nucleotide sequences that can produce the same sequence of 100 amino acids, how many discrete polypeptide backbones can emerge from translation before the protein begins to fold? The linear model says that at this point we are certain of primary sequence, and that all primary sequences are created equal. It says that the genetic code has no mechanism to impart more information into the molecule beyond the identity of its sub-units, so the count of polypeptide backbones is unimportant. The Rafiki model says that primary structure is the real name of the game before the translated polypeptide begins to fold. The code has the ability to translate many primary structures that all share the same primary sequence, and that each discrete structure has a potential to fold quite differently and therefore become a functionally different protein. Rafiki sees a much greater amount of genetic information getting through this step in translation and being stored in the conformations of each peptide bond. This idea is clearly nuts, but is it true? Common sense says it should be, and the weight of existing evidence seems to support it. Why is it so hard to accept? Why not test it? A codon only gives part of the information in the genetic code. Triplets are seemingly capable of translating only enough information to define the sequence of amino acids. However, of equal importance to the ultimate shape and function of the protein is the critical information regarding the nature of each peptide bond. For proper translation of stereochemical information, the genetic code looks beyond the individual codon. All of the genetic information taken together during translation defines the entire peptide bond, information which includes: 1. Amino acid identities. 2. Major bond configuration. 3. Bond rotations. For this reason, the salient unit of information in the genetic code goes beyond the codon to specify the complete peptide bond. I will call this unit the pepton. Mapping this information is discussed elsewhere. The following illustrations are designed to demonstrate the three components of the peptide bond.

The peptide bond combines two amino acids, or tetrahedrons, and it is formed by a group of atoms from each side of the bond called the peptide group, illustrated here by a rectangle. The first amino acid contributes to the peptide group its cabonyl carbon (C1), a-carbon (alpha 1) and the carbonyl oxygen (Oxygen). The second amino acid contributes its amide nitrogen (N), the a-carbon (alpha 2), and the amide hydrogen (Hydrogen). These molecule schematics are for relative position and not scale. Although there is some wiggle to the group as a whole, for the most part the bond is planar, and the group can take one of two major configurations: cis or trans. The above illustration demonstrates the trans configuration. Once formed, the bond will not spontaneously switch configurations; however, there are enzymes capable of catalyzing the switch, which involves a 1800 rotation around the axis of the peptide bond.

As we can see here, the cis configuration of the peptide group has a much greater steric hindrance between the two amino acids. This means that the atoms get in the way of each other, and they do not like this. For this reason nearly all of the peptide bonds in naturally occurring proteins are in the trans configuration, at the very least 95% of them are trans-peptide bonds. However, the cis configuration can and does naturally occur. Apart from thermodynamic folding, is there genetic information, or a translation mechanism to dictate when a cis bond should occur? When a cis bond does occur, it almost always involves an amino acid called proline. Proline is like a latch that locks the backbone in place. Cis bonds and proline are prominent in protein structures called turns, where the peptide backbone makes a dramatic change of direction. Proline and cis bonds therefore have a significant impact on the final shape and function of a protein. It logically follows that the genetic code should somehow specify the creation of bond configuration during translation. There are a variety of mechanisms that might do this, but whether or not it can is a key to information content in the protein. Despite a decided predilection for trans configuration (perhaps all bonds are created trans) there is another opportunity for the genetic code to specify the shape of the polypeptide backbone. Each peptide bond has a measure of rotational freedom around the bond between the a-carbon and its peptide group member on either side. The rotation between the a-carbon and the amide nitrogen has been named phi. The rotation between the a-carbon and the carbonyl carbon has been named psi.

Together the two rotations determine the bond angle between adjacent peptide bonds. The planes of the peptide groups are not typically parallel in an actual protein, and the location of the R-groups can vary relative to the polypeptide backbone. This is a critical factor in determining the ultimate folding of the protein, and therefore the function of the protein. It is not unrealistic to expect a system that determines the ultimate function of the overall organism to somehow determine this essential element of organism building - before folding. The overall peptide bond angles are determined by the combination of these two rotational angles, phi and psi. These combinations can be good or bad from the standpoint of steric hindrance and therefore energy stability. A plot of these combinations was done by Ramachandran and co-workers, and was later confirmed by empiric data. A simplified view is shown below: (Original URL for this picture: http://www.cryst.bbk.ac.uk/PPS95/course/ 3_geometry/rama.html)

The range of actual bond angles will in a large part be dictated by the actual amino acids participating in a particular bond. The spread of a Ramachandran plot involving glycine, for instance, will be much larger than one involving proline, as will be described shortly. Information involves quantizing and specifying, and it seems then that the peptide bond, like everything in the universe, can be neatly quantized. We can describe the bond with a few parameters. We can simply quantize each parameter, digitize them so to speak, and then combine them to create many possible bonds. A language capable of executing this process would go a long way toward folding a protein. In honor of Ramachadrons work, I have named the quantity of information contained in the genetic code regarding the peptide bond rotation the Rnumber. The information regarding the major bond configuration is the C-number. The C-number tells us the bend and the R-number tells us the twist in each peptide bond. There is one other property of the bond that merits attention; it is the laxity of the bond. Every bond has some wiggle or play. The degree to which this is true depends on the specific functional groups on the amino acids involved in the bond. For instance, proline has no play, but Glycine is a virtual swivel. A Glycine-Glycine bond is sloppy, but a Proline-Proline bond is tight. A Glycine-Proline bond will combine features of each. In this way the key contribution to the backbone made by each amino acid is its impact on the flexibility of the peptide bond it forms with a given partner at a certain angle and major conformation. Residue identities play other roles with respect to folding a protein, of course, but Rafiki sees this as the key role of each functional group in determining the overall shape of a protein. The amino acid identities can determine just how locked-in at a certain angle that a peptide bond might be at the time it is formed.

Before we can begin to quantify actual bond information in the genetic code, we must first attempt to identify the mechanisms and components that can carry and translate this information. In so doing, we begin to see that a major reformation of our view of the genetic code is required. The first element pertains to what we define as the actual code, or more appropriately, where the code actually resides. The central dogma holds that information flows DNA RNA Protein. Transcription is the process of moving information: DNA RNA. Translation is the process of moving information: RNA Protein. The genetic code is about translation. Information is moved from nucleic acids to amino acids. DNA serves a storage function only and conventionally plays no direct role in translation. Messenger RNA is like a modified transcript of information stored in DNA important, but not everything. It is far from the only nucleic acid molecule that participates in translation. Also integral to translation are transfer RNA (tRNA) and ribosomal RNA (rRNA) both of which must be considered active participants in translation. They are not merely mute cogs, they are active components of the genetic code, providing an additional voice in the translation process. Genetic information is directly communicated by these components during translation. This concept is easily lost when one focuses entirely on primary sequence and neglects primary structure. To get a feel for whats really happening on the shop floor, the actual action at the point of translation, one must imagine being an amino acid; we must visit the shop floor. If you were an amino acid in a living cell, what would you see at the moment of translation? Lets use a colored chessboard to make the point visually. The board is a graphical representation of nucleotides, codons, amino acids and their relative water affinities. (Red = hydrophobic, blue = hydrophilic)

This graphic is equivalent to the standard Watson-Crick codon table. All the information in the table is in this graphic, but the table relies on letters and words whereas this graphic uses shapes and colors. There is something radically wrong with both they are each misleading. Why?

Codons do not specify amino acids!


Codons specify tRNA molecules. tRNA molecules specify amino acids. Lets take a look at some of the features of a tRNA molecule and imagine that we are an amino acid, Leucine, for instance, encountering our very own tRNA.

Each tRNA molecule is a polynucleotide, as are mRNA and DNA, and each tRNA folds into a three-dimensional shape that determines its function, much like a protein. This does not stop tRNA from being and behaving like DNA and mRNA with respect to nucleotide-nucleotide interaction. Each tRNA has an anticodon region which gives the molecule its raison d'tre - complimentary pairing with codons. However, there are other nucleotides in the molecule that play a key role in the translation process, many of which can be schematically found in the variable region or body as labeled above. This is a very stylized look at tRNA, but it will help us make the central point about the role of tRNA in the genetic code. The actual structure of tRNA is fairly well documented, but this simplified view is copasetic with what is known. We can revise our assignment table, substituting the tRNA-amino acid complex where we previously only had amino acids.

Nothing seems to have fundamentally changed. This is because we have not done enough thinking about our new partner, tRNA. A much better description can be found elsewhere on the internet, and here is a link: http://www.mun.ca/biochem/courses/3107/Lectures/Topics/tRNA.html

Here is some interesting information about tRNA that can be found at that link: In 1966, Francis Crick proposed the Wobble hypothesis He suggested that while the interaction between the codon in the mRNA and the anticodon in the tRNA needed to be exact in two of the three nucleotide positions, this did not have to be so in the third position. He proposed that non-standard base-pairing might occur between the nucleotide base in the 5' position of the anticodon and the 3' position of the codon. This hypothesis not only accounts for the number of tRNAs that are observed, it also accounts for the degeneracy that is observed in the Genetic Code. The degenerate base is that in the wobble position These wobble rules are not followed exactly. If they were, only 31 tRNAs would be needed to pair will all possible codons. There are, however, more than 31 tRNAs. The sequence of Escherichia coli K12 contains 84 tRNA reading frames. The only amino acids with a single tRNA are histidine, tryptophan and selenocysteine. There are 7 tRNAs for arginine and valine, and there are 8 tRNAs for leucine.

Lets incorporate this newfound info into our assignment table. Since we dont have the specifics on an organisms distribution of tRNA, again we will wing it in a stylized way. This will help us visualize how a hypothetical population of tRNA, say E. Coli, might be distributed across the genetic code.

The above discription of the population and wobble appears to me to be a direct contradiction. The wobble hypothesis was presented to explain redundancy, not create it. Furthermore, I cant possibly imagine what definition of linear this concept might fit into. Cause and effect are now clearly disproportionate. Our conventional view of a one-codon-one-amino acid system can at least be plotted on a graph, and it actually produces a line.

But with this new realization that codons do not specify amino acids, they specify tRNAs we must revisit the graph. If we plot every tRNA against its codon, and there are more tRNAs than codons, then we cannot produce a line. Depending on the exact proportions, we might get a plot that will look more like an electron cloud.

There is a difference between reading frames and actual tRNA in a cell. But we also havent gotten all the potential forms of transcriptional splicing and post-transcriptional modifications of tRNA into the mix. The bottom line is that we presently do not know the precise population of discrete tRNA molecules in any given cell. There could be less than expected, or there could be more, a lot more. This will become an important part of sorting out the actual genetic code. We now have a problem when we assemble proteins on the shop floor. Before in the linear model, when an order for an amino acid came down from nucleotides, all we had to do was refer to our linear graph and assemble the amino acid - the only amino acid - into the chain as dictated by the code. We now must find a way to decide which of these damn tRNA to use. If two different tRNA will work for a single codon, how do we decide which one to use? We pine for the simpler world of yesteryear. Why do we need to confuse things, why not stick to a one-to-one system? If wobble is a beneficial phenomenon credited with reducing redundancy, then it should actually in some way reduce redundancy. A force that allows a property of a system to exceed its predicted maximum should never be proposed as a reducing force. Codons are now ambiguous with respect to the tRNA they specify, and there is a likelihood of finding more than 64 tRNA in a cell. Strictly from the standpoint of information, wobble rules alone increase the information content of translation. The rules allow for an expansion of molecular choices at the third position.

Wobble is touted as an information reducer, but in fact it increases the information by 250% in converting 64 codons into 160 anti-codons. Consequently, the anticodon table is more robust than the codon table.

Translation mechanisms to handle the newfound ambiguity can be leveraged to increase the information even further. In other words, anticodons are just the start. I am uncertain of whether there is an upper boundary to the number of distinct tRNA that might be found in the cell. Reliable sources tell me there are ~1000 known reading frames for tRNA so far located in the human genome. Who knows what this translates into in terms of recombination and post-transcriptional modifications. The actual number of human tRNA might be surprisingly large. We must try a little harder to come up with a model that actually explains these numeric contradictions with dogmatic expectations, rather than ignores them. Take Leucine, for instance. We know that Leucine is highly redundant with respect to codon assignments (it has six). The wobble hypothesis as a reducer predicts that Leucine might only require two tRNA molecules to be fully translated in the code. This is because all Leucine codons start with one of two diads: UU and CU. Leucine codons: UUA UUG CUU CUC CUG CUA So leucine would only need anti-codons of AAI and GAI. However, now we are told that Leucine has perhaps eight tRNA molecules in E. Coli, which makes little sense. It makes even less sense when one considers that without wobble one might expect six (one for each codon) and the wobble hypothesis was ostensibly proposed to reduce this number even further. There is now ambiguity in the translation of at least one codon.

Every tRNA molecule carries an identical Leucine, but there are differences in the nucleotide quantities and sequences of each tRNA. These differences are schematically demonstrated above by changing the colors of the anticodons at the bottom of the molecule and the variable regions in the middle of the molecules. I do not have specifics of the actual variations or the pattern made by the group, but we can guess that at least two sets of anticodons probably share sequences. We also know that this phenomenon grows much larger in eukaryotes than in the prokaryote example of E. Coli that we are addressing. Lets put this concept visually into the assignment table.

Note that these molecules are fairly large compared to the size of a codon. Each one has between 73 and 93 nucleotides, which means that the anticodon region accounts for 4% of the molecule at most. The amino acid is truly tiny compared to the tRNA molecule. The whole function of these molecules is to bring these tiny amino acids into proximity so they can be bonded together in a polypeptide chain. The steric influence of these molecules must be significant. To get a better view of this I will show a stylized version of translation of four amino acids using these tRNA. We will stick with Leucine for this illustration, but we will use a different tRNA for the addition of each residue.

This would be a truly remarkable feat of molecular manners to find anything approaching this type of well-behaved stereochemistry in nature. For starters, it is unlikely that all four tRNA leucines will ever simultaneously line-up on mRNA as pictured, they generally work two at a time, but all four will play a role in the tetra-leucine at some time. Think of the chain of tRNA as a virtual scaffolding that forms through time. A polypeptide with 100 residues will never visit a contemporaneous scaffold of 100 tRNA, but we can think of it that way. The scaffold is positively huge compared to the polypeptide for which it forms a template; this diagram does not do justice to the relative size differences. It is not implausible to think that the topology of the scaffold will impact the primary structure of a polypeptide. The question now becomes, how many different scaffolds can be built in a given cell that create the same amino acid sequence? How do these discrete scaffolds differ, and how will the different scaffolds affect the ultimate shape of the proteins they create? The schematic demonstrates another critical point: the genetic code is operating in at least two distinct zones of interaction. There is nucleotide interaction obviously occurring between codons and anticodons, but there also is nucleotide interaction occurring between nucleotides in adjacent tRNA molecules. More significantly, there is a necessary spatial interaction between adjacent tRNA. It is this dual level of interaction that makes the code truly multi-dimensional, going beyond sequence. It also is this extra dimension of information, or interaction that translates the stereochemical information in the genetic code. A revised view of this situation would lead to the conclusion that eight does in fact represent a dramatically reduced tRNA count for Leucine, due in large part to wobble. There could potentially be many more tRNA in a cell, and each one might contribute to a tRNA scaffold in a topologically different way.

tRNAs now compete for space, or jockey for position. There must be logical pairings between tRNA in order for a peptide bond to be made successfully. Consequently, the tiny amino acid and its peptide bond will be jostled, or sterically affected in the process. Small alterations in the properties of tRNA molecules could have large stereochemical consequences for translation. The tRNA act as angle compounding dies in the expansion of spatial information. They are geometrically suited perfectly for this task. The common thread, or the encryption key involves the golden triangle. The shape first shows up in the cross-sectional decagon of the double helix.

This is the seed of spatial information that begins in nucleotides and is unpacked into proteins. The shape of tRNA preserve these logical proportions.

More fun, if we place a golden triangle solid spike on a dodecahedron to represent each anti-codon on the Rafiki map, we find an impressive fit for an expanded icosidodecahedron.

I find it absolutely remarkable that any regular shape can be found to symbolically represent tRNA molecules on a comprehensive mapping of all codons. It is one thing to realize that the real symbolic constraints of codons are an exact match with the symmetry elements of a dodecahedron, but it is another thing to discover that tRNA shapes will extend the symmetry much further. It gives us pause, during which we should consider the question: Did codons make this shape, or did this shape make codons?

If we let our imaginations flex their considerable muscles, we can see a clever, self-contained erector set for molecules, a platonic building system of blueprints, building blocks and angle compounding dies. From an engineering standpoint, it is perplexing that tRNA molecules should be so large while codons and amino acids are so small. When the first inkling of the existence of a functional molecule like tRNA was imagined, they were given the name adapters. It was guessed that these molecules were mirrors of codons, and that they were only three nucleotides big. Although it turns out that tRNA seem to be inordinately large for this role, their actual size in and of itself is proof of nothing. However, it is irresistibly fun to speculate about collections of really large golden triangles mingling with sets of much smaller golden triangles in some cryptic engineering project, optimized for precision alignment of tiny tetrahedrons. It would be fabulous if they actually turn out to be the spatial amplifiers that they pretend to be. Beyond this level of translation, downstream into folded proteins, structures are highly conserved in a surprisingly small number of shapes. The laws of form are precisely what drove DArcy Thompson at the turn of the century, and it has since been claimed that these conserved protein structures can be broadly classified by the symmetries they share with regular solids. Proteins love to form regular solids when they auto-assemble in larger numbers, so again we might find the golden mean at this level. It is not inconceivable then that the golden mean is the shared element through these various levels of genetic translation. At least it could be used as an intellectual tool to analyze information as it gets passed along and expanded in growing, organic systems. Self-assembly at the smaller scales, such as proteins and virus particles is a platonic game, but even larger structures, such as radiolarian shells, have long been known to have a fondness for platonic shapes.

Returning to and further complicating the study of translation mechanisms, there is a third level of nucleotide involvement in the peptide bond formation. The catalyst for the bond is not a protein but a nucleotide, the 23S rRNA subunit. The catalytic property of this structure is referred to as peptidyl transferase, and it interacts with tRNA during bond formation and has a direct impact on the final state of the bond between every adjacent amino acid in every polypeptide.

Still another level of interaction is required. The tRNA must be loaded with amino acids. This job is performed by a set of proteins, or enzymes known as transferases. There are twenty (one for each amino acid) and they are evenly divided into two types. The lines of division are clearly organized around proline.

Many interpretations of this distribution of two enzyme types have been made, but within the context of the Rafiki map the symmetry around proline is compelling. More interesting, every tRNA accepts its amino acid onto the same three nucleotides, CCA, so as far as amino acids are concerned there is really only one codon. There is a separate code that operates in the world of transferases to identify anti-codons, amino acids, and something else. Perhaps these musings thought experiments - do not prove a thing, but they do bring to light the pragmatic difficulties of taking a contrary position. There appear to be a huge number of nucleotides intervening between the codon and the amino acid, all of which seem to owe their invitation to the process to the need for a peptide bond. The consistency from this large and diverse collection of nucleotides in the process of making these peptide bonds has been proven, but uniformity of bonds has not. Therefore, the mechanism would seem to consistently form non-uniform bonds. Rafiki says that this bond irregularity is a pre-determined part of the program. It is the essence of genetic information. Nucleotides speak a stereochemical language in building the backbones of polypeptides. In order for this view to be completely invalid, the following circumstances would have to be proven. For every potential sequence of our hypothetical tetra-leucine, identical tetra-peptides would have to be produced. Otherwise, the difference between tetra-peptides must be attributed to some macromolecular nucleotide mechanism acting on peptide bond formation. If we start with the assumption that there are 8 forms of tRNA-leucine available in a cell, then there are perhaps a minimum of 4096 code sequences that might specify tetra-leucine (84). Perhaps some tRNA-leucines are not sterically compatible with one or more of their brother tRNA molecules. In this case, when one tRNA follows an incompatible tRNA in the sequence the translation apparatus has a tRNA option (or perhaps a requirement). When this option does not exist, translation is terminated. There are tremendously more tRNA required by this system, and wobble can play an effective role in keeping this number smaller than it might otherwise have been. We can now fully appreciate what it means to be an amino acid. We are leucine, after all remember? The thought experiment asks one question, what do we see, as amino acids, at the point of contact with the genetic code. This is, in fact, what the genetic code is. It is all about nucleic acids talking to amino acids and getting them to behave in a way that leaves no doubt that information is crossing molecular boundaries. The money shot of this whole thread, so to speak, is in trying to decide what the nucleic acids are telling the amino acids. We now have two basic choices and we are going to use tetra-leucine to decide which one is more logical. If all conceivable nucleic acid sequences for four consecutive leucines look the same, or homogeneous to us as leucine, then the code probably can carry no stereochemical information.

However, if any valid scaffold has a different look to us, one that results in even a slightly different polypeptide backbone, and this structural difference leads to discrete folding options, then we must conclude that the code is truly multi-dimensional and it carries stereochemical information. A different look to the above sequence could take on any one of roughly 4000 appearances to us, assuming that we are leucine in E. Coli as described.

This is the view of a simple case, tetra-leucine. What is the view of our hypothetical 100 residue backbone? How many different topologies can the tRNA scaffolds generate? How do these differences translate into the information content of proteins? These are all good questions that to my eye havent even been asked, let alone answered. This institutional absence of curiosity is ludicrous, because after all, this is the genetic code. Dogma be damned.

A new method of mapping Genetic Information


Genetic information can be mapped in a simple schematic of the molecular sets during protein synthesis. The state-of-art method used today is quite simple. Nucleotide sequences provide information that is translated directly into amino acid sequences, and nothing else.

The diagram shows that three nucleotides translate into one amino acid. This model teaches that the codon is the only salient unit of information contained in polynucleotides during translation, where a codon is a grouping of three nucleotides specifying one and only one amino acid in a polypeptide chain. If this is taken as the complete foundation of genetic information, then the schematic is fairly simple, and the method of mapping seems trivial: 1. Map all possible codons with amino acids. Of course this has already been done, and this is the conventional view of genetic translation today. Unfortunately, this view is technically incomplete, and there are many observed phenomena that cannot be easily handled by this model. No attention is given in this view to the specific parameters of a peptide bond when it is formed. No role is given to any of the RNA sets - mRNA, tRNA, rRNA - in forming the specific parameters of each peptide bond, beyond determining the two residues that will bond together. The specific conformation of these two residues as an immediate consequence of their formation is being completely ignored. Do we know for certain that this is a wise and accurate view of translation? Do all nucleotide sequences and apparati that bond the same two amino acids lead to identical peptide bonds? In other words, given the 36 codon pairs (147,456 possible genons) that can create a leucine-arginine peptide bond, does this bond emerge from translation in identical conformations in all cases. If not, this constitutes real genetic information, regardless of the impact, if any, it might have post-translation. The empiric evidence is accumulating evermore rapidly that silent mutations can change peptide bond conformations in the folded protein. Now the really important question is whether or not silent mutations can change bond conformations at the moment of translation.

I say they can.

Rafiki's method of mapping holds that the peptide bond is the primary output of translation. Amino acid sequences are a necessary consequence of peptide bonding, as opposed to the commonly held reverse viewpoint that peptide bonds are a trivial, yet necessary consequence of amino acid sequencing. Look at the basic picture of a peptide group:

This is the fundamental output of translation. It includes all of the informative parameters of the peptide bond: residue identities, major configuration - cis or trans - and phi-psi angles. The key question for genetic translation is, How many distinct peptide groups can be formed during translation? If all peptide bonds are identical when they form, then the answer is 400 (20 amino acids X 20 amino acids). If some residue pairs form distinct peptide conformations within different sequence contexts, then the answer is >400, and the genetic code can no longer have only one degree of freedom in translation. I prefer to view the primary structure of a polypeptide - the immediate output of translation - as an overlapping sequence of peptide groups, not a sequence of amino acids. The sequential identities of amino acids are but components of the distinct identities of sequential peptide groups. Some new terms will be added to the old ones and applied to the new method of mapping genetic information: Genon the unit of mRNA involved in fully describing a single peptide bond. Pepton the ensemble of tRNA involved in fully describing a peptide bond. Whereas the linear model is flat, requiring knowledge of isolated codon correlations alone to determine primary sequence, the new model is hierarchical, requiring a number of elements for determination of the primary structure of a polypeptide backbone. The concepts of genon and pepton are valid even in the classic model, but they would have little use in that onecodon-one-amino acid world. The new method of mapping is still conceptually simple, but new steps must be added. 1. Map all expressed tRNA to amino acids in an organism. 2. Map all possible genons to peptons. 3. Map all peptons to peptide bonds. The above schematic is modified to reflect this new method.

C = Codon . . T = tRNA . . PB = Peptide Bond . . AA = Amino Acid The most significant result is that the peptide bond becomes the focal point of the genetic code. The genetic code is composed of at least two forms of RNA, mRNA and tRNA, and it is an overlapping code; therefore, no unit of information can be removed from its appropriate level of context yet still retain its complete meaning. Each tRNA, and therefore the amino acid it carries, plays a role in two peptide bonds, so a single substitution might impact the output in more than one way. It is entirely plausible that the state of leading and following bonds will effect the ultimate conformation of any given bond. The schematic illustrates this by including the surrounding tRNA in the concept of a pepton. This method should be universal, but the results are not necessarily universal. In other words, it must be applied to each organism under study, and it is expected to yield different results for each. Every organism expresses a different population of tRNA molecules, and each codon has the ability to specify more than one tRNA. This is in stark contrast to the one-codonone-amino acid view of the genetic code. Therefore, it is only within a genon context that codon-tRNA mappings are made. It is not guaranteed that every organism will share the same view of genons and peptons, just as they are known to differ on approaches to codons. It has been shown that organisms have the ability to change their codon to amino acid correlations, and they even have the ability to eliminate codons entirely from their code (codon capture). An organism with a missing codon cannot have the same code as one without a missing codon. So even the number of codons is not a universal part of the code. Likewise, it should be expected that an organism might take advantage of flexibility in genon and pepton mechanisms. Number and type of each can be expected to vary. It may even be found that genons and peptons vary in size within and between organisms. For this reason it is not possible to precisely define sizes and varieties of these genetic components without data. There is a transition from the monotony of bonds in DNA to the diversity of bonds in protein. This is literally a translation of genetic information. The $64,000 question is, how is this transition governed? The process of making a total bond identity has traditionally been believed to occur post-translation, but this is a dangerous, unproven assumption. Our new method of mapping brings to light the need to track bond conformations all the way to the mechanisms of bond formation, if only for the sake of intellectual accuracy and completeness.

Maximum Symmetry in the Genetic Code - The Rafiki Map


The genetic code, as it is presently defined, is merely a set of data. I wish to propose a novel structure for the data to maximize its symmetry. The physical structure of DNA, by analogy, is central to natures methods of storing and replicating genetic information. The double helix has an ideal form to dictate its function, which shows that relationships between molecules - not the molecules themselves - define the essence of biological information. These relationships are described by structure and shape. What then is the shape of the genetic code? We conventionally view the genetic code as a two-dimensional spreadsheet, where nucleotide symbols standing for the first and last codon positions are arranged in rows, and the symbols for the middle nucleotides are arranged in columns.

This data structure is certainly familiar enough to obviate further description here. Of course all data structures must adopt conventions that are predicated on some degree of subjectivity, so we must accept this in our symbolic treatments of the code. However, the genetic code itself must be based on a purely objective structure in nature. Imagine a set of organic molecules with the task of obeying the dictates of the genetic code. What informative, objective parameter could possibly exist in the universe to serve as a logical foundation for this if not shape? Any spreadsheet standing for the genetic code must employ a reading algorithm that permutes single nucleotides into triplets. The data cells spatially created by the permutation are then mapped with the twenty amino acids contained in the standard set. There are unfortunate, subtle biases to this particular schema, and they bring with them seldom-recognized yet nasty epistemic side effects. First, the inherent symmetry of nucleotide relationships is destroyed by unbalanced treatment of the three positions within permuted triplets. Second, a completely subjective hierarchy of nucleotides must be imposed on the data structure (commonly U, C, G, A is the one chosen) and this relative positioning of symbols impacts the visible patterns commonly recognized in the code. Many alternate structures have been proposed; they include hypercubes, circles, spirals and various funky geometric schemas, but they all share similar drawbacks due to a lack of maximum symmetry in the combined components of the code. At one time in the late 1960s, the framework of a 5,000 year-old Chinese spiritual philosophy known as the I Ching was formally and studiously fitted with the genetic code. No kidding. This demonstrates the importance of, and the extent to which we are willing to investigate and expand our perspective on possible structures for this particular set of data. Regardless of the actual structure used in viewing the genetic code, an unmistakable element of symmetry usually emerges from the data. Symmetry in this case can be defined in many ways, but most of them relate to a transformation of data that leaves fundamental properties unchanged. For example, a wheel is transformed by rotation, but symmetry in the perimeter and spokes leave the appearance of the wheel unchanged. In the case of the genetic code, symmetries have been found throughout the data using a variety of creative techniques. Appreciating many of these symmetries in the past has required high-level, abstract mathematics, as well as taking account of several properties of the molecules in the genetic code. The most obvious symmetry is found in the third position of codons, where the same amino acid is often assigned independent of the nucleotide in that position. This symmetry in the data is associated with concepts known as degeneracy of codon assignments and wobble in cognate anticodons.

The goal of this paper is to describe a symbolic configuration of nucleotides that is maximally symmetric. I call this the Rafiki map. Since the data is based upon nucleotide permutations, it is logical that symmetry in the data should be as well. This means that all nucleotides must be treated independent of position or relationship to other nucleotides within the structure. Remarkably, it can be achieved by using twelve faces of a dodecahedron, a schema that magically generates all sixty-four codons. This treatment of the data results in the most symmetric and therefore not surprisingly the most compressed possible arrangement of nucleotides. Through a spherical permutation network of single nucleotide symbols we also are able to find the most symmetric arrangement of nucleotide doublets and triplets. Preserving all of these natural symmetry elements in the overall data structure is informative to a proper view of the genetic code.

Constructing the Rafiki Map.


Constructing a codon map from a dodecahedron is conceptually quite simple. Begin with a tetrahedron, and label each of its four vertices with one of the nucleotides in mRNA.

These tetrahedral vertices serve as major poles in the codon map. An equilateral triangle labeled with three congruent symbols is then affixed at proper angles to each pole.

These four symbolic triads comprise the twelve faces of a dodecahedron, as well as the twelve vertices of an icosahedron. The tetrahedron, dodecahedron and icosahedron are three of only five possible regular solids, also known as platonic, or perfect solids. The other two perfect solids, cube and octahedron, can be applied to subsets of the data as well. The dodecahedron and icosahedron are dual to each other, a relationship where vertices in one solid match face centers in the other. The cube and octahedron are duals also, but the tetrahedron is dual only with itself.

There are 120 distinct rotational permutations of three adjacent faces on a dodecahedron. Each permutation defines one of twenty vertices, or twenty distinct symbolic triangles. The codon reading conventions within face triplets are as follows. Begin with any face on the dodecahedron and move to any of the five adjacent faces. There are now only two remaining faces that are adjacent to both of the first two, and choosing one of them defines a vertex. Selecting three ordered faces in this way creates one of six permutations from this set of three faces. By convention, the corresponding codon is mapped inside the triangle on the first dodecahedral face, and in the reading direction of the second face.

From here we can develop more terminology to help understand the data within this structure. There are three different classes of codons in the genetic code depending on the heterogeneity of the nucleotides in a codon. Each class generates one or more types of codon based on the order of nucleotides. The major poles create four primary triplets that have homogenous nucleotides; UUU, for instance. There are twelve semi-homogenous secondary triplets, such as UUG, and four completely heterogeneous tertiary triplets, such as ACG. By joining mirrored permutations we can enhance the appearance and readability of the map.

Twelve nucleotides oriented in twenty triangles on a three-dimensional surface create a symbolic network that preserves inherent symmetry in the data of the genetic code.

The network produces a variety of units of nucleotide symmetry that can inform our view of the genetic code. Doublet and triplet are terms to stand for unordered sets of two and three nucleotides respectively. A codon is a specific permutation of a triplet. The term multiplet will mean a group of four codons that all share the same B1B2 nucleotides. Four major poles. Twelve singlets. Sixty doublet permutations. Sixteen multiplets. Twenty distinct triplets with six possible permutations each. Sixty-four distinct triplet permutations representing all codons. In addition to the reading conventions already described, there is one other significant convention adopted by the Rafiki map at the level of the four major poles. There are only two nucleotide configurations that lead to a comprehensive mapping of codons, and they are stereoisomers, or perversions of each other. They are made interchangeable by swapping any two of the major poles, because the only distinct arrangements of a tetrahedron are mirrors of each other.

Data patterns form in both versions of the map, but given the data provided by nature the one on the left is visibly superior.

I will call the superior configuration of nucleotides the L-form, and use it exclusively to stand for the Rafiki map. I use the term as if it is a single map, but note there is a D-form as well. Compared to other structures for the data, this degree of subjectivity is small, and it might even have pedagogic value with respect to fact that all amino acids in the standard set are L-isomers. Beyond this, nucleotides in a dodecahedron must be objectively configured to achieve a comprehensive map of codons. Symmetry in the Rafiki map expands from the poles, creating a hierarchy of elements culminating in logical Hamiltonian circuits of entire nucleotide sequences. The hierarchy of symmetric elements in the structure is as follows. Pole : Singlet : Doublet : Triplet : Codon : Circuit Symmetry in the poles has already been addressed within the context of a tetrahedron, and symmetry in nucleotide singlets derives from the faces of a dodecahedron. However, the relationships between individual nucleotide symbols become more informative within the context of the four poles. They offer a convenient method for distinguishing single nucleotides relative to the context of the entire network, because every face on a dodecahedron is parallel to exactly one other face. All four poles of the Rafiki map include three faces, and each of these three faces is parallel to a face from a different pole. We can use subscripts (McNeil subscripts) to denote parallel faces, effectively creating twelve distinct nucleotide symbols.

These subscripts help distinguish symmetries in the hierarchy beyond singlets, including the sixty distinct nucleotide doublet permutations in the map. Without unique identification of twelve nucleotides there can be only sixteen distinct doublets. B1B2 doublets (base one and base two of a codon) aggregate into sixteen multiplets consisting of four codons each. These multiplets correspond to the B1B2 portion of a codon, and they have long been recognized as significant within the assignments of the genetic code. There are two types of multiplets that correlate with homogenous and heterogenous doublets. For instance, the doublet UU generates the homogeneous multiplet of UUU, UUA, UUC and UUG, whereas the UG doublet leads to a heterogeneous multiplet made of UGC, UGG, UGU and UGA. The latter outnumber the former three to one, and together they form a quartet around each major pole.

The quartet of multiplets centered on each major pole is a completely symmetric interweaving of multiplets within the overall structure.

Of course, this data is best viewed in three dimensions, but we usually have only two dimensions available on a printed page. In that case, it is convenient to organize the data along either the faces of a polyhedron, or within the structure of the four major poles. To illustrate, I shall use amino acid water affinities to assign colors within the structure. The red portion of the spectrum colors the most hydrophobic amino acids, and hydrophilic residues are given the blue end of the spectrum. Yellow and green color the middle affinities. Isoleucine and arginine are redpurple and blue-purple respectively to connect the two ends of the color wheel. Colors for the nucleotide faces and their McNeil subcripts are added as well (A=blue, C=green, G=yellow, U=red).

Summary
This short paper is intended merely as an introduction to some of the properties and terminology of the Rafiki Map. It is not meant as a comprehensive interpretation or review of the applications for this data structure. However, the pure symmetry and objectivity of the structure make it ideal for studying patterns in the genetic code. Its advantages include the following. 1. It is the most symmetric treatment of the data. 2. It is the most compressed arrangement of nucleotides. 3. It is the most objective arrangement of nucleotides. 4. It is based on the symmetry of real, three-dimensional objects, and the geometry of these objects is congruent with the geometry of molecules comprising the genetic code. Due to these qualities, the Rafiki map is useful for viewing and normalizing patterns across the entire data set, or for comparisons of patterns based on a specific triplet, class or codon type. Because the map is based on real polyhedra, as opposed to the imaginary structure of hypercubes and spreadsheets, it can augment studies of various relationships between real molecules of genetic translation. The genetic code is part of a language where a sequence of dodecahedrons (DNA) communicates information to a sequence of tetrahedrons (proteins). It is no accident then that this language fits so perfectly into the structure of a dodecahedron. The maps perfect symmetry is useful in visualizing the effect that codon assignment patterns have on translation, especially in the case of frameshifts. Beyond all that, the harmony of spatial relationships within the structure, and the aesthetic appeal of its simplicity are likely to prove informative to our views on the origin, logic and evolution of the genetic code. More importantly, this structure might shed light on functions of information handling that take place during genetic translation. This is a process with which we are still remarkably unfamiliar, despite all of the past hyperbole. I must confess that the data when studied in this structure has completely infected me with the heretical conviction that information translated by the genetic code is primarily about protein structure, not sequence. After all, the code is directly responsible for the peptide bonds that make up a native protein, so it is logical to expect the code to be optimized for that function. The genetic code builds proteins, not sequences. Fear not - that argument will not be made here. In summary, the Rafiki map is a useful general tool for studying the data set known as the genetic code, because it brings maximum symmetry to its structure. The beautiful symmetry in the genetic code need no longer be imagined; it can now be held in the hands and seen with the eyes.

Seeing Codons
It is a challenge to see something abstract. How does one see information? This is the fundamental task in shifting our paradigm of genetic information, so I have spent a good deal of effort looking for alternative ways that codons can be seen. This is part art and part science, and I have come up with dozens of different strategies for visualizing the information contained in a codon. This lengthy page will be a general tour and review of the thought processes and the visual representations they have spawned. Efforts to visualize the data of the genetic code have convinced me of the following: 1. Historically, we have not had a functional way to see genetic information. 2. There must be an objective context within which we can begin mapping the components of a code, and thereby create the essential context of component relationships. 3. Symmetry plays a major role in seeing genetic information correctly. 4. Cyclic permutations of nucleotide triplets are the basic informative structure within the molecular system of protein synthesis. The Rafiki map is useful because it unifies all visualization techniques through a mathematically unbiased contextual network of cyclic permutations.

To begin a process of visualizing information, consider a simple street map of your hometown. It is laid out on flat paper, but our razor-sharp human intelligence quickly understands that the shapes on the paper translate into a three-dimensional town of much greater size. Mapping is a process of correlating two versions of informative reality. It is a form of code. It is only slightly tougher to imagine that the orthogonal, two-dimensional information on a map is merely a subset of a vastly larger set of information on the surface of a sphere. Our street map is but a tiny window into a surface that wraps around a gigantic sphere the earth. The fact that this information is scaled down, cut into a square and flattened onto a plane is of no concern whatsoever. The useful information the map contains is hardly affected by such transformations. However, the same cannot be said of the genetic code and protein synthesis. Transformations here are all important, if only to recognize that they have occurred in our textbooks and our thinking. Like symbols for roads, buildings and lakes on the map of your hometown, we will need symbols to construct a map of the genetic code. We will start with the following three typographic symbols: A = Nucleic Acid set member B = Assignment - set C = Codon set permutation Codons are permutations of three nucleic acids. They are specific configurations of a set, which in this case has three members (triplets, or triads). Permutations are actual arrangements of symbols. For instance, the sequence of symbols XYZ is a permutation of the symbols X, Y and Z. Other permutations of the same symbols are YZX, or ZYX. Taking the number of possible symbols, in this case 4 nucleic acids, and raising it to a power of the number of symbols in the permutation, in this case 3, determine the total set of permutations. Therefore, there are sixty-four codons (43 = 4 X 4 X 4 = 64). An assignment is a set of symbols that will be tied together somehow in an informative mapping. Nature made assignments of sets of molecules with other sets of molecules in making the genetic code. The meaning of a code is derived from its assignments. The classic paradigm of the genetic code is called linear and one-dimensional because it is seen as an unambiguous one-to-one mapping of components with no further need for context. In the linear model, there are an equal number of codons and assignments, so there seem to be no ambiguous assignments. All nucleotides and the codons they form are assigned once and only once. There are, however, only twenty amino acids in the standard set, so looking at the process from the perspective of amino acids, there appear to be only 20 required assignments, forty-four fewer than codons available. Most curious, nest pas? This means that there is redundancy - more vulgar, degeneracy. The term degeneracy sprung from mathematics and engineering. It is a technical term to describe a formula that has more than one valid solution. More than one codon is usually assigned to each amino acid, so most assignments appear degenerate. The only two non-degenerate assignments of amino acids are Methionine and Tryptophan. The rest are total degenerates. From this standpoint the linear model appears quite wasteful. It is, as I have been trying to tell you, a degenerate model. The genetic code has the capacity to carry more information in the form of more amino acids. Why does it not aggressively take advantage of this opportunity somewhere, in some organism? There has been a lot of speculation about the meaning or the use for this redundancy, but we are far from a consensus agreement about it, and using 64 codons to only assign 20 amino acids is an undeniably wasteful information system. Is there something we are missing? Regardless, if we begin to construct a map of the genetic code based on the tenets of the linear model, we know the exact structure and shape of the map with which we must start a line. This is the proper context for seeing the genetic code from the perspective of the classic paradigm. Using the symbols A, B and C as described above creates the following brutish physical appearance of the linear model.

The breadth of symbols discourages their presentation in an actual linear format - otherwise they would be illegibly small. One must use the scissors of imagination to carve this page into a series of strips that are attached end-to-end, creating a single genetic code tape. Only then can we appreciate the literal shape of the linear code. A two-dimensional grid is a more familiar presentation of assignments. I, for one, have never actually seen the linear model presented as a line. This is probably because it is not particularly useful, except to demonstrate how useless it is. However, for the sake of rigor I will do just such a thing here. (below is the tape of the above segments placed end-to-end.)

See how useless it is? Of course the use of this format is explicit from the model. A linear code contains only one-dimension of information in assignments, and finding additional meaning in those assignments is therefore verboten; otherwise they are no longer one-dimensional. We are explicitly told to view the assignments as one-dimensional, arbitrary and meaningless. Any pattern that might appear in the tape must be rejected as a chimera. For if it is linear and arbitrary then it can be broken apart, reassembled and presented in any alternate fashion. There is absolutely no criterion on which to judge the correctness of one linear presentation of assignments over another; therefore, none can be best. The above tape could just as easily be rearranged and start with the following sequence, so long as all the numbers add up at the end.

The number of potential arrangements is extraordinarily large, and the basic problem is that we need to find a way to order, or weight the information so that we can find a pragmatic structure into which we can place and view it. This structure, whatever it is, will provide the context for seeing the information as we study the code and its functions. Science has chosen a grid, and it is most curious that the conventional two-dimensional codon table has somehow drawn a pass on the constraint of linearity and arbitrariness. It is so common to see the genetic code presented in a grid that the rigorous academic requirement for a linear disclaimer has somehow evolved out of the system. I never see the assignments accompanied by the warning: This table is manmade - do not recognize patterns.

Pattern recognition in the genetic code data grid should be dogmatically forbidden fruit. After all, any pattern in the assignment of amino acids to codons must be rejected as meaningless and arbitrary. Certainly no good scientist would take a pattern recognized in this correlation table and use it as the basis of further work. That is tacit approval of a non-linear model. After all, a grid is in fact a non-linear model. A pattern in the grid would be a sign that there is more than one dimension of meaning in the assignments of codons to amino acids. This would be a second degree of freedom in the mechanism that assigned amino acids within the code. It would be a sign of at least a second dimension of information, and perhaps more. Assignments either have one dimension of meaning or they dont. Which is it? If there is truly only one dimension of information in the code then patterns in the data are worthless. We know that there are patterns in the code, and they are not worthless, so how many levels of assignment, degrees of freedom or dimensions of meaning are in the code? I am going to start with an excellent textbook presentation of the genetic correlation table and make it more excellent. I will start here and apply my skills acquired in a wasted youth as a graphic artist. The premise of the exercise is to imagine how we see the genetic code as it is presently taught, and then explore various ways to see it differently.

This two-dimensional grid clarifies yet confuses. I will tease out some of the assumptions and information that are contained in it, starting with the mathematical tape that is the linear model. The tape shows a need for 192 nucleic acids (64 codons X 3 nucleic acids per codon). A quick count of the above nucleic acids shows me only 24. Where are the other 168 nucleic acids? I am not a total idiot - I am just being coy. I know that they are stacked on top of each other. It would be a grievous no-no to suggest that we could have somehow jettisoned 168 required nucleic acids from the sacred code. This would imply double duty for at least some nucleic acids, but these identical molecules would be doing their duties in an imbalanced way, so then we must determine which nucleotides in a codon would be assigned to which duties. These apparent missing nucleotides are merely illusions created by the conveniences of our written notation - a sacrifice to typographic efficiency. In fact, the table is nothing but a convenient presentation of data. It was more or less accidentally presented this way, and since we find it somehow useful, it has become frozen in our system of seeing the code. Our use of this table is a rare case of an actual frozen accident in the genetic code! But it was our conscious decision to make it such, not a natural process that requires it. Another feature of the above table is the obvious grouping of assignments. It seems that there is some mystical force in our arbitrary universe that is compelling assignments to gravitate to regions of an arbitrary (and forbidden) space. Assignments of amino acids appear to be clumping within sections of the table. A man of lesser discipline would be tempted to succumb to the temptation of inferring meaning from these groupings, which would amount to a second dimension of meaning in a one-dimensional process. Fortunately, I am a man of lesser discipline, and Im not only willing to assign meaning, Im willing to make more groupings. (See how much fun science can be if youre willing to ignore dogma.) Lets start by rearranging the above grid into another grid. Like potato chips, you can never have just one. The grid shows obvious relationships between codons and amino acids, but what are the relationships between amino acids, nucleic acids and their codons? In fact, it would be nice to know the inter and intra-relationships between all components of the system. Starting with amino acids, the most likely relationship between them will somehow involve water, because water is so much a part of living systems. How well a molecule interacts with water is called its water affinity, and it can be measured in a number of ways. I created my own stylized grid, and since color is king, lets add some color. In my world of visualizing abstract notions, color is a better choice for symbols than are typographic glyphs, like A, B and C. Now we can really start to see some patterns.

Just as the street map is a tiny window into a sphere, this table is a tiny window into another informative surface. From this table we can begin to develop the necessary symbols to illuminate that surface. In other words, we can begin doing some real work with patterns now by saying it with color, saying it strictly with color.

Here is the tape, or table in living color. And while were at it, we might as well take advantage of all the physical dimensions that God has given us. Two dimensions are nice, but three are better.

Before we lose our alphabet entirely, and with it English, becoming mired in the insanity of Rafiki-speak, lets bring back an old friend, the textbook table to combine what it shows us with what we have just created. This is exciting, isnt it?

We are finally in a position to go looking for those 168 missing nucleic acids.

There they are, all 192 of them, but why are they all treated differently? Why is the third position chopped up into 4 stacks? DUH! The page only has two dimensions, Homer, and we need three But we now have three!

Look at all the pretty colors - they must mean something. How could a one-dimensional, arbitrary and meaningless process produce such a beautiful three-dimensional pattern? Remember, those colors have meaning to those amino acids - they represent relative water affinity. Its a shame that were technically not allowed to use it for anything. Things just get curiouser and curiouser all the time. What is a pattern anyway? In this case a pattern is an arrangement of symbols and colors. Through combinations amongst and between each group of symbols a total pattern emerges. This pattern turns on the concept of neighbor, but what is a neighbor? For our study of the codon table, a neighbor is technically a next-to, as in this is next-to that. In the above pattern there are hundreds of different types of actual and potential nexttos. There are colors next-to each other in a hierarchy of color. There are shapes next-to each other. There are categories of shapes next-to each other and next-to other categories. There are rows, columns and layers of next-tos. But the one-dimensional dogma destroys the concept of nextto, doesnt it? I mean anything next-to anything else is just random. There is no real next-to in the genetic code, is there? All of the next-tos above were created by mathematical weighting, or observer bias. There should be no meaning to the pretty patterns that we have just created, unless there is meaning in the observer bias upon which weve stumbled. Demons exist. Only malicious demons would create this beautiful collection of accidental, meaningless next-tos to tempt us into forbidden territory. We wont go wecant screw it. Lets go - we can make better next-tos than this. The problem with the above next-tos is precisely that they are biased by us, the observer, and these biases are clearly unequal. It is politically incorrect to allow unequal anything, let alone unequal next-tos. Notice that corner amino acid assignments in the pattern have only three next-tos, edges and sides have four, and middles have an embarrassingly capitalistic six next-tos. I will fight for the corners, edges and sides, and tax the middles. Sorry middles, get over it. This brings up a very interesting problem: what determines the next-tos? Take the case of amino acids, what determines which amino acid falls on a corner, edge or middle, and therefore determines quantity and type of an amino acids next-to? Unfortunately - or fortunately - it is another set of next-tos. Specifically, it is the nucleotide next-tos. The specter of recursion looms large, but fear not, fearless traveler. This grid - the textbook table - explicitly demands that nucleotides must be next to other nucleotides. This is the symbolic shorthand that allows us to generate permutations. By taking one symbol from the left we are given a choice of four symbols from the top, which in turn opens up another choice of four from the sixteen symbols on the right. There is a channeling process at work here, but it is never addressed by the dogma. It is actually a thinly veiled systematic, mathematical weighting of codons. By picking any nucleotide we are limiting our choices of nucleotide next-tos. The nucleotides in the grid are differentiated and stratified according to position in the codon. For this textbook presentation, we only need four literal symbols for each of the first two positions, but we need sixteen for the third position. Ultimately, this channeling process will allow us to arrange the entire genetic code along the linear genetic map. In other words, weighting data in this way means that there is a first, second, third and last codon in the table. This table actually weights codons from first to last. Trivial? Not from the standpoint of the presentation and the pattern it demonstrates. If patterns are the goal, it is far from trivial, because we have no way to know if a pattern is natural or man-made. Changing the weighting will create an entirely different pattern. Furthermore, it appears that the 2nd position dominates the color assignments. Whats up there? I suppose we might tiptoe past this graveyard with the caveat that the channeling is an illusion, but lets have a closer look at the weighting process before we decide. The table is merely a specific instance of the tape in the linear model of the genetic code. Somehow the codons are assigned values around which the table is arranged. It turns out that the source of the values in this table is a simple formula that requires two biases. The first bias is a weighting of the nucleotides, which we will call the nucleotide values. The second bias is a weighting of the positions of the nucleotides in the codon, which we will call the position value. The assigned value of the codon then becomes the sum of the nucleotide values times their position values, which can be written as follows:

We can easily see that this formula will produce the textbook grid from the genetic tape. If we recreate the grid and display all of the values we will finally see the bias at work.

But since there is no meaning to the assignments, we are free to reassign these weights however we like. It is easy to change the position values and produce an entirely different grid.

There are still some patterns here, but they have taken a considerable hit, and this is just one of twenty-four permutations that can use these arbitrary values. Furthermore, there is a very bold unspoken assumption in these tables, namely that position is more important than identity when considering the value of a nucleotide in a codon. But since this table business is meaningless, there is absolutely nothing stopping us from coming up with any values that strike our fancy. Our one-dimensional system allows for no second dimension on which to evaluate any weighting scheme. In fact, if we take the linear crowd at their word, the most appropriate arrangement would be one that is completely arbitrary, such as this:

Now the patterns have disappeared completely. Of course this is what we expect from an arbitrary process - arbitrary results. However, if we dont find arbitrary results, how can we conclude it was an arbitrary process? Any grid that we contrive will have contact with this non-linear heresy, so we cannot prohibit something multi-dimensional pattern recognition - ignore the prohibition, and then fail to examine the unspoken assumptions inherent in the process. Clearly the nucleotide next-toness is influencing the table and somehow forming pretty patterns in the assignments. Furthermore, good scientists - ironically some of the same ones that forbade pattern recognition by insisting on a linear code - have used these patterns to form conventional theory, specifically the theory of wobble. The wobble hypothesis is nothing but a pattern recognition theory and therefore contradicts the premise of a one-dimensional model. The implication is that wobble somehow acted as a force in shaping the genetic code. This is necessarily a second dimension in an otherwise one-dimensional process. Each codon in the wobble model has amino acids and wobble partners assigned to it. It is the wobble partner assignment that theoretically shows up in the two-dimensional grid of data in the form of amino acid assignment clumping. Wrong. Our task at hand then is to remove weighting from the presentation of the data, so any pattern popping out must be natural instead of manmade. How can we un-weight a table? We cant completely, but we can start with some rules. Rule: All nucleic acids will be treated equally. This must be so if weighting is to be removed, and there are important consequences of this rule. There was never a contrary rule to my knowledge, but somehow thats how it worked itself out in the grid. Im not sure anyone ever cared enough about unequal treatment of nucleic acids to even notice. Nonetheless, if all nucleotides are equal then we dont need whole stacks of 16 equal nucleotides. All we need is one from each stack, and we can discard the rest. That leaves us with twelve. Each of the 192 nucleic acids on the tape is a symbol that can have one of four values. Now there are only twelve symbols, and they must account for at least twenty 20 assignments. If we spend two symbols per assignment we end up with only 16 codons (4 2) but this clearly is not enough codons. If we decide to spend 3 symbols per assignment we generate a wasteful 64 codons. Since we had unlimited nucleic acids in the past we took that bargain. The times they are a changin. What if the bargain was without waste instead of wasteful - how would it work? Start by assuming that nucleotide assignment is precious. We now only have 12 nucleotides to spend, and we must achieve at least 20 assignments. This means that each nucleotide must be spent 5 times ((20 X 3) / 12). This is an awful lot to ask of a nucleotide. A1 = (B1, B2, B3, B4, B5) A2 = (B1, B2, B6, B7, B8) A3 = (B2, B3, B8, B9, B10) A4 = (B3, B4, B10, B11, B12) A5 = (B4, B5, B12, B13, B14) A6 = (B1, B5, B6, B14, B15) A7 = (B9, B10, B11, B16, B17) A8 = (B7, B8, B9, B17, B18) A9 = (B6, B7, B15, B18, B19) A10 = (B13, B14, B15, B19, B20) A11 = (B11, B12, B13, B16, B20) A12 = (B16, B17, B18, B19, B20)

The inversion of thinking here is that each nucleic acid participates in multiple assignments. The conventional thinking is that each amino acid should participate in multiple codon assignments, but that approach requires that selection of nucleotides and codons from all possible configurations is somehow pre-ordained. The mystical assignment process must have involved both nucleotides and amino acids simultaneously converging on codons, because codons could not have otherwise existed in any meaningful way before this assignment lottery occurred. It must have been a universal high-wire act of balancing molecular forces all molecular forces. The classical perspective seems to assume some safe haven for the numbers in the system before the assignments were made. What is the origin and nature of these relationships? When nucleotide identities and positions within a codon are considered together, the Rafiki model is covered by 12 symbols. Of course, each assignment must also be associated with three nucleotides, and a careful analysis of the above set of relationships shows that this is true. This means that if a nucleotide, adenine for instance, can be plugged into A1 it can be plugged into any or all of the other 11 symbols as well. Nucleic acid triplets have 6 permutations as follows: Permutation #1, P1 = 1, 2, 3 Permutation #2, P2 = 2, 3, 1 Permutation #3, P3 = 3, 1, 2 Permutation #4, P4 = 1, 3, 2 Permutation #5, P5 = 3, 2, 1 Permutation #6, P6 = 2, 1, 3 This is a cyclic permutation set of three members. It implies that in our new model we must accept that there are 6 permutations of all possible nucleic acid triplets, including seemingly trivial cases such as (Adenine, Adenine, Adenine). Each assignment represents a collection of all permutations of the three nucleic acids that are related to it. We have no way of knowing which triplets to discard in cases of redundancy within the cyclic permutation. The potential codon count seemingly goes to 120. B1 = B2 = B3 = B4 = B5 = B6 = B7 = B8 = B9 = B10 = B11 = B12 = B13 = B14 = B15 = B16 = B17 = B18 = B19 = B20 = P(A1, A6, A2) P(A1, A2, A3) P(A1, A3, A4) P(A1, A4, A5) P(A1, A5, A6) P(A2, A6, A9) P(A2, A9, A8) P(A2, A8, A3) P(A3, A8, A7) P(A3, A7, A4) P(A4, A7, A11) P(A4, A11, A5) P(A5, A11, A10) P(A5, A10, A6) P(A6, A10, A9) P(A7, A12, A11) P(A7, A8, A12) P(A8, A9, A12) P(A9, A10, A12) P(A10, A11, A12)

The danger here is in failing to recognize the meaning of any assignment within this system. We started with only 20 required assignments, because that is what the empiric evidence suggested that we do. But our assignment process immediately yielded multiple potential meanings to each triplet depending on its context within the model. Notice that the set member represented by A1 participates in five of the assignments, and for each of these A1 is the initial base in the assignment permutation exactly twice. It also is the second and third base exactly twice. In this way every member plays a balanced role in the system. A1 = (B1, B2, B3, B4, B5) B1 = (A1, A6, A2), (A6, A2, A1), (A2, A1, A6), (A1, A2, A6), (A2, A6, A1), (A6, A1, A2) This holds true for all of the 12 nucleotides and their related assignments, so each base is a primary initiator of five codons and a secondary initiator of five codons. Therefore, there are sixty primary initiators and sixty secondary initiators. We will assign each permutation a label so that we can demonstrate each symbols role as initiator, for example C1 = (A1, A6, A2) and C61 = (A1, A2, A6): Primary initiators A1 = (C1, C2, C3, C4, C5) A2 = (C6, C7, C8, C9, C10) A3 = (C11, C12, C13, C14, C15) A4 = (C16, C17, C18, C19, C20) A5 = (C21, C22, C23, C24, C25) A6 = (C26, C27, C28, C29, C30) A7 = (C31, C32, C33, C34, C35) A8 = (C36, C37, C38, C39, C40) A9 = (C41, C42, C43, C44, C45) A10 = (C46, C47, C48, C49, C50) A11 = (C51, C52, C53, C54, C55) A12 = (C56, C57, C58, C59, C60) Secondary initiators A1 = (C61, C62, C63, C64, C65) A2 = (C66, C67, C68, C69, C70) A3 = (C71, C72, C73, C74, C75) A4 = (C76, C77, C78, C79, C80) A5 = (C81, C82, C83, C84, C85) A6 = (C86, C87, C88, C89, C90) A7 = (C91, C92, C93, C94, C95) A8 = (C96, C97, C98, C99, C100) A9 = (C101, C102, C103, C104, C105) A10 = (C106, C107, C108, C109, C110) A11 = (C111, C112, C113, C114, C115) A12 = (C116, C117, C118, C119, C120)

Each assignment set has six permutations B1 = (C1, C28, C9, C61, C88, C69) B2 = (C2, C8, C14, C62, C68, C74) B3 = (C3, C13, C19, C63, C73, C79) B4 = (C4, C18, C24, C64, C78, C84) B5 = (C5, C23, C29, C65, C83, C89) B6 = (C10, C27, C41, C70, C87, C101) B7 = (C6, C45, C37, C66, C105, C97) B8 = (C7, C36, C15, C67, C96, C75) B9 = (C11, C40, C32, C71, C100, C92) B10 = (C12, C31, C20, C72, C91, C80) B11 = (C16, C35, C52, C76, C95, C112) B12 = (C17, C51, C25, C77, C121, C95) B13 = (C21, C55, C47, C81, C115, C107) B14 = (C22, C46, C30, C82, C106, C90) B15 = (C26, C50, C42, C86, C110, C102) B16 = (C34, C56, C53, C94, C116, C113) B17 = (C33, C39, C57, C93, C99, C127) B18 = (C38, C44, C58, C98, C104, C118) B19 = (C43, C49, C59, C103, C109, C119) B20 = (C48, C54, C60, C108, C114, C120) Although we achieved a absolute reduction from 192 to 12 nucleotides, we also note a peculiar increase in the number of required permutations from 64 to 120. This is due to the models inability to distinguish between seemingly trivial permutations. However, this new model is not a twodimensional, one-to-one, sequestering grid; it is a multidimensional inter-relation network, which we will call an identity network. It is not unreasonable to suspect that within a network the seemingly trivial permutations actually could have unique meanings depending on their context. We have a network capable of presenting any and all of the required permutations. It differs from the conventional grid on the important issue of bias; specifically it can present the data without weighting the nucleotides. One glaring drawback: unlike a grid, the identity network does not lend itself to two-dimensional schematic representation. However, what it lacks in 2D it makes up for in 3D. We can easily use these relationships to generate a dodecahedron or an icosahedron, but they are dual to each other. In fact, the concept should be interpreted as a sphere, but polyhedrons are more effective when given a flat starting medium such as the paper. Diagram of the symbolic relationships in the Rafiki model

A full appreciation of the relationships in this identity network requires the diagram be cut and folded. Mike McNeil, a Rafiki sympathizer, brings up an interesting point about the above symbol identifiers. Mike is a bright guy, and a patent attorney, skilled in distilling an idea. He points out that since we are dealing with a known system of symbols with only four identities, the symbol identifiers should reflect this. In other words, perhaps A1-12 is sub-optimal. What about 4 sets of 3 inter-related identifiers: A12,3,4, A21,3,4, A31,2,4 and A41,2,3? Thanks for the additional burden, Mike. Fortunately, we can deal with this quite nicely, but both systems are informative to different circumstances. I feel that he has identified an actual truth in the new universe of yet underdeveloped combinatorial mathematics and the genetic code. In honor of this I will refer to these subscripts as McNeil subscripts. For now we will stick with our symbols of the non-McNeil variety, which is to say no subscripts at all. We are finally able to return to the task at hand making pretty patterns. We now have a weightless presentation format for our data. It is an unbiased permutation grid in three dimensions, and we can use it to see what kind of patterns nature has given us. We start by examining some of the unexpected curiosities in the new model. Although nucleotides have become equal, triplets have become decidedly unequal. There are now three classes of triplets: primary, secondary and tertiary. Since color-is-king, so lets assign some colors to these classes. Unfortunately, there is only one rainbow, so we are going to have to re-use amino acid and nucleotide colors. Please try not to get confused by this.

If we add the nucleotide initials to each permutation we can see how each triplet generates six permutations, but the triplets are not homogenous in their behavior.

When all possible permutations are present, the structure contains 4 primary, 12 secondary and 4 tertiary triplets. If we stylize these - and we should because we can - they look like this:

Triplets can now assume one of three classes, and we notice that within each class there are different types of permutations based on the class of the triplet. The Rafiki model contains the following distinguishable permutations:

If we combine the model with the color-is-king style, we produce the following two-dimensional map of the identity network of permutations. This is merely an un-weighted mathematical treatment of a 43 permutation set.

We could do the same with a dodecahedron, but lets go bold; lets go 3D. In 3D the dodecahedron-icosahedron debate can be mooted by mapping to a sphere.

This is merely a structure for holding data. It is a receptacle into which we can place any appropriate permutation data, such as the assignment data of codons and amino acids. (The I Ching fits well also by coincidence.) This is essentially an unbiased view of data in our search of natures patterns. Apparently, the demons were in full malicious mode when they capriciously scattered their arbitrary and meaningless assignments across our unanticipated new model. Look at our grid in terms of permutation class and type.

To my mind, this is the first visual glimpse of the idea that cyclic permutations of nucleotide triplets codon types can provide informative units of genetic translation. Patterns in codon types are everywhere, I suspect, because these units carry genetic information. A signal with six separate channels, so to speak. We started with the genetic code, or I should say the codon table, a linear phenomenon that has been deemed arbitrary and meaningless. We arranged the line in a two-dimensional grid, theoretically a no-no, and started to notice some patterns. On the strength of this, we further arranged the grid into three-dimensions and saw - guess what - more patterns. This opened a whole new space for investigation, the network space. In the network space several curious things happened. Nucleotides equalized and we jettisoned 180 that were no longer required. Triplets became combinatorial, and codons became differentiated based on their generative triplet and their location within that triplet. The formal recognition of a differentiated codon should have absolutely no meaning, and certainly no predictive value in the real world of codon assignment, right? There should be no pattern whatsoever based on such a ludicrous stratification, certainly no meaningful pattern. The whole thing is arbitrary and meaningless, so no pattern within it can be strategic or meaningful to the genetic code. Really? Actually, the opposite is true - this is the only pattern that is not suspect. It is the only way to present the data in an un-biased, mathematically unweighted format. The patterns seen in this presentation are the only ones that we can really trust. If we see a pattern here, some force of nature must have put it there. The reason that networking the assignment table generates patterns that correlate across seemingly hokey parameters, such as codon differentiation, is because the assignment table is not linear, not one-dimensional, and it is not arbitrary, as dogma has insisted. Codons being assigned to amino acids is only a part of a larger system that is in fact a network, a system heretofore widely studied and cherished but poorly understood. We casually refer to codon-amino acid assignments as the genetic code, but this is incorrect. The code is more robust than the narrow dogmatic view. Trying to look through the dark lenses of a linear model will destroy our ability to appreciate pretty colors making beautiful patterns. The Rafiki model treats the genetic code as a balancing network of inter-related components. Amino acids, nucleotides, and triplets are all inter-related. Amino acids cooperate with each other by, among other things, logically distributing themselves uniformly across the network of nucleotides. Ironically, now that we have spent all this time un-weighting the data, the challenge now is to re-weight it. It is useful to have a hierarchy of some type to all 64 codons so that we can see how they all relate to each other. Regardless of the weighting strategy, it seems logical that the data should be viewed within the structure of triplet permutations. Lets see what we can come up with. One most logical first effort is to use the weighting formula of the standard table, but plug in different values, ones that are consistent with the empiric observations of assignment data.

This is the weighting of the data used to generate the classic table. What if we change the weighting? Plugging in new nucleotide values into the formula gives us an entirely new table.

Merely by juggling the weighting we have created a rainbow of the water affinities, but we can do even more to them by re-proportioning the weights. We will use the following values relative to the classic table:

This data hardly looks more interesting than the data we had before, but it is. When we present it as a spectrum, according to the relative weight of every codon we clearly see the pattern of a rainbow.

I did not make this rainbow. The genetic code assignment process made the color progressions and packaged them into the white light we normally see. This is a water affinity rainbow within the genetic code. The conventional textbook grid is a similar, but less effective mathematical presentation of assignments. It is a partial filter that produces a stippled pattern of color. I merely acted as a prism to spread out the white light into its full spectrum.

It is true that humans will always see what they want to see, and I am human. Actually, we see what we must see under any given set of circumstances. Scientists are no different, and most models are constructed to show an anticipated result. This rainbow was created by a mathematical formula, but lets be clear about the origin of the pattern and the mathematics behind it. Both the conventional table and the rainbow above were generated by the same, simple mathematical formula. The differences are due to the values inserted into the formula, and the final presentation format. What good are they? This rainbow is lots and lots and lots of fun. The first thing we can do is break it apart and see how it was put together. We can examine the components that make the weighting and make the rainbow.

See any patterns here?

What about now? With this formula we can stretch out the classic table and see that each nucleotide provides a different channel to the overall signal that is a rainbow. There is an adenine channel, as well as cytosine, guanine and uracil channels in the data rainbow.

Four tiny rainbows are interwoven to make one large rainbow. It is almost impossible for me to believe that this level of assignment carries no meaning in the genetic code. Water affinity is a global force forming the pattern of the code, but must be only one of many. See folding a rainbow. The bigger issue is in seeing multiple meanings in every nucleotide of every codon. This is probably what the code is doing, and probably how the code was formed. The weighting of the above rainbow is based on water affinities, primarily the second nucleotide in each codon. Is it possible that there are multiple dimensions of genetic information in this system? Did the genetic code actually assign several amino acids to each of the four nucleotide types relative to all possible contexts within a nucleotide sequence?

We can see from this table, it is actually more informative to know the nucleotide missing from an amino acid assignment pattern than it is to know the third nucleotide in a single codon. The third position is the final two bits of context, but what if we subtract it out of our weighting?

We have now created a hierarchy of sixteen multiplets that correlate to the sixteen multiplets of the Rafiki map.

Each multiplet groups with three others to form a major axis around the four nucleotides in the code. All of the codon types are equally distributed in each major nucleotide axis.

With this weighting formula and by eliminating the weight of the third nucleotide in each codon we have assigned a numerical value to each multiplet in the code. We can apply a color to each multiplet to see how these multiplets are distributed within the range of assignments.

This table is an alternate view of the Rafiki map. We have merely recognized the multiplets and found a way to rank them. When we use the colors created by this ranking of multiplets to view the actual codon data we see the following pattern in the assignments.

The strength of this pattern is compelling, but it is merely another way of viewing the clumping of assignments we are accustomed to seeing in the conventional table. The rainbow appears to have been diced up by this procedure, but this is merely an illusion caused by several complex factors. The rainbow returns when we arrange the major axes of the Rafiki map according to the rank of their homogenous multiplets.

The really interesting thing now is that we start to see some method to the madness of START, STOP, water affinity and the placement of the eight perfect multiplets in the very center of this rainbow. Methionine can be seen at the far left, STOP is at the far right, and the most symmetry of assignments is in the very middle surrounding proline and glycine. Make no mistake about what we are looking at - I did not artificially create this pattern. I am not clever enough to make this pattern, but nature is. The most clever thing that nature did with this particular rainbow was use it to anticipate frameshifts in nucleotide sequences. A forward shift will be well behaved into a single multiplet. A backward shift is spread across four multiplets. We can visualize how a frameshift is anticipated by the code by rotating the reading convention on the Rafiki map (a huge advantage of the Rafiki map.) This is how the multiplet rainbow appears in anticipation of a backward frameshift.

What this means is that the code is laid out so that backward frameshifting anticipates water affinity to a remarkable degree. Any lingering doubts about the importance of water affinity in forming the structure of the code should be erased by this demonstration. This is also a perfect illustration of what I mean by symmetry in the code. Only symmetry will allow such a mathematical trick of assignments. Twenty amino acids is an optimized number from the standpoint of frameshifting. Another fun thing to do with the weighting formula is to isolate the 20 assignment sets from which we generate all cyclic permutations. This is easy to do by plugging in equal values for all position values in the triplet.

Again, there doesnt appear to be any pattern of interest here, but we can put them back into our modified assignment table. Since no amino acid is assigned more than six codons, we can distribute all of the codons into six channels within the table.

I have absolutely no idea what this means, but it seems that some force in nature has tried to organize this data in some meaningful way. Perhaps an alternate view will make this more apparent. Consider a signal represented by a three-axes graph where each axis can carry two channels.

We can place the weighted data from the genetic code into this graph and perhaps begin to get a feel for the pattern in the code. Again, we are doing nothing more than looking for a way to rearrange the data in the classic table by accepting the formula behind the table, but juggling the bias.

Having played with this codon weighting formula - too much - I searched for a more logical way to weight the codons. I want to try to see a codon in the same way that the code can see differences between molecules. The genetic code is nothing more than a balancing act of relative forces. What are the forces that are balancing and how does their appearance change from one molecule to the next. What are the physical toeholds available to make assignments in the code to begin with? The disproportional nature of nucleotide position suggested an alternate way of weighting codons for visualization. We can move a step closer to seeing this by recognizing that there is a physical difference between a nucleotide as a molecule, and three nucleotides as a codon molecule. We cannot merely string three nucleotides together in our minds, but we must logically combine them somehow to form a codon. The three different positions in the codon contribute disproportionately to the assignment of a codon: B2 > B1 > B3, so a type of codon weighting formula is needed to reflect this difference between nucleotide positions. We can mathematically find a nifty way to do this by interpreting every codon as a continued fraction. The weight of every codon is calculated as a continued fraction (CF) with the following formula:

As a continued fraction each codon receives three values, a numerator (N) a denominator (D) and a decimal value (Dec). By using a weighting scheme of continued fractions we get an absolute and relative size, and each codon takes on internal relative proportions. Rectangular icons that graphically depict a continued fraction can now illustrate these mathematical relationships between all codons and their parts. I have chosen a convention where the portion of the rectangle contributed by the first part of the fraction is yellow, the second part is blue and the third part is red. These icons provide a quick visualization of all sixty-four codons and all of their parts relative to each other. One last observation that ties codons back into the un-weighted Rafiki arrangement of them: the Rafiki map places a network of twelve nucleotides in a dodecahedron. Therefore, all of the components of the map can be related to each other by powers of the golden mean, which is an infinitely repeating continued fraction where all values in the fraction equal one, i.e. 1;1,1,1,1... So the logic of representing code components as continued fractions has more than one potential application. The actual assignments are remarkable for their consistency with respect to water affinity, and for the similarity of patterns across all six codon classes and types. One must wonder whether there are distinguishable physical differences between triplet classes and codon types on which the code takes advantage. It appears as if the code logically assigned at least one amino acid from each part of the hydropathy spectrum to all six codon types. It certainly gives the appearance of a harmonic music scale played out in water affinities and codon classes. This is another glimpse into how symmetry should appear in the code. From a coding perspective, each codon appears to have a meaning based on all of its properties: size, shape, and consistency of parts, which makes sense in a setting where a molecular code must act upon all available parameters. The code in some sense has the ability to say the same thing in six different ways, at least with respect to relative water affinities. As if by deftly parsing the fine physical details of codons, six discrete meanings become available.

I love these continued fractions as representations of codons. They are so fabulous that I intuitively feel the patterns they create should be a part of how we see the code. Of course the first step is to convince people that we are not seeing the code properly in the first place. That message, I think, is getting lost in all the anger that I generate by suggesting that the classic paradigm is flawed and stereochemistry should play a role. Clearly our traditional view of the code is incomplete, with or without stereochemistry, and new perspectives are warranted. The most literal view of stereochemical visualization of the code that I have attempted involves a discrete symbolic approach to universal spatial parameters. I have called this system quantum geometry (Why? Why not.) It is a poor mans super symmetry, if you will. (If you havent had enough of this by now then you should link with quantum geometry wackiness here.) Rigorous mathematical treatments have been applied to the data elsewhere, super-symmetry, hypercubes, etc. but I will not visit them here. In virtually all treatments, strong patterns emerge. The key questions are how should we see these patterns and what possible meaning could they have within the code? The objection that I keep getting to these approaches is that the patterns they generate seem invalid because they involve a vigorous, subjective massage of the data. I agree, but people who lodge these complaints are missing the point entirely. The data is already being thoroughly massaged when we peek at it in the conventional view of the genetic code from the Watson-Crick codon table. The Rafiki map is the only pristine, un-weighted view of the data that I can imagine. Many of the subsequent visualization schemes here are efforts to explore how we should intelligently weight the data - if it needs to be weighted at all. Im just calling into question whether we might legitimately conclude anything from any patterns generated by any weighting.

Symmetrys role in expanding the paradigm


The classic paradigm of the genetic code embodied in the codon table is that of a simplified computer function. The code is arranged as a look-up table, and therefore it operates as a function that accepts three integer arguments and returns an integer result. The classic paradigm: AA = GeneticCode (B1, B2, B3) This says that the function of the genetic code is to return an amino acid identity given any sequence of three nucleotides. B1, B2, and B3 are three variables, each representing any of the four possible nucleotide bases found in messenger RNA (A, C, G, U), and AA is a variable to store one of twenty amino acids in the standard set, plus a null value representing termination codons. So the function of the genetic code under this paradigm is one that translates sixty-four possible inputs (43) into twenty outputs, and it must therefore impose a compression algorithm. This simple function is the essence of the genetic code as we currently perceive it, regardless of how we might structure the data. This paradigm has taken us a long way in investigations of biological information, but the classic paradigm has now clearly failed. There are many phenomena demanding explanation, but no plausible explanations exist within the simple, classic perspective; plus several harmful epistemic consequences come from adopting it so completely. To gain better understanding moving forward we must examine every detail and question every assumption that led us to this classic viewpoint in the first place. It is also useful to consider an alternative, expanded paradigm, one that views the genetic code as a function accepting an entire nucleotide sequence (NS) for input and returning a protein structure as output. The expanded paradigm: Protein = GeneticCode (NS) This has been the mantra throughout these pages: The genetic code builds proteins, not sequences of amino acids. The classic paradigm was historically touted as the equivalent of this expanded paradigm. In other words, the classic paradigm when repeated on every codon in a nucleotide sequence would produce the function of the expanded paradigm. However, in order for this to work an axiom must be added - the axiom that primary sequence (PS) determines protein structure must be added to the classic paradigm: And so it was. Sequential iteration of the classic paradigm: NS = Codons PS = GeneticCode (NS) Protein = PS Protein = GeneticCode (NS) In the words of Anfinsen, It would be a considerable simplification if this statement of equivalence was true. Unfortunately, it has proven false in several ways. The most conclusive disproof is the experimental evidence that silent mutations can lead to different proteins in vivo. A NS can change but still leave PS the same, and when it does, the genetic code can actually somehow produce a different protein. It is difficult to maintain the classic paradigm in light of this evidence, because the output of the genetic code should now be seen as a discrete protein structure independent of its PS. The classic paradigm has failed because it is based entirely on codon-amino acid correlation data from the familiar spreadsheet, and this spreadsheet is missing a large amount of critical information toward making proteins. The biological information in a protein is not equivalent to that in a primary sequence, despite longstanding dogma. There is far more information in a protein than in a sequence of amino acids, and the genetic code has some mechanism for translating at least a part of that additional information from NS. The simplest reason why the spreadsheet is misleading in this regard is that codons, in reality, do not select amino acids; rather they select tRNA molecules with the corresponding amino acids attached. There are far more tRNA than amino acids, and for that matter there are two and one half times more anticodons than codons. We all know what the codon table looks like, but what about tRNA and the anticodon table?

Furthermore, there is more than one way for each amino acid to be oriented within a sequence. All of these quantities go into the biological information system that comprises the function of the genetic code. Information theory is not a function of subjective value judgments; it is strictly a process of counting finite states of a system. Although there are sixty-four valid codons, there are also 160 valid anticodons. This significant step toward information expansion during translation is due to a relaxation of the Watson-Crick base pairing rules at the B3 variable standing for the third nucleotide of the codon sequence. Codons map to tRNA and tRNA map to amino acids. Evidence strongly suggests that a change in codons leads to a change in tRNA, and a subsequent change in the final protein, despite the fact that there is no change in the corresponding amino acid. Therefore, tRNA are a legitimate part of the genetic code. The code functions primarily as a whole protein generator, not as an isolated amino acid sequencer. There are vastly more possible protein conformations than there are possible primary sequences. Classic paradigm where fGC(x) stands for function of the genetic code.

If the genetic code is to be the function responsible for protein synthesis it must have a method of selecting one protein from many possible, not just consecutive amino acids from a limited group of amino acids. Expanded paradigm.

There is a fundamental choice to be made that extends well beyond semantics. Is the genetic code charged with sequencing amino acids or is it charged with making whole proteins from whole sequences of nucleotides? Regardless of the philosophical position taken, we can at least now clearly recognize that there is vital information involved in the process of making a protein that is not addressed by the classic paradigm, and the code must somehow go beyond specifying a sequence and into specifying a protein. Our thinking on this matter has been the victim of reductionistic absurdity. Mountains of evidence that disprove the classic paradigm are blithely ignored, while apologists for the reductionistic failure are many. Common sense says that nature has adapted the genetic code to perform a function of protein synthesis, not amino acid sequencing. There is a real difference, and the function of the code is more accurately viewed under the expanded paradigm.

There are clues in the pattern of codon assignments to affirm this common sense. To help appreciate this evidence, let us consider a hypothetical sequence of 101 nucleotides. Under the classic paradigm this sequence contains enough information to form 33 consecutive codons, but this limited view is inaccurate. The sequence is read in triplets, so there are actually 99 codons present in three sequences shifted by a single nucleotide relative to each other. The genetic code must assign structure to all three simultaneously. It has adapted for this, and it is optimized precisely for this task.

We will refer to the reference frame as F1 and use the middle base of a codon, B2, as the reference base. Relative to F1 there is a frame shifted forward so that B2 becomes B1. This will be F2. There is a third relative frame were B2 shifts backward into B3, and we will label this F3.

There are no absolute distinctions in nature in an entire sequence of nucleotides between any group of three nucleotides and a codon; this is a relative distinction made on the basis of a reference reading frame at the time of translation. A sequence of 101 nucleotides has the information content of three proteins - or more accurately three protein segments - containing 33 amino acids apiece. In most cases there is a different amino acid assigned at every position in each of the three segments. The expanded paradigm illustrates a more accurate notion that every nucleotide in a sequence is simultaneously assigned to three different amino acids, not just one. I am going to reiterate as a point of emphasis:

Every nucleotide in a sequence is part of three codons simultaneously.


The situation is actually more complex than this, because a single nucleotide sequence is stored in a double helix, so it actually represents two nucleotide sequences - the sense strand and the anti-sense strand. The antisense strand is the complement of each nucleotide in the sense strand. There is increasing evidence that antisense reading frames of genes show up more than expected greater than random - in reading frames on the sense strand. This actually is logical from a symmetry perspective, and it means that each nucleotide sequence really represents six potential frames: F1, F2, F3, -F1, -F2, -F3. To make it even more interesting, each sequence can be inverted, and any one of these transformations can be compounded with any other. However, for the purposes of discussion here we will focus only on the three frames of the sense strand. The arguments that apply to the sense strand can then be expanded to include the complement frames on the anti-sense strand as well. The genetic code must assign an amino acid to every codon, so it must assign each nucleotide in every sequence three times simultaneously. The superficial task of the genetic code was to distribute twenty amino acids across four nucleotides. However, The actual amino acids assigned to any single nucleotide are context dependent on the four nucleotides that surround each nucleotide - two on each side. The genetic code should be seen as a network of assignments involving all possible contexts of individual nucleotides. In a random, or near random sequence of nucleotides there is seemingly no way to account for this amount of context, but the genetic code has managed to do so. It uses the middle base, B2, as the primary axis of assignment, and the first base, B1, is the second most influential.

The classic data is so familiar that it may be difficult to see in a new light, but ask yourself whether this is a case of assigning one amino acid to every codon, or if it is better described as assigning several amino acids to every nucleotide. This looks to me like a pattern made to cover all contexts for each nucleotide. Each of the four nucleotides stand for some general amino acid characteristics. Adenine means hydrophilic and uracil means hydrophobic. Cytosine means tighten the backbone, while Guanine will generally loosen it. The amazing thing about the code is that the general characteristics of protein seqments coming out of F1, F2, and F3 share far more than random elements. This is only possible through assignment symmetry. The genetic code is perfectly assigned to achieve this feat. Coincidence? One amino acid can be assigned to every codon in an infinite number of ways, but several amino acids assigned to an individual nucleotide in 592 possible contexts is tricky as hell, and the code did a remarkably good job. It is optimized for this task, but the classic paradigm makes an optimized view of the assignment pattern difficult to see. We are rightly impressed by the knowledge that the genetic code more-or-less achieved the goal of assigning one amino acid to every codon, which then deceives us into assuming that it is the ultimate function of the code. Far more impressive is the fact that the genetic code almost achieved the goal of assigning three and only three amino acids to every nucleotide in every sequence context. This is more impressive after we realize that it is numerically impossible to do this, but the code came about as close as is possible. Therefore, we might reexamine some of the basic perceptions about parameters of the genetic code. 1. There are four bases in the genetic code False. There are five bases in the genetic code because wobble introduces a new base and new pairing rules at B3. There are four bases upstream in mRNA, but there are five bases downstream in tRNA, so information actually expands downstream. The genetic code is an information system, and the fifth base is a vital component of that information. 2. The code operates in nucleotide triplets True. The behavior of the code is compelling toward the notion that it is all about triplets, but we must take a good hard look at what a triplet is and what a triplet does. Triplets are especially informative in light of the cyclic permutations - codon types - that they generate. However, triplets alone cannot form peptide bonds; it takes two to tango, and they must be combined to do this. 3. Because of a functional imperative, the code cannot change False. The code is able to change, was able to change, and will always be able to change. We must view the code not as a workable kluge but as the most highly adapted structure on the planet. 4. The code must contain exactly twenty amino acids False. There is more than one valid anticodon for every valid codon. Therefore, no logical upper or lower boundary exists for the number of amino acids in the genetic code. The actual number might have been anything between one and infinity. The classic reasoning says that it started low and went up, but I think it more logically started high and went down. It is easier to include than restrict, so I think that precision selectivity of the code is a late evolutionary development. The explanation for twenty must be one of numerical optimization. There is a risk and reward associated with every possible number and combination of amino acids, so there must be something really special about the number and combination in the standard set.

This expanded notion is that every nucleotide sequence gives the genetic code three (or six, or twelve) bites at the apple, or chances to make a useful protein. It can theoretically produce multiple unique proteins, or protein segments from any nucleotide sequence, so it would logically try to somehow maximize this opportunity. Working from this platform, lets examine the evidence for how the genetic code went about its task of assigning three amino acids to every nucleotide in a sequence. In an ideal world, each nucleotide would be assigned three and only three amino acids for all instances of F1, F2 and F3. This is not mathematically possible, because although there is only one instance of F1, there are four instances of both F2 and F3 that must be assigned. A seemingly possible solution is to assign the same amino acid to all four instances of F2, and another amino acid to all four instances of F3. This solution would produce three amino acids consistently assigned across nine instances, but this too is not mathematically possible. The most viable alternative is to assign a single amino acid to the only instance in F1, a single amino acid to all four instances of either F2 or F3, and create a group of amino acids that somehow approximate each other, and assign them to the remaining four instances. This is the strategy that the code has taken. The code operates on the input of mRNA nucleotide triplets as the fundamental unit of information. Because there are only four bases available to mRNA, there are only twenty triplets on which to operate. There are sixtyfour ways to arrange these triplets, but we are going to focus on the triplets themselves. If the code were able to assign a single amino acid to all instances of a nucleotide in each frame, then we would only need the assignment data from the permutations of one triplet to determine the assignment of nine codons. AA1 = B1, B2, B3 AA2 = B2, B3, B1 AA3 = B3, B1, B2 Because it is not mathematically possible to create a data structure in this way, this formula cannot be completely accurate for nine codons in any assignment schema where the number of amino acids is greater than one. It can be accurate for six of the nine codons if there are sixteen amino acids in the schema, but there must always be a level of approximation for some property of the amino acids in at least three of the codons. Again, it is numerically impossible to assign three and only three amino acids to all instances of a specific nucleotide for all sequences of nucleotides. The goal now is to assign as many as possible, and somehow approximate the rest for all circumstances - as well as possible. The assignment of a nucleotide in one frame cannot dictate the assignment of that nucleotide in another frame. The assignment in F1 is dictated by the random context of the two surrounding nucleotides. The same context of three nucleotides in F1 is only partial context in F2 and F3, and the rest of the context in those frames is randomly supplied by the sequence. Therefore, it would seem impossible to use the context of F1 to anticipate the random context of F2 and F3. The fact that the genetic code has managed to do this makes it worth a closer look. Given the assignment data of any permuted triplet we can say the following about the genetic code. AA1 is always assigned to F1. AA2 anticipates the properties of F2 for all possible random eventualities (B2, B3, A); (B2, B3, C); (B2, B3, G); and (B2, B3, U). AA3 anticipates the properties of F3 for all possible random eventualities (A, B1, B2); (C, B1, B2); (G, B1, B2); and (U, B1, B2). An example is given for the nucleotide triplet UAC. This is a tertiary triplet, so it can generate six distinct permutations, and each of these is assigned in the code to a different amino acid (an anti-Gamow pattern). UAC = Tyrosine ACU = Threonine CUA = Leucine UCA = Serine CAU = Histidine AUC = Isoleucine F1, Reference: UAC = Tyrosine F2, anticipated: ACU = Threonine F3, anticipated: CUA = Leucine F2, Possible: ACA = Threonine, ACC = Threonine, ACG = Threonine, ACU = Threonine F3, Possible: AUA = Isoleucine, CUA = Leucine, GUA = Valine, UUA = Leucine Another example from the same triplet: F1, Reference: UCA = Serine F2, anticipated: CAU = Histidine F3, anticipated: AUC = Isoleucine F2, Possible: CAA = Glutamine, CAC = Histidine, CAG = Glutamine, CAU = Histidine F3, Possible: AUC = Isoleucine, CUC = Leucine, GUC = Valine, UUA = Phenylalanine

What this demonstrates is that small portions of data contain patterns informative to the global pattern of the entire data set. It is somewhat like a holographic data set. It is an ingenious way for nature to have structured the data so that all three frames are logically related. Although the reference amino acid, AA1, is seemingly unrelated to either AA2 or AA3, the possible assignments of both anticipated sets are closely related, if not identical. The net effect is that triplets are assigned in such a way that merely those assignments anticipate the effect that random nucleotide context will have on information output by the genetic code. This is virtually impossible by chance alone. Although the outcome of a shift on a single codon in a random sequence should be random, the outcome of the entire nucleotide sequence is pre-determined. In contrast to a point mutation, all of the information regarding any possible frameshift is already contained by the sequence before the shift. By permuting the assignments of any given triplet, one can anticipate some property of the amino acid already assigned to an unknown outcome, despite the fact that eight possible random outcomes must be accounted for. More importantly, this function must hold to some degree across sixty-four distinct codons, and the simple formula is in fact remarkably consistent throughout the entire code. Even partial effectiveness of this formula is no simple trick to pull off for any data set. It is like trying to fill out a complex three dimensional magic square, in the dark, with virtually no starting knowledge or constraints. In any case, the ability of a codon assignment pattern to achieve any level of frameshift anticipation, let alone a high level must be the mother of all combinatorial optimizations, and it is stunning when the components required to get to that level are properly understood. To help visualize the concept we must appreciate the task presented to nature by the sets of numbers involved. Each F1 codon permutation has four possible outcomes in F2, and four more possible outcomes in F3. But each triplet class and codon type has a different fingerprint, or statistical scatter pattern in the three frames, as illustrated in these Rafiki maps.

Primary, secondary and tertiary triplets all have different fingerprints when shifting into either F2 or F3. Homogeneous and heterogeneous multiplets make good reference units toward understanding shift distribution patterns across the entire code. The above diagrams illustrate some of the numeric realities in the structure of the code, but the observed shift patterns are confusing to describe, so they are much easier to spot in the diagrams. F1 codons will always map into the four codons of a single F2 multiplet. If the original codon is part of a homogeneous multiplet then it will stay in the original major pole. If the original codon is part of a heterogeneous multiplet then it will shift into the adjacent pole. F1 codons will all map into a codon from four different F3 multiplets, but all four codons from an F1 multiplet will map into the same four codons of four different F3 multiplets. The scatter of a single F1 codon is tight into F2, but spreads into F3. Conversely, the scatter from an F1 multiplet is spread into F2, but is unitary into F3. These are the features that the genetic code took advantage of in putting reliability into the simple anticipatory formula. More importantly, these are the universal numeric patterns that primarily selected for the patterns in codon assignments seen in the genetic code.

Note that these diagrams graphically demonstrate that two completely different codon assignment strategies are required in order for the anticipatory formula to work at all. First, the assignments must obey multiplet boundaries to anticipate F2. Second, all of the multiplets must be coordinated in such a way to send four possible codons to four different multiplets in F3. Further note that the reliability of the formula will be degraded by every amino acid that is added to the set, and the degradation should accelerate rapidly above twenty. This is because there are only twenty anticipatory data sets available to the formula, corresponding to the twenty distinct triplets allowed within the structure of the code. From this purely numeric perspective, twenty is an optimized value within the genetic code, and the arrangement of this specific set of twenty is optimized for the mutual anticipation of the properties of three different frames of reference. This might be the basis of the super symmetry in codon assignments that has been identified by several mathematicians and nuclear physicists. For the code to be considered as an optimization we must consider all amino acid parameters, under all circumstances, and when applied to all possible nucleotide sequences. It is not realistic to illustrate such a broad concept in a single diagram, but we can select one important parameter to move the illustration forward. Water affinity of each amino acid plays a critical role in determining the final conformation of every protein. Therefore, it is logical that the genetic code should take this into account in forming an optimal codon assignment pattern. To help visualize how the code actually did this I will use the color wheel of relative hydropathies. A plot of the actual codon assignment data using these colors shows the unmistakable fingerprint of F2 in the genetic code.

This pattern is not surprising, because it is recognizable in virtually any treatment of the genetic code. However, it means that AA2 will always reliably anticipate F2. This oft recognized multiplet pattern is the consequence of the B3 symmetry in multiplet assignments noted for decades. The pattern was attributed primarily to wobble, and also proposed as a buffer to point mutations. But these are probably not the forces that drove it initially. If wobble were to be the optimizing force, we should see the fruits of wobble, which would show up as an optimum number of reduced tRNA. No organism demonstrates anything near this number, and it is very likely that organisms actually have more than sixty-four distinct tRNA molecules swimming in the soup of their cells. Wobble helps keep tRNA populations down, but we dont see anything approaching thirty-one tRNA, which is the optimized minimum. Point mutations cannot be the driving force either, because they create no real pattern at all. It is unlikely that this relatively trivial and virtually non-existent pattern drove the pattern of codon assignments.

Remember, the multiplet assignments are merely step one of a two-step process. Point mutations in the third position are in fact covered, but this alone probably isnt significant enough to drive that pattern. The second step is to interweave these multiplet assignments so that they anticipate F3. This is much tougher to do and much tougher to visualize, but the Rafiki map is unmatched for this purpose. By applying the anticipation formula to all twenty triplets in the Rafiki map we achieve an F3 color rotation on the map in the following way.

This is simply a mathematical manipulation of the data as directed by the F3 anticipation formula described above. I have applied this formula to all 64 codons and used the same color scale for water affinity. When all twenty triplets are rotated by the simple anticipatory formula, an F3 Rafiki map appears as follows.

The multiplets are now assigned in a notably consistent manner with respect to water affinity. The UUU pole is now seen as completely assigned to the most hydrophobic amino acids in anticipation of F3, and the AAA pole anticipates amino acids from the hydrophilic spectrum.

This pattern is difficult to detect in classic spreadsheets, but it is undeniable in this plot. However, this color assignment isolates only a single amino acid parameter, and it is convincing mostly in these two poles. The remaining two poles have a decent but somewhat subtler result for hydropathy in F3, but those two poles are dominated by a completely different peptide parameter. The CCC pole houses proline, and the GGG pole houses glycine. These are the two most significant amino acids with respect to the steric properties of peptide backbones. The axis between these two poles optimizes primarily on steric parameters, and secondarily on hydropathy, whereas the AAA-UUU axis is primarily organized by hydropathy. Its really hard to know what arginine is all about under any context. It is the rebellious teenager in the code family. The four primary triplets (AAA, CCC, GGG, UUU) play a unique and significant role in codon assignments. They are mathematically different from the other sixteen triplets based on their pattern from F1 into F2 and F3. Primary triplets have a one in four chance of remaining unchanged in each frame, but all other triplets must always create new permutations. In the case where a homogenous multiplet is symmetrically assigned to a single amino acid, that amino acid will remain the same in six of nine possible instances. Therefore, primary triplets in conjunction with homogeneous multiplets can provide backbone benchmarks in correlating the three protein structures, F1, F2 and F3. This correlation is made along two axes: the first is relative hydropathy, and the second is the degree of steric freedom in peptide bonds. Proline is incomparable in its degree of bond rigidity, and glycine is unmatched for steric freedom. By having solid homogeneous multiplet assignments, these two amino acids display the strongest possible symmetry within the framework of the code. As a consequence, they are exceptional for their resistance to shift replacement, and therefore they are the primary benchmarks for correlating protein structures across F1, F2, and F3. Hydropathy is not far behind in terms of symmetry of assignments and importance to a protein folding benchmark between frames. As we have seen, hydropathy dominates the other axis of correlation across the three frames. In this context, the four major poles provide a structural template into which at least three distinct protein descriptions can be predefined for any nucleotide sequence. The genetic code has discovered this numeric curiosity and taken advantage of it. However, the code has one more clever trick up its multi-frame sleeve. Note that F2 can be anticipated by individual codons mapping into a single multiplet, but an entire multiplet can only be anticipated as one of four codons mapped into four different multiplets. These numeric relationships will not allow specific amino acid substitutions to be anticipated. However, when one codon is used preferentially from a multiplet then a specific amino acid substitution becomes guaranteed. Codon bias is a long recognized pattern in nucleotide sequences, but less familiar are the specific numeric consequences in the assignments given three relative frames in the genetic code. F2 reflects a much higher degree of F1 structure because of codon bias. A history of repeated use of frameshifts to generate novel protein structures may actually explain severe bias in various organisms. A biased genome will produce more highly related structures in F1 and F2 than will an unbiased genome. Therefore, codon bias and frameshifting are inter-related genetic phenomena.

What's symmetry got to do with this?


The pattern of assignments as unique relationships between three reading frames is a product of symmetry. Symmetry is a well-recognized part of codon assignments, but the origin and function of this symmetry has been obscure. Consider that a completely random codon assignment pattern would produce, for any given nucleotide sequence, three randomly spaced, logically unrelated proteins on the landscape of potential proteins. However, because of symmetry the three structures are logically spaced and interrelated on this landscape. Is this just a happy coincidence, or is it an optimized function of the genetic code? The odds that any random code would have this ability are infinitesimally improbable, so we can comfortably wager that the genetic code actually adapted for this as a primary function. Consider the chain of coincidences required to grant the code this rare ability. 1. It selected a set of completely interchangeable geometries in its amino acids. All amino acids are the same type of isomer, the L-type, and spatial sameness is a requirement if assignment symmetry is to be structurally effective. 2. It selected twenty amino acids, which provides the optimized combination of symmetry and diversity in assignment schemas. 3. It assigned amino acids strictly to multiplets. 4. It symmetrically interwove the multiplet assignments. 5. It assigned two important binary axes of peptide properties to the four major poles and their primary triplets. 6. It frequently employs codon bias, guaranteeing a high percentage of directed shift substitution.

All of these observed facts of the code have independent consequences and explanations, but they all must be present to achieve the remarkable relationship of assignments between F1, F2 and F3. It is therefore plausible to suspect that these features have adapted as an optimized set. Codon assignments were not made for absolute meanings; rather each one has a relative meaning to all others. No codon-amino acid correlation can be considered a good correlation outside of the context of all sixty-four taken together. It is a network of assignments made in concert to achieve a particular optimized result. Consider the virtually infinite number of possible genetic codes. Assuming that no code is better than any other would lead to few patterns, and no way to explain the near universality of a changeable code. Since there are remarkable patterns and near universality in the codes we see in diverse organisms, it is valid to infer that this code is exceptionally well suited for its task. The fact that only twenty amino acids occur in the standard code, and all of these are the same isomer is a longstanding riddle. Why not more amino acids, or less, and why not a mixture of isomers? These curiosities are required within the network of assignments to ensure that the many simultaneously defined protein structures should maintain a logical distance of shared features on the landscape of all possible proteins. To find the infinitesimally small subset of potential codes that have this bizarre ability by mere coincidence boggles the mind. There must be an advantage to the code by having this exceptionally rare ability. Note, we have not even mentioned stereochemical genetic information in this scenario. (way too heretical at this point to cloud the seperate discussion of symmetry and frameshifting.) We are basing the conclusions here only on the notion of amino acid identities, not spatial orientations. The impact of symmetry is strong on identities alone, but it becomes much stronger if stereochemistry is a real part of the information translated by the code. The two novel arguments regarding the nature of the code and its structure are independant but mutually supportive. The classic paradigm has clouded our view of the genetic codes achievements. Adopting it early in investigations also had us accept that there may be no rhyme or reason to the particular patterns we might detect in the data. The linear paradigm suggests that one pattern might be every bit as good as another. This is a huge drawback to thinking in terms of 'one-dimension' of information, or one degree of freedom in assignments. We are then apt to fail to appreciate the logic in the assignment patterns toward producing not merely primary sequences of amino acids but entire protein structures. The what, how and why behind the genetic code have all been partial to total enigmas in the classic paradigm. They can be better understood within an expanded paradigm. What does the genetic code do? It simultaneously defines multiple overlapping structures for any nucleotide sequence. How does it do this? By employing symmetry in its pattern of assignments to select many logically related structures from all possible structures. Finally, the big picture comes into better focus when we consider why it should do these things. To the question of why nature should have a genetic code in the first place the classic paradigm answers: To make sequences of amino acids from the information in sequences of nucleotides. Beyond the distinction between sequence and structure there is another vital division between this and the expanded paradigm. The genetic code is also charged with the task of finding proteins within sequences of nucleotides, because any useful protein must first be found in a nucleotide sequence before it is reliably and repeatedly made from it. The genetic code is an important part of the search algorithm nature uses to find these structures, and some patterns of codon assignment will be undeniably better at finding useful protein structures than will others. There are many ways to search a random nucleotide sequence, and the genetic code has found the best way to search. Here's a metaphor that might help. The genetic code is a die around which a string of data is wrapped. The string can wrap in many ways, but the die is cast for all wrappings in anticipation of all possible data. There are good and bad ways for precasting dies, and the genetic code has found the best set of all possible dies. Any string found to be good in one wrapping has a much higher than random chance of being good in the other wrappings. In a situation where many good strings must be found as solutions for diverse and changing problems of the environment, why not use good strings as many ways as possible? If a solution works well forward, why not backward? Why not shifted or complimentary? This is the essence of symmetry. Use, re-use, re-process, invert, compliment, combine and re-combine - in general transform but leave elements untouched. Symmetry is transformation without change, and the code is founded on symmetry. Think symmetry, think mult-tasking. Spacing molecular properties in a random sequence is one thing, but spacing many related sequences is quite another thing. Imagine the task of purchasing three lottery tickets. All three have an equally dismal chance of anticipating a small collection of random events. Is there a logically superior way to buy three lottery tickets? No, unless some aspect of the first ticket is known before purchasing the second, and likewise the third. Imagine you are told that the digits on the first ticket are correct, but in the wrong order, the chances of having subsequent tickets pay off will skyrocket. In essence, this is what the genetic code has done. It knows vital things about one sequence that guide it in trying another, seemingly random sequence.

Imagine a computer program written so that it was processed three bits at a time, let's say a check writing program. Now imagine if that program was fed into the computer shifted one bit to the left, would you expect the program to still write checks? Most probably not. The genetic code is set up in such a way that when this program is shifted forward or backward one bit it produces amazingly 'program-like' qualities, perhaps for unrelated tasks, like an address book or paint program. Also, the program could be fed through backwards, or the XOR equivalent of the program could be fed through the computer and it has a much better than random chance of performing some function. This would come in real handy when entirely new programs are required. This property of the computer system would be an efficient way to set up a search for new programs - start with a functioning program and transform it. Each nucleotide sequence is like a program for making a protein. Because of the compatible symmetries of genomes and the genetic code, new proteins are more easily found from seemingly random material. When looking at the genetic code there has traditionally been a bias toward considering its function toward making proteins. I am saying that there should be more emphasis put on thinking about its role in finding new proteins. This function of the genetic code is probably very important and has probably played a major role in shaping its appearance. The genetic code not only needs to have a way to make the same protein over and over again, it also needs to constantly find new useful proteins, so it must relentlessly search the landscape of all possible proteins. The genetic code is the vehicle by which life can travel from one place to another upon the protein landscape. The 'junk' DNA in a genome probably represents the most fertile search ground for new structures. They are sequences like a taffy pull of existing sequences with proven utility. By finding a harmonic network of assignments the genectic code has optimized the speed and efficiency of a search for new protein components within an existing genome. DNA is a complex crystal. Protein is a more complex crystal. The genetic code translates the first crystal into the second crystal. Life emerges as the most complex crystal. A genome represents a vibrant crystal factory. It devotes a small percentage of resources to production, a larger percentage to management, and the greatest percentage to R&D. Life is natures most aggressive search algorithm. In contrast to life, a salt crystal rarely has a need for solving problems posed by its environment, or for finding a new way to make a novel salt crystal. Life has an insatiable need to find new ways of doing things. Organic environments are constantly changing, so life is constantly finding new ways to solve the problems they pose. Morphologic 'churn' drives the process. Consider the analogy of a whole human organism in the time of the caveman. Being the best possible hunter solved the problem of nourishment for our caveman friend. Sexual reproduction is natures preferred way of finding the best possible hunter from the set of all possible hunters. Rather than finding one really good hunter and then replicating it, nature chooses to combine components from two humans for the chance of creating useful novelty. This is a strategy predicated on an anticipation of extreme and rapid environmental change. The attributes of being the best possible hunter change through time. The solution of using changeless replication will quickly fail in any complex environment. Tinkering with the hunter by making random point mutations will have a low likelihood of success as well. Sexual reproduction is the symmetric solution to finding new types of hunter. Offspring are complete transformations of two parents, yet everything is the same in some sense, because the offspring is still human. This system of sexual reproduction ensures a high chance of functionality with a guarantee of novelty. Life shuffles and sorts to solve new problems, in this way addressing the pressing demands of a constantly changing environment. Proteins are found in much the same way as sexual reproduction finds whole organisms. First, proteins are generally made of parts called exons. Second, these exons are free to recombine in novel ways to form entirely new proteins. Third, the genetic code has an effective way to find new exons through frameshifting, complimenting and inverting. Once a useful exon is located in F1 it is like knowing the partial numbers on a lottery ticket. If F1 is functional, then keeping the general template while creating an entirely new potential exon in F2, F3, inversions and compliments have a significantly higher potential for leading to a successful search than does sampling a random sequence. Useful sequences are cobbled together from the frameworks of known useful sequences rather than conducting random searches from scratch. The genetic code is a required part of the scheme.

The more sparse the landscape of useful proteins, the greater the advantage of logically spacing the frames relative to each other, and therefore the greater the advantage of having symmetry in the assignments. A genetic code optimized for diversity would assign sixty-four or more amino acids across all of the available codons. A genetic code optimized for consistency would assign one amino acid to all codons. The standard genetic code is optimized for diversity and consistency by symmetrically assigning a network of twenty amino acids across a network of nucleotides. Rather than a blue-collar protein building recipe, the genetic code can more accurately be seen as an elegant method for searching nucleotide sequences to find useful proteins. Life itself is a search algorithm, and the genetic code is the lynchpin in the search.

Folding a Rainbow
There are so many outstanding challenges regarding the genetic code it is hard to know where to begin or what to do. What is the genetic code? What does it do and how? And to me the most fascinating and mystical question is where did it come from? Since we do not have all that great of a handle on the first few questions - the ones that we can actually observe in todays world - there is no way to have any real confidence in the last question. The base for any viable scenario, I think, must be founded in emergence. Emergence is a very difficult concept to define precisely, but it is the companion or dual concept of complexity. Each leads to the other. I will let someone far more qualified than I weigh in on the definition of emergence. (Italics are the authors.) Emergence is above all a product of coupled, context-dependent interactions. Technically these interactions, and the resulting systems, are nonlinear: The behavior of the overall system cannot be obtained by summing the behaviors of its constituent parts. We can no more truly understand strategies in a board game by compiling statistics of the movements of its pieces than we can understand the behavior of an ant colony in terms of averages. Under these conditions, the whole is indeed more than the sum of its parts. However, we can reduce the behavior of the whole to the lawful behavior of its parts, if we take the nonlinear interactions into account. John H. Holland Emergence, From chaos to Order The most stunning aspect of protein synthesis, in my mind, is that it is a sequential process. I think this aspect is almost universally misinterpreted as linear because of the superficial similarities between sequential nucleotides and lines. To me, the first big hurdle that life had to clear was in coming up with a sequential molecular mechanism so that significant amounts of molecular information could be algorithmically handled. Cairns-Smith makes two indispensable points in thinking about the origins of life. First, the original molecules of life, organic molecules, could not have been organic molecules by todays standards. Second, layers of inorganic molecules are capable of storing sequential molecular information. Given this mix of ideas, I see a value to understanding how todays code could be visualized in terms of inorganic forces. The primary force that shapes life is water, and the ideal form driving the structure of all organic processes is a dodecahedron. Actually I think of life as an inherently spherical system, because I think that life represents a perfect dynamic balance of physical molecular forces. But the dodecahedron is the largest possible form of perfectly balanced forces. Twenty is the most possible points that can be evenly spaced on a sphere, and if you connect all these points they will form a dodecahedron. So I see, somewhere in the mix of molecular history, there was a need to fold up layers, or a sequence of molecules within a hierarchy of water affinities. The original organic material could have been no more than some form of carbon sludge, and I see the process as some emergence of gazillions of instances of progressively more complex dynamic crystalline structures. I think this calls to mind an iterative process of concentrating structures by repeated cycles of hydration and dehydration of little sludge globules. If the freezing or crystallization occurs within pockets of pockets of sludge. The dehydration process can partition itself in sections of sludge, and gradients are formed. The grand orchestration can split between the easy access of tetrahedrons and the more complex circumstances of dodecahedrons. With separation and gradients come interfaces and transitions. This is where the real drama rages, at the interfaces, the interface of chemicals, shapes and order. This drama is a process not an event. The process involves large amounts of repetition sprinkled with the essential spice of constant change. We mark the milestones in the process historically with the crowning of champions, symbolic kings of survival. But these inaugurations are chimera existing only in our mental models of complex, repetitious events, historical events with no special features during the times of their own existence. They carry the titles of last common ancestor and origin of this or origin of that. They are the pyric victories of posthumous coronations for detached, historical champions. The drama emerges not from imagining something improbable and isolated, but in imagining a process ubiquitous and inevitable. Enough goth, lets splash some color into our scene. Heres a sludge globule.

Pretty unimpressive, but we can imagine some simple sludge properties for this globule to set it up for some sludge tricks. If we consider that there are natural properties of sludge, such as its ability to get along with water, then sludge placed in a dynamic setting might entertain us. If sludge is not perfectly copasetic among water, then it will segregate itself into globules. If these globules are not homogenous sludge, then they will create gradients according to water affinity. The water-loving sludge will hang around the perimeter, and the water hating sludge will squirrel itself as far from water as it can.

Of course no sludge is an island, so water will continue to be a part of sludges life to some degree or another. It is only through dehydration, or the exclusion of water, that sludge can crystallize into a solid of its own. This process will follow a path as diverse as the number of sludge globules one can imagine. In general, the crystallization process will make sludge smaller, so we can imagine sludge columns forming within the water matrix. This is simpler to visualize if we start with a sludge layer on top of water.

Now we have a new sludge-water interface as the stratified sludge begins to crystallize, which is a happy circumstance because new gradients can form, and we love gradients.

The most concentrated water-hating sludge should be found at the core of the sludge crystals. This process is just as true with sludge globules as sludge layers. Perhaps it is a combination of the two, where sludge spicules form and aggregate into sludge clusters, much like lipid layers are known to do.

It is the core of these crystals and aggregations of crystals that I imagine environments for carbon crystallization much like a freezing pond. The specifics are beyond me by many orders of magnitude, but I generally sense that the process of carbon jostling for crystalline position will be multileveled, complex, and interesting.

Sludge at the core will be presented with different environments than sludge at the watered periphery, it will chose different strategies, and there will be an interface between strategies. We now have imagined a carbon sludge crystal that is stratified with a strategy predicated on water affinity. We can imagine this beast in bazillions of instances with bazillions of different strategies. It is at this rudimentary level that the true dynamics of the drama can kick in. They invite new players called combination and aggregation. Complexity, network science, universality, chaos, emergence are all playgrounds for these new players. Simple things combine to form complex things. Things aggregate and accumulate, but these combinations do not become big versions of little things, they become different versions of new things. More is not more, more is different. I view a living cell as a colony of molecules. They are not sentient beings, but they interact in a molecular dance no less complex than a modern city. More perplexing, these molecules have a language and can communicate information. On what physical principle could such information and such a language be founded? In this new world of social mingling and sludge spicule accumulation it would be a useful trick for sludge to learn a measure of independence. If talented spicules figured out how to ball-up, they could travel from aggregate to aggregate and form new connections, a crystal network can evolve and complexity will emerge. Buckminster Fuller teaches us the simplest way to ball up a spicule or rod; its called a tetrahedron. This is the lowest energy folding strategy for a column, practically just two folds and a splice.

In the real world of sludge spicules the splice is a bitch, and the folds are no cake either. If these sludge regions actually wanted to get together they would have gotten together in the first place. They are apart because they prefer to be that way, so they will need special sludge chaperones, or glue to keep them together. The region of most desperate conflict is the space between water loving and water hating, blue and red, so we will put a firewall in as a splice.

The other regions between the hydrophobic and hydrophilic poles of the folded structure will require creative arrangements to maintain a happy structure. We can represent the finessing of the balance at these junctions with some special sludge icons.

This is a quite fanciful representation of a mythical sludge spicule strategically built to form and travel in water. Such a beast could probably never exist on this planet, but it is starting to look like a pattern weve seen before.

The spicule is mythical, but the map, or globe of the genetic code is not. The pattern demonstrated by that structure is self-generated. Even if you reject the notion that there is meaning embedded in the genetic code, you cannot deny the pedagogic value of studying this object. It is exponentially more powerful to comprehend codons on this globe compared to anything approaching linear. Assignments, properties, such as water affinities, similarities and differences of all parameters in the system can be illustrated relative to one another. It is the natural habitat of natures data. These study tricks are founded on many of the principles we used to imagine the folded spicule. It is possibly a coincidence that these parallels are even possible, but the intellectual utility of this genetic globe is striking. Just as our fanciful crystal spicules can aggregate, so too can crystal particles. These rough, imprecise seedlings will be polar, and they will cluster in water. This form of spontaneous aggregation or assembly is particularly interesting in light of the many modern homologs in biochemistry, viruses being a major example. In cases where aggregation of subunits is a major growth strategy dodecahedral and icosahedral symmetries are common.

As seed crystals, these large mythical beasts are now in a position to fold a diverse collection of smaller beasts - proteins - based on any and all platonic forms. Empiric evidence suggests that this is exactly how proteins fold. Is this the clue to a fundamental law of folding? But are crystal spicules common, and what relationship would they have to dodecahedrons? Crystal spicules are fairly common phenomena, but dodecahedrons are not. Dodecahedrons are very rare in natural crystal systems because that symmetry is an awkward way to make an interconnected lattice. Consequently, inorganic crystals demonstrating the fivefold symmetries of dodecahedrons are unusual, but five-fold symmetry is the rule rather than the exception in living systems. What would a dodecahedral spicule look like?

Again, merely a fanciful illustration of a dodecahedral crystal spicule, intentionally designed for a demonstrative reason. I am not aware of any natural crystals that spontaneously form in this type of dodecahedral structure. I aligned dodecahedrons sequentially and proportioned them so that their complimentary faces can be colored and the cores removed:

I am not aware of any structure in nature like this, exceptwellDNA, but thats not really a crystal, is it? Another interesting foray can be found in tiling a plane with a dodecahedron. There is another artist named Mark Curtis in Britain who has been bitten by the bug of geometric principles behind lifes little crystalline tricks. You should visit his website and see how he feels about the geometry of DNA. This is the exact proportion of the double helix in DNA, which also demonstrates dodecahedral symmetry. It has a major and minor grove, complimentarity, the same pitch, the same number of faces per rotation. It is not meaningful in any sense other than as a fanciful parallel between DNA and a sequence of dodecahedrons. This is not a typically useful crystallization strategy for carbon, which will generally take one of two pure forms in nature: graphite and diamond. The precise form taken, and therefore the symmetry employed is a function of the environment in which it crystallizes. The following diagrams are from my college minerology textbook, Dr. Kleins. Im sure he wont mind.

In low pressure environments, such as at earths surface, carbon will crystallize into the six-fold sheeted symmetry of graphite. At higher pressures carbon crystallizes into the tetrahedral symmetry of diamond.

Graphite

Diamond

The differences in physical properties between these two structures are dramatic. Color, luster, cleavage and hardness are quite disparate, and they form the basis of widely differing commercial value. The environment produced the structure, and the structure produces the utility. An environment where regular crystallization is not an option will favor strategies that embrace irregularity. These are the type of strategies that demonstrate emergence. They are agentbased strategies, where simple agents are acting on simple laws in huge numbers. It is only through context and combination that more complex behavior begins to emerge from the system.

Grandma
The theory of Babbage accounts with great probability for the rise of ground in the vicinity of volcanos, and Herschels theory accounts, perhaps, for the subsidence of deltas and other places where great accumulation of sediment occurs; and this latter theory has the additional advantage of accounting for metamorphism, and perhaps, also, for volcanic phenomena. But it is evident that some other and more general theory is necessary to account for those great inequalities of the earths crust which form land and sea-bottom. Joseph Le Conte Elements of Geology 1898 Life is a cascade of networked precursors. Flesh, bone, talent, ideas, culture, it all moves along with time, propagating from one network to the next. We like to see linear relationships in the process, but I doubt that the linearity is more than an illusion. My last living Grandparent just died, Naomi. I learned a lot of things about her only after she died. She was a special lady on many counts, but she had horrendous handwriting - of course I knew that before she died. What I didnt know was how beautiful her husbands handwriting was, Grandpas. She always wrote everything for the pair, but you could never read a damn word she wrote. Grandpa could write like nobodys business, but I never saw a thing he wrote until after they both died. My Dad got his mothers handwriting, which he didnt give to me, because I have no handwriting at all, so I print, or I use my computer. I cant really decide which of my four grandparents is to blame for that lexic defect. Naomi attended the university of Chicago before girls generally went off to college, and she took geology classes. My brother was a geology student as well, neither of us knew that Grandma had preceded us in the study of geology. After she died, we found some of her projects and textbooks. Based on that material, she was clearly a better student and geologist than I ever was or ever will be. The above quote was from one of her books. It is remarkable for two reasons. First, they had no clue whatsoever in 1898 about plate techtonics. We now know that the crust of the earth is wandering around the surface, but without that little tidbit of information there are a lot of facts that are absolute head-scratchers. For instance, an ancient shoreline can somehow be located at the peak of a contemporary mountain. They struggled valiantly to make sense of that, but in the end they were debating the dance patterns of fairies on the head of a pin. Second, the statement is remarkable for the following line: But it is evident that some other and more general theory is necessary to account for those great inequalities of the earths crust which form land and sea-bottom. They knew what they didnt know, and they were willing to withhold final judgment on the theory in lieu of better ideas to come. In other words, knowing something about something isnt the same as knowing everything about something. We seem to have forgotten this. Conversely They did not vex their cultural descendants with dogmatic insistence on one theory or another. Apparently the excitement of bigger ideas outweighed the fear of no ideas at all. To them, the interpretation of the earths crust without plate techtonics was like the interpretation of the genetic code without a dodecahedron. It shouldnt be done. Differential settling was eventually replaced with a model of techtonics, but it was not an unreasonable guess, because heavy things sink. The really heavy stuff in the earth is toward the center, and it gets hotter from pressure as we descend to the core. Toward the periphery are left the lighter elements, the most interesting of which are carbon, oxygen, nitrogen, and of course hydrogen. The most interesting interaction of these elements at the surface of the earth - in the universe - is the interaction of hydrogen and oxygen to form water. The properties of water are oddball in innumerable ways, but without them Life is improbable. Seventy percent of the surface of the earth is water. The atmosphere is full of it. H2O exists on the periphery of earth in many forms, gas, liquid, ice, and in that most interesting of all forms, the semi-solid carbon slushy form of Life. About seventy percent of Life is water. Life is like a continuous crystalline outer crust on the surface of the earth. What possible mechanism could get this crystal started all those billions of years ago?

Current Dogma as the Antithesis to Chaos and Complexity


The current dogma regarding the genetic code represents the antithesis to the concepts and foundations of chaos and complexity theories. The idea that proteins actually fold in a simple and linear fashion, as described by current dogma, should be an outrageous proposition to those who understand complexity. Lets briefly outline the accepted theory: 1. Genetic information is stored in sequences of nucleotides. 2. The genetic code translates all of this information only into sequences of amino acids. 3. The sequence of amino acids alone determines the shape of a folded protein. The lynchpin of this scenario is called the thermodynamic hypothesis of protein folding. This hypothesis holds that for any sequence of amino acids there is a single folded conformation that represents a global free energy minimum relative to all possible conformations of that sequence, and that it will be found in all cases of physiologic folding. Therefore, the universe has predetermined one native form for every possible sequence before the process of folding begins. If this were true, we should expect to see a single conformation for any protein, independent of its nucleotide sequence or folding history. Indeed, early studies of denatured and refolded proteins seemed to support this hypothesis, but they have misled us. Translation has already ocurred, and denaturing a protein does not reverse the informative processes of translation. Confirmation of this acccepted hypothesis is still conspicuously absent. A rigorous proof should at least require prospective studies of folding a single primary sequence from multiple synonymous sequences. This proof is missing from the literature, and in fact contradictory evidence is available. Silent mutations do affect protein folding. In addition to this, curiously, codon usage is correlated with secondary structure. The issue is actually much simpler than it has been made out to be. Given a sequence of amino acids, how many ways might we expect it to fold? If the answer is one, then we can continue to happily pursue the tenets of existing dogma. If a protein might fold many ways, then we must abandon the failed dogma before it does more harm - and it is doing considerable harm. This is where chaos and complexity should rise up and be heard, because protein folding will be properly understood within that context. The process of folding a protein is complex, and it is sensitive to initial conditions. It is much more like the weather, or like an economy than it is like a ball on an inclined plane. We can easily come up with a long list of conditions that certainly divert a protein from folding along its unalterable path of thermodynamic destiny. The most crucial of these conditions will involve the molecular apparatus of sequence translation, which is the heart of the genetic code. Here is a metaphor to make the point. Consider a river that represents a single sequence of amino acids. One side of the river stands for all synonymous sequences of nucleotides that translate into those amino acids, and the other side of the river stands for all possible folded conformations of that sequence. The genetic code is a bridge between the two sides of the river. Our current paradigm is the simplest one possible: All roads lead to one destination. But this view completely ignores the engineering of the bridge. Each crossing of the river has a timeline and a terrain. Each sequence is built in real-time with a diversity of components, and surely this alone leads to the probability of multiple destinations. This does not mean that every point on the nucleotide side leads to a distinct conformation, just that there are multiple potential destinations. Even if there are only a few, the fact that sequence does not inevitably lead to structure means that we are ignoring a vital part of translation. Genetic information defines the destination, and the genetic code gets us there. This is more than just a metaphor; it tracks with reality. The genetic code lays down a physical bridge between nucleotides and protein conformations that is built in sections of tRNA. For every point of departure we must imagine the timeline and the physical contour of the bridge built by the genetic code. The sequence of amino acids lies not on codons but on a sequence of tRNA. The shape and state of initial conditions for every polypeptide as it enters the process of folding is directly determined by the sequence of tRNA, not codons. This is a non-trivial, informative physical reality with the very real opportunity to impact the path of a protein as it undergoes complex folds. These molecular scaffoldings in time and space represent genetic information at that point in the process. It is a system predicated on the non-linear outcomes from complex dynamics of protein folding. If this translation is in reality a very simple process, then current dogma will prevail. If the process of folding a protein is complex and chaotic, like the weather, then initial conditions will be all important and our view of the genetic code must then be expanded.

You might also like