Dna Structure: A, B and Z Dna Helix Families Dnastructure: Sequence Effects
Dna Structure: A, B and Z Dna Helix Families Dnastructure: Sequence Effects
Dna Structure: A, B and Z Dna Helix Families Dnastructure: Sequence Effects
0003147
Abstract
DNA sequencing is the determination of base order in a DNA molecule. Methods for
determining base order involve either chemical degradation or, more commonly, enzymatic
synthesis of the region that is being sequenced. Automation of the DNA sequencing
process is accelerating the progress of the Human Genome Project.
Introduction
The development of methods that allow one to quickly and reliably determine the order of
bases, or the ‘sequence’, in a fragment of DNA is a key technical advance, the importance
of which cannot be overstated. Knowledge of DNA sequence enables a greater
understanding of the molecular basis of life. DNA sequence information provides
information critical to understanding a wide range of biological processes. The order of
bases in DNA specifies the order of bases in RNA, the molecule within the cell that directly
encodes the informational content of proteins. Scientists routinely use the DNA sequence
information to deduce protein sequence information. Base order dictates DNA structure and
its function, and provides a molecular programme that can specify normal development,
manifestation of a genetic disease, or cancer. See also DNA Structure: A‐, B‐ and Z‐DNA
Helix Families, DNAStructure: Sequence Effects, and Cancer Genetics (human)
Knowledge of DNA sequence and the ability to manipulate these sequences has accelerated
the development of biotechnology and has led to the development of molecular techniques
that provide the tools for asking and answering important scientific questions. The
polymerase chain reaction (PCR), an important biotechnique that facilitates sequence‐
specific detection of nucleic acid, relies on sequence information. DNA sequencing
methods allow scientists to determine whether a change has been introduced into the DNA,
and to assay the effect of the change on the biology of the organism, regardless of the type
of organism that is being studied. Ultimately, DNA sequence information may provide a
way to identify individuals uniquely. See also Polymerase Chain Reaction (PCR)
DNA sequencing has become so commonplace that the technique itself is often taken for
granted. However, this has not always been the case. It was, in fact, almost required that
scientists publish or present DNA sequence data before a sequence was considered reliable.
Furthermore, the length of the DNA information that it is possible to obtain and the number
of sequences that are analysed on a single gel have increased by an order of magnitude.
This article provides an overview of DNA sequencing development. See also History of
Biotechnology
To understand the DNA sequencing process, one must recall several facts about DNA.
First, a DNA molecule is composed of four bases, adenine (A), guanine (G), cytosine (C)
and thymine (T). These bases interact with each other in very specific ways through
hydrogen bonds, such that A interacts with T, and G interacts with C. These specific
interactions between the bases are referred to as base pairings. The two strands of a DNA
molecule occur in an antiparallel orientation in which one strand is positioned in the 5′ to 3′
direction and the other strand is positioned in the 3′ to 5′ direction. The terms 5′ and 3′ refer
to the directionality of the DNA backbone, and are critical to describing the order of the
bases. The convention for describing base order in a DNA sequence uses the 5′ to 3′
direction, and is written from left to right. Thus, if one knows the sequence of one DNA
strand, the complementary sequence can be deduced. See also Watson–Crick Base Pairs
and Nucleic Acids Stability, and Nucleic Acids: General Properties
There are two methods that are typically used to determine DNA sequence; the
development of each method resulted in the award of a Nobel Prize. The first uses
chemicals to specifically degrade the DNA strand, and is referred to as Maxam–Gilbert
DNA sequencing in honour of the inventors, A. Maxam and W. Gilbert. The second
method involves specific inhibition of enzymatic DNA synthesis and is referred to as
Sanger sequencing in honor is its inventor, F. Sanger. These two sequencing methods are
described in more detail below. See also Sanger, Frederick, and Gilbert, Walter
Both methods require that the reaction products share a common endpoint. This
requirement stems from the separation method used to visualize the reaction products.
These reaction products are size‐separated by applying an electric current through a gel
matrix (electrophoresis), and a common end is necessary to keep the reaction products in
register with respect to size mobility, so that the smaller products migrate more rapidly on
the gel relative to the larger products. More specifically, either the 5′ or the 3′ end can
define the fragment endpoint in a Maxam–Gilbert sequencing reaction, while only the 5′
end defines the fragment endpoint in a Sanger sequencing reaction. The reason for this
difference is clarified below. See also Gel Electrophoresis: One‐dimensional
Maxam–Gilbert sequencing is not routinely used by most investigators for several reasons.
First, data produced in chemical sequencing reactions are typically more ambiguous than
data produced in enzymatic sequencing reactions. One reason for this is that the chemical
reactivity of the bases is influenced by reaction impurities. Therefore, when one reads the
sequence from this type of reaction, the relative intensities of the reaction products must be
analysed for proper interpretation of base identity. Additionally, this procedure uses
hazardous chemicals and high levels of radioactivity. When compared with enzymatic
DNA sequencing, Maxam–Gilbert sequencing produces relatively shorter sequence
information and the procedures required to generate this information are more labour‐
intensive.
As described for Sanger‐type sequencing reactions using (primarily) isotopes to detect the
extension products, some automated sequencers use four lanes to collect the data from the
reactions. However, some machines use differently coloured fluorescent tags to indicate
base identity (Figure 3). This approach enables a single lane to contain the data for a DNA
template and increases fourfold the amount of data contained on a gel. This single‐lane
approach is made possible by the development of fluorescent tags that can be attached
either to the DNA primer or to the ddNTP. Since four‐colour chemistry is used by more
researchers, it is discussed in more detail below.
Figure 3
Open in figure viewerPowerPoint
(a) Raw sequence data collected on an automated DNA sequencer (Perkin‐Elmer ABI
PRISM Model 377). The four colours indicate the relative position of the bases in the DNA
fragment. Each four‐colour vertical line corresponds to a different sequence reaction. The
smaller fragments (nearer the cathode) identify bases that are closer to the primer (5′ end of
the sequence information). (b) Portions of a representative, analysed sequence determined
by the automated sequencer.
Genome Sequencing
Very often, a researcher needs to determine the sequence of a DNA fragment that is larger
than the 500–1000 base average sequencing read length. Not surprisingly, strategies to
accomplish this have been developed. These strategies are divided into two major classes,
random or directed. Strategy choice is influenced by the size of the fragment to be
sequenced.
In random, or shotgun, DNA sequencing, a large DNA fragment (typically one larger than
20 000 base pairs) is broken into smaller fragments that are inserted into a cloning vector. It
is assumed that the sum of information contained within these smaller clones is equivalent
to that contained within the original DNA fragment. Numerous smaller clones are randomly
selected, DNA templates are prepared for sequencing reactions, and fluorescently‐labelled
primers that will base‐pair with the vector DNA sequence bordering the insert are used to
begin the sequencing reaction. Subsequently, the sequence of the original DNA fragment is
reconstructed by computer assembly of the sequences obtained from the smaller DNA
fragments. This strategy is being used extensively to determine the sequence of ordered
fragments that represent the entire human genome
[http://www.nhgri.nih.gov.biblioteca.ibt.unam.mx:2048/HGP/]. However, this random
approach is typically not sufficient to complete sequence determination, since gaps in the
sequence often remain after computer assembly. A directed strategy (described below) is
usually used to complete the sequence project.
A directed, or primer‐walking, sequencing strategy can be used to fill gaps remaining after
the random phase of large‐fragment sequencing, and as an efficient approach for
sequencing smaller DNA fragments. This strategy uses DNA primers that anneal to the
template at a single site and act as a start site for chain elongation. This approach requires
knowledge of some sequence information to design the primer. The sequence obtained
from the first reaction is used to design the primer for the next reaction and these steps are
repeated until the complete sequence is determined. Thus, a primer‐based strategy involves
repeated sequencing steps from known into unknown DNA regions; this process minimizes
redundancy, and it does not require additional cloning steps. However, this strategy
requires the synthesis of a new primer for each round of sequencing.
The necessity of designing and synthesizing new primers, coupled with the expense and the
time required for their synthesis, has limited the routine application of primer‐walking for
sequencing large DNA fragments. Researchers have proposed using a library of short
primers to eliminate the requirement for custom primer synthesis. The availability of a
primer library would minimize waste of primer, since each primer could be used to prime
multiple reactions, and would allow immediate access to the next sequencing primer. See
also Genome Mapping, and Genome Sequence Analysis
Prospects
One of the original goals of the Human Genome Project was to complete sequence
determination of the entire human genome by 2005. However, the project is ahead of
schedule and it is expected to produce a ‘working draft’ of the human genome by 2001. The
completed genome sequence is expected by 2003, at least two years ahead of schedule.
Technological advances are responsible for the rapid progress of this ambitious project.
Progress in all aspects involving DNA manipulation (especially manipulation and
propagation of large DNA fragments), evolution of faster and better DNA sequencing
methods, development of computer hardware and software capable of manipulating and
analysing the data (bioinformatics), and automation of procedures associated with
generating and analysing DNA sequences is responsible for this acceleration. See
also Human Genome Project, and The Promise of Whole Genome Sequencing