Evolution of Genome
Evolution of Genome
Evolution of Genome
The field of genomics focuses on studying entire sets of genes and their
interactions within an organism, providing a comprehensive
understanding of its biological functions. However, the vast amount of
data generated by sequencing efforts requires sophisticated computational
methods for storage and analysis, leading to the emergence of
bioinformatics as a crucial discipline in modern biology. Bioinformatics
involves the application of computational techniques to manage and
analyze biological data efficiently.
Scientists involved in the Human Genome Project recognized this need early on
and incorporated goals related to establishing centralized databases and refining
analytical software. These efforts were aimed at ensuring that the vast amounts
of genomic data could be efficiently managed, shared, and analyzed by
researchers worldwide. Here's how bioinformatics contributes to this process:
Overall, bioinformatics serves as a vital bridge between raw genomic data and
biological knowledge, providing the tools and methods necessary to extract
meaningful insights from genomic sequences. In the context of large-scale
projects like the Human Genome Project, bioinformatics plays a central role in
achieving the project's goals of understanding the structure, function, and
evolution of the human genome.
Protein Data Bank (PDB): The Protein Data Bank, maintained by institutions
like Rutgers University and the University of California, San Diego, houses
three-dimensional structures of proteins determined through experimental
methods. Researchers can access PDB to study protein structures, analyze their
functions, and visualize protein interactions. The ability to view protein
structures from different angles enhances the understanding of protein biology.
Other Tools: In addition to BLAST and PDB, the NCBI website offers various
other bioinformatics tools and software for sequence analysis, protein
comparison, domain identification, and evolutionary tree construction. These
tools empower researchers to address diverse biological questions related to
genomics, proteomics, and evolutionary biology.
Comparative Analysis: Once potential genes are identified, their sequences are
compared with those of known genes from other organisms. By comparing
DNA and protein sequences, researchers can infer the likely function of a gene
based on similarities to genes with known functions. This comparative approach
is especially useful for predicting the function of newly identified genes.
In some cases, newly identified genes may match sequences of known genes
with well-characterized functions in other species. This similarity provides
valuable clues about the function of the newly identified gene. However, there
are instances where the sequence of a gene is entirely novel, presenting a
challenge for predicting its function. In such cases, a combination of
biochemical and functional studies is necessary to deduce the protein's function.
The significance of the ENCODE project lies in its vast dataset, which
comprises over 1,600 large data sets generated by more than 440 scientists in 32
research groups. One of the most striking findings of ENCODE is that a
substantial portion of the human genome—about 75%—is transcribed at some
point in at least one cell type, despite less than 2% coding for proteins.
Moreover, functional roles have been assigned to DNA elements constituting at
least 80% of the genome. These findings challenge previous notions of the
genome's organization and function.
Proteins, as the primary effectors of cellular activities, play a crucial role in the
functioning of cells and organisms. Therefore, understanding when and where
proteins are produced, as well as how they interact within networks, is essential
for unraveling the complexities of biological systems.
In the initial phase of TCGA, a pilot project focused on three types of cancer—
lung cancer, ovarian cancer, and glioblastoma of the brain. By comparing gene
sequences and patterns of gene expression in cancer cells with those in normal
cells, researchers identified common mutations and aberrant gene expression
patterns associated with these cancers. This approach not only confirmed the
roles of suspected genes but also uncovered previously unknown ones,
suggesting potential targets for novel therapies. The success of this pilot led to
the extension of TCGA to ten additional types of cancer, chosen based on their
prevalence and lethality.
Genome Size:
Within Taxonomic Groups: Even within taxonomic groups like insects, there
can be significant variation in genome size. For instance, the cricket genome
(Anabrus simplex) is much larger than that of the fruit fly (Drosophila
melanogaster), despite both being insects.
Eukaryotes:
The number of genes in eukaryotes varies widely. Unicellular fungi, such as
yeasts, may have around 5,000 genes, while some multicellular eukaryotes can
have over 40,000 genes.
Within eukaryotes, the number of genes in a species is not always directly
correlated with the size of its genome.
For example:
The genome of the nematode C. elegans is around 100 Mb and contains
approximately 20,100 genes.
In contrast, the genome of Drosophila melanogaster is larger (165 Mb) but has
fewer genes, approximately 14,000.
The human genome is substantially larger (3,000 Mb) than that of D.
melanogaster or C. elegans. Initially, it was expected that the human genome
would contain between 50,000 and 100,000 genes, based on the number of
known human proteins. However, the actual number of genes identified in the
completed human genome sequence turned out to be fewer than 21,000, a
surprising finding for biologists.
This relatively low number of genes in the human genome compared to initial
expectations has led to further investigation, with projects like ENCODE
aiming to elucidate the functional elements and regulatory mechanisms within
the genome. Overall, the diversity in gene numbers across different organisms
underscores the complexity of genome organization and gene regulation in
living systems.
For instance, in the human genome, approximately 1.5% of the DNA codes for
proteins or is transcribed into functional RNAs like rRNAs or tRNAs. Another
5% of the genome comprises gene-related regulatory sequences, while
approximately 20% is made up of introns, which are noncoding regions
interspersed within protein-coding genes. The remaining 98.5% of the genome
consists of various elements, including unique noncoding DNA fragments,
pseudogenes (former genes that have accumulated mutations and lost their
protein-coding function), and repetitive DNA sequences.
The findings from projects like ENCODE have shed light on the functional
importance of much of this noncoding DNA. Understanding how genes and
noncoding DNA sequences are organized within genomes provides valuable
insights into genome evolution and ongoing genetic processes in multicellular
eukaryotes.
Some of these sequences retain the ability to move within the genome,
facilitated by enzymes encoded by transposable elements. However, others are
related sequences that have lost their mobility. Together, transposable elements
and related sequences can constitute a substantial portion, ranging from 25% to
50%, of most mammalian genomes. In certain organisms like amphibians and
many plants, this proportion can be even higher, with transposable elements
accounting for a significant fraction of the genome size. For instance,
transposable elements make up a remarkable 85% of the corn genome.
In some multigene families, the genes consist of identical DNA sequences that
are clustered tandemly. An example is the family of genes encoding the three
largest rRNA molecules, which are essential components of ribosomes involved
in protein synthesis. These genes are repeated hundreds to thousands of times in
one or several clusters in the genome of multicellular eukaryotes, allowing cells
to produce the large number of ribosomes required for protein synthesis
efficiently.
Mutation: Mutations are alterations in the DNA sequence that can arise
spontaneously or be induced by various factors such as radiation, chemicals, or
errors during DNA replication. Mutations can occur at the nucleotide level,
leading to single nucleotide substitutions, insertions, or deletions, or they can
involve larger scale changes such as chromosomal rearrangements or
duplications. Mutations provide the raw material for evolutionary change by
introducing genetic variation upon which natural selection can act.
Unequal Crossing Over: During meiosis, unequal crossing over can occur
when non-homologous chromosomes align improperly. This misalignment can
lead to unequal exchange of genetic material, resulting in one chromosome
gaining an extra copy of a particular gene while the other loses it. This
mechanism can lead to gene duplication or deletion.
Slippage during DNA Replication: Errors during DNA replication can also
lead to gene duplication. Slippage occurs when the replication machinery
encounters repetitive sequences and "slips" or misaligns, resulting in the
duplication of a segment of DNA. This process is particularly common in
regions with repetitive sequences, such as simple sequence DNA.
Multigene Families: Evidence for gene duplication can be observed in
multigene families, where multiple copies of similar or identical genes are
present within the genome. The globin family, which includes genes encoding
various forms of hemoglobin subunits, is an example of a multigene family
resulting from gene duplication events.
Exon Duplication: Unequal crossing over during meiosis can result in the
duplication of a particular exon within a gene on one chromosome and its loss
from the homologous chromosome. As a result, one copy of the gene contains a
duplicated exon, leading to the production of a protein with two copies of the
encoded domain. This duplication may enhance the protein's function by
increasing stability, ligand-binding capacity, or other properties. Many protein-
coding genes, including those encoding structural proteins like collagen, exhibit
multiple copies of related exons due to duplication and subsequent divergence.
Exon Shuffling: Exon shuffling involves the mixing and matching of different
exons within a gene or between nonallelic genes, often due to errors in meiotic
recombination. This process can lead to the generation of new proteins with
novel combinations of functions. For example, the gene for tissue plasminogen
activator (TPA), which helps control blood clotting, has four domains encoded
by different exons. Through exon shuffling during meiotic recombination and
subsequent duplication events, the current version of the TPA gene likely arose,
incorporating exons from other genes to create a protein with unique functions
(see Figure 20.16).
Think:
Assembly: The sequences of the fragments are then ordered relative to each
other using computer software. This assembly process involves aligning
overlapping sequences and merging them into one overall sequence of the
chromosome.
The depiction of scattered DNA fragments in step 2 of the figure reflects the
random nature of the whole-genome shotgun approach. Instead of sequencing
the chromosome in a linear fashion, this method involves breaking the
chromosome into random fragments, sequencing them, and then piecing them
back together based on overlapping sequences. This approach allows for rapid
sequencing of entire genomes without the need for prior knowledge of the
genome's organization.
Bioinformatics tools, such as those available through the National Center for
Biotechnology Information (NCBI), play a crucial role in analyzing and
interpreting DNA sequence data. These tools allow scientists to access DNA and
protein sequences, search for similar sequences in databases, and perform
various types of sequence analysis, such as identifying protein domains and
predicting protein structures.
Exons: These are the regions of genes that code for proteins or are transcribed
into rRNA or tRNA molecules. Exons make up about 1.5% of the human
genome.
Introns: These are non-coding regions within genes that are transcribed into
mRNA but are removed during RNA processing. Introns, along with regulatory
sequences associated with genes, constitute about a quarter of the human
genome.
Regulatory sequences: These are DNA sequences that regulate the expression
of genes, including promoters, enhancers, and silencers. They play crucial roles
in controlling when and where genes are expressed.
Alu elements: These are short, interspersed nuclear elements that are about 300
base pairs long and are found in large numbers throughout the human genome,
making up approximately 10% of the genome.
L1 sequences: Long interspersed nuclear elements (LINE-1 or L1 elements) are
a type of transposable element that can move around the genome. They make up
about 17% of the human genome.
Large-segment duplications: These are duplications of long stretches of DNA,
ranging from 5-6% of the human genome. They often include functional genes
and may have been copied from one chromosomal location to another.
The mechanism described in Figures 20.8 and 20.9 that results in a copy
remaining at the original site as well as a copy appearing in a new
location is the copy-and-paste mechanism for transposons (Figure 20.8)
and retrotransposons (Figure 20.9). In both cases, a new copy of the
transposon or retrotransposon is inserted into the genome at a new
location, while the original copy remains in place.
The organization of the rRNA gene family involves multiple copies of
rRNA transcription units clustered together, with each unit producing
transcripts for the three main types of ribosomal RNA (18S, 5.8S, and
28S). This clustered organization allows for efficient production of
ribosomal RNA, which is essential for protein synthesis. In contrast, the
globin gene families consist of multiple copies of genes encoding alpha
and beta globin polypeptide subunits. These gene families provide
redundancy and flexibility in hemoglobin production, allowing for
variations in globin expression during different developmental stages or
in response to changing physiological conditions.
How crossing over occurs during meiosis and how it leads to the
duplication of a gene due to unequal crossing over:
Chromatid with gene duplicated: Conversely, the other chromatid may gain
an extra copy of the gene (or DNA segment) due to the exchange of genetic
material. This chromatid now contains a duplicated segment, including the gene.
Single Nucleotide Polymorphisms (SNPs): SNPs are the most common type
of genetic variation in the human genome, occurring at single base-pair sites
where genetic variation is found in at least 1% of the population. By analyzing
SNPs, researchers can identify genetic markers associated with traits, diseases,
and population differences. SNPs serve as valuable tools for studying human
evolution, population genetics, and personalized medicine.
Think:
In the ancestral EGF and fibronectin genes, multiple exons might have arisen
through the following mechanisms:
Exon Shuffling: Exon shuffling refers to the process where exons from
different genes are brought together through recombination events, leading to
the creation of new genes with novel functions. In the case of EGF and
fibronectin genes, exon shuffling events may have occurred, bringing together
exons from different ancestral genes to form the current gene structure.
The extra Alu elements in the human genome likely arose through a process of
Alu element proliferation via retrotransposition. Alu elements are a type of
transposable element that can copy themselves and insert into new genomic
locations via an RNA intermediate. Over evolutionary time, these elements have
been duplicated and accumulated in the human genome.
One possible role of these extra Alu elements in the divergence of humans and
chimpanzees could be their contribution to genome evolution through the
generation of genetic diversity. Alu elements can insert into genes or regulatory
regions, potentially leading to mutations or alterations in gene expression. These
changes may have contributed to phenotypic differences between humans and
chimpanzees, thereby playing a role in their evolutionary divergence.
Additionally, Alu elements may have contributed to genomic rearrangements
and structural variations that distinguish the two species.
Regarding your additional questions about the Human Genome Project, the
ENCODE project, comparisons of genome sizes and gene numbers among
domains and eukaryotes, the function of transposable elements in noncoding
DNA, the role of chromosomal rearrangements in speciation, and the
information obtained from comparing genomes of closely and distantly related
species.
The provided sequences represent short segments of the FOXP2 protein from
six species: chimpanzee (C), orangutan (O), gorilla (G), rhesus macaque (R),
mouse (M), and human (H).
First, identify the sequences that are identical among the chimpanzee (C),
gorilla (G), and rhesus macaque (R) species. These sequences are "ATETI,"
"PKSSD," "TSSTT," and "NARRD."
Next, identify the sequence for the human (H) species, which differs from the
chimpanzee (C), gorilla (G), and rhesus macaque (R) sequences at two amino
acids. Underline these two differences in the human sequence.
The orangutan (O) sequence differs from the chimpanzee (C), gorilla (G), and
rhesus macaque (R) sequences at one amino acid (having V instead of A) and
from the human (H) sequence at three amino acids. Identify the orangutan
sequence.
In the mouse (M) sequence, circle the amino acid(s) that differ from the
chimpanzee (C), gorilla (G), and rhesus macaque (R) sequences, and draw a
square around those that differ from the human (H) sequence.
Comparison of Amino Acid Differences between Mouse and Primates
with Human and Primates:
Compare the amino acid differences between the mouse (M) sequence and the
chimpanzee (C), gorilla (G), and rhesus macaque (R) sequences with those
between the human (H) sequence and the chimpanzee (C), gorilla (G), and
rhesus macaque (R) sequences.
Count and compare the number of amino acid differences between the mouse
(M) and primate sequences versus those between the human (H) and primate
sequences.
Consider the evolutionary implications of these differences in terms of the
divergence between rodents and primates compared to that between humans and
other primates.
Discuss how comparing the sequences of the FOXP2 protein across primates
can provide insights into the evolutionary changes associated with speech.
Explain the significance of identifying specific amino acid changes in FOXP2
across different primate species and how these changes may be linked to the
development of speech-related traits in humans.
Consider the evolutionary context and implications of FOXP2 sequence
variation in the context of primate evolution and the emergence of speech and
language abilities.
Discuss the concept of gene regulation and its role in controlling developmental
processes and morphological traits.
Explain how changes in gene regulation, such as mutations or alterations in
regulatory sequences, can lead to the evolution of novel structures and
phenotypic traits.
Use the example of the treehopper's thorns to illustrate how changes in gene
regulation may have influenced the evolution of this unique structure.
Consider the adaptive significance of changes in gene regulation and their
contribution to organismal diversity and evolution.
Comparing the sequences of the FOXP2 protein across primates can provide
insights into the evolutionary changes associated with speech. By examining the
differences and similarities in FOXP2 sequences among different primate
species, researchers can identify specific amino acid changes that may have
contributed to the development of speech-related traits. Understanding how
FOXP2 has evolved across primate lineages can help elucidate the genetic basis
of speech and language evolution in humans.