Perspectives: Overcoming Challenges and Dogmas To Understand The Functions of Pseudogenes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11
At a glance
Powered by AI
The document discusses the history and evolution of the definition of pseudogenes. It also discusses different classes of pseudogenes and how they arise. Current genome annotation approaches are discussed alongside limitations that have inhibited research on pseudogene regions.

There are four main classes of pseudogenes: processed, unprocessed, unitary, and polymorphic pseudogenes. Processed pseudogenes arise from retrotransposition of mRNA, while unprocessed result from genomic duplication. Unitary pseudogenes form without duplication through gene inactivation.

Initial genome annotations often misannotate pseudogenes as protein-coding. Projects like GENCODE aim to accurately annotate all gene features, including identifying over 10,000 processed pseudogenes in the human genome. However, exact pseudogene numbers can vary depending on annotation methodology.

PERSpECTIVES

result, pseudogene-annotated regions are


Overcoming challenges and dogmas largely excluded from functional screens
and genomic analyses17–19. Therefore,

to understand the functions of the process of pseudogene annotation is


paramount in the consideration of which

pseudogenes genomic elements are assessed for biological


impact. However, with a growing number
of instances of pseudogene-annotated
Seth W. Cheetham   , Geoffrey J. Faulkner   and Marcel E. Dinger    regions later found to exhibit biological
function, there is an emerging risk that
Abstract | Pseudogenes are defined as regions of the genome that contain these regions of the genome are prematurely
defective copies of genes. They exist across almost all forms of life, and in dismissed as pseudogenic and therefore
regarded as void of function. Due to the
mammalian genomes are annotated in similar numbers to recognized
recent maturation of several enabling
protein-coding genes. Although often presumed to lack function, growing technologies, we propose that the time
numbers of pseudogenes are being found to play important biological roles. is opportune for a re-evaluation of the
In consideration of their evolutionary origins and inherent limitations in genome functions of pseudogene-annotated regions.
annotation practices, we posit that pseudogenes have been classified on a The advent of long-read transcriptomics
scientifically unsubstantiated basis. We reflect that a broad misunderstanding of enables the identification of dynamically
active pseudogenes, and whole-genome
pseudogenes, perpetuated in part by the pejorative inference of the ‘pseudogene’ sequencing of large cohorts enables
label, has led to their frequent dismissal from functional assessment and exclusion the identification of disease-associated
from genomic analyses. With the advent of technologies that simplify the study of pseudogenes. Additionally, the CRISPR–
pseudogenes, we propose that an objective reassessment of these genomic Cas9 revolution allows straightforward
elements will reveal valuable insights into genome function and evolution. interrogation of pseudogene functions.
In this Perspective article, we
systematically assess the process of
The term ‘pseudogene’ was first used in disabling mutations in the reference genome pseudogene annotation and highlight
1977 by Jacq et al.1 to describe a truncated but are intact in some individuals5,6 (Fig. 1d). the assumptions and limitations that are
5S ribosomal RNA gene in Xenopus GENCODE (v31), a large-scale project implicit in classification algorithms. We
laevis, and similarly truncated gene copies that aims to annotate all gene features in provide a critical analysis of the pitfalls
have since been found to be a common the human genome with high accuracy, of current approaches to annotating
feature in 5S RNA gene repeat regions in identifies 10,668 processed pseudogenes in pseudogenes and describe how
metazoans2. In the absence of evidence the human genome accounting for 72% of methodological limitations and largely
that the 5S pseudogenes were transcribed, all human pseudogenes5. However, exact arbitrarily defined assumptions have
Jacq et al. concluded that the most numbers differ, with frequencies ranging inhibited research in pseudogene-annotated
probable explanation for the existence of from ∼8000 (ref.7) to ~13,000 (refs8,9) regions of the genome. Finally, we consider
the pseudogenes is that they are a relic depending on the annotation methodology. the scientific utility of the pseudogene
of evolution and are functionless1. As a critical mechanism underpinning concept in the context of evolutionary
Since the coining of the term pseudogene, biological evolution10,11, processed biology and the dogma established through
its definition has broadened and is now pseudogene formation is ongoing in human virtue of the term itself. We propose that
widely accepted to define any genomic evolution with at least 48 retrotransposed the availability of new technologies and
sequence that is similar to another gene and gene copies (retrocopies) that are approaches to study gene and genome
is defective3. Pseudogenes are classified by polymorphic in the human population12–15. function, together with the recalibration
their mechanisms of origin. The two major Annotations of metazoan genomes of how pseudogenes are perceived, will
classes are processed pseudogenes, derived typically describe between 10,000 and invigorate study into the biology of these
from retrotransposition of processed mRNAs4 20,000 regions as pseudogenic16. The regions of the genome.
(Fig. 1a), and unprocessed pseudogenes, binary distinction between genes and
derived from segmental duplication (Fig. 1b). pseudogenes forms a central theme in Identifying pseudogenes in Eukarya
Unitary pseudogenes are a minor third genome annotation and, ultimately, informs During initial annotations of the genic
class that are formed without duplication, the reference list of genes of an organism. content of the human genome by ab
when a single original gene is inactivated Consequently, the annotation of genomic initio gene annotation software, many
through mutation such that no functional regions as pseudogenes constitutes an pseudogenes are erroneously annotated
copy of the gene remains (Fig. 1c). A fourth etymological signifier that an element as protein coding, and their removal from
class of rare pseudogenes are those that have has no function and is not a gene. As a gene annotations is considered a priority20,21.

Nature Reviews | Genetics


Perspectives

a Processed pseudogenes b Unprocessed pseudogenes

Promoter Promoter

Transcription
mRNA AAAAAAAAA
Duplication and mutation
Reverse transcription and integration
Promoter

c Unitary pseudogenes d Polymorphic pseudogenes


Ancestor gene Reference genome sequence (pseudogene)

Promoter Promoter

Mutation and loss of functional gene Non-reference sequence (gene)

Promoter Promoter

Fig. 1 | Major classes of eukaryotic pseudogenes. a | Processed pseudogenes arise from the reverse transcription and integration of a processed mRNA.
b | Unprocessed pseudogenes originate from gene duplications that accumulate mutations, preventing their translation. c | Unitary pseudogenes are
derived without duplication from an ancestral protein-coding gene that has lost protein-coding potential. d | Polymorphic pseudogenes are sequences
that have disabling mutations in the reference genome, but are intact in other non-reference genomes.

Annotation of a region as pseudogenic is 8.9% of recognized human protein-coding duplicated pseudogenes, which have not
largely sufficient for it to be excluded from genes do not contain introns and are likely had sufficient time to diverge from their
functional consideration. Thus, the correct to be derived from retrotransposition26. parental genes, and for pseudogenes
differentiation of gene from pseudogene is Thus, computational differentiation of that are translated in different reading
of crucial importance in genome biology. pseudogenes from genes on a purely frames than their parent genes due to
Pseudogenes in eukaryotic genomes rule-based system is unlikely to be feasible frameshifting insertions or deletions
are identified by computational pipelines as it will inherently conflict with many (indels) or 5′ truncations. Evidence of
and manual annotation5. Pseudogenes are protein-coding genes derived through transcription could be a useful metric to
first identified by searches for sequences gene duplication and retrotransposition. identify protein-coding genes with intact
similar to known genes22,23. The absence Accordingly, untruncated pseudogenes open reading frames incorrectly annotated
of introns, the occurrence of truncations are often manually reannotated as as processed pseudogenes. However,
and disruptions to the open reading protein-coding genes if they have determining the transcriptional state of
frame relative to the parent gene are the demonstrated function (for example, PGK2, pseudogenes is technically challenging,
primary characteristics used to identify POU5F1B and NANOGP8)5. as will be discussed later. Therefore, we
pseudogenes20. Different laboratory groups Determination of the ratio of rates suggest that it may be useful to consider
use various combinations of characteristics of non-synonymous substitutions to the annotation of pseudogenes in genomes
to identify putative pseudogenes20,24. synonymous substitutions (Ka/Ks)27 for as a prediction or a hypothesis rather
Importantly, the absence of introns and the pseudogenes is a proposed mechanism than a classification. As discussed further
absence of strong evidence of transcription for assessing the coding potential of below, the inherent semantic contradiction
are sufficient to identify a processed putative pseudogenes20. Ratios substantially that arises when a pseudogene is found to
pseudogene. Additionally, some processed divergent from 1 can be an indication have biological function raises the notion
pseudogenes do not harbour truncating of purifying selection (Ka/Ks << 1) or that the term pseudogene should be
mutations and have the same protein-coding positive selection (Ka/Ks >> 1). However, reserved for gene copies that have been
capacity as their parent genes8,25. Indeed, this approach is uninformative for recently empirically demonstrated to be defective
rather than indicated by algorithmic
prediction alone.
Glossary
Expressed sequence tags Purifying selection Functional pseudogenes
(ESTs). Short fragmented sequences of cDNAs. Selection against alleles that are deleterious to fitness.
Where pseudogenes have been studied
Mapping ESTs identifies transcribed genes. Purifying selection maintains the amino acid sequence.
directly they are often found to have
Non-​synonymous substitutions Retrotransposition quantifiable biological roles (Table 1).
Nucleotide substitutions that change the encoded Insertion of a sequence into the genome via the reverse Poignantly, a 5S ribosomal RNA
amino acid sequence. transcription and integration of an RNA intermediate. pseudogene, similar to the first description
Positive selection Synonymous substitutions
of a pseudogene discussed above1, was
Selection for alleles that increase fitness. Positive Nucleotide substitutions that do not change the recently demonstrated in humans to
selection results in shifts of the allele frequency. encoded amino acid sequence. induce an innate immune response

www.nature.com/nrg
Perspectives

Table 1 | Functional pseudogenes


Pseudogene Organism Parent gene function Biological impact of pseudogene Regulatory Refs
mechanism
PGK2 Human Phosphoglycerate kinase Testis-specific enzyme that catalyses the Protein-based 32,33

conversion of 1,3-bisphosphoglycerate to
3-phosphoglycerate during glycolysis
POU5F1B Human Pou-domain transcription Putative transcription factor that promotes Protein-based 36

factor tumour growth; amplified in gastric cancer


NANOGP8 Human Homeodomain transcription Putative transcription factor that promotes cell Protein-based 126

proliferation
ΨCX43 (GJA1P1) Human Gap junction protein Putative gap junction protein that inhibits cell Protein-based 127

growth
NOTCH2NL Human Transmembrane receptor Activates NOTCH signalling by sequestering Protein-based 37,38

the inhibitory ligand DELTA ; expands cortical


progenitor population
SRGAP2C Human Slit-Robo Rho GTPase Dimerizes with SRGAP2, inhibiting its function Protein-based 39,40

activating protein
NOS Lymnaea Nitric oxide synthase Antisense RNA prevents the translation of NOS by RNA-based 41

pseudogene stagnalis forming an RNA–RNA hybrid


PTENP1 Human Phosphatase that converts Increases PTEN expression by sequestering RNA-based 45

PtdIns(4,5)P2 to PtdIns(3,4,5)P3, microRNAs; can act as a tumour suppressor


inhibiting PI3K–AKT signalling
BRAFP1 Human Serine/threonine protein Increases BRAFP1 expression by sequestering RNA-based 46

kinase microRNAs; can act as an oncogene


HMGA1-p Human High-mobility-group chromatin Inhibits HMGA1 expression by competing for the RNA-based 128

protein RNA-stabilizing protein αCP1


Lethe Mouse Ribosomal protein subunit Directly binds to and inhibits NF-κB, modulating RNA-based 44

S15A inflammatory responses


RNA5SP141 Human 5S ribosomal RNA Binds to RIG-I during herpesvirus infection, RNA-based 28

inducing interferon expression


OCT4pg5-as Mouse Pou-domain transcription Suppresses OCT4 expression by increasing EZH2 RNA-based 129

factor occupancy at the OCT4 promoter


HBBP1 Human β-globin Facilitates switching of fetal to adult globin DNA-based 50

expression by regulating contacts with the locus


control region
Immunoglobulin Chicken Immunoglobulin segments Generate immunoglobulin diversity by gene DNA-based 130,131

pseudogenes conversion
PRSS3P2 Human Cationic trypsinogen Causes hereditary pancreatitis by gene conversion DNA-based 55

with PRSS3
CYP21A2P Human 21-Hydroxylase, a cytochrome Causes adrenal hyperplasia by gene conversion DNA-based 56

P450 enzyme with CYP21A2


CYP2A7 Human Cytochrome P450 enzyme Increases CYP2A6 mRNA stability due to a 3′ UTR DNA-based 132

polymorphism formed by gene conversion


PI3K , phosphoinositide 3-kinase; PtdIns(4,5)P2, phosphatidylinositol 4,5-bisphosphate; PtdIns(3,4,5)P3, phosphatidylinositol 3,4,5-trisphosphate; PTEN, phosphatase and
tensin homologue; UTR , untranslated region.

to viral infection28. Many further examples Although PGK2 has the hallmarks of a intact retrocopies remain annotated as
of pseudogenes functioning through diverse retrotransposed sequence (the absence of pseudogenes.
mechanisms have since been described. As introns, a genomic polyadenylate tract and Truncation relative to the parental open
these have been comprehensively reviewed target site duplications), it has no disruptions reading frame is not a definitive criterion
elsewhere29–31, this section focuses on key in its coding sequence and is expressed in for pseudogene non-functionality. In a
illustrative examples to highlight the various the testis32,33 (Fig. 2a). Genome-wide analysis striking recent example, human-specific
means by which pseudogenes can perform of human retrocopy expressed sequence tags duplications of NOTCH2 (NOTCH2NL)
biological functions. (ESTs) reveals that more than 1,000 are likely were discovered to expand cortical
Retrocopies were presumed to to be transcribed, including 272 that have no neurogenesis37,38. NOTCH2NL is highly
be transcriptionally silent processed disruption in their open reading frames25. truncated, containing less than half of
pseudogenes due to the loss of genomic Intact retrocopies are now widely recognized the intact NOTCH2 coding sequence:
cis-regulatory elements. The study of PGK2, as important regulators of antiviral defence34, NOTCH2NL encodes the ligand-binding
a human retrocopy of the phosphoglycerate neural function35 and cancer36. Although domain but not the transactivation domain.
kinase gene, revealed that not all human these individual retrocopies have been By binding the NOTCH ligand DELTA but
retrocopies are functionless as assumed. reclassified as protein-coding genes, many not inducing transactivation, NOTCH2NL

Nature Reviews | Genetics


Perspectives

Translation into proteins Sources of non-coding RNAs DNA-mediated regulation

a c e Locus control
Untruncated pseudogene Antisense pseudogene region
Pseud og e n e
Hybridization to parent mRNA
Translation AAAA

Gen
B

e
AAAA

Processing into siRNAs Translational inhibition e.g. HBBP1 G ene A


e.g. PGK2
e.g. NOS

b d f Gene conversion
Truncated pseudogene Pseudogene lncRNA

Translation Pseudogene Gene

e.g. NOTCH2NL Gene


e.g. Lethe

Fig. 2 | examples of pseudogene functions. a | Untruncated pseudogenes (siRNAs), inhibiting parental gene expression. d | Pseudogenes can encode
can encode full-length proteins with high similarity to their parent genes. long non-coding RNAs (lncRNAs) that function through RNA–protein inter-
b | Truncated proteins encoded by pseudogenes can function through intact actions. e | Pseudogenes can function in an RNA-independent manner
domains. c | Pseudogenes transcribed in antisense relative to their parent by facilitating 3D chromatin interactions. f | Pseudogenes can transfer dele­
genes can form hybrids with parental mRNAs, inhibiting translation. terious alleles to their parental genes by non-allelic recombination
Pseudogene–mRNA hybrids can be processed into small interfering RNAs (gene conversion).

modulates the level of NOTCH activity in microRNAs via shared binding sites. influencing chromosome stability. A study
the cortex. Similarly, a truncated protein This hypothesis was generalized as a of the genomic architecture of the 22q11.2
encoded by a human pseudogene of theory (‘ceRNA hypothesis’) of competitive locus, which is associated with DiGeorge/
SRGAP2 can inhibit its parental gene39,40. endogenous RNA (ceRNA) networks, velocardiofacial syndrome, proposed
Thus, many pseudogenes may function as wherein changes in the levels of an that deletions of a pseudogene within the
protein-coding genes despite truncation RNA influences the levels of RNAs that low-copy repeat region could increase
relative to their parent gene (Fig. 2b). share microRNA target sites47. However, non-allelic homologous recombination,
Many pseudogenes contain a frequency recent evidence has concluded that which in turn results in deletions that
of mutations that render them unlikely such competitive effects would require underlie the disease51. Pseudogenes can
to be (or incapable of being) translated unphysiological levels of competing also produce pseudogene–gene fusion
into proteins. However, such mutations transcripts48,49. One interpretation of transcripts. In prostate cancer, exons of
do not necessarily preclude pseudogenes generally low pseudogene transcription the pseudogene KLKP1 are spliced into the
from performing a biological function. is that these pseudogenes would make adjacent KLK4 gene, creating a novel
The first such function identified was particularly unsuitable candidates to exert fusion protein52,53. Polymorphic retrocopies
for a pseudogene of neural nitric oxide strong regulatory effects by microRNA can also form fusion transcripts if they
synthase (NOS) in the snail Lymnaea competition49. are located in the intron of a gene.
stagnalis41. This pseudogene is transcribed Another mechanism through which For example, a polymorphic retrocopy
antisense with respect to the parent gene pseudogenes can function is by influencing of CBX3 located in an intron of CCDC32
and can form a stable RNA duplex in vivo, chromatin or genomic architecture results in a chimeric transcript of
inhibiting translation of the parent NOS (Fig. 2e). HBBP1, a pseudogene residing unknown function in some individuals15.
mRNA (Fig. 2c). Subsequently, a myriad of within the haemoglobin locus, enables the Pseudogenes can also transfer deleterious
RNA-based regulatory mechanisms have dynamic chromatin changes that regulate alleles to their parental genes by non-allelic
been described for pseudogenes, including expression of fetal and adult globin genes recombination (gene conversion)54 (Fig. 2f).
processing into small interfering RNAs during development50. Notably, although Pseudogene-mediated gene conversion
(siRNAs)42,43 (Fig. 2c) that may regulate inhibiting HBBP1 transcription has no underlies cases of hereditary pancreatitis55,
their parent genes, acting as a decoy for effect, deletion of the genomic locus adrenal hyperplasia56, polycystic kidney
transcription factors44 (Fig. 2d) and, most reactivates fetal globin expression. HBBP1 disease57, cataracts58 and a multitude of
prominently, as molecular sponges for DNA contacts, but not transcription, are other diseases54.
microRNAs45,46. The expression of the required for suppressing the expression of Copy number variations in
pseudogenes PTENP1, KRASP1 and fetal globin genes in adult erythroid cells50. pseudogenes can also contribute to
BRAFP1 was hypothesized to increase the Another example in which pseudogenes may human disease. Increased copy number
levels of their parent genes by sequestering function intrinsically as DNA elements is by in the NOTCH2NL region is associated

www.nature.com/nrg
Perspectives

with autism and microcephaly, whereas the dominant and default perception. the pseudogene was probably dispensable
deletions are associated with microcephaly An illustrative example is Lethe, which is a for organismal survival but provided
and schizophrenia38. Most strikingly, mouse pseudogene of the ribosomal protein the evolutionary substrate for biological
a deletion in the pseudogene FAAHP1 RPS15A. RPS15A is a structural component innovation. At least 55 conserved human
was recently linked to the complete pain of the ribosome, whereas Lethe is a lncRNA lncRNAs similarly appear to have evolved
insensitivity of a patient59. The frequency that binds to and inhibits NF-κB signalling from ancestral protein-coding genes73,74.
of disease-causing pseudogene variants in innate immunity44. Lethe is clearly Thus, the lack of contemporary function
is undetermined. Analyses of human defective with respect to ribosomal function does not preclude that a sequence will
exomes typically exclude pseudogenes, but has an important biological function. have future regulatory function after
although their sequences are likely to be Should Lethe be considered a pseudogene selective pressure. As the abundance of such
captured due to cross-hybridization, which or a lncRNA gene, or both? Is NOTCH2NL a acquired functions does not appear to be
can confound variant detection in some pseudogene due to not fulfilling the function an especially rare or isolated phenomenon,
genes60. Similarly, non-coding regions of the of NOTCH2 or a gene as it is translated into it would seem remiss to take the default
genome (including pseudogenes) were rarely a functional protein? Differentiating genes perspective that processed pseudogenes
considered when linking polymorphisms from pseudogenes is further complicated by are functionless. Instead, it is probable
detected by genome-wide association polymorphic pseudogenes that are defective that pseudogene-containing regions of
studies to genes61. Recently, numerous in the reference genome but are intact in the genome harbour important biological
long non-coding RNAs (lncRNAs) were some individuals. Such examples preclude functions that are yet to be revealed.
identified overlapping disease-associated a simple delineation between gene and
loci in apparent ‘gene deserts’62, and the pseudogene. Experimental challenges
connection between disease-associated The definition of function is a complex The extent to which pseudogenes contribute
variants and non-coding RNA expression and often controversial concept in biology70. to organismal biology remains largely
is increasingly evident63. It is expected that Although many pseudogenes may not have unclear. In addition to the demotivation
further links between human pseudogene a detectable function that currently affects into exploring pseudogene function by
polymorphisms and complex diseases will cellular or organismal fitness, their existence the a priori assumption that they are
be identified in the coming years. may have an evolutionary role. Gene functionless, their systematic study has
The examples of pseudogene function function runs along a continuum from that also been hindered by a lack of robust
elaborated on here should not imply that of essential biological function through to methodologies capable of distinguishing the
pseudogene functionality is likely to be enabling a platform for the long-term fitness biological activities of pseudogenes from
confined to isolated instances. At least of an organism. An example of a long-term the functions of the genes from which they
15% of pseudogenes are transcriptionally evolutionary role is genetic redundancy, are derived. The scenario is reminiscent
active across three phyla, many of which where a given biochemical function is of, and in many regards analogous to, the
are proximal to conserved regulatory redundantly encoded by two or more challenges that the lncRNA field underwent
regions16. It is estimated that at least genes71. The notion of genetic redundancy following the initial observation of their
63 new human-specific protein-coding genes as an evolutionary mechanism to provide pervasive transcription in mammalian
were formed by retrotransposition since the resilience is contrary to the concept of genomes75. lncRNAs were similarly
divergence from other primates64. Numerous pseudogenization, which presumes gene dismissed initially as emanating from
‘retrogenes’ continue to be recognized as copies are functionless. Given the extent ‘junk DNA’ or as transcriptional noise76,
functional protein-coding genes rather of genome annotation impacted by these largely by virtue of their definition as
than pseudogenes across species9,65. contrasting hypotheses, there is a surprising non-protein-coding, and were challenging
High-throughput mass spectrometry lack of debate around these opposing to study due to their generally lower and
and ribosomal profiling approaches have perspectives in the scientific literature. more restricted expression patterns relative
identified hundreds of pseudogenes that Another point in the spectrum of to mRNAs77. Following a combination of
are translated into peptides66–69. Although biological functionality is the potential for technology developments78, genome-
the functions of these peptides remain to be a genomic region to evolve new function wide studies79,80 and detailed biochemical
experimentally determined, such examples over time. As processed pseudogenes do not studies81,82, lncRNAs are now routinely
illustrate the challenge in substantiating a duplicate introns, their propagation leads to included in genome-wide analyses, and
gene–pseudogene dichotomy. the mobilization and dispersal of relatively their functional potential as cellular
GC-rich and repeat-poor gene exons regulators is widely recognized83. Key
Caveats of the ‘pseudogene’ term throughout the genome. As well as potential among these advances was the shift away
In light of the growing number of examples protein-coding roles, these functionally from targeted microarrays, which measured
of pseudogenes that exert biological novel elements can provide the substrate for expression only of annotated genes, to
function, it is important to consider whether the evolution of novel non-coding regulatory largely unbiased transcriptome sequencing
a pseudogene that has a demonstrated elements. An illustrative example is the approaches, which could detect expression
function should still be considered a eutherian dosage compensation RNA Xist. of lncRNAs84. In retrospect, it is likely
pseudogene. The original conceptualization Xist, which is critical for inactivation of one that advancement of both practical and
of pseudogenes as ‘defective’ is vague3; that of the two X chromosomes in female somatic theoretical heuristics was pivotal in the
is, it does not define whether the term refers cells, evolved from the pseudogenization proliferation of research in lncRNAs and
to defective with respect to performing the of the ancestral Lnx3 gene72. It is probable their ascent in modern genome biology.
function of the parent gene or, rather, that that Lnx3 lost protein-coding capacity By contrast, due in part to the experimental
it performs no function. Nevertheless, the prior to the evolution of a dosage challenge of investigating their function
non-functionality of pseudogenes remains compensation function. During this period, and expression, pseudogenes are typically

Nature Reviews | Genetics


Perspectives

excluded from genome-wide functional dynamics and specificity of pseudogenes Despite lower per-base accuracy, long
screens and expression analyses. remain largely unknown. read lengths should contain sufficient
Whereas the advent of massively nucleotide differences to unambiguously
Detection of pseudogene expression. parallel RNA sequencing (RNA-seq)90,91 detect pseudogene transcription
An illustrative example of the potential for revolutionized the identification of lncRNA (Fig. 3c). Long-read transcriptomics
pseudogene functionality is the unexpected transcripts84, the short read lengths implicit could enable accurate quantification of
observation that processed pseudogenes in most sequencing technologies limit their pseudogene transcripts and elucidation
are commonly expressed. Processed application in charting the pseudogene of pseudogene transcript structures.
pseudogenes were presumed to have been transcriptome. Each short read usually However, current limited throughputs
rendered transcriptionally silent by the contains insufficient sequence difference of Nanopore and, particularly, PacBio
loss of cis-regulatory elements during the between the parental gene and the sequencing may restrict the quantification
retrotransposition of mature mRNAs. pseudogene to unambiguously map reads of lowly expressed pseudogenes.
It was not until 2006 that this assumption if there are limited differences between the
was first tested by an in silico interrogation gene and pseudogene sequences (Fig. 3b). Experimental perturbation of pseudogenes.
of human ESTs, which demonstrated that Although specialized pipelines have been The use of assays ill-suited to analysis
thousands of retrotransposed gene copies are developed to consider only reads from of pseudogenes has arguably stymied
transcribed and are often spliced into known divergent regions of pseudogenes92, such elucidation of their biological roles. RNA
protein-coding transcripts8,24,25. Similarly, up approaches are restricted to evolutionarily depletion and translational inhibition by the
to 10,000 mouse pseudogenes have evidence ancient pseudogenes and do not usually hybridization of short antisense sequences
of transcription, the majority of which form part of standard analyses. As such, (for example, siRNAs or morpholino
are retrocopies24. Although these studies typically only RNA-seq reads that overlap oligonucleotides) are routine and scalable
established that pseudogenes are actively reference annotations (which may exclude approaches for loss-of-function experiments.
transcribed, their cell-type specificity pseudogenes) are counted. This precludes However, due to the high sequence
and dynamic expression remain largely detection of pseudogene transcripts and similarity of pseudogenes to their parent
unresolved. For mRNAs and lncRNAs, may impact the observed levels of other genes, cross-hybridization precludes the
expression patterns were determined by transcripts; that is, pseudogene RNAs may be confident mapping of functions to genes or
DNA microarray hybridization24,85–88. erroneously assigned to their parent genes. pseudogenes (Fig. 3d). Although antisense
However, this approach is unsuitable for Long-read transcriptomics may be oligonucleotides targeting splice sites may
the detection of pseudogene expression. a solution to accurately characterize be able to specifically inhibit parental
DNA oligonucleotides are typically unable pseudogene transcription. Real-time genes without targeting their processed
to differentiate a pseudogene from its sequencing technologies enable pseudogenes (which do not contain
parental gene (or other pseudogene copies) the sequencing of full-length cDNAs the exon–intron junctions), targeting
due to sequence similarity24,89, precluding (Nanopore and Pacific Biosciences (PacBio) pseudogenes remains challenging.
unambiguous identification of their single-molecule real-time sequencing)93,94 CRISPR-based innovations
transcript levels (Fig. 3a). The expression or RNAs (Nanopore)95 in single reads. represent a revolution in our ability to

a Hybridization to DNA microarrays b Short-read cDNA sequencing

Pseudogene
Gene Pseudogene

Co-hybridization of pseudogene/gene RNAs to the same probes

d RNA interference RISC


siRNA
c Long-read cDNA sequencing
Pseudogene RNA AAAAAAAAA
Pseudogene

AAAAAAA
Parent mRNA AAAAAAAAA
AAAAAAA
AAAAAAA Off-target hybridization

Fig. 3 | challenges and solutions to understanding pseudogene nucleotide differences per read. c | Long-read cDNA sequencing
functions. a | Microarray analysis cannot distinguish between parent allows accurate quantification of pseudogene RNAs due to a higher
gene and pseudogene expression due to the co-hybridization of number of specific differences per read. d | RNA interference is poorly
the two similar transcripts to the same oligonucleotide probes. suited to analysis of pseudogenes due to off-target hybridization of
b | Short-read cDNA sequencing is unable to confidently distinguish small interfering RNAs (siRNAs) to the parent gene. RISC, RNA-induced
many pseudogene RNAs from their parent mRNAs due to insufficient silencing complex.

www.nature.com/nrg
Perspectives

a Targeting divergent PAM sites b Targeting splice junction sequences


Parent gene Parent gene
Cas9 Cas9 Intron
gRNA gRNA

Disrupted target sequence Disrupted target sequence

Pseudogene Pseudogene

c CRISPR-mediated deletion d Integration of transcriptional terminator


Cas9 Cas9
gRNA gRNA gRNA

Pseudogene Pseudogene
Novel 5′ exon

poly(A) signal
Homology arms

e CRISPR interference f CRISPR activation

KRAB dCas9 VP16 dCas9

gRNA gRNA

Pseudogene Pseudogene
Unique target site Unique target site

Fig. 4 | cRisPR–cas9-based approaches to understanding pseudogene recombination to deplete pseudogene transcripts. e | CRISPR-based tran-
functions. a | Small insertions or deletions (indels) can be induced in puta- scriptional interference (CRISPRi) enables depletion of pseudogene
tively translated pseudogenes by targeting sequences that are divergent in transcription by targeting a dCas9–KRAB fusion protein to unique sequences
the parent gene. b | Indels can be induced in putatively translated pseu- upstream of the transcriptional start site. f | CRISPR-based transcriptional
dogenes by targeting sequences that are interrupted by introns in the parent activation (CRISPRa) enables activation of pseudogene transcription by tar-
gene. c | CRISPR–Cas9 genome engineering allows deletion of pseudogenes geting a dCas9–VP16 fusion protein to unique sequences upstream of the
by targeting unique flanking sequences. d | If pseudogenes have novel 5′ transcription start site. dCas9, catalytically inactive Cas9; gRNA , guide RNA ;
exons, transcriptional terminators can be introduced by homologous PAM, protospacer-adjacent motif; poly(A), polyadenylation.

systematically determine the functions of span exon–exon junctions in the parent gene duplications (resulting in non-unique
pseudogenes. CRISPR enables the relatively (Fig. 4b). As the two parts of the target site flanking sequences). When performing such
straightforward introduction of mutations are together in the processed pseudogene, deletions, it is important to avoid disrupting
or deletions into eukaryotic genomes96–98. the guide RNA will be able to bind, whereas any nearby regulatory elements that may
For putatively translated pseudogenes, in the parent gene, these sequences will be confuse results104,105. For pseudogenes
targeting sequences that are divergent interrupted by introns. that have unique 5′ exons, insertion of
from the parent gene may be a feasible Although introducing small indels polyadenylation signals will allow depletion
approach to introducing frameshifting by targeting a single target site will be of pseudogene transcripts (Fig. 4d). An
indels (Fig. 4a). Despite the short spacer efficacious for dissecting the functions of alternative strategy to deplete processed
sequence length of CRISPR–Cas9 guides, translated pseudogenes, they are unlikely pseudogene transcription is targeting
cleavage is largely intolerant of mismatches to disrupt pseudogenes that act through catalytically inactive Cas9 (dCas9) fused
proximal to the protospacer-adjacent motif DNA-based or RNA-based mechanisms. to transcriptional inhibitors (for example,
(PAM) site99, thus allowing specific targeting By targeting unique flanking genomic KRAB or KRAB–MECP2) to the unique
of pseudogenes. Off-target editing of the sequences with pairs of guide RNAs, promoter regions immediately upstream
parent gene due to partial spacer sequence pseudogenes can be specifically deleted of pseudogene promoters106,107 (Fig. 4e).
matches100 may be further alleviated using without risk of mutating their parental Complementarily, processed pseudogenes
dual-targeting nucleases or engineered genes (Fig. 4c). This strategy holds the most can be endogenously overexpressed
guide RNA with improved specificity101–103. promise for deleting processed pseudogenes by targeting dCas9 fused to the VP16
In either case, validating that no mutations (as the flanking regions are derived from the transcriptional activator to the promoter
are induced in the corresponding parent insertion site rather than from the parent regions upstream of the pseudogene
gene is crucial. An alternative strategy to gene copy), but unprocessed pseudogenes transcriptional start site108,109 (Fig. 4f).
ensure specificity in targeting a processed may be confounded by the location of However, using dCas9 fusions to modulate
pseudogene is to design guide RNAs that the pseudogene in larger segmental pseudogene expression will have limited

Nature Reviews | Genetics


Perspectives

applicability to unprocessed pseudogenes The importance of terminology specifically, the premature specification of
due to cross-hybridization at the parent Theoretical terms are central to the a term to provide convenience to a complex
genes’ regulatory elements. Furthermore, development of scientific theories. Unlike and poorly understood domain — led to a
the impact of altered pseudogene physics, which is underpinned by robust profound lack of scientific inquisition in
expression may only be apparent in certain foundational axioms that form the basis a fundamental domain of biology121.
circumstances. Some may act redundantly of new scientific theories, biology lacks The blanket assignment of the
with their parent genes, necessitating a formal framework through which new pseudogene term to a heterogeneous
ablation of both gene and pseudogene before theoretical terms are defined and adopted class of elements is also reminiscent
a biological impact becomes apparent. Thus, in scientific use. Therefore, there is often a of the indiscriminate lncRNA label,
CRISPR-based approaches, carefully applied, lack of consensus about the meaning and which is applied to the ten thousands of
have the potential to revolutionize our ability application of new terms. A consequence non-protein-coding transcripts longer
to dissect the functions of pseudogenes. of this is that terms in biology can rapidly than ~200 nucleotides originally identified
When performing pseudogene become apparently axiomatized, despite not in mammalian transcriptomes75,124.
perturbation studies, it is important to having undergone any formal process to Despite this group of transcripts being
confirm that the intended perturbation has obtain community consensus. This leaves vastly heterogeneous in both form and
been achieved, which can be challenging. biology vulnerable to developing theoretical function, lncRNAs are often treated as a
Measuring changes in pseudogene constructs on uncertain foundations or cogent class of transcripts in the manner of
transcripts using quantitative real-time obscuring the pursuit of more productive microRNAs, ribosomal RNAs or mRNAs.
PCR is not straightforward due to avenues of research. This overgeneralization in terminology
cross-amplification of parent genes45. Careful In addition to the untested hypothesis can confound objective scientific inquiry
primer designs targeting diagnostic sequence that evolution has left us with a dichotomy where functions or characteristics of one
differences and extensive validation are between genes and pseudogenes, the term lncRNA are spuriously extended to others.
necessary to discriminate pseudogene pseudogene itself asserts a paradigm of This is exemplified by the aforementioned
and parent gene RNAs110. Antibodies non-functionality through its taxonomic ceRNA hypothesis, which proposed a
may be similarly unable to distinguish construction. Pseudogenes are defined unifying theory for a role for lncRNAs in
peptides produced from pseudogenes as defective and not genes. This point is large-scale regulatory network transacted
from their parent proteins, unless they highlighted because impartial language in by an RNA language encoded in microRNA
vary considerably in size or sequence. science is known to inherently restrict the response elements47. The theory rested on
Although changes in RNA and protein level neutral investigation between conflicting the assumption that the capacity of a few
due to pseudogene manipulation could paradigms119. In the case of pseudogenes, lncRNAs to act as sponges for microRNAs
be detected unambiguously by tandem the term itself is constructed to support the could be generalized to all lncRNAs — a
mass spectrometry or long-read RNA-seq, dominant paradigm and therefore limit, notion unlikely to be feasible due to the
these approaches are currently expensive consciously or unconsciously, scientific diversity of characteristics inherent in this
and overpowered for routine functional objectivity in their investigation. grouping48,49. By over-hastily classifying
experiments. Rather, as is common in the One of the most well-recognized pseudogenes as a homogeneous class, we
study of pseudogenes, careful experimental instances of poorly rationalized similarly risk trivializing the complexity
design must be used to monitor changes in axiomatization impacting scientific and diversity of potential functions and
pseudogene expression due to experimental advance was the introduction of the term conduct unnuanced approaches to their
manipulations. ‘prokaryote’ in 1962 (ref.120) to define all characterization.
Alteration of phenotypes due to the non-nucleated life forms as a distinct The relative infancy of genome biology
ablation of pseudogenes may be subtle, phylum. Up to this time, microbiologists leaves the domain vulnerable to the
context specific, evident only at an organismal had struggled to realize an overarching adoption of terms that become axiomatic
level or manifest only on co-depletion with phylogenetic framework to order bacterial for the sake of simplicity and convenience,
the parent gene. Therefore, a lack of overt life and, accordingly, the convenient rather than through scientific grounding.
functional consequence of pseudogene assertion that bacteria were a distinct In the case of pseudogenes, the original
perturbation in one experimental setting monolithic phylum went by unchallenged, use of the term rapidly proliferated to
does not discount that the pseudogene despite a lack of any material evidence encompass a heterogeneous class of genomic
may still have notable functions in other that this was the case121. The grouping of elements. Despite early attempts to provide
contexts. The identification of functions bacteria in a single phylum crystallized the a systematic and scientifically grounded
for pseudogenes has been enabled by unsubstantiated notion that all bacteria nomenclature to these genomic elements125,
technological progress, but so too has arose from a common ancestor. Despite the the term ‘pseudogene’ became widely used
elucidation of their mechanisms. Unbiased available technical capability to demonstrate to describe a potentially defective copy
mapping of RNA–chromatin interactions at that non-nucleated life was two distinct of a gene, and pseudogene identification
the single RNA111–113 or transcriptome114–116 phyla, it was not until the early 1990s became a routine process in the annotation
level, global identification of RNA–RNA that life on Earth was largely recognized of genomes.
hybrids117 and maps of chromatin contacts118 as occurring in three evolutionary As the absence of function or
will allow the dissection of the mechanisms domains122. The ‘prokaryote hypothesis’ biological impact cannot be determined
of action of pseudogenes en masse. persisted for at least 30 years and left a by bioinformatic pipelines, the automated
Methodological innovations in functional generation of microbiologists operating classification of gene-like sequences as
genomics will continue to lower the barriers under a false assumption that obscured the pseudogenes should be avoided. Instead,
to understanding pseudogenic regions understanding of microbial evolution for we propose that descriptive terms that do
of the genome. decades123. In this context, language — and, not make functional inferences should

www.nature.com/nrg
Perspectives

be used in reference to genomic elements misdirects experimental design. The Seth W. Cheetham   1*, Geoffrey J. Faulkner   1,2
that arose from gene duplication and pseudogene term is often used without any and Marcel E. Dinger   3*
retrotransposition. We suggest that the qualification. Despite the term being quite 1
Mater Research Institute, University of Queensland,
previously coined term ‘retrocopy’10,25 absolute in its meaning — that is, a defective TRI Building, Woolloongabba, QLD, Australia.

should be adopted for the annotation copy of another gene — computational 2


Queensland Brain Institute, University of Queensland,
Brisbane, QLD, Australia.
of retrotransposed elements, and that pipelines that annotate pseudogenes have
3
School of Biotechnology and Biomolecular Sciences,
‘gene copy’ or ‘paralogue’ should be used not made any assessment of function.
University of New South Wales, Sydney, NSW,
for gene duplications with appropriate Rather than the inherent limitations Australia.
qualification where they are untranscribed of computational annotations being *e-mail: [email protected];
and/or untranslated. In cases where such emphasized, computational annotations [email protected]
elements are transcribed, but do not have have become increasingly written into lore https://doi.org/10.1038/s41576-019-0196-1
discernible open reading frames, they could through their perpetuation across databases Published online xx xx xxxx
simply be referred to as lncRNA with some and reference genome annotations that, in
1. Jacq, C., Miller, J. R. & Brownlee, G. G. A pseudogene
reference to the ancestral protein-coding turn, inform the development of algorithms structure in 5S DNA of Xenopus laevis. Cell 12,
gene. Regardless of the nuances of the and software used for both research and 109–120 (1977).
2. Vierna, J., Wehner, S., Höner zu Siederdissen, C.,
nomenclature, the overarching principle clinical purposes. Martínez-Lage, A. & Marz, M. Systematic analysis and
is that terminology should not impose The use of a liberal definition of evolution of 5S ribosomal DNA in metazoans. Heredity
111, 410–421 (2013).
any unsubstantiated assumption on pseudogenes is attractive as it simplifies 3. Vanin, E. F. Processed pseudogenes: characteristics
end users. genomic analyses. This approach, often and evolution. Annu. Rev. Genet. 19, 253–272
(1985).
unknowingly to the researcher, leads 4. Esnault, C., Maestre, J. & Heidmann, T. Human LINE
Concluding remarks to the consolidation of the pseudogene retrotransposons generate processed pseudogenes.
Nat. Genet. 24, 363–367 (2000).
Gene duplication and retrotransposition are classification — that is, their exclusion 5. Pei, B. et al. The GENCODE pseudogene resource.
crucial processes that underlie organismal by convenience in functional studies. Genome Biol. 13, R51 (2012).
6. Frankish, A. et al. GENCODE reference annotation for
robustness and the evolution of new Many regions now considered to be ‘dead the human and mouse genomes. Nucleic Acids Res.
biological functions and characteristics. genes’ potentially encode cis-regulatory 47, D766–D773 (2019).
7. Zhang, Z., Harrison, P. M., Liu, Y. & Gerstein, M.
These processes sit alongside other elements, non-coding RNAs and proteins Millions of years of evolution preserved: a
evolutionary mechanisms, such as meiosis, with impacts in human biology and health. comprehensive catalog of the processed pseudogenes
in the human genome. Genome Res. 13, 2541–2558
that create the organismal diversity upon Accordingly, determining the functions (2003).
which natural selection acts. Like other of putative pseudogenes warrants active 8. Baertsch, R., Diekhans, M., Kent, W. J., Haussler, D.
evolutionary processes, gene duplication pursuit by their inclusion in functional & Brosius, J. Retrocopy contributions to the evolution
of the human genome. BMC Genomics 9, 466 (2008).
and retrotransposition leave behind screens and analyses of genomic, 9. Navarro, F. C. P. & Galante, P. A. F. RCPedia: a
observable ancestral signatures that transcriptomic and proteomic data. database of retrocopied genes. Bioinformatics 29,
1235–1237 (2013).
provide insight into an organism’s history. With innovations in long-read sequencing 10. Kaessmann, H., Vinckenbosch, N. & Long, M.
In the fundamental reductionist approach and CRISPR-based methodologies now RNA-based gene duplication: mechanistic and
evolutionary insights. Nat. Rev. Genet. 10, 19–31
often assumed in genetics and molecular readily accessible, the technological (2009).
biology, the perspective is often lost that limitations that formerly motivated the 11. Kaessmann, H. Origins, evolution, and phenotypic
impact of new genes. Genome Res. 20, 1313–1326
life as we observe it today is not only the exclusion from functional investigation are (2010).
product of billions of years of evolutionary largely resolved. 12. Ewing, A. D. et al. Retrotransposition of gene
transcripts leads to structural variation in mammalian
processes but also still subject to these The dominant limitation in advancing genomes. Genome Biol. 14, R22 (2013).
same processes. Although the pseudogene the investigation of pseudogenes now lies 13. Richardson, S. R., Salvador-Palomeque, C. &
Faulkner, G. J. Diversity through duplication: whole-
concept arose to describe an individual in the trappings of the prevailing mindset genome sequencing reveals novel gene retrocopies in
molecular phenomenon, the term was that pseudogenic regions are intrinsically the human population. Bioessays 36, 475–481 (2014).
14. Abyzov, A. et al. Analysis of variable retroduplications
rapidly adopted to annotate tens of non-functional. Notwithstanding the in human populations suggests coupling of
thousands of genomic regions that met only introduction of a new theoretical framework retrotransposition to cell division. Genome Res. 23,
2042–2052 (2013).
loosely defined criteria and was effectively to consider the evolutionary context and 15. Schrider, D. R. et al. Gene copy-number polymorphism
axiomatized without being subject to any functional role of gene duplication caused by retrotransposition in humans. PLOS Genet.
9, e1003242 (2013).
rigorous scientific debate. This lack of and retrotransposition, the assumptions 16. Sisu, C. et al. Comparative analysis of pseudogenes
consensus-seeking process has left underlying pseudogene annotations across three phyla. Proc. Natl Acad. Sci. USA 111,
13361–13366 (2014).
genome biology with a legacy concept should be carefully questioned in genomic 17. Wang, T. et al. Identification and characterization of
that obscures objective investigation of investigations. We propose that the essential genes in the human genome. Science 350,
1096–1101 (2015).
genome function. pseudogene term should be used sparingly, 18. Tsherniak, A. et al. Defining a cancer dependency map.
The purpose of this article is not to preferably only in the context where such Cell 170, 564–576.e16 (2017).
19. Ghandi, M. et al. Next-generation characterization
discard the pseudogene concept or to a genomic region demonstrably lacks of the Cancer Cell Line Encyclopedia. Nature 569,
suggest that all pseudogenes are functional. biological activity, and instead the more 503–508 (2019).
20. Zhang, Z. & Gerstein, M. Large-scale analysis of
The majority of currently annotated objective terminology of retrocopy and pseudogenes in the human genome. Curr. Opin.
pseudogenes are neither robustly transcribed gene copy (or paralogue in cases where Genet. Dev. 14, 328–335 (2004).
21. van Baren, M. J. & Brent, M. R. Iterative gene
nor translated. Such regions fit well the the gene is transcribed and translated) prediction and pseudogene removal improves genome
original descriptions of pseudogenes as should be adopted. With renewed scientific annotation. Genome Res. 16, 678–685 (2006).
22. Torrents, D., Suyama, M., Zdobnov, E. & Bork, P.
‘similar, but defective’. Rather, we argue objectivity, we anticipate that a wealth of A genome-wide survey of human pseudogenes.
that their labelling as pseudogenes is discoveries to understand genome function, Genome Res. 13, 2559–2567 (2003).
23. Zhang, Z. et al. PseudoPipe: an automated
not constructive for advancement of its role in disease and the development of pseudogene identification pipeline. Bioinformatics
understanding of genome function and new treatments is within reach. 22, 1437–1439 (2006).

Nature Reviews | Genetics


Perspectives

24. Frith, M. C. et al. Pseudo-messenger RNA: phantoms 52. Lai, J. et al. A variant of the KLK4 gene is expressed of long noncoding RNAs in the mouse brain. Proc.
of the transcriptome. PLOS Genet. 2, e23 (2006). as a cis sense–antisense chimeric transcript in Natl Acad. Sci. USA 105, 716–721 (2008).
25. Vinckenbosch, N., Dupanloup, I. & Kaessmann, H. prostate cancer cells. RNA 16, 1156–1166 (2010). 80. Dinger, M. E. et al. Long noncoding RNAs in mouse
Evolutionary fate of retroposed gene copies in the 53. Chakravarthi, B. V. et al. Pseudogene associated embryonic stem cell pluripotency and differentiation.
human genome. Proc. Natl Acad. Sci. USA 103, recurrent gene fusion in prostate cancer. Neoplasia Genome Res. 18, 1433–1445 (2008).
3220–3225 (2006). 21, 989–1002 (2019). 81. Martianov, I., Ramadass, A., Serra Barros, A.,
26. Jorquera, R. et al. SinEx DB: a database for single 54. Bischof, J. M. et al. Genome-wide identification of Chow, N. & Akoulitchev, A. Repression of the human
exon coding sequences in mammalian genomes. pseudogenes capable of disease-causing gene dihydrofolate reductase gene by a non-coding
Database (Oxford) 2016, baw095 (2016). conversion. Hum. Mutat. 27, 545–552 (2006). interfering transcript. Nature 445, 666–670
27. Hurst, L. D. The Ka/Ks ratio: diagnosing the form of 55. Rygiel, A. M. et al. Gene conversion between cationic (2007).
sequence evolution. Trends Genet. 18, 486 (2002). trypsinogen (PRSS1) and the pseudogene trypsinogen 82. Rinn, J. L. et al. Functional demarcation of active
28. Chiang, J. J. et al. Viral unmasking of cellular 5S rRNA 6 (PRSS3P2) in patients with chronic pancreatitis. and silent chromatin domains in human HOX loci by
pseudogene transcripts induces RIG-I-mediated Hum. Mutat. 36, 350–356 (2015). noncoding RNAs. Cell 129, 1311–1323 (2007).
immunity. Nat. Immunol. 19, 53–62 (2018). 56. Concolino, P. & Costella, A. Congenital adrenal 83. Morris, K. V. & Mattick, J. S. The rise of regulatory
29. Pink, R. C. et al. Pseudogenes: pseudo-functional hyperplasia (CAH) due to 21-hydroxylase deficiency: RNA. Nat. Rev. Genet. 15, 423–437 (2014).
or key regulators in health and disease? RNA 17, a comprehensive focus on 233 pathogenic variants of 84. Cabili, M. N. et al. Integrative annotation of human
792–798 (2011). CYP21A2 gene. Mol. Diagn. Ther. 22, 261–280 large intergenic noncoding RNAs reveals global
30. Pink, R. C. & Carter, D. R. F. Pseudogenes as (2018). properties and specific subclasses. Genes Dev. 25,
regulators of biological function. Essays Biochem. 57. Watnick, T., Gandolph, M. A., Weber, H., 1915–1927 (2011).
54, 103–112 (2013). Neumann, H. P. & Germino, G. G. Gene conversion 85. Pang, K. C. et al. Genome-wide identification of long
31. Kovalenko, T. F. & Patrushev, L. I. Pseudogenes as is a likely cause of mutation in PKD1. Hum. Mol. noncoding RNAs in CD8+ T cells. J. Immunol. 182,
functionally significant elements of the genome. Genet. 7, 1239–1243 (1998). 7738–7748 (2009).
Biochemistry 83, 1332–1349 (2018). 58. Vanita et al. A unique form of autosomal dominant 86. Sunwoo, H. et al. MEN ε/β nuclear-retained non-coding
32. McCarrey, J. R. & Thomas, K. Human testis-specific cataract explained by gene conversion between RNAs are up-regulated upon muscle differentiation
PGK gene lacks introns and possesses characteristics β-crystallin B2 and its pseudogene. J. Med. Genet. and are essential components of paraspeckles.
of a processed gene. Nature 326, 501–505 (1987). 38, 392–396 (2001). Genome Res. 19, 347–359 (2009).
33. McCarrey, J. R. Nucleotide sequence of the promoter 59. Habib, A. M. et al. Microdeletion in a FAAH 87. Mercer, T. R. et al. Long noncoding RNAs in neuronal–
region of a tissue-specific human retroposon: pseudogene identified in a patient with high glial fate specification and oligodendrocyte lineage
comparison with its housekeeping progenitor. Gene anandamide concentrations and pain insensitivity. maturation. BMC Neurosci. 11, 14 (2010).
61, 291–298 (1987). Br. J. Anaesth. 123, e249–e253 (2019). 88. Lockhart, D. J. et al. Expression monitoring by
34. Sayah, D. M., Sokolskaja, E., Berthoux, L. & Luban, J. 60. Ali, H. et al. PKD1 duplicated regions limit clinical hybridization to high-density oligonucleotide arrays.
Cyclophilin A retrotransposition into TRIM5 explains utility of whole exome sequencing for genetic Nat. Biotechnol. 14, 1675–1680 (1996).
owl monkey resistance to HIV-1. Nature 430, diagnosis of autosomal dominant polycystic kidney 89. Millson, A. et al. Processed pseudogene confounding
569–573 (2004). disease. Sci. Rep. 9, 4141 (2019). deletion/duplication assays for SMAD4. J. Mol. Diagn.
35. Burki, F. & Kaessmann, H. Birth and adaptive 61. Gallagher, M. D. & Chen-Plotkin, A. S. The post-GWAS 17, 576–582 (2015).
evolution of a hominoid gene that supports high era: from association to function. Am. J. Hum. Genet. 90. Cloonan, N. et al. Stem cell transcriptome profiling via
neurotransmitter flux. Nat. Genet. 36, 1061–1063 102, 717–730 (2018). massive-scale mRNA sequencing. Nat. Methods 5,
(2004). 62. Bartonicek, N. et al. Intergenic disease-associated 613–619 (2008).
36. Hayashi, H. et al. The OCT4 pseudogene POU5F1B regions are abundant in novel transcripts. Genome 91. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L.
is amplified and promotes an aggressive phenotype in Biol. 18, 241 (2017). & Wold, B. Mapping and quantifying mammalian
gastric cancer. Oncogene 34, 199–208 (2015). 63. GTEx Consortium et al. Genetic effects on gene transcriptomes by RNA-seq. Nat. Methods 5,
37. Suzuki, I. K. et al. Human-specific NOTCH2NL genes expression across human tissues. Nature 550, 621–628 (2008).
expand cortical neurogenesis through Delta/Notch 204–213 (2017). 92. Kalyana-Sundaram, S. et al. Expressed pseudogenes
regulation. Cell 173, 1370–1384.e16 (2018). 64. Marques, A. C., Dupanloup, I., Vinckenbosch, N., in the transcriptional landscape of human cancers. Cell
38. Fiddes, I. T. et al. Human-specific NOTCH2NL genes Reymond, A. & Kaessmann, H. Emergence of young 149, 1622–1634 (2012).
affect notch signaling and cortical neurogenesis. Cell human genes after a burst of retroposition in 93. Oikonomopoulos, S., Wang, Y. C., Djambazian, H.,
173, 1356–1369.e22 (2018). primates. PLOS Biol. 3, e357 (2005). Badescu, D. & Ragoussis, J. Benchmarking of the
39. Dennis, M. Y. et al. Evolution of human-specific neural 65. Kabza, M., Ciomborowska, J. & Makałowska, I. Oxford Nanopore MinION sequencing for quantitative
SRGAP2 genes by incomplete segmental duplication. RetrogeneDB—a database of animal retrogenes. and qualitative assessment of cDNA populations.
Cell 149, 912–922 (2012). Mol. Biol. Evol. 31, 1646–1648 (2014). Sci. Rep. 6, 31602 (2016).
40. Charrier, C. et al. Inhibition of SRGAP2 function by its 66. van Heesch, S. et al. The translational landscape of the 94. Au, K. F. et al. Characterization of the human ESC
human-specific paralogs induces neoteny during spine human heart. Cell 178, 242–260.e29 (2019). transcriptome by hybrid sequencing. Proc. Natl Acad.
maturation. Cell 149, 923–935 (2012). 67. Kim, M.-S. et al. A draft map of the human proteome. Sci. USA 110, E4821–E4830 (2013).
41. Korneev, S. A., Park, J. H. & O’Shea, M. Neuronal Nature 509, 575–581 (2014). 95. Garalde, D. R. et al. Highly parallel direct RNA
expression of neural nitric oxide synthase (nNOS) 68. Ji, Z., Song, R., Regev, A. & Struhl, K. Many lncRNAs, sequencing on an array of nanopores. Nat. Methods
protein is suppressed by an antisense RNA 5′UTRs, and pseudogenes are translated and some 15, 201–206 (2018).
transcribed from an NOS pseudogene. J. Neurosci. are likely to express functional proteins. eLife 4, 96. Jinek, M. et al. RNA-programmed genome editing in
19, 7711–7720 (1999). e08890 (2015). human cells. eLife 2, e00471 (2013).
42. Tam, O. H. et al. Pseudogene-derived small interfering 69. Brosch, M. et al. Shotgun proteomics aids discovery 97. Mali, P. et al. RNA-guided human genome engineering
RNAs regulate gene expression in mouse oocytes. of novel protein-coding genes, alternative splicing, via Cas9. Science 339, 823–826 (2013).
Nature 453, 534–538 (2008). and ‘resurrected’ pseudogenes in the mouse genome. 98. Cong, L. et al. Multiplex genome engineering using
43. Watanabe, T. et al. Endogenous siRNAs from naturally Genome Res. 21, 756–767 (2011). CRISPR/Cas systems. Science 339, 819–823 (2013).
formed dsRNAs regulate transcripts in mouse oocytes. 70. Doolittle, W. F. We simply cannot go on being so vague 99. Anderson, E. M. et al. Systematic analysis of CRISPR–
Nature 453, 539–543 (2008). about ‘function’. Genome Biol. 19, 223 (2018). Cas9 mismatch tolerance reveals low levels of
44. Rapicavoli, N. A. et al. A mammalian pseudogene 71. Kafri, R., Springer, M. & Pilpel, Y. Genetic redundancy: off-target activity. J. Biotechnol. 211, 56–65 (2015).
lncRNA at the interface of inflammation and new tricks for old genes. Cell 136, 389–392 (2009). 100. Zhang, X.-H., Tee, L. Y., Wang, X.-G., Huang, Q.-S.
anti-inflammatory therapeutics. eLife 2, e00762 72. Duret, L., Chureau, C., Samain, S., Weissenbach, J. & & Yang, S.-H. Off-target effects in CRISPR/Cas9-
(2013). Avner, P. The Xist RNA gene evolved in eutherians by mediated genome engineering. Mol. Ther. Nucleic
45. Poliseno, L. et al. A coding-independent function of pseudogenization of a protein-coding gene. Science Acids 4, e264 (2015).
gene and pseudogene mRNAs regulates tumour 312, 1653–1655 (2006). 101. Kim, D. et al. Genome-wide analysis reveals
biology. Nature 465, 1033–1038 (2010). 73. Hezroni, H. et al. A subset of conserved mammalian specificities of Cpf1 endonucleases in human cells.
46. Karreth, F. A. et al. The BRAF pseudogene functions long non-coding RNAs are fossils of ancestral Nat. Biotechnol. 34, 863–868 (2016).
as a competitive endogenous RNA and induces protein-coding genes. Genome Biol. 18, 162 102. Kleinstiver, B. P. et al. Genome-wide specificities
lymphoma in vivo. Cell 161, 319–332 (2015). (2017). of CRISPR–Cas Cpf1 nucleases in human cells.
47. Salmena, L., Poliseno, L., Tay, Y., Kats, L. & 74. Liu, W.-H., Tsai, Z. T.-Y. & Tsai, H.-K. Comparative Nat. Biotechnol. 34, 869–874 (2016).
Pandolfi, P. P. A ceRNA hypothesis: the Rosetta Stone genomic analyses highlight the contribution of 103. Kocak, D. D. et al. Increasing the specificity of CRISPR
of a hidden RNA language? Cell 146, 353–358 pseudogenized protein-coding genes to human systems with engineered RNA secondary structures.
(2011). lincRNAs. BMC Genomics 18, 786 (2017). Nat. Biotechnol. 37, 657–666 (2019).
48. Denzler, R., Agarwal, V., Stefano, J., Bartel, D. P. 75. Carninci, P. et al. The transcriptional landscape of 104. Groff, A. F. et al. In vivo characterization of Linc-p21
& Stoffel, M. Assessing the ceRNA hypothesis with the mammalian genome. Science 309, 1559–1563 reveals functional cis-regulatory DNA elements.
quantitative measurements of miRNA and target (2005). Cell Rep. 16, 2178–2186 (2016).
abundance. Mol. Cell 54, 766–776 (2014). 76. Mattick, J. S. Challenging the dogma: the hidden layer 105. Bassett, A. R. et al. Considerations when investigating
49. Thomson, D. W. & Dinger, M. E. Endogenous of non-protein-coding RNAs in complex organisms. lncRNA function in vivo. eLife 3, e03058 (2014).
microRNA sponges: evidence and controversy. Bioessays 25, 930–939 (2003). 106. Qi, L. S. et al. Repurposing CRISPR as an RNA-guided
Nat. Rev. Genet. 17, 272–283 (2016). 77. Gloss, B. S. & Dinger, M. E. The specificity of long platform for sequence-specific control of gene
50. Huang, P. et al. Comparative analysis of noncoding RNA expression. Biochim. Biophys. Acta expression. Cell 152, 1173–1183 (2013).
three-dimensional chromosomal architecture identifies 1859, 16–22 (2016). 107. Yeo, N. C. et al. An enhanced CRISPR repressor for
a novel fetal hemoglobin regulatory element. Genes 78. Clark, M. B. et al. Quantitative gene profiling of long targeted mammalian gene regulation. Nat. Methods
Dev. 31, 1704–1713 (2017). noncoding RNAs with targeted RNA sequencing. 15, 611–616 (2018).
51. Vergés, L. et al. An exploratory study of predisposing Nat. Methods 12, 339–342 (2015). 108. Gilbert, L. A. et al. Genome-scale CRISPR-mediated
genetic factors for DiGeorge/velocardiofacial 79. Mercer, T. R., Dinger, M. E., Sunkin, S. M., control of gene repression and activation. Cell 159,
syndrome. Sci. Rep. 7, 40031 (2017). Mehler, M. F. & Mattick, J. S. Specific expression 647–661 (2014).

www.nature.com/nrg
Perspectives

109. Cheng, A. W. et al. Multiplexed activation of 119. Kuhn, T. S. The Structure of Scientific Revolutions light chain preimmune repertoire. Cell 48, 379–388
endogenous genes by CRISPR-on, an RNA-guided (Univ. Chicago Press, 1962). (1987).
transcriptional activator system. Cell Res. 23, 120. Stanier, R. Y. & van Niel, C. B. The concept of a 131. Reynaud, C. A., Dahan, A., Anquez, V. & Weill, J. C.
1163–1171 (2013). bacterium. Arch. Mikrobiol. 42, 17–35 (1962). Somatic hyperconversion diversifies the single Vh gene
110. Endrizzi, K. et al. Discriminative quantification of 121. Woese, C. R. A new biology for a new century. of the chicken with a high incidence in the D region.
cytochrome P4502D6 and 2D7/8 pseudogene Microbiol. Mol. Biol. Rev. 68, 173–186 (2004). Cell 59, 171–183 (1989).
expression by TaqMan real-time reverse transcriptase 122. Woese, C. R., Kandler, O. & Wheelis, M. L. Towards a 132. Wang, J., Pitarque, M. & Ingelman-Sundberg, M.
polymerase chain reaction. Anal. Biochem. 300, natural system of organisms: proposal for the domains 3′-UTR polymorphism in the human CYP2A6 gene
121–131 (2002). Archaea, Bacteria, and Eucarya. Proc. Natl Acad. affects mRNA stability and enzyme expression.
111. Simon, M. D. et al. The genomic binding sites of a Sci. USA 87, 4576–4579 (1990). Biochem. Biophys. Res. Commun. 340, 491–497
noncoding RNA. Proc. Natl Acad. Sci. USA 108, 123. Woese, C. R. & Goldenfeld, N. How the microbial (2006).
20497–20502 (2011). world saved evolution from the scylla of molecular
112. Chu, C., Qu, K., Zhong, F. L., Artandi, S. E. & biology and the charybdis of the modern synthesis. Acknowledgements
Chang, H. Y. Genomic maps of long noncoding RNA Microbiol. Mol. Biol. Rev. 73, 14–21 (2009). The authors thank J. Mattick for feedback on the manuscript
occupancy reveal principles of RNA–chromatin 124. Mercer, T. R., Dinger, M. E. & Mattick, J. S. Long and A. Ewing for helpful discussion. S.W.C. acknowledges
interactions. Mol. Cell 44, 667–678 (2011). non-coding RNAs: insights into functions. Nat. Rev. support from a National Health and Medical Research
113. Cheetham, S. W. & Brand, A. H. RNA-DamID Genet. 10, 155–159 (2009). Council (NHMRC) Early Career Fellowship (GNT1161832)
reveals cell-type-specific binding of roX RNAs at 125. Brosius, J. & Gould, S. J. On ‘genomenclature’: and the Mater Foundation. G.J.F. acknowledges support from
chromatin-entry sites. Nat. Struct. Mol. Biol. 25, a comprehensive (and respectful) taxonomy for a CSL Centenary Fellowship and the Mater Foundation.
109–114 (2018). pseudogenes and other ‘junk DNA’. Proc. Natl Acad.
114. Li, X. et al. GRID-seq reveals the global RNA– Sci. USA 89, 10706–10710 (1992). Author contributions
chromatin interactome. Nat. Biotechnol. 35, 126. Zhang, J. et al. NANOGP8 is a retrogene expressed S.W.C. and M.E.D. contributed to all aspects of the article.
940–950 (2017). in cancers. FEBS J. 273, 1723–1730 (2006). G.J.F. revised the manuscript.
115. Bell, J. C. et al. Chromatin-associated RNA sequencing 127. Kandouz, M., Bier, A., Carystinos, G. D.,
(ChAR-seq) maps genome-wide RNA-to-DNA contacts. Alaoui-Jamali, M. A. & Batist, G. Connexin43 Competing interests
eLife 7, e27024 (2018). pseudogene is expressed in tumor cells and inhibits The authors declare no competing interests.
116. Bonetti, A. et al. RADICL-seq identifies general and growth. Oncogene 23, 4763–4770 (2004).
cell type-specific principles of genome-wide RNA– 128. Chiefari, E. et al. Pseudogene-mediated Peer review information
chromatin interactions. Preprint at bioRxiv https://doi. posttranscriptional silencing of HMGA1 can result in Nature Reviews Genetics thanks P. Carninci and the other,
org/10.1101/681924 (2019). insulin resistance and type 2 diabetes. Nat. Commun. anonymous, reviewer(s) for their contribution to the peer
117. Lu, Z. et al. RNA duplex map in living cells reveals 1, 40 (2010). review of this work.
higher-order transcriptome structure. Cell 165, 129. Hawkins, P. G. & Morris, K. V. Transcriptional
1267–1279 (2016). regulation of Oct4 by a long non-coding RNA Publisher’s note
118. Lieberman-Aiden, E. et al. Comprehensive mapping antisense to Oct4-pseudogene 5. Transcription 1, Springer Nature remains neutral with regard to jurisdictional
of long-range interactions reveals folding principles of 165–175 (2010). claims in published maps and institutional affiliations.
the human genome. Science 326, 289–293 130. Reynaud, C. A., Anquez, V., Grimal, H. & Weill, J. C.
(2009). A hyperconversion mechanism generates the chicken © Springer Nature Limited 2019

Nature Reviews | Genetics

You might also like