Ans .: DNA Annotation or Genome Annotation Is The Process of Identifying The

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Write a note on gene annotation?

Ans.
 Definition: DNA annotation or genome annotation is the process of identifying the
locations of genes and all of the coding regions in a genome and determining what
those genes do.
 History: First genome annotation software designed by Dr. Owen White in 1995and
sequenced bacterium Haemophilus influenza. So basically this is done with the help
of sequencers and computers.
 Gene annotation can be further divided into structural annotation (introns, exons, stop
and start codons etc.) and functional annotation (based on biological function of
gene)
 Based on structure:
1. open reading frames and their localisation
2. Gene structure
3. CDS (coding regions)
4. location of regulatory motifs
 Based on function:
1. biochemical function
2. Biological function
3. Involved regulations
 Genome annotation is done manually or automated :

Automated genome annotation


Automatic annotation relies on a variety of software tools and different data sources to create
gene annotations. Typically, these tools are combined into programs referred to as pipelines.
The process consists of two main phases, the computation phase and the synthesis phase.
The computation phase itself can be broken down into the following steps:

 Repeat identification
 Transcript evidence alignment
 Protein evidence alignment
 Ab initio gene prediction
1. Most genomes of more complex organisms harbor an abundance of repetitive
sequence which need to be excluded from subsequent steps.
2. These repeats range from simple cases consisting of a dozen to several hundred
repeats of the same 2–6 nucleotide long motif (like
TATATATATATATATATATATATATATATATA) to transposable elements,
mobile genes or gene fragments of viral origin cluttering their host genomes.
3. Repetitive sequence may comprise up to well over half the total sequence of a
genome, and can lead to spurious annotation results if left unchecked.
4. Identifying and excluding (often referred to as “masking”) repeats is therefore the first
step during automatic gene annotation and performed by specialized tools like
RepeatMasker.
5. After repeat-masking, the next two steps involve aligning external evidence for genes
to the genome assembly.
6. Transcript evidence is usually obtained from RNA-Seq or EST data of the same
organism, and presents the most direct and trustworthy evidence for the location and
structure of a gene. In addition, it gives information about whether the gene is
transcriptionally active (and thus most likely functional, that is, producing a functional
protein) and whether alternative transcripts exist. 
7. Protein alignment follows the same principle, but aligns proteins from a set of closely
related organisms to the genome assembly
8. A frequently used source of protein information other than from closely related
organisms is SwissProt/UniProt, a database of high-quality, manually curated protein
data. Protein evidence alignment is usually performed by BLAST.
9. ab initio gene prediction (from Latin, “from the beginning”) attempts to identify genes
using mathematical models. This approach also tends to lack in accuracy, although
some tools like SNAP, Augustus or GeneMark can be trained to adjust their search
parameters to organism-specific traits, thus improving accuracy.
Genome browsers:
A variety of genome browsers have been developed so far and they can be classified
into two groups:
1. Web-based genome browsers, such as the UCSC genome browser , Ensembl
genome browser, Generic Genome Browser (Gbrowse), and ZENBU (see Figs. 1, 2,
and 3), in which services are hosted in a remote computer and endusers access them
with a web browser over the Internet. Typically a variety of genome annotations are
preloaded and configured. End-users can display and browse annotations and upload
their own data to compare with them.
2. Desktop genome browsers, such as IGV and the NCBI genome workbench in which
the software has to be installed in a local computer.

Manual gene annotation :


Quality control of automatically derived gene annotations and in-depth analyses of genes of
interest often requires manual gene annotation, which is the focus of this tutorial.
The first step in reviewing and editing a gene of interest is identifying its location in the
genome. This can be accomplished by using BLAST and a reference gene – usually a
curated ortholog from a related organism – as a query, either against the genome assembly,
or a set of consensus gene models produced by an automatic gene annotation pipeline.
The genomic region harbouring the gene of interest can then be inspected using a genome
browser and editor like Web Apollo, and the structure of the gene can be reviewed,
including the following features:

 Coding sequence (CDS)


 Intron-exon boundaries (splice sites)
 Untranslated regions (UTR)
 Alternative transcripts

 Once a gene’s structure and thus coding sequence has been established,
its identity can be verified as well. This includes reviewing its homology relations to
putatively orthologous genes like the reference gene, and similar genes in the same
genome (potential gene duplicates or paralogs).
 Finally, assigning a function is the terminal part in gene annotation.

Applications of Gene annotation:

• Gene Prediction: Predictions of gene location and structure. A simple, long ORF (open
reading frame) in a prokaryotic DNA sequence can be predicted as protein coding.
• Repeat Prediction: The genomes of all organisms, particularly eukaryotic organisms,
contain repetitive elements of varying lengths that can occupy a significant fraction of
the total DNA content. For example, the human genome consists of more than 50%
repeated sequences of various types (Lander et al. 2001). Repeats play a vital role in a
number of regulatory functions and are responsible for instability of genomes.
• Identification of DNA features associated with gene structures, and translation of
protein coding genes into protein sequence. 
• Predict phylogenetic relatedness
• Predict the function of an ORF (open reading frame )
• Localise regulatory elements used for gene expression

You might also like