CE6068 Lecture 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 95

CE6068

Bioinformatics and Computational Molecular


Biological Databases and Sequence Analysis Tools
Chia-Ru Chung
Department of Computer Science and Information Engineering
National Central University
2024/3/13
Outline

• Quick Review
• Biotechnological Tools
• Brief History of Bioinformatics
• Biological Databases
• Sequence Analysis Tools

1
Quick Review
Ref/ https://medium.com/@ashutoshbele/introduction-to-bioinformatics-492d2f39f972

What is Bioinformatics
• Bioinformatics is described as the research, development, or application of computational
tools and approaches for expanding the use of biological, medical, behavioral, or health data.
• This includes acquiring, storing, organizing, archiving, analyzing, or visualizing such data.
• The essence of bioinformatics lies in its emphasis on information technology applications to
manage and analyze biological data.
• This involves creating and maintaining databases for biological information, developing
algorithms and software tools for processing and interpreting biological data, and analyzing
sequence data to identify genes, establish phylogenies, and predict the structure and function
of proteins. 3
Why Bioinformatics (1/2)
• Managing Large Volumes of Data: The advent of high-throughput technologies,
such as next-generation sequencing, has led to an explosion of biological data.
Bioinformatics provides the tools and methodologies to store, manage, and retrieve
this vast amount of data efficiently.
• Understanding Genetic and Genomic Information: Bioinformatics tools help in
deciphering the information contained within genomes, including the structure and
function of genes, the regulation of gene expression, and the identification of genetic
variations linked to diseases.
4
Why Bioinformatics (2/2)
• Drug Discovery and Development: Bioinformatics is critical in identifying molecular
targets for drug discovery, understanding the mechanisms of drug action, and predicting
potential drug side effects
• Personalized Medicine: By analyzing genetic information, bioinformatics enables the
customization of healthcare, with medical decisions, practices, and products being tailored to
the individual patient.
• Evolutionary and Comparative Genomics: Bioinformatics tools allow researchers to
compare the genomes of different species, shedding light on evolutionary relationships, the
function of genes and proteins, and the conservation of sequences across species. 5
What is Computational Biology
• Computational biology is defined as the development and application of data-
analytical and theoretical methods, mathematical modeling, and computational
simulation techniques to study biological, behavioral, and social systems.
• This field focuses more on the theoretical and experimental modeling of biological
systems, integrating data across different biological scales to understand complex
biological systems.

6
What is Computational Molecular Biology
• Computational Molecular Biology is a subfield of bioinformatics focused specifically
on the molecular level of biology.
• It involves the use of computational algorithms, models, and tools to understand and
predict the structures, functions, and interactions of molecules such as DNA, RNA,
and proteins.
• The emphasis is on molecular mechanisms and the computational approaches to
unravel them, making it a highly specialized area within computational biology.

7
Why Computational Biology
• Computational biology involves the use of biological data to develop algorithms or
models to understand biological systems and relationships.
• Unlike bioinformatics, which is more focused on the development of tools and
databases, computational biology is concerned with the theoretical and experimental
modeling of biological systems.

8
Comparison and Integration
• Tool Development vs. Theory Application: Bioinformatics is primarily concerned
with the development of software tools and databases for biological analysis, while
computational biology focuses on applying computational and mathematical theories
to answer biological questions.
• Data Management vs. Data Analysis: Bioinformatics emphasizes data management,
including storage, retrieval, and primary analysis of biological data. Computational
biology, conversely, is more about using this data to model biological processes and
systems.
9
Our Body
• Our body consists of a number of organs.
• Each organ composes of a number of tissues.
• Each tissue composes of a tremendous number
of cells.

Ref. https://www.exploringnature.org/db/view/Levels-of-Organization-in-the-Body-Cells-to-Organisms-Color 10
Cells
• Cell: Basic functional unit of life
‐ Perform chemical reactions necessary to maintain our
life
‐ Pass the information for maintaining life to the next
generation
• Actors (molecules):
‐ DNA stores and passes information
‐ RNA is the intermediate between DNA and proteins
‐ Protein performs chemical reactions
Ref. https://kknews.cc/science/ja96e86.html 11
DNA
• Deoxyribonucleic acid (DNA) is the genetic
material in all organisms (with certain viruses being
an exception) and it stores the instructions needed
by the cell to perform daily life functions.
• DNA consists of two strands which are interwoven
together to form a double helix.
• Each strand is a chain of small molecules called
nucleotides.
Ref. https://www.priyamstudycentre.com/2023/08/deoxyribonucleic-acid-dna.html 12
DNA Nucleotides
• There are 4 different nucleotides: adenine(A), cytosine(C), guanine(G), and thymine(T),

Ref. Figure 1.6 in Algorithms in Bioinformatics: A Practical Introduction by Wing-Kin Sung

• C and T have 1 ring, and are called pyrimidines


• A and G have 2 rings, and are called purines
13
Orientation of a DNA
• One strand of DNA is generated by chaining together nucleotides.
• It forms a phosphate-sugar backbone.
• It has direction: from 5’to 3’. (Because DNA always extends from 3’end.)
• Upstream: from 5’to 3’ ACGTA

• Downstream: from 3’to 5’

Ref. Figure 1.7 in Algorithms in Bioinformatics: A Practical Introduction by Wing-Kin Sung

14
Watson-Crick Base Pairing
• Complementary bases:
‐ A with T (two hydrogen-bonds)
‐ C with G (three hydrogen-bonds)
• The distance between the two strands is
about 10 Å .
• Due to the weak interaction force, the
two strands form double helix.

Ref. https://www.mun.ca/biology/scarr/Watson-Crick_Model.html 15
Double Stranded DNA
• Normally, DNA is double stranded within a cell.
The two strands are antiparallel. One strand is
the reverse complement of another one.
• The double strands are interwoven together and
form a double helix.
• One reason for double stranded is that it
eases DNA replicate.
Ref. https://www.khanacademy.org/science/ap-biology/gene-expression-and-
regulation/replication/a/hs-dna-structure-and-replication-review

16
Location of DNA in a Cell
• Two types of organisms: Prokaryotes (原核生物) and
Eukaryotes (真核生物).
• Prokaryotes are single celled organisms with no nuclei
(e.g. bacteria)
‐ DNA swims within the cell
• Eukaryotes are organisms with single or multiple cells. Ref. https://www.ancestry.com/c/dna-learning-hub/cells

Their cells have nuclei. (e.g. plant and animal)


‐ DNA locates within the nucleus.
17
Gene
• A gene is a sequence of DNA that encodes a protein or an RNA molecule.
• In human genome, it is expected there are 30,000–35,000 genes.
• For gene that encodes protein,
‐ In Prokaryotic genome, one gene corresponds to one protein
‐ In Eukaryotic genome, one gene can corresponds to more than one protein because of the
process “alternative splicing”

18
First Phase: DNA
• Special nucleotide sequences on DNA define different gene regions:
‐ Where the transcription machinery (RNA polymerase) should be loaded
‐ Where transcription should start
‐ Where transcription should end
‐ Where the on/off switches (regulatory elements) are

Ref. https://www.ck12.org/studyguides/biology/regulation-of-gene-expression-study-guide.html 19
Second Phase: RNA
• RNA: Ribonucleic acid Ref. Figure 1.10 in Algorithms in Bioinformatics: A Practical Introduction by Wing-Kin Sung

Additional hydroxyl group at 2’ carbon as compared to DNA (that’s why DNA is “deoxy…”)

• Also four types commonly found (note: U instead of T)

Ref. Figure 1.6 in Algorithms in Bioinformatics: A Practical Introduction by Wing-Kin Sung 20


Ref. https://stock.adobe.com/hk/images/central-dogma-of-dna-transcription-and-translation/502112289

21
Ref. https://hwoihann.github.io/farnorth/definition/2017/09/29/basicNGS-centralDogma-transcription.html
DNA Serves as Template
• RNA sequence is determined according to the template strand
Template DNA Resulting RNA
A U (not T)
C G
G C
T A
Ref. https://medicine.en-academic.com/11752/antisense

‐ RNA use the base U instead of T.


‐ U is also complementary to A
‐ “Coding” in “coding strand” means protein coding.
• RNA has only one strand. 22
Gene Structure and RNA Splicing

Ref. https://en.wikipedia.org/wiki/Gene_structure 23
Transcription – Eukaryotes
• A prokaryotic gene is completely transcribed into an
mRNA by the RNA polymerase.
1. RNA polymerase produces a pre-mRNA which
contains both introns and exons.
2. The 5′ cap and poly-A tail are added to the pre-mRNA.
3. The introns are removed and an mRNA is produced.
4. The final mRNA is transported out of the nucleus.

Ref. Figure 1.12 in Algorithms in Bioinformatics: A Practical Introduction by Wing-Kin Sung


24
Third Phase: Protein
Cα (the central carbon)
• Protein: A chain of amino acids, folded into a
particular structure
• Amino acid: 20 common types, all with three
components:
‐ Amine group
‐ Carboxylic acid group
‐ Side chain
• The 20 types only differ in the side chain
Ref. https://www.reagent.co.uk/blog/what-are-amino-acids/

25
20 Common Amino Acids
• There are 20 common amino
acids, characterized by
different R groups.
• These 20 amino acids can be
classified according to their
mass, volume, acidity, polarity,
and hydrophobicity.
疏水性

Ref. https://www.reagent.co.uk/blog/what-are-amino-acids/ 26
Proteins and Molecular Functions
• Proteins are the major molecules in cells to
carry out a variety of biological functions
in cellular processes, such as gene
transcriptional regulation, RNA splicing,
post-translational modifications, signal
transduction, metabolic pathways, etc.

Genome → Transcriptome → Proteome Ref. https://www.nature.com/articles/s41570-020-00223-8

27
Post-Translation Modification (PTM)
• Post-translation modification (PTM) is the chemical modification of a protein after its
translation. PTM is an extremely important cellular control mechanism which alter
protein physical and chemical properties, folding, conformation, stability, activity, and
consequently, the function of the proteins.
‐ Addition of functional groups
E.g. acylation, methylation, phosphorylation
‐ Addition of other peptides
E.g. ubiquitination, the covalent linkage to the protein ubiquitin.
‐ Structural changes 共價鍵
E.g. disulfide bridges, the covalent linkage of two cysteine amino acids.
28
雙硫鍵 泛素化

Post-Translation Modification (PTM) 甲基化

• Post-translation modification (PTM) is the chemical modification of a protein after its


translation. PTM is an extremely important cellular control mechanism which alter
醣基化
磷酸化 and
protein physical and chemical properties, folding, conformation, stability, activity,
consequently, the function of the proteins.
‐ Addition of functional groups
E.g. acylation, methylation, phosphorylation
‐ Addition of other peptides
E.g. ubiquitination, the covalent linkage to the protein ubiquitin.
‐ Structural changes 蛋白水解切割
乙醯化
E.g. disulfide bridges, the covalent linkage of two cysteine amino acids.
29
Ref. https://doi.org/10.1586/14789450.2015.1042867
Omics Sciences
Ref. https://doi.org/10.1177/0022034510383691

30
Biotechnological Tools
What are Basic Biotechnological Tools
• Basic biotechnological tools refer to a set of techniques and methods developed for
the manipulation, analysis, and characterization of DNA, RNA, proteins, and other
biomolecules.
• These tools have revolutionized biological sciences by allowing scientists to read,
interpret, and modify the genetic material.
• Key tools include Restriction Enzymes, Sonication, Cloning, PCR (Polymerase Chain
Reaction), Gel Electrophoresis, Hybridization, and Next Generation DNA Sequencing.

32
Basic Biotechnological Tools
• Cutting and breaking DNA
‐ Restriction Enzymes
‐ Sonication
• Copying DNA
‐ Cloning
‐ Polymerase Chain Reaction –PCR
• Measuring length of DNA
‐ Gel Electrophoresis
33
Restriction Enzymes (1/2)
• Restriction enzyme recognizes certain point,
called restriction site, in the DNA with a
particular pattern and break it.
• Such process is called digestion.
• Naturally, restriction enzymes are used to break
foreign DNA to avoid infection.
• Used to cut DNA at specific sequences, enabling
the study of gene structure and function, and the
Ref. https://www.sciencelearn.org.nz/resources/2035-restriction-enzymes

construction of recombinant DNA. 34


Restriction Enzymes (2/2)
• Example:
‐ EcoRI is the first restriction enzyme discovered that cuts DNA wherever the sequence
GAATTC is found.
‐ Similar to most of the other restriction enzymes, GAATTC is a palindrome, that is,
GAATTC is its own reverse complement.
• Currently, more than 300 known restriction enzymes have been discovered.

35
EcoRI
• EcoRI is the first discovered restriction
enzyme.
• It cut between G and A. Sticky ends are
created.
• Note that some restriction enzymes give
rise to blunt ends instead of sticky ends.

Ref. https://www.sciencephoto.com/media/1293243/view/ecori-enzyme-restriction-site-illustration 36
Sonication
• Breaks DNA into smaller pieces through
ultrasonic vibration, useful for preparing
DNA for sequencing or library construction.
• Method:
‐ Have a solution having a large amount of
purified DNA
‐ By applying high vibration, each molecule is Ref. https://goldbio.com/articles/article/how-to-fragment-DNA-for-NGS

broken randomly into small fragments.


37
Why Cutting DNA (1/2)
• Cloning and Genetic Engineering: Cutting DNA at specific sequences allows scientists to isolate
genes of interest or specific DNA segments. These fragments can then be inserted into vectors (like
plasmids) and introduced into host organisms for cloning, producing recombinant DNA, and ultimately
expressing the genes. This is foundational for producing genetically modified organisms (GMOs),
synthesizing proteins, and developing gene therapies.
• Genome Mapping and Sequencing: Fragmenting DNA is a critical step in sequencing genomes and
creating detailed genetic maps. Techniques like shotgun sequencing involve cutting DNA into
numerous small pieces, sequencing these fragments, and then using bioinformatics tools to assemble
the sequences into a complete genome. This approach has been vital for projects like the Human
Genome Project.
38
Why Cutting DNA (1/2)
• Diagnostic Applications: Cutting DNA allows for the detection of genetic mutations,
variations, and the presence of pathogenic organisms.
• Study of Gene Expression and Regulation: Researchers cut and manipulate DNA to study
how genes are regulated and expressed in different conditions.
• Epigenetics and Chromatin Studies: Cutting DNA is used in studying the structure and function of
chromatin and epigenetics.
• Facilitates Research and Development: In biotechnology and pharmaceutical industries, cutting DNA
is instrumental in developing new drugs, vaccines, and therapies. It allows for the discovery and
production of therapeutic proteins, the development of genetic tests, and the creation of genetically
engineered vectors for gene therapy.
39
載體

Cloning 質體

• For many experiments, small amounts of DNAs are not


enough.
• Cloning is one way to replicate DNAs.
• Given a piece of DNA X, the cloning process is as follows:
1. Insert X into a plasmid vector with antibiotic-resistance gene and a
recombinant DNA molecule is formed
2. Insert the recombinant into the host cell (usually, E. coli).
3. Grow the host cells in the presence of antibiotic.
Note that only cells with antibiotic-resistance gene can grow
When we duplicate the host cell, X is also duplicated.
4. Select those cells with antibiotic-resistance genes.
Ref. https://byjus.com/biology/dna-cloning/
5. Kill them and extract X
40
Polymerase Chain Reaction (1/2)
• PCR is invented by Kary B. Mullis in 1984.
• PCR allows rapidly replication of a selected region of a DNA without the need for a
living cell.
 Automated! Time required: a few hours.
• Inputs for PCR:
‐ Two oligonucleotides are synthesized, each complementary to the two ends of the region.
They are used as primers.
‐ Thermostable DNA polymerase TaqI.
Taq stands for the bacterium Thermos aquaticus that grows in the yellow stone hot springs.
41
Polymerase Chain Reaction (2/2)
• PCR consists of repeating a cycle with three phases 25-30 times. Each cycle takes
about 5 minutes
Phase 1: separate double stranded DNA by heat
Phase 2: cool; add synthesis primers
Phase 3: Add DNA polymerase TaqI to catalyze 5’to 3’DNA synthesis
• Then, the selected region has been amplified exponentially.

42
Ref. https://www.britannica.com/science/polymerase-chain-reaction 43
Example Applications of PCR
• PCR method is used to amplify DNA segments to the point where it can be readily
isolated for use.
• Example applications:
Clone DNA fragments from mummies
Detection of viral infections

44
Gel Electrophoresis
• Developed by Frederick Sanger in 1977.
• A technique used to separate a mixture of DNA fragments of different lengths.
• We apply an electrical field to the mixture of DNA.
• Note that DNA is negative charged. Due to friction, small molecules travel faster than
large molecules.
• The mixture is separated into bands, each containing DNA molecules of the same
length.

45
Applications
• Separating DNA sequences from a mixture
For example, after a genome is digested by a restriction enzyme, hundreds or
thousands of DNA fragments are yielded.
By Gel Electrophoresis, the fragments can be separated.

46
Sequencing by Gel electrophoresis
• An application of gel electrophoresis is to reconstruct DNA sequence of length 500-
800 within a few hours.
• Idea:
‐ Generating all sequences end with A
‐ Using gel electrophoresis, the sequences end with A are separated into different bands.
Such information tells us the positions of A’s in the sequence.
‐ Similar for C, G, and T

47
Read the Sequence
• We have four groups of fragments: A, C, G, and T.
• All fragments are placed in negative end.
• The fragments move to the positive end.
• From the relative distances of the fragments, we can reconstruct the sequence.

Ref. Figure 1.17 in Algorithms in Bioinformatics: A Practical Introduction by Wing-Kin Sung 48


Hybridization
• Among thousands of DNA fragments, Biologists routinely need to find a DNA
fragment which contains a particular DNA subsequence.
• This can be done based on hybridization.
1. Suppose we need to find a DNA fragments which contains ACCGAT.
2. Create probes which is inversely complementary to ACCGAT.
3. Mix the probes with the DNA fragments.
4. Due to the hybridization rule (A=T, C≡G), DNA fragments which contain ACCGAT will
hybridize with the probes
49
DNA Array
• The idea of hybridization leads to the DNA array technology.
• In the past, “one gene in one experiment”
• Hard to get the whole picture.
• DNA array is a technology which allows researchers to do experiment on a set of
genes or even the whole genome.

50
Idea of DNA Array
• An orderly arrangement of thousands of spots.
• Each spot contains many copies of the same DNA
fragment.
• When the array is exposed to the target solution, DNA
fragments in both array and target solution will match based
on hybridization rule:
• A=T, C≡G (hydrogen bond) Ref. Figure 1.18 in Algorithms in Bioinformatics: A Practical Introduction by Wing-Kin Sung

• Such idea allows us to do thousands of hybridization


experiments at the same time.
51
Applications of DNA arrays
• Sequencing by hybridization
‐ A promising alternative to sequencing by gel electrophoresis
‐ It may be able to reconstruct longer DNA sequences in shorter time
• Expression profile of a cell
‐ DNA arrays allow us to monitor the activities within a cell
‐ Each spot contains the complement of a particular gene
‐ Due to hybridization, we can measure the concentration of different mRNAs within a cell
• SNP detection
‐ Using probes with different alleles to detect the single nucleotide variation.
52
Microarray (1/2)
• Microarray technology has become one of the indispensable tools that many
biologists use to monitor genome-wide expression levels of genes (actually the
mRNAs) in a given organism.
Expression level is estimated by measuring the amount of mRNA for a particular gene.
• A microarray is typically a glass slide on to which DNA molecules are fixed in an
orderly manner at specific locations called spots (or features).
• DNA microarrays, also known as DNA chips or biochips, are a technology used to
measure the expression levels of thousands of genes simultaneously or to genotype
multiple regions of a genome. 53
Microarray (2/2)

54
Microarray (3/2)
• A microarray may contain thousands of spots and each spot may
contain a few million copies of identical DNA molecules that
uniquely correspond to a gene.
• The DNA in a spot may either be genomic DNA or short stretch of
oligo-nucleotide strands that correspond to a gene. The spots are
printed on to the glass slide by a robot or are synthesized by the
process of photolithography.

55
The Principle of Microarray
• Microarrays can be used to measure
gene expression in many ways, but one
of the most popular applications is to
compare expression of a set of genes
from a cell maintained in a particular
condition (condition A) to the same set
of genes from a reference cell
maintained under normal conditions
(condition B).
56
Types of DNA Microarrays
• cDNA Microarrays: These involve spotting probes of complementary DNA (cDNA)
on the microarray surface. They are often used for gene expression studies, where the
mRNA levels of different genes are compared between samples.
• Oligonucleotide Microarrays: These contain short DNA oligonucleotide probes
designed to hybridize to specific sequences. They can be used for both gene
expression analysis and for genotyping.

57
DNA Microarray

Ref. https://www.youtube.com/watch?v=NgRfc6atXQ8&ab_channel=Henrik%27sLab 58
Next Generation Sequencing
• Next Generation Sequencing (NGS), also known as high-throughput sequencing,
represents a collection of modern sequencing technologies that have revolutionized
genomic research.
• Unlike the Sanger sequencing method, which sequences DNA fragments one at a time,
• NGS technologies allow for the simultaneous sequencing of millions or even billions
of DNA fragments.
• This advancement has dramatically reduced the cost and time required for genome
sequencing, making large-scale genomic studies feasible.
59
Next Generation Sequencing

60
Ref. https://www.labtestsguide.com/next-generation-sequencing-ngs
Major NGS Platforms
• Next Generation Sequencing (NGS), also known as high-throughput sequencing,
represents a collection of modern sequencing technologies that have revolutionized
genomic research.
• Unlike the Sanger sequencing method, which sequences DNA fragments one at a time,
• NGS technologies allow for the simultaneous sequencing of millions or even billions
of DNA fragments.
• This advancement has dramatically reduced the cost and time required for genome
sequencing, making large-scale genomic studies feasible.
61
Applications of NGS

Ref. https://www.youtube.com/watch?v=WKAUtJQ69n8&ab_channel=ClevaLab 62
Applications of NGS
• Whole Genome Sequencing: Comprehensive analysis of entire genomes, aiding in the identification
of genetic variants, mutations, and structural variations associated with diseases.
• Transcriptome Analysis (RNA-seq): Examines the complete set of RNA transcripts at a given
moment, providing insights into gene expression levels and novel gene discovery.
• Epigenomics: Studies modifications on the genetic material of a cell that do not change the DNA
sequence but influence gene expression, such as DNA methylation and histone modification.
• Metagenomics: Analyzes genetic material recovered directly from environmental samples, enabling
the study of microbial communities and their roles in health and disease without the need for culturing.
• Personalized Medicine: Facilitates the development of personalized treatment strategies based on the
genetic makeup of an individual's disease, particularly in oncology.
63
Brief History of Bioinformatics
Ref. https://link.springer.com/chapter/10.1007/978-3-031-22206-1_4 65
Ref. https://link.springer.com/chapter/10.1007/978-3-031-22206-1_4 66
Ref. https://en.wikipedia.org/wiki/Computational_biology

67
Biological Databases
Biological Databases
• Biological databases are structured collections of biological information, curated to
support the retrieval, analysis, and understanding of data across various fields of
biological research.
• The rapid advancement in biotechnology and high-throughput sequencing
technologies has led to an explosion of biological data, necessitating the development
of databases to manage this information efficiently.

69
Significance of Biological Databases
• Data Management: Biological databases provide a systematic way to store, organize, and manage vast
amounts of data generated by research activities worldwide. They ensure data preservation and
accessibility for future research.
• Research Acceleration: By offering immediate access to pre-analyzed and structured data, databases
significantly reduce the time and resources needed for preliminary analyses, accelerating the pace of
scientific discovery.
• Data Integration: Databases facilitate the integration of diverse types of data (genomic, proteomic,
phenotypic, etc.), enabling holistic views of biological questions and fostering interdisciplinary
research.
• Enhanced Reproducibility: They promote transparency and reproducibility in science by providing
access to raw data and metadata, allowing researchers to validate and build upon previous findings.
70
Overview of the Types
• Sequence Databases: Store DNA, RNA, and protein sequences. Examples include GenBank, EMBL,
DDBJ (nucleotide sequences), and UniProt (protein sequences).
• Structure Databases: Contain information on the three-dimensional structures of biological
macromolecules. Examples are Protein Data Bank (PDB) and CATH.
• Functional and Annotation Databases: Provide annotations for genes and proteins, including
functional descriptions, pathway information, and disease associations. Examples include Gene
Ontology (GO) and KEGG.
• Expression Databases: Hold data from gene expression studies, such as microarray or RNA-Seq
datasets. GEO and ArrayExpress are prime examples.
• Literature Databases: Offer access to scientific publications and curated summaries of research
articles. PubMed is the most widely used literature database.
71
Sequence Databases
• Sequence Databases are perhaps the most foundational types of biological databases,
including repositories of nucleotide sequences (DNA, RNA) and protein sequences.
• They are crucial for a wide range of analyses, from basic sequence alignment to complex
evolutionary studies.
‐ Nucleotide Sequence Databases: GenBank, EMBL, and DDBJ form the triad of primary nucleotide
sequence databases. These databases are instrumental for gene identification, comparative genomics, and
phylogenetic analyses.
‐ Protein Sequence Databases: UniProt is a central repository for protein sequences and annotations. It
combines information from Swiss-Prot, TrEMBL, and the PIR-PSD database, providing comprehensive
data on protein function, classification, and sequence. Protein databases are vital for understanding
protein function, structure, and evolution.
72
Nucleotide Databases (1/3)

73
Nucleotide Databases (2/3)

The growth of nucleotide sequences on a


logarithmic scale in public databases

74
Ref. Fig. 2.1 in Bioinformatics and Computational Biology: A Primer for Biologists by Basant K. Tiwary
Nucleotide Databases (3/3)

The exponential growth in the number of


taxa in the available public databases.

75
Ref. Fig. 2.2 in Bioinformatics and Computational Biology: A Primer for Biologists by Basant K. Tiwary
GenBank
Ref. Fig. 2.3 in Bioinformatics and Computational Biology: A Primer for Biologists by Basant K. Tiwary

The GenBank is
searched online using
Entrez, whereas both
EMBL and DDBJ
databases are
searched through
SRS servers.

76
Ensembl Genome Database
Ref. Fig. 2.4 in Bioinformatics and Computational Biology: A Primer for Biologists by Basant K. Tiwary

• Ensembl is an integrated
platform for genome
annotation and distribution of
genomic data with
comprehensive annotation of
genomic variants, transcript
structures and regulatory
regions.
• It provides a valuable resource
for evolutionary studies using
large-scale comparative
genomics data of 227
vertebrate and model species.
77
UCSC Genome Browser
Ref. Fig. 2.5 in Bioinformatics and Computational Biology: A Primer for Biologists by Basant K. Tiwary

The University of
California Santa Cruz
(UCSC) Genome
Browser currently
provides a web-based
view of 211 genome
assemblies of more
than hundred species.

78
Protein Databases

79
UniProt
• Universal protein resource (UniProt) provides a comprehensive access to high-quality protein
sequences. UniProt knowledge base (UniProtKB) is the primary source of universal protein
sequence information containing more than 189 million sequences obtained from
experimental sequencing as well as translated ORF sequences from EMBL.
• This database is endowed with well-annotated protein sequences and a preliminary
assignment of motifs present in the sequences.
• It is also cross-referenced with other useful databases. Uniprot database consists of two
divisions: SwissProt and TrEMBL. TrEMBL is an automated database requiring minimum
human intervention, whereas Swissprot is a well-curated database having manual entry of
useful information from available literature.
80
Protein Data Bank (PDB)
• The Protein Data Bank (PDB) is an international consortium consisting of four
partners, namely Research Collaboratory for Structural Bioinformatics Protein Data
Bank (RCSB-PDB), Protein Data Bank in Europe (PDBe), Protein Data Bank Japan
(PDBj) and Biological Magnetic Resonance Bank (BMRB).
• It is a global archive for protein structures and other macromolecules determined
using X-ray crystallography and NMR. This global database was launched in 1971
and contains more than 150,000 macromolecular structures adding about 12,000 new
structures each year.
81
Expression Databases (1/2)
• These databases store data derived from experiments that measure the expression
levels of genes across different conditions, tissues, or developmental stages.
• Gene Expression Omnibus (GEO) and ArrayExpress are leading examples, facilitating
the exploration of gene expression patterns and the discovery of genes associated with
diseases or specific biological processes.

82
Expression Databases (2/2)

83
Pathway Databases (1/2)
• Pathway databases, such as KEGG and Reactome, provide curated information on
biochemical pathways, including metabolic pathways, signaling pathways, and gene
regulatory networks.
• These resources are indispensable for systems biology, allowing researchers to
understand complex biological interactions and the effects of alterations within these
pathways.

84
Pathway Databases (2/2)

85
Disease Databases (1/2)
• Disease databases focus on linking genetic variations to specific diseases.
• Online Mendelian Inheritance in Man (OMIM) and the Genome-Wide Association
Study (GWAS) Catalog are examples of databases that aggregate information on
genetic predispositions to diseases, supporting research into the genetic bases of
diseases and the development of personalized medicine.

86
Disease Databases (2/2)

87
Organism-Specific and Virus Databases (1/2)
• These databases are tailored to specific organisms or groups of organisms, providing
comprehensive resources for species-specific research. Examples include WormBase
for nematodes, FlyBase for Drosophila species, and specific databases for model
organisms like Mouse Genome Informatics (MGI).
• During the COVID-19 pandemic, databases focusing on the SARS-CoV-2 virus, such
as the COVID-19 Data Portal, have become critical for tracking virus mutations and
understanding its biology.

88
Organism-Specific and Virus Databases (2/2)

89
Sequence Analysis Tools
BLAST (1/2)
• The Basic Local Alignment Search Tool (BLAST) is most widely used search tool for
sequence databases. It finds a region of local similarity (conserved sequence pattern)
between a query DNA or protein sequence against a target database.
• The ultimate aim of a BLAST program is to infer an evolutionary and functional
relationship between two DNA or protein sequences.
• There are three important aspects of a search process: the input query sequence, target
database and choosing a customized BLAST program.

91
BLAST (2/2)

92
Q &A
Thank you!

You might also like