CE6068 Lecture 2
CE6068 Lecture 2
CE6068 Lecture 2
• Quick Review
• Biotechnological Tools
• Brief History of Bioinformatics
• Biological Databases
• Sequence Analysis Tools
1
Quick Review
Ref/ https://medium.com/@ashutoshbele/introduction-to-bioinformatics-492d2f39f972
What is Bioinformatics
• Bioinformatics is described as the research, development, or application of computational
tools and approaches for expanding the use of biological, medical, behavioral, or health data.
• This includes acquiring, storing, organizing, archiving, analyzing, or visualizing such data.
• The essence of bioinformatics lies in its emphasis on information technology applications to
manage and analyze biological data.
• This involves creating and maintaining databases for biological information, developing
algorithms and software tools for processing and interpreting biological data, and analyzing
sequence data to identify genes, establish phylogenies, and predict the structure and function
of proteins. 3
Why Bioinformatics (1/2)
• Managing Large Volumes of Data: The advent of high-throughput technologies,
such as next-generation sequencing, has led to an explosion of biological data.
Bioinformatics provides the tools and methodologies to store, manage, and retrieve
this vast amount of data efficiently.
• Understanding Genetic and Genomic Information: Bioinformatics tools help in
deciphering the information contained within genomes, including the structure and
function of genes, the regulation of gene expression, and the identification of genetic
variations linked to diseases.
4
Why Bioinformatics (2/2)
• Drug Discovery and Development: Bioinformatics is critical in identifying molecular
targets for drug discovery, understanding the mechanisms of drug action, and predicting
potential drug side effects
• Personalized Medicine: By analyzing genetic information, bioinformatics enables the
customization of healthcare, with medical decisions, practices, and products being tailored to
the individual patient.
• Evolutionary and Comparative Genomics: Bioinformatics tools allow researchers to
compare the genomes of different species, shedding light on evolutionary relationships, the
function of genes and proteins, and the conservation of sequences across species. 5
What is Computational Biology
• Computational biology is defined as the development and application of data-
analytical and theoretical methods, mathematical modeling, and computational
simulation techniques to study biological, behavioral, and social systems.
• This field focuses more on the theoretical and experimental modeling of biological
systems, integrating data across different biological scales to understand complex
biological systems.
6
What is Computational Molecular Biology
• Computational Molecular Biology is a subfield of bioinformatics focused specifically
on the molecular level of biology.
• It involves the use of computational algorithms, models, and tools to understand and
predict the structures, functions, and interactions of molecules such as DNA, RNA,
and proteins.
• The emphasis is on molecular mechanisms and the computational approaches to
unravel them, making it a highly specialized area within computational biology.
7
Why Computational Biology
• Computational biology involves the use of biological data to develop algorithms or
models to understand biological systems and relationships.
• Unlike bioinformatics, which is more focused on the development of tools and
databases, computational biology is concerned with the theoretical and experimental
modeling of biological systems.
8
Comparison and Integration
• Tool Development vs. Theory Application: Bioinformatics is primarily concerned
with the development of software tools and databases for biological analysis, while
computational biology focuses on applying computational and mathematical theories
to answer biological questions.
• Data Management vs. Data Analysis: Bioinformatics emphasizes data management,
including storage, retrieval, and primary analysis of biological data. Computational
biology, conversely, is more about using this data to model biological processes and
systems.
9
Our Body
• Our body consists of a number of organs.
• Each organ composes of a number of tissues.
• Each tissue composes of a tremendous number
of cells.
Ref. https://www.exploringnature.org/db/view/Levels-of-Organization-in-the-Body-Cells-to-Organisms-Color 10
Cells
• Cell: Basic functional unit of life
‐ Perform chemical reactions necessary to maintain our
life
‐ Pass the information for maintaining life to the next
generation
• Actors (molecules):
‐ DNA stores and passes information
‐ RNA is the intermediate between DNA and proteins
‐ Protein performs chemical reactions
Ref. https://kknews.cc/science/ja96e86.html 11
DNA
• Deoxyribonucleic acid (DNA) is the genetic
material in all organisms (with certain viruses being
an exception) and it stores the instructions needed
by the cell to perform daily life functions.
• DNA consists of two strands which are interwoven
together to form a double helix.
• Each strand is a chain of small molecules called
nucleotides.
Ref. https://www.priyamstudycentre.com/2023/08/deoxyribonucleic-acid-dna.html 12
DNA Nucleotides
• There are 4 different nucleotides: adenine(A), cytosine(C), guanine(G), and thymine(T),
14
Watson-Crick Base Pairing
• Complementary bases:
‐ A with T (two hydrogen-bonds)
‐ C with G (three hydrogen-bonds)
• The distance between the two strands is
about 10 Å .
• Due to the weak interaction force, the
two strands form double helix.
Ref. https://www.mun.ca/biology/scarr/Watson-Crick_Model.html 15
Double Stranded DNA
• Normally, DNA is double stranded within a cell.
The two strands are antiparallel. One strand is
the reverse complement of another one.
• The double strands are interwoven together and
form a double helix.
• One reason for double stranded is that it
eases DNA replicate.
Ref. https://www.khanacademy.org/science/ap-biology/gene-expression-and-
regulation/replication/a/hs-dna-structure-and-replication-review
16
Location of DNA in a Cell
• Two types of organisms: Prokaryotes (原核生物) and
Eukaryotes (真核生物).
• Prokaryotes are single celled organisms with no nuclei
(e.g. bacteria)
‐ DNA swims within the cell
• Eukaryotes are organisms with single or multiple cells. Ref. https://www.ancestry.com/c/dna-learning-hub/cells
18
First Phase: DNA
• Special nucleotide sequences on DNA define different gene regions:
‐ Where the transcription machinery (RNA polymerase) should be loaded
‐ Where transcription should start
‐ Where transcription should end
‐ Where the on/off switches (regulatory elements) are
Ref. https://www.ck12.org/studyguides/biology/regulation-of-gene-expression-study-guide.html 19
Second Phase: RNA
• RNA: Ribonucleic acid Ref. Figure 1.10 in Algorithms in Bioinformatics: A Practical Introduction by Wing-Kin Sung
Additional hydroxyl group at 2’ carbon as compared to DNA (that’s why DNA is “deoxy…”)
21
Ref. https://hwoihann.github.io/farnorth/definition/2017/09/29/basicNGS-centralDogma-transcription.html
DNA Serves as Template
• RNA sequence is determined according to the template strand
Template DNA Resulting RNA
A U (not T)
C G
G C
T A
Ref. https://medicine.en-academic.com/11752/antisense
Ref. https://en.wikipedia.org/wiki/Gene_structure 23
Transcription – Eukaryotes
• A prokaryotic gene is completely transcribed into an
mRNA by the RNA polymerase.
1. RNA polymerase produces a pre-mRNA which
contains both introns and exons.
2. The 5′ cap and poly-A tail are added to the pre-mRNA.
3. The introns are removed and an mRNA is produced.
4. The final mRNA is transported out of the nucleus.
25
20 Common Amino Acids
• There are 20 common amino
acids, characterized by
different R groups.
• These 20 amino acids can be
classified according to their
mass, volume, acidity, polarity,
and hydrophobicity.
疏水性
Ref. https://www.reagent.co.uk/blog/what-are-amino-acids/ 26
Proteins and Molecular Functions
• Proteins are the major molecules in cells to
carry out a variety of biological functions
in cellular processes, such as gene
transcriptional regulation, RNA splicing,
post-translational modifications, signal
transduction, metabolic pathways, etc.
27
Post-Translation Modification (PTM)
• Post-translation modification (PTM) is the chemical modification of a protein after its
translation. PTM is an extremely important cellular control mechanism which alter
protein physical and chemical properties, folding, conformation, stability, activity, and
consequently, the function of the proteins.
‐ Addition of functional groups
E.g. acylation, methylation, phosphorylation
‐ Addition of other peptides
E.g. ubiquitination, the covalent linkage to the protein ubiquitin.
‐ Structural changes 共價鍵
E.g. disulfide bridges, the covalent linkage of two cysteine amino acids.
28
雙硫鍵 泛素化
30
Biotechnological Tools
What are Basic Biotechnological Tools
• Basic biotechnological tools refer to a set of techniques and methods developed for
the manipulation, analysis, and characterization of DNA, RNA, proteins, and other
biomolecules.
• These tools have revolutionized biological sciences by allowing scientists to read,
interpret, and modify the genetic material.
• Key tools include Restriction Enzymes, Sonication, Cloning, PCR (Polymerase Chain
Reaction), Gel Electrophoresis, Hybridization, and Next Generation DNA Sequencing.
32
Basic Biotechnological Tools
• Cutting and breaking DNA
‐ Restriction Enzymes
‐ Sonication
• Copying DNA
‐ Cloning
‐ Polymerase Chain Reaction –PCR
• Measuring length of DNA
‐ Gel Electrophoresis
33
Restriction Enzymes (1/2)
• Restriction enzyme recognizes certain point,
called restriction site, in the DNA with a
particular pattern and break it.
• Such process is called digestion.
• Naturally, restriction enzymes are used to break
foreign DNA to avoid infection.
• Used to cut DNA at specific sequences, enabling
the study of gene structure and function, and the
Ref. https://www.sciencelearn.org.nz/resources/2035-restriction-enzymes
35
EcoRI
• EcoRI is the first discovered restriction
enzyme.
• It cut between G and A. Sticky ends are
created.
• Note that some restriction enzymes give
rise to blunt ends instead of sticky ends.
Ref. https://www.sciencephoto.com/media/1293243/view/ecori-enzyme-restriction-site-illustration 36
Sonication
• Breaks DNA into smaller pieces through
ultrasonic vibration, useful for preparing
DNA for sequencing or library construction.
• Method:
‐ Have a solution having a large amount of
purified DNA
‐ By applying high vibration, each molecule is Ref. https://goldbio.com/articles/article/how-to-fragment-DNA-for-NGS
Cloning 質體
42
Ref. https://www.britannica.com/science/polymerase-chain-reaction 43
Example Applications of PCR
• PCR method is used to amplify DNA segments to the point where it can be readily
isolated for use.
• Example applications:
Clone DNA fragments from mummies
Detection of viral infections
44
Gel Electrophoresis
• Developed by Frederick Sanger in 1977.
• A technique used to separate a mixture of DNA fragments of different lengths.
• We apply an electrical field to the mixture of DNA.
• Note that DNA is negative charged. Due to friction, small molecules travel faster than
large molecules.
• The mixture is separated into bands, each containing DNA molecules of the same
length.
45
Applications
• Separating DNA sequences from a mixture
For example, after a genome is digested by a restriction enzyme, hundreds or
thousands of DNA fragments are yielded.
By Gel Electrophoresis, the fragments can be separated.
46
Sequencing by Gel electrophoresis
• An application of gel electrophoresis is to reconstruct DNA sequence of length 500-
800 within a few hours.
• Idea:
‐ Generating all sequences end with A
‐ Using gel electrophoresis, the sequences end with A are separated into different bands.
Such information tells us the positions of A’s in the sequence.
‐ Similar for C, G, and T
47
Read the Sequence
• We have four groups of fragments: A, C, G, and T.
• All fragments are placed in negative end.
• The fragments move to the positive end.
• From the relative distances of the fragments, we can reconstruct the sequence.
50
Idea of DNA Array
• An orderly arrangement of thousands of spots.
• Each spot contains many copies of the same DNA
fragment.
• When the array is exposed to the target solution, DNA
fragments in both array and target solution will match based
on hybridization rule:
• A=T, C≡G (hydrogen bond) Ref. Figure 1.18 in Algorithms in Bioinformatics: A Practical Introduction by Wing-Kin Sung
54
Microarray (3/2)
• A microarray may contain thousands of spots and each spot may
contain a few million copies of identical DNA molecules that
uniquely correspond to a gene.
• The DNA in a spot may either be genomic DNA or short stretch of
oligo-nucleotide strands that correspond to a gene. The spots are
printed on to the glass slide by a robot or are synthesized by the
process of photolithography.
55
The Principle of Microarray
• Microarrays can be used to measure
gene expression in many ways, but one
of the most popular applications is to
compare expression of a set of genes
from a cell maintained in a particular
condition (condition A) to the same set
of genes from a reference cell
maintained under normal conditions
(condition B).
56
Types of DNA Microarrays
• cDNA Microarrays: These involve spotting probes of complementary DNA (cDNA)
on the microarray surface. They are often used for gene expression studies, where the
mRNA levels of different genes are compared between samples.
• Oligonucleotide Microarrays: These contain short DNA oligonucleotide probes
designed to hybridize to specific sequences. They can be used for both gene
expression analysis and for genotyping.
57
DNA Microarray
Ref. https://www.youtube.com/watch?v=NgRfc6atXQ8&ab_channel=Henrik%27sLab 58
Next Generation Sequencing
• Next Generation Sequencing (NGS), also known as high-throughput sequencing,
represents a collection of modern sequencing technologies that have revolutionized
genomic research.
• Unlike the Sanger sequencing method, which sequences DNA fragments one at a time,
• NGS technologies allow for the simultaneous sequencing of millions or even billions
of DNA fragments.
• This advancement has dramatically reduced the cost and time required for genome
sequencing, making large-scale genomic studies feasible.
59
Next Generation Sequencing
60
Ref. https://www.labtestsguide.com/next-generation-sequencing-ngs
Major NGS Platforms
• Next Generation Sequencing (NGS), also known as high-throughput sequencing,
represents a collection of modern sequencing technologies that have revolutionized
genomic research.
• Unlike the Sanger sequencing method, which sequences DNA fragments one at a time,
• NGS technologies allow for the simultaneous sequencing of millions or even billions
of DNA fragments.
• This advancement has dramatically reduced the cost and time required for genome
sequencing, making large-scale genomic studies feasible.
61
Applications of NGS
Ref. https://www.youtube.com/watch?v=WKAUtJQ69n8&ab_channel=ClevaLab 62
Applications of NGS
• Whole Genome Sequencing: Comprehensive analysis of entire genomes, aiding in the identification
of genetic variants, mutations, and structural variations associated with diseases.
• Transcriptome Analysis (RNA-seq): Examines the complete set of RNA transcripts at a given
moment, providing insights into gene expression levels and novel gene discovery.
• Epigenomics: Studies modifications on the genetic material of a cell that do not change the DNA
sequence but influence gene expression, such as DNA methylation and histone modification.
• Metagenomics: Analyzes genetic material recovered directly from environmental samples, enabling
the study of microbial communities and their roles in health and disease without the need for culturing.
• Personalized Medicine: Facilitates the development of personalized treatment strategies based on the
genetic makeup of an individual's disease, particularly in oncology.
63
Brief History of Bioinformatics
Ref. https://link.springer.com/chapter/10.1007/978-3-031-22206-1_4 65
Ref. https://link.springer.com/chapter/10.1007/978-3-031-22206-1_4 66
Ref. https://en.wikipedia.org/wiki/Computational_biology
67
Biological Databases
Biological Databases
• Biological databases are structured collections of biological information, curated to
support the retrieval, analysis, and understanding of data across various fields of
biological research.
• The rapid advancement in biotechnology and high-throughput sequencing
technologies has led to an explosion of biological data, necessitating the development
of databases to manage this information efficiently.
69
Significance of Biological Databases
• Data Management: Biological databases provide a systematic way to store, organize, and manage vast
amounts of data generated by research activities worldwide. They ensure data preservation and
accessibility for future research.
• Research Acceleration: By offering immediate access to pre-analyzed and structured data, databases
significantly reduce the time and resources needed for preliminary analyses, accelerating the pace of
scientific discovery.
• Data Integration: Databases facilitate the integration of diverse types of data (genomic, proteomic,
phenotypic, etc.), enabling holistic views of biological questions and fostering interdisciplinary
research.
• Enhanced Reproducibility: They promote transparency and reproducibility in science by providing
access to raw data and metadata, allowing researchers to validate and build upon previous findings.
70
Overview of the Types
• Sequence Databases: Store DNA, RNA, and protein sequences. Examples include GenBank, EMBL,
DDBJ (nucleotide sequences), and UniProt (protein sequences).
• Structure Databases: Contain information on the three-dimensional structures of biological
macromolecules. Examples are Protein Data Bank (PDB) and CATH.
• Functional and Annotation Databases: Provide annotations for genes and proteins, including
functional descriptions, pathway information, and disease associations. Examples include Gene
Ontology (GO) and KEGG.
• Expression Databases: Hold data from gene expression studies, such as microarray or RNA-Seq
datasets. GEO and ArrayExpress are prime examples.
• Literature Databases: Offer access to scientific publications and curated summaries of research
articles. PubMed is the most widely used literature database.
71
Sequence Databases
• Sequence Databases are perhaps the most foundational types of biological databases,
including repositories of nucleotide sequences (DNA, RNA) and protein sequences.
• They are crucial for a wide range of analyses, from basic sequence alignment to complex
evolutionary studies.
‐ Nucleotide Sequence Databases: GenBank, EMBL, and DDBJ form the triad of primary nucleotide
sequence databases. These databases are instrumental for gene identification, comparative genomics, and
phylogenetic analyses.
‐ Protein Sequence Databases: UniProt is a central repository for protein sequences and annotations. It
combines information from Swiss-Prot, TrEMBL, and the PIR-PSD database, providing comprehensive
data on protein function, classification, and sequence. Protein databases are vital for understanding
protein function, structure, and evolution.
72
Nucleotide Databases (1/3)
73
Nucleotide Databases (2/3)
74
Ref. Fig. 2.1 in Bioinformatics and Computational Biology: A Primer for Biologists by Basant K. Tiwary
Nucleotide Databases (3/3)
75
Ref. Fig. 2.2 in Bioinformatics and Computational Biology: A Primer for Biologists by Basant K. Tiwary
GenBank
Ref. Fig. 2.3 in Bioinformatics and Computational Biology: A Primer for Biologists by Basant K. Tiwary
The GenBank is
searched online using
Entrez, whereas both
EMBL and DDBJ
databases are
searched through
SRS servers.
76
Ensembl Genome Database
Ref. Fig. 2.4 in Bioinformatics and Computational Biology: A Primer for Biologists by Basant K. Tiwary
• Ensembl is an integrated
platform for genome
annotation and distribution of
genomic data with
comprehensive annotation of
genomic variants, transcript
structures and regulatory
regions.
• It provides a valuable resource
for evolutionary studies using
large-scale comparative
genomics data of 227
vertebrate and model species.
77
UCSC Genome Browser
Ref. Fig. 2.5 in Bioinformatics and Computational Biology: A Primer for Biologists by Basant K. Tiwary
The University of
California Santa Cruz
(UCSC) Genome
Browser currently
provides a web-based
view of 211 genome
assemblies of more
than hundred species.
78
Protein Databases
79
UniProt
• Universal protein resource (UniProt) provides a comprehensive access to high-quality protein
sequences. UniProt knowledge base (UniProtKB) is the primary source of universal protein
sequence information containing more than 189 million sequences obtained from
experimental sequencing as well as translated ORF sequences from EMBL.
• This database is endowed with well-annotated protein sequences and a preliminary
assignment of motifs present in the sequences.
• It is also cross-referenced with other useful databases. Uniprot database consists of two
divisions: SwissProt and TrEMBL. TrEMBL is an automated database requiring minimum
human intervention, whereas Swissprot is a well-curated database having manual entry of
useful information from available literature.
80
Protein Data Bank (PDB)
• The Protein Data Bank (PDB) is an international consortium consisting of four
partners, namely Research Collaboratory for Structural Bioinformatics Protein Data
Bank (RCSB-PDB), Protein Data Bank in Europe (PDBe), Protein Data Bank Japan
(PDBj) and Biological Magnetic Resonance Bank (BMRB).
• It is a global archive for protein structures and other macromolecules determined
using X-ray crystallography and NMR. This global database was launched in 1971
and contains more than 150,000 macromolecular structures adding about 12,000 new
structures each year.
81
Expression Databases (1/2)
• These databases store data derived from experiments that measure the expression
levels of genes across different conditions, tissues, or developmental stages.
• Gene Expression Omnibus (GEO) and ArrayExpress are leading examples, facilitating
the exploration of gene expression patterns and the discovery of genes associated with
diseases or specific biological processes.
82
Expression Databases (2/2)
83
Pathway Databases (1/2)
• Pathway databases, such as KEGG and Reactome, provide curated information on
biochemical pathways, including metabolic pathways, signaling pathways, and gene
regulatory networks.
• These resources are indispensable for systems biology, allowing researchers to
understand complex biological interactions and the effects of alterations within these
pathways.
84
Pathway Databases (2/2)
85
Disease Databases (1/2)
• Disease databases focus on linking genetic variations to specific diseases.
• Online Mendelian Inheritance in Man (OMIM) and the Genome-Wide Association
Study (GWAS) Catalog are examples of databases that aggregate information on
genetic predispositions to diseases, supporting research into the genetic bases of
diseases and the development of personalized medicine.
86
Disease Databases (2/2)
87
Organism-Specific and Virus Databases (1/2)
• These databases are tailored to specific organisms or groups of organisms, providing
comprehensive resources for species-specific research. Examples include WormBase
for nematodes, FlyBase for Drosophila species, and specific databases for model
organisms like Mouse Genome Informatics (MGI).
• During the COVID-19 pandemic, databases focusing on the SARS-CoV-2 virus, such
as the COVID-19 Data Portal, have become critical for tracking virus mutations and
understanding its biology.
88
Organism-Specific and Virus Databases (2/2)
89
Sequence Analysis Tools
BLAST (1/2)
• The Basic Local Alignment Search Tool (BLAST) is most widely used search tool for
sequence databases. It finds a region of local similarity (conserved sequence pattern)
between a query DNA or protein sequence against a target database.
• The ultimate aim of a BLAST program is to infer an evolutionary and functional
relationship between two DNA or protein sequences.
• There are three important aspects of a search process: the input query sequence, target
database and choosing a customized BLAST program.
91
BLAST (2/2)
92
Q &A
Thank you!