W2 DNA Sequencing - The Human Genome Project - Stu

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 39

SIO1003

DNA Sequencing & The Human Genome Project

Semester 1 Session 2023/2024

SIO1003 | DNA Sequencing & The Human Genome Project


Components of the human genome
• Nuclear and mitochondrial (organelle).
• Each of the approximately 1013 cells in the
adult human body has its own copy or copies
of the genome (exceptions: red blood cells,
that lack a nucleus in their fully differentiated
state).
• Majority of cells are diploid and so have two
copies of each autosome, plus two sex
chromosomes, XX for females or XY for males
(46 chromosomes). These are called somatic
cells.
• Sex cells or gametes, which are haploid and
have just 23 chromosomes, comprising one
of each autosome and one sex chromosome.
• Both types of cell have about 8000 copies of
the mitochondrial genome, 10 or so in each
mitochondrion.

SIO1003 | DNA Sequencing & The Human Genome Project


DNA – a complete genetic blueprint of life

DNA is a hereditary molecule that contains the complete


instructions an organism needs to develop, grow, function
and reproduce
DNA is made up of sugar, phosphate
and nucleotides
SQI7019 | OMICS TECHNOLOGIES
NUCLEIC ACIDS
Type of Pentose Base Phosphate
nucleic acid sugar

DNA Deoxyribose Adenine (A) Phosphate


Guanine (G)
Cytosine (C)
Thymine (T)

RNA Ribose Adenine (A) Phosphate


Guanine (G)
Cytosine (C)
Uracil (U)

SQI7019 | OMICS TECHNOLOGIES


DNA sequencing

SIO1003 | DNA Sequencing & The Human Genome Project


What is the Human Genome Project?
• Determine the complete sequences of the nucleotides of the
haploid human genome.
• Identify all the genes in human DNA.
• Store this information in databases.
• Develop tools for data analysis.
• Address the ethical, legal, and social issues.
• Study and compare the genomes of lab organisms e.g. fruit fly,
mouse, nematode worm, yeast.
• DNA sequence  genes  gene activities
SIO1003 | DNA Sequencing & The Human Genome Project
Principles & Goals of The Human Genome Project
Principles:
• Collaborative effort
• High-throughput sequencing technology
• Data sharing

International Human Genome Sequencing Consortium (IHGSC)

• ~200 US-based labs funded by NIH or US Dep. Of Energy.


• >18 countries

Goals:
• To create a complete map of the human genome
• To improve our understanding of human biology and disease
• To develop new technologies and methods for genome analysis
• To address ethical, legal, and social issues
SIO1003 | DNA Sequencing & The Human Genome Project
Human Genome Project timeline
Goal for Human Physical Map
Genetic Map Covers 98% of Human draft
Exceeded Genome
NRC U.S. Human Gene Map Human Gene Map
Recommends HGP (16,000 genes) (30,181 genes)
HGP Begins

1990 1995 2000


C. elegans
Pilot Human Sequencing Begins
Full-Scale Human
Sequencing
Begins

Yeast E. coli Drosophila


• Early 1980s: mitochondrial genome completed.
• 1984 to 1986: first proposed at US Department of Energy meetings.
• 1988: Endorsed by US National Research Council (Funded by National Institutes of Health and US Department of Energy).
• 1990: Human Genome Project started formally.
• 2001: First draft published.  83-84% of the sequence
 16-17% (telomeres & centromeres)
• 2003: Finished Human Genome sequence published in Nature (up to 85% known at the time, still had 15% gaps unknown)

SIO1003 | DNA Sequencing & The Human Genome Project


Sequencing phase
• 26 June 2000  US president: Bill Clinton.
International Human Genome Sequencing Consortium: Francis Collins.
Celera genomics: J. Craig Venter.
announce the completion of human genome working draft.
IHGSC, 2001; Venter et. al., 2001.
• Clone contig approach (clone-based sequencing, hierarchical shotgun sequencing):
• ~90% of the genome.
• 329 million bp  constitutive heterochromatin (telomeres, centromeres):
• DNA is very tightly packed.
• Few genes, if any.
• Each part has been sequenced at least 4 times, 20%  8-10 times.
• ~50,000 scaffolds, Ave. size 54.2 kb.
• Whole-genome shotgun sequencing  similar statistics.

SIO1003 | DNA Sequencing & The Human Genome Project


IHGSC

Phase 1: Physical Mapping


Prof. Francis
Collins

Phase 2: Shotgun Sequencing


Phase 3: Finishing and Annotation

SIO1003 | DNA Sequencing & The Human Genome Project


Celera
Genomics
Dr. Craig Venter 1st data:
27 million DNA seq reads
(mean 543bp)
5 individuals

2nd data:
From IHGSC
16 million DNA seq reads

SIO1003 | DNA Sequencing & The Human Genome Project


(Sanger Sequencing)
How to sequence DNA.
A) DNA polymerase binds to a single-
stranded DNA template (blue) and
synthesizes a complementary strand of
DNA (red). B) When DNA polymerase
randomly incorporates a fluorescently
labeled ddNTP base, synthesis terminates.
This step produces a mixture of newly
synthesized DNA strands that differ in
length by a single nucleotide. Each strand
is labeled at the 3′ end with a fluorescently
labeled ddNTP base. C) The DNA mixture
is separated by electrophoresis. D) The
electropherogram results show peaks
representing the color and signal intensity
of each DNA band. From these data, the
sequence of the newly synthesized DNA
strand is determined, as shown above the
peaks.
© 2003 Macmillan Publishers, Ltd. Dennis, C. & Gallagher, R. (eds) The Human
Genome (Palgrave, Basingstoke, 2001).

SIO1003 | DNA Sequencing & The Human Genome Project


Chain-Termination Method - Sanger

heat denatured/ formamide treatment

Requirement in a rxn tube:


• a single-stranded DNA template
• DNA primer initiates
• DNA polymerase DNA
synthesislabelled
• Radioactively/fluorescently
nucleotides (dNTPs)
• Per ddNTP

SIO1003 | DNA Sequencing & The Human Genome Project


Smaller strands migrate
to the bottom, while
larger strands stay up top.
We can read each
molecule in order to find
the DNA sequence.

Key Features
•Uses ddNTPs to terminate DNA synthesis
•DNA synthesis reactions in four separate tubes
•Radioactive dATP is also included in all the tubes so the
DNA products will be radioactive.
•Yielding a series of DNA fragments whose sizes can be
measured by electrophoresis.
•The last base in each of these fragments is known.

The fluorescently-labeled DNA sequences are run through capillary


DNA fragments contain about 700 nucleotides electrophoresis and their order is resolved by color.
SIO1003 | DNA Sequencing & The Human Genome Project
Automated DNA sequencing:
Dye termination sequencing

• Automated
• No radioactive labelling
• Each ddNTP is fluorescently labelled
• Excited by laser, 4 different fluorescence detected
• Fluorescence intensity translated into “peak”
• All 4 chain terminators can be in the same tube = run
one lane on a gel

SIO1003 | DNA Sequencing & The Human Genome Project


de novo DNA Sequencing:
Shotgun Sequencing
• de novo seq = sequencing a novel genome where no reference
sequence is available for alignment
• Shotgun seq = sequencing method that can assemble an entire
genome that has not yet been sequenced before
• Used to analyse sequences longer than 1K bp to entire chromosomes

Procedure
• gDNA is fragmented by sonification/hydrodynamic shearing.
• All sticky-end fragments are blunt-ended with T4 DNA polymerase and
exonuclease activity.
• T4 polynucleotide kinase is added - 5' ends are phosphorylated.
• Fragments separated into either different-sized fragments.
• A library is created per each size in plasmids and transformed into E. coli cells.
• Vector DNA is purified (and amplified) from each library.
• Each DNA strand is sequenced (can attach a primer upstream of our vector, then
use any sequencing by synthesis method).
• Computer program called a base caller filters out poor calls (bioinformatics).
• The assembler finds overlapping segments and generates contigs (long successive
continguous stretches of nucleotides).

SIO1003 | DNA Sequencing & The Human Genome Project


Who produced human genome sequence
information?
1. Major players in the Human Genome Project:
i. Craig Venter (Industry)
ii. William Jefferson Clinton (president of the United States)
iii. Francis Collins (NIH)
2. Publicly funded “International Human Genome Sequencing
Consortium”.
• IHGSC, Nature, volume 409, issue number 6822, 15 February
2001.
3. Privately funded company “Celera” (Rockville, MD, USA).
• Venter et. al., Science, volume 291, issue number 5507, 16
February 2001.
SIO1003 | DNA Sequencing & The Human Genome Project
Sequencing Strategies

SIO1003 | DNA Sequencing & The Human Genome Project


Sequencing Strategies
Public effort-strategy (clone-contig): Celera-strategy:
Top-down: Bottom-up:
• Hierarchical Shotgun. • Shotgun with computational
assembly
• Clone-by-clone, anchored to map.
• Sequence first map (assemble)
• Map first sequence later later

Celera’s view of International International Consortium’s


Consortium view of Celera

Unfair competition: IC delivering Unfair competition: Celera delivering


the same goods but with state the same goods but can use IC data,
funding. while IC cannot use Celera data.

SIO1003 | DNA Sequencing & The Human Genome Project


Next Gen Sequencing
Technologies
• powerful platform enabling simultaneous
sequencing
• High-throughput sequencing
• A product from the high demand for low- 1st Generation
cost seq.

2000
•Sanger Sequencing
• Allow us to sequence DNA/RNA more •Maxam-Gilbert Sequencing
2nd Generation
quickly and more affordable than Sanger

2006-2010
•Pyrosequencing
Sequencing •Sequencing by Reversible
• Revolutionized genomics & molecular Terminator Chemistry
•Sequencing by Ligation
biology
3rd Generation

2010-2015
• Puts bioinformatics into the hot seat •Single Molecule Fluorescent Sequencing
• Overcome limitations of conventional seq •Single Molecule Real Time Sequencing
•Semiconductor Sequencing
methods
•Nanopore Sequencing
4th Generation
Aims conducting genomic analysis
SIO1003 | DNA Sequencing & The Human Genome Project directly in the cell.
Human Genome
Properties of chromosome bands
Dark bands (G bands) Pale bands (R bands)
AT-rich GC-rich
DNase insensitive DNase sensitive
Condense early during cell cycle Condense late during cell cycle
Replicate late Replicate early
Gene poor Gene rich
Large genes Small genes
Tissue-specific House-keeping
CpG islands;few CpG islands;many
LINE SINE, Alu repeats rich

SIO1003 | DNA Sequencing & The Human Genome Project


DNA content of human chromosomes
Chromosome DNA(Mb) Chormosome DNA (Mb)
1 263 13 114
2 255 14 109
3 214 15 106
4 203 16 98
5 194 17 92
6 183 18 85
7 171 19 67
8 155 20 72
9 145 21 50
10 144 22 56
11 144 X 164
12 143 Y 59
Average 130 ( 6Mb/ average band )

SIO1003 | DNA Sequencing & The Human Genome Project


What are important features of the human genome?
• Genome organization:
• nuclear DNA (3.3 Gb):
• 25% genes and related DNA:
• 1-2% actual genes (exons or coding
regions).
• 23-24% non-coding sequences (introns,
UTR’s).
• 75% non-genes or intergenic regions:
• 40% (or more) highly repetitive sequences.
• 35% (or less) fairly unique or moderately
repetitive sequences.
• mitochondrial DNA (17 kb)
SIO1003 | DNA Sequencing & The Human Genome Project
Human genome structure

SIO1003 | DNA Sequencing & The Human Genome Project


Proteins encoded by genes in human genome

SIO1003 | DNA Sequencing & The Human Genome Project


What are important features of the
human genome?
• Human Gene Content: Surprisingly Few Genes.
• Gene density varies widely between chromosomes:
• 23 genes/Mb (chromosome 19) versus 5 genes/Mb (chromosome 13 and
Y).
• Most of the genome consists of non-coding and “non-gene related”
sequences:
• lots of repetitive sequences.
• lots of duplications and rearrangements.
• Mutation rate seems to be higher in males than in females. Why?
• Lots of single nucleotide polymorphisms = SNP’s : ~ 1-2 million:
• Random pairs of DNA sequences of two individuals differ by 1bp/1250
(on average) indicating very strong and pronounced genetic
heterogeneity of human population, i.e. there are lots of different alleles.
SIO1003 | DNA Sequencing & The Human Genome Project
Gene density in human chromosome
the HLA region & the dystrophin gene(DMD) region

SIO1003 | DNA Sequencing & The Human Genome Project


Variation in human gene structure
Gene product Size of Number of Average size Average size
gene (kb) exons of exon (bp) of intron (kb)

tRNA-tyr 0.1 2 50 0.02


Insulin 1.4 3 155 0.48
β-globin 1.6 3 150 0.49
Class I HLA 3.5 8 187 0.26
Serum albumin 18 14 137 1.1
Type VII collagen 31 114 77 0.19
Complement C3 41 29 122 0.9
Phenylalanine-hydroxylase 90 26 96 3.5
Factor VIII 186 26 375 7.1
CTFR (cystic fibrosis) 250 27 227 9.1
Dystrophin 24000 79 180 30.0

SIO1003 | DNA Sequencing & The Human Genome Project


What are important features of the human genome?
• Number of genes is fairly low (compared to organisms that
are much less complex):
• humans: ~25,000 (37 are mitochondrial).
• worms (nematode/Caenorhabditis elegans): 19-20,000.
• flies (fruit fly/Drosophila melanogaster): 13-14,000.
• fungus (baker’s yeast/Saccharomyces cerevisiae):
~6,000.
• plant (thale cress/Arabidopsis thaliana): ~25,000.

SIO1003 | DNA Sequencing & The Human Genome Project


Eukaryotic Genomes: How Many Genes Are Protein-Coding
• In humans:
• Estimated 20k-25k protein-coding genes
• Predicted 30k – 100k genes (pre-HGP)
• T. vaginalis (single-celled parasite) has ~60k
genes
• A. thaliana (mustard plant) has ~25.5k genes
– albeit having one of the smallest plant
genomes
• Why humans, a more complex organism,
have lesser protein-coding genes??

Genome Size and Number of Protein-Coding Genes for a Select Handful of Species. Table
adapted from Van Straalen & Roelofs, 2006

SIO1003 | DNA Sequencing & The Human Genome Project


The functions of human genes
• Half of the ~30,000 genes  known
function
• Most code for proteins
• <2500 non-coding rRNA
• 23.2%  expression, replication and
maintenance of genome
• 20%  signal transduction
• 17.5%  biochemical function
• 38.2%  molecular trafficking &
transportation, folding structure,
immune response, structural proteins

SIO1003 | DNA Sequencing & The Human Genome Project


Eukaryotic Genomes: How Many Genes Are Protein-Coding
Does DNA encode for proteins only?
• In mouse (FANTOM Consortium et al., 2005). :
• >60% of 2.5 billion bp genome transcribed  mostly make RNAs involved in splicing and gene regulation
• only <2% is translated into functional proteins

Technological capacity has provided the platform for counting genes


• Gene-prediction programs - To estimate the number of protein-coding genes in a genome; i.e. sequence alignments
• Gene-location programs – to identify genes’ genomic loci; using gene sequence characteristics; i.e. ORF within
exons, CpG islands within promoter regions
*only predict gene presence in silico – must be validated/confirmed in vivo/in vitro in lab; i.e. microarray hybridisation

• Accuracy of gene prediction tools has become more accurate over the years – improved precision brought possible
45k genes down to ~26.k protein-coding genes (Human Genome Project) (Venter et al., 2001), then down to 20k~
in 2008.
*computational tools back then had high false positive rates

SIO1003 | DNA Sequencing & The Human Genome Project


The human mitochondrial genome
• Sequenced in 1981 (Anderson et al.)
• 16,569 bp
• 37 genes, respiratory complex gene
(mitochondrial biochemical function  13)
• No intron
• The human mitochondrial genome is small
and compact, with little wasted space
• Overlap genes: ATP6 and ATP8 genes.
• Abbreviations: ATP6, ATP8, genes for ATPase
subunits 6 and 8; COI, COII, COIII, genes for
cytochrome c oxidase subunits I, II and III;
Cytb, gene for apocytochrome b; ND1 ND6,
genes for NADH hydrogenase subunits 1 6.
Ribosomal RNA and transfer RNA are two
types of non-coding RNA

SIO1003 | DNA Sequencing & The Human Genome Project


The Human Nuclear & Mitochondrial Genomes
Nuclear genome Mitochond. genome
Size 3300Mb 16.6kb
No. of genes 30000-40000 37
Gene density 1/70kb 1/0.45kb
Repetitive DNA Large fraction Very little
Transcription The great bulk of genes Continuous transcription
transcribed individually of multiple genes
Introns Found in most genes Absent
% of coding DNA ~2% ~93%
Recombination At least once at meiosis Not evident
Inheritance Mendelian Exclusively maternal

SIO1003 | DNA Sequencing & The Human Genome Project


A segment of the human genome

• 50 kb segment of the human β T-


cell receptor locus on chromosome
7
• 1 gene
• 2 gene segment
• 52 genome-wide repeat sequences
• 2 microsatellite
• GA, 16 times
• TATT, 6 times
• 50%  streches of non-genic, non-
repetitive, single copy DNA of no
known function

SIO1003 | DNA Sequencing & The Human Genome Project


Future Challenges: What We Still Don’t Know

• Gene number, exact locations, and functions


• Gene regulation
• DNA sequence organization
• Chromosomal structure and organization
• Noncoding DNA types, amount, distribution, information content, and functions
• Coordination of gene expression, protein synthesis, and post-translational events
• Interaction of proteins in complex molecular machines
• Predicted vs experimentally determined gene function
• Evolutionary conservation among organisms

SIO1003 | DNA Sequencing & The Human Genome Project


Future Challenges: What We Still Don’t Know
• Protein conservation (structure and function)
• Proteomes (total protein content and function) in organisms
• Correlation of SNPs (single-base DNA variations among individuals) with
health and disease
• Disease-susceptibility prediction based on gene sequence variation
• Genes involved in complex traits and multigene diseases
• Complex systems biology including microbial consortia useful for
environmental restoration
• Developmental genetics, genomics

SIO1003 | DNA Sequencing & The Human Genome Project

You might also like