CGE Course Johanne

Phylogeny
- based on whole genome data
Johanne Ahrenfeldt
Research Assistant
Overview
• What is Phylogeny and what can it be used for

• Single Nucleotide Polymorphism (SNP) methods
- snpTree and CSI Phylogeny
• Nucleotide Differences
- NDtree
• Controlled Evolution study
• What services for which data
What is phylogenetic trees
• Trees are traditionally made using aligned sequences

of single genes or proteins
• Whole genome data may be used to create trees
based on
– SNP calling
– K-mer overlap
What is a SNP
• A Single Nucleotide Polymorphism (SNP) is a DNA

sequence variation occurring commonly* within a
population (e.g. 1%) in which a Single Nucleotide
— A, T, C or G — in the genome (or other shared
sequence) differs between members of a biological
species or paired chromosomes.
What is phylogeny used for
• Classify taxonomy – the classic use

• Outbreak detection – becoming more increasing
with WGS data
How does it work
Strain A ATTCAGTA
Strain B ATGCAGTC
Strain C ATGCAATC
Strain D ATTCAGTC
Construct distance matrix
Strain A ATTCAGTA
Strain B ATGCAGTC
Strain C ATGCAATC
Strain D ATTCAGTC
A B C D
A 0 2 3 1
B 2 0 1 1
C 3 1 0 2
D 1 1 2 0
Make Tree
Strain A ATTCAGTA
Strain B ATGCAGTC
Strain C ATGCAATC
Strain D ATTCAGTC
A B C D
A 0 2 3 1
B 2 0 1 1
C 3 1 0 2
D 1 1 2 0
A B
C
D
snpTree
• First online webserver for constructing phylogenetic

trees based on whole genome sequencing
snpTree--a web-server to identify and construct SNP trees from whole genome sequence data. Leekitcharoenphon P,
Kaas RS, Thomsen MC, Friis C, Rasmussen S, Aarestrup FM. BMC Genomics. 2012;13 Suppl 7:S6.
snpTree flow
snpTree output
CSIPhylogeny
• Finds SNPs in the same manner as snpTree

• More strict sorting of SNPs
• Requires all SNPs to be significant
– Z-score higher than 1.96 for all SNPs
X -Y
Z=
X+Y
• X is the number of reads, of the most common
nucleotide at that position, and Y the number of reads
with any other nucleotide.
NDtree
• A different approach where the main distinction is

not between if a SNP should be called or not, but
between whether there is solid evidence for what
nucleotide should be called or not.
NDtree
• When all reads had been mapped the significance of the
base call at each position was evaluates by calculating the
number of reads X having the most common nucleotide at
that position, and the number of reads Y supporting other
nucleotides.
• A Z-score threshold was calculated as:
X - Y > 1.96 (or 3.29)
Z=
X+Y
• >90% of reads supporting the same base

NDtree
• Count nucleotide differences

– Method 1: Each pair of sequences was compared and
the number of nucleotide differences in positions called
in all sequences was counted.
• More accurate (Z=1.96 is used as threshold)
– Method 2: Each pair of sequences was compared and
the number of nucleotide differences in positions called
in both sequences was counted.
• More robust (Z=3.29 is used as threshold)
NDtree
• A matrix with these numbers was given as input to a

UPGMA algorithm implemented in the neighbor
program.
• Simplest method:
– First cluster closest strains
• Merge those strains to one point
– Distance of cluster to another strain is average of distances for each
member in cluster to that strain
– Then merge second closest
– Repeat until all strains are clustered
Controlled Evolution study
This study was performed as part of my Master Thesis

Naming the descendants
Mutations
Phylogenetic tree using NDtree
(UPGMA)
Phylogenetic tree using
Neighbor Joining
UPGMA vs. Neighbor Joining
• UPGMA works well when samples have been taken

the same time
• Neighbor joining is better when samples have been

taken at different times
CSI Phylogeny
d_1_1_1_2_2
b_1_1_1_1_2
b_1_2_2
1_2_1_1
i_1_2_1_1_1
j_1_2_1_1_2
1_1_1_1
1_1_1_2
1_1_1
1_1_2_1
1_1_2_2
1_1_2
1_1
1_2_1
1_2_2_1
1_2_2_2
1_2_2
1_2
1
c_1_1_1_2_1
e_1_1_2_1_1
f_1_1_2_1_2
f_2_1_1
g_1_1_2_2_1
h_1_1_2_2_2
m_1_2_2_1_1
n_1_2_2_1_2
o_1_2_2_2_1
1_2_1_2
k_1_2_1_2_1
l_1_2_1_2_2
a_1_1_1_1_1
0.2
So… What should I use when?
• CSI Phylogeny is advantageous to use when you

expect the differences between the samples to be
larger than 5-10 mutations
• NDtree on the other hand is able to find these small
differences, but may not be strict enough to handle
very large differences.

CGE Course Johanne

Uploaded by

Copyright:

Available Formats

CGE Course Johanne

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CGE Course Johanne

Uploaded by

Copyright:

Available Formats

Phylogeny

- based on whole genome data

• What is Phylogeny and what can it be used for

• Trees are traditionally made using aligned sequences

• A Single Nucleotide Polymorphism (SNP) is a DNA

• Classify taxonomy – the classic use

• First online webserver for constructing phylogenetic

• Finds SNPs in the same manner as snpTree

• A different approach where the main distinction is

• >90% of reads supporting the same base

• Count nucleotide differences

• A matrix with these numbers was given as input to a

This study was performed as part of my Master Thesis

• UPGMA works well when samples have been taken

• Neighbor joining is better when samples have been

• CSI Phylogeny is advantageous to use when you

You might also like