CGE Course Johanne

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 24

Phylogeny

- based on whole genome data

Johanne Ahrenfeldt
Research Assistant
Overview

• What is Phylogeny and what can it be used for


• Single Nucleotide Polymorphism (SNP) methods
- snpTree and CSI Phylogeny
• Nucleotide Differences
- NDtree
• Controlled Evolution study
• What services for which data
What is phylogenetic trees

• Trees are traditionally made using aligned sequences


of single genes or proteins
• Whole genome data may be used to create trees
based on
– SNP calling
– K-mer overlap
What is a SNP

• A Single Nucleotide Polymorphism (SNP) is a DNA


sequence variation occurring commonly* within a
population (e.g. 1%) in which a Single Nucleotide
— A, T, C or G — in the genome (or other shared
sequence) differs between members of a biological
species or paired chromosomes.
What is phylogeny used for

• Classify taxonomy – the classic use


• Outbreak detection – becoming more increasing
with WGS data
How does it work

Strain A ATTCAGTA
Strain B ATGCAGTC
Strain C ATGCAATC
Strain D ATTCAGTC
Construct distance matrix
Strain A ATTCAGTA
Strain B ATGCAGTC
Strain C ATGCAATC
Strain D ATTCAGTC
A B C D
A 0 2 3 1
B 2 0 1 1
C 3 1 0 2
D 1 1 2 0
Make Tree
Strain A ATTCAGTA
Strain B ATGCAGTC
Strain C ATGCAATC
Strain D ATTCAGTC
A B C D
A 0 2 3 1
B 2 0 1 1
C 3 1 0 2
D 1 1 2 0

A B
C
D
snpTree

• First online webserver for constructing phylogenetic


trees based on whole genome sequencing

snpTree--a web-server to identify and construct SNP trees from whole genome sequence data. Leekitcharoenphon P,
Kaas RS, Thomsen MC, Friis C, Rasmussen S, Aarestrup FM. BMC Genomics. 2012;13 Suppl 7:S6.
snpTree flow
snpTree output
CSIPhylogeny

• Finds SNPs in the same manner as snpTree


• More strict sorting of SNPs
• Requires all SNPs to be significant
– Z-score higher than 1.96 for all SNPs

X -Y
Z=
X+Y
• X is the number of reads, of the most common
nucleotide at that position, and Y the number of reads
with any other nucleotide.
NDtree

• A different approach where the main distinction is


not between if a SNP should be called or not, but
between whether there is solid evidence for what
nucleotide should be called or not.
NDtree
• When all reads had been mapped the significance of the
base call at each position was evaluates by calculating the
number of reads X having the most common nucleotide at
that position, and the number of reads Y supporting other
nucleotides.
• A Z-score threshold was calculated as:
X - Y > 1.96 (or 3.29)
Z=
X+Y

• >90% of reads supporting the same base


NDtree

• Count nucleotide differences


– Method 1: Each pair of sequences was compared and
the number of nucleotide differences in positions called
in all sequences was counted.
• More accurate (Z=1.96 is used as threshold)
– Method 2: Each pair of sequences was compared and
the number of nucleotide differences in positions called
in both sequences was counted.
• More robust (Z=3.29 is used as threshold)
NDtree

• A matrix with these numbers was given as input to a


UPGMA algorithm implemented in the neighbor
program.
• Simplest method:
– First cluster closest strains
• Merge those strains to one point
– Distance of cluster to another strain is average of distances for each
member in cluster to that strain
– Then merge second closest
– Repeat until all strains are clustered
Controlled Evolution study

This study was performed as part of my Master Thesis


Naming the descendants
Mutations
Phylogenetic tree using NDtree
(UPGMA)
Phylogenetic tree using
Neighbor Joining
UPGMA vs. Neighbor Joining

• UPGMA works well when samples have been taken


the same time

• Neighbor joining is better when samples have been


taken at different times
CSI Phylogeny
d_1_1_1_2_2
b_1_1_1_1_2
b_1_2_2
1_2_1_1
i_1_2_1_1_1
j_1_2_1_1_2
1_1_1_1
1_1_1_2
1_1_1
1_1_2_1
1_1_2_2
1_1_2
1_1
1_2_1
1_2_2_1
1_2_2_2
1_2_2
1_2
1
c_1_1_1_2_1
e_1_1_2_1_1
f_1_1_2_1_2
f_2_1_1
g_1_1_2_2_1
h_1_1_2_2_2
m_1_2_2_1_1
n_1_2_2_1_2
o_1_2_2_2_1
1_2_1_2
k_1_2_1_2_1
l_1_2_1_2_2
a_1_1_1_1_1

0.2
So… What should I use when?

• CSI Phylogeny is advantageous to use when you


expect the differences between the samples to be
larger than 5-10 mutations
• NDtree on the other hand is able to find these small
differences, but may not be strict enough to handle
very large differences.

You might also like