bioRxiv (Cold Spring Harbor Laboratory), Oct 8, 2022
Dating phylogenetic trees to obtain branch lengths in the unit of time is essential for many down... more Dating phylogenetic trees to obtain branch lengths in the unit of time is essential for many downstream applications but has remained challenging. Dating requires inferring mutation rates that can change across the tree. While we can assume to have information about a small subset of nodes from the fossil record or sampling times (for fast-evolving organisms), inferring the ages of the other nodes essentially requires extrapolation and interpolation. Assuming a clock model that defines a distribution over rates, we can formulate dating as a constrained maximum likelihood (ML) estimation problem. While ML dating methods exist, their accuracy degrades in the face of model misspecification where the assumed parametric statistical clock model vastly differs from the true distribution. Notably, existing methods tend to assume rigid, often unimodal rate distributions. A second challenge is that the likelihood function involves an integral over the continuous domain of the rates and often leads to difficult non-convex optimization problems. To tackle these two challenges, we propose a new method called Molecular Dating using Categorical-models (MD-Cat). MD-Cat uses a categorical model of rates inspired by nonparametric statistics and can approximate a large family of models by discretizing the rate distribution into k categories. Under this model, we can use the Expectation-Maximization (EM) algorithm to coestimate rate categories and branch lengths in the time unit. Our model has fewer assumptions about the true clock model than parametric models such as Gamma or LogNormal distribution. Our results on two simulated and real datasets of Angiosperms and HIV and a wide selection of rate distributions show that MD-Cat is often more accurate than the alternatives, especially on datasets with nonmodal or multimodal clock models.
Phylogenetic trees inferred from sequence data often have branch lengths measured in the expected... more Phylogenetic trees inferred from sequence data often have branch lengths measured in the expected number of substitutions and therefore, do not have divergence times estimated. These trees give an incomplete view of evolutionary histories since many applications of phylogenies require time trees. Many methods have been developed to convert the inferred branch lengths from substitution unit to time unit using calibration points, but none is universally accepted as they are challenged in both scalability and accuracy under complex models. Here, we introduce a new method that formulates dating as a nonconvex optimization problem where the variance of log-transformed rate multipliers is minimized across the tree. On simulated and real data, we show that our method, wLogDate, is often more accurate than alternatives and is more robust to various model assumptions.
... Model for Team Communication Tracey Wiggins, Daniel Swift, Uyen Mai, and Ray Luechtefeld Univ... more ... Model for Team Communication Tracey Wiggins, Daniel Swift, Uyen Mai, and Ray Luechtefeld University of La Verne, [email protected], daniel.swift@laverne. edu, [email protected], [email protected] ...
This repository containts data and script used in the paper 'TreeCluster: clustering biologic... more This repository containts data and script used in the paper 'TreeCluster: clustering biological sequences using phylogenetic trees'
Supplementary material. Appendix A â Theorem proofs. Appendix B â Supplementary figures and table... more Supplementary material. Appendix A â Theorem proofs. Appendix B â Supplementary figures and tables. (PDF 399 kb)
Motivation As genome-wide reconstruction of phylogenetic trees becomes more widespread, limitatio... more Motivation As genome-wide reconstruction of phylogenetic trees becomes more widespread, limitations of available data are being appreciated more than ever before. One issue is that phylogenomic datasets are riddled with missing data, and gene trees, in particular, almost always lack representatives from some species otherwise available in the dataset. Since many downstream applications of gene trees require or can benefit from access to complete gene trees, it will be beneficial to algorithmically complete gene trees. Also, gene trees are often unrooted, and rooting them is useful for downstream applications. While completing and rooting a gene tree with respect to a given species tree has been studied, those problems are not studied in depth when we lack such a reference species tree. Results We study completion of gene trees without a need for a reference species tree. We formulate an optimization problem to complete the gene trees while minimizing their quartet distance to the gi...
SUMMARYAlthough horizontal gene transfer is recognized as a major evolutionary process in Bacteri... more SUMMARYAlthough horizontal gene transfer is recognized as a major evolutionary process in Bacteria and Archaea, its general patterns remain elusive, due to difficulties tracking genes at relevant resolution and scale within complex microbiomes. To circumvent these challenges, we analyzed a randomized sample of >12,000 genomes of individual cells of Bacteria and Archaea in the tropical and subtropical ocean - a well-mixed, global environment. We found that marine microorganisms form gene exchange networks (GENs) within which transfers of both flexible and core genes are frequent, including the rRNA operon that is commonly used as a conservative taxonomic marker. The data revealed efficient gene exchange among genomes with <28% nucleotide difference, indicating that GENs are much broader lineages than the nominal microbial species, which are currently delineated at 4-6% nucleotide difference. The 42 largest GENs accounted for 90% of cells in the tropical ocean microbiome. Freque...
Phylogenetic trees inferred from sequence data often have branch lengths measured in the expected... more Phylogenetic trees inferred from sequence data often have branch lengths measured in the expected number of substitutions and therefore, do not have divergence times estimated. These trees give an incomplete view of evolutionary histories since many applications of phylogenies require time trees. Many methods have been developed to convert the inferred branch lengths from substitution unit to time unit using calibration points, but none is universally accepted as they are challenged in both scalability and accuracy under complex models. Here, we introduce a new method that formulates dating as a non-convex optimization problem where the variance of log-transformed rate multipliers are minimized across the tree. On simulated and real data, we show that our method, wLogDate, is often more accurate than alternatives and is more robust to various model assumptions.
Rapid growth of genome data provides opportunities for updating microbial evolutionary relationsh... more Rapid growth of genome data provides opportunities for updating microbial evolutionary relationships, but this is challenged by the discordant evolution of individual genes. Here we build a reference phylogeny of 10,575 evenly-sampled bacterial and archaeal genomes, based on a comprehensive set of 381 markers, using multiple strategies. Our trees indicate remarkably closer evolutionary proximity between Archaea and Bacteria than previous estimates that were limited to fewer “core” genes, such as the ribosomal proteins. The robustness of the results was tested with respect to several variables, including taxon and site sampling, amino acid substitution heterogeneity and saturation, non-vertical evolution, and the impact of exclusion of candidate phyla radiation (CPR) taxa. Our results provide an updated view of domain-level relationships.
Clustering homologous sequences based on their similarity is a problem that appears in many bioin... more Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given a (not necessarily ultrametric) tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints that limit the diameter of each cluster, the sum of its branch lengths, or chains of pairwise distances. These three versions of the problem can be solved in time that increases linearly with the size of the tree, a fact that ha...
Sequence data used in reconstructing phylogenetic trees may include various sources of error. Typ... more Sequence data used in reconstructing phylogenetic trees may include various sources of error. Typically errors are detected at the sequence level, but when missed, the erroneous sequences often appear as unexpectedly long branches in the inferred phylogeny. We propose an automatic method to detect such errors. We build a phylogeny including all the data then detect sequences that artificially inflate the tree diameter. We formulate an optimization problem, called the k-shrink problem, that seeks to find k leaves that could be removed to maximally reduce the tree diameter. We present an algorithm to find the exact solution for this problem in polynomial time. We then use several statistical tests to find outlier species that have an unexpectedly high impact on the tree diameter. These tests can use a single tree or a set of related gene trees and can also adjust to species-specific patterns of branch length. The resulting method is called TreeShrink. We test our method on six phyloge...
In this section we prove the following propositions, which were used in the main text to support ... more In this section we prove the following propositions, which were used in the main text to support the theory of the MinVar rooting method. Please refer to the main paper for more details. Proposition 1. A point p on tree T is a local MV if and only if it is a balance point. Based on Proposition 1, we refer to local MV and balance point interchangeably. Proposition 2. Any tree has at least one local MV. Proposition 3. The global MV of any tree is one of its local MVs.
2011 Frontiers in Education Conference (FIE), 2011
... edu. Tracey Wiggins, Doctoral Candidate, University of La Verne, [email protected]. ... more ... edu. Tracey Wiggins, Doctoral Candidate, University of La Verne, [email protected]. Ray Luechtefeld, Ph.D., Associate Professor of Organizational Leadership, University of La Verne, [email protected].
2011 Frontiers in Education Conference (FIE), 2011
... Model for Team Communication Tracey Wiggins, Daniel Swift, Uyen Mai, and Ray Luechtefeld Univ... more ... Model for Team Communication Tracey Wiggins, Daniel Swift, Uyen Mai, and Ray Luechtefeld University of La Verne, [email protected], daniel.swift@laverne. edu, [email protected], [email protected] ...
bioRxiv (Cold Spring Harbor Laboratory), Oct 8, 2022
Dating phylogenetic trees to obtain branch lengths in the unit of time is essential for many down... more Dating phylogenetic trees to obtain branch lengths in the unit of time is essential for many downstream applications but has remained challenging. Dating requires inferring mutation rates that can change across the tree. While we can assume to have information about a small subset of nodes from the fossil record or sampling times (for fast-evolving organisms), inferring the ages of the other nodes essentially requires extrapolation and interpolation. Assuming a clock model that defines a distribution over rates, we can formulate dating as a constrained maximum likelihood (ML) estimation problem. While ML dating methods exist, their accuracy degrades in the face of model misspecification where the assumed parametric statistical clock model vastly differs from the true distribution. Notably, existing methods tend to assume rigid, often unimodal rate distributions. A second challenge is that the likelihood function involves an integral over the continuous domain of the rates and often leads to difficult non-convex optimization problems. To tackle these two challenges, we propose a new method called Molecular Dating using Categorical-models (MD-Cat). MD-Cat uses a categorical model of rates inspired by nonparametric statistics and can approximate a large family of models by discretizing the rate distribution into k categories. Under this model, we can use the Expectation-Maximization (EM) algorithm to coestimate rate categories and branch lengths in the time unit. Our model has fewer assumptions about the true clock model than parametric models such as Gamma or LogNormal distribution. Our results on two simulated and real datasets of Angiosperms and HIV and a wide selection of rate distributions show that MD-Cat is often more accurate than the alternatives, especially on datasets with nonmodal or multimodal clock models.
Phylogenetic trees inferred from sequence data often have branch lengths measured in the expected... more Phylogenetic trees inferred from sequence data often have branch lengths measured in the expected number of substitutions and therefore, do not have divergence times estimated. These trees give an incomplete view of evolutionary histories since many applications of phylogenies require time trees. Many methods have been developed to convert the inferred branch lengths from substitution unit to time unit using calibration points, but none is universally accepted as they are challenged in both scalability and accuracy under complex models. Here, we introduce a new method that formulates dating as a nonconvex optimization problem where the variance of log-transformed rate multipliers is minimized across the tree. On simulated and real data, we show that our method, wLogDate, is often more accurate than alternatives and is more robust to various model assumptions.
... Model for Team Communication Tracey Wiggins, Daniel Swift, Uyen Mai, and Ray Luechtefeld Univ... more ... Model for Team Communication Tracey Wiggins, Daniel Swift, Uyen Mai, and Ray Luechtefeld University of La Verne, [email protected], daniel.swift@laverne. edu, [email protected], [email protected] ...
This repository containts data and script used in the paper 'TreeCluster: clustering biologic... more This repository containts data and script used in the paper 'TreeCluster: clustering biological sequences using phylogenetic trees'
Supplementary material. Appendix A â Theorem proofs. Appendix B â Supplementary figures and table... more Supplementary material. Appendix A â Theorem proofs. Appendix B â Supplementary figures and tables. (PDF 399 kb)
Motivation As genome-wide reconstruction of phylogenetic trees becomes more widespread, limitatio... more Motivation As genome-wide reconstruction of phylogenetic trees becomes more widespread, limitations of available data are being appreciated more than ever before. One issue is that phylogenomic datasets are riddled with missing data, and gene trees, in particular, almost always lack representatives from some species otherwise available in the dataset. Since many downstream applications of gene trees require or can benefit from access to complete gene trees, it will be beneficial to algorithmically complete gene trees. Also, gene trees are often unrooted, and rooting them is useful for downstream applications. While completing and rooting a gene tree with respect to a given species tree has been studied, those problems are not studied in depth when we lack such a reference species tree. Results We study completion of gene trees without a need for a reference species tree. We formulate an optimization problem to complete the gene trees while minimizing their quartet distance to the gi...
SUMMARYAlthough horizontal gene transfer is recognized as a major evolutionary process in Bacteri... more SUMMARYAlthough horizontal gene transfer is recognized as a major evolutionary process in Bacteria and Archaea, its general patterns remain elusive, due to difficulties tracking genes at relevant resolution and scale within complex microbiomes. To circumvent these challenges, we analyzed a randomized sample of >12,000 genomes of individual cells of Bacteria and Archaea in the tropical and subtropical ocean - a well-mixed, global environment. We found that marine microorganisms form gene exchange networks (GENs) within which transfers of both flexible and core genes are frequent, including the rRNA operon that is commonly used as a conservative taxonomic marker. The data revealed efficient gene exchange among genomes with <28% nucleotide difference, indicating that GENs are much broader lineages than the nominal microbial species, which are currently delineated at 4-6% nucleotide difference. The 42 largest GENs accounted for 90% of cells in the tropical ocean microbiome. Freque...
Phylogenetic trees inferred from sequence data often have branch lengths measured in the expected... more Phylogenetic trees inferred from sequence data often have branch lengths measured in the expected number of substitutions and therefore, do not have divergence times estimated. These trees give an incomplete view of evolutionary histories since many applications of phylogenies require time trees. Many methods have been developed to convert the inferred branch lengths from substitution unit to time unit using calibration points, but none is universally accepted as they are challenged in both scalability and accuracy under complex models. Here, we introduce a new method that formulates dating as a non-convex optimization problem where the variance of log-transformed rate multipliers are minimized across the tree. On simulated and real data, we show that our method, wLogDate, is often more accurate than alternatives and is more robust to various model assumptions.
Rapid growth of genome data provides opportunities for updating microbial evolutionary relationsh... more Rapid growth of genome data provides opportunities for updating microbial evolutionary relationships, but this is challenged by the discordant evolution of individual genes. Here we build a reference phylogeny of 10,575 evenly-sampled bacterial and archaeal genomes, based on a comprehensive set of 381 markers, using multiple strategies. Our trees indicate remarkably closer evolutionary proximity between Archaea and Bacteria than previous estimates that were limited to fewer “core” genes, such as the ribosomal proteins. The robustness of the results was tested with respect to several variables, including taxon and site sampling, amino acid substitution heterogeneity and saturation, non-vertical evolution, and the impact of exclusion of candidate phyla radiation (CPR) taxa. Our results provide an updated view of domain-level relationships.
Clustering homologous sequences based on their similarity is a problem that appears in many bioin... more Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given a (not necessarily ultrametric) tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints that limit the diameter of each cluster, the sum of its branch lengths, or chains of pairwise distances. These three versions of the problem can be solved in time that increases linearly with the size of the tree, a fact that ha...
Sequence data used in reconstructing phylogenetic trees may include various sources of error. Typ... more Sequence data used in reconstructing phylogenetic trees may include various sources of error. Typically errors are detected at the sequence level, but when missed, the erroneous sequences often appear as unexpectedly long branches in the inferred phylogeny. We propose an automatic method to detect such errors. We build a phylogeny including all the data then detect sequences that artificially inflate the tree diameter. We formulate an optimization problem, called the k-shrink problem, that seeks to find k leaves that could be removed to maximally reduce the tree diameter. We present an algorithm to find the exact solution for this problem in polynomial time. We then use several statistical tests to find outlier species that have an unexpectedly high impact on the tree diameter. These tests can use a single tree or a set of related gene trees and can also adjust to species-specific patterns of branch length. The resulting method is called TreeShrink. We test our method on six phyloge...
In this section we prove the following propositions, which were used in the main text to support ... more In this section we prove the following propositions, which were used in the main text to support the theory of the MinVar rooting method. Please refer to the main paper for more details. Proposition 1. A point p on tree T is a local MV if and only if it is a balance point. Based on Proposition 1, we refer to local MV and balance point interchangeably. Proposition 2. Any tree has at least one local MV. Proposition 3. The global MV of any tree is one of its local MVs.
2011 Frontiers in Education Conference (FIE), 2011
... edu. Tracey Wiggins, Doctoral Candidate, University of La Verne, [email protected]. ... more ... edu. Tracey Wiggins, Doctoral Candidate, University of La Verne, [email protected]. Ray Luechtefeld, Ph.D., Associate Professor of Organizational Leadership, University of La Verne, [email protected].
2011 Frontiers in Education Conference (FIE), 2011
... Model for Team Communication Tracey Wiggins, Daniel Swift, Uyen Mai, and Ray Luechtefeld Univ... more ... Model for Team Communication Tracey Wiggins, Daniel Swift, Uyen Mai, and Ray Luechtefeld University of La Verne, [email protected], daniel.swift@laverne. edu, [email protected], [email protected] ...
Uploads
Papers by Uyên Mai