Research Seminars: Probability and Statistics

Spring 2023

Time & Location: All talks are on Wednesdays in Gibson 414 at 4:00 PM unless otherwise noted.
Organizer: Xiang jiMichelle Lacey and Yuwei Bao



September 9

Title: Many-core algorithms for scaling phylogenetic inference
Karthik Gangavarapu | UCLA

Abstract:  The rapid growth in genomic pathogen data spurs the need for efficient inference techniques, such as Hamiltonian Monte Carlo (HMC) in a Bayesian framework, to estimate parameters of these phylogenetic models where the dimensions of the parameters increase with the number of sequences N. HMC requires repeated calculation of the gradient of the data log-likelihood with respect to (wrt) all branch-length-specific (BLS) parameters that traditionally takes O(N^2) operations using the standard pruning algorithm. A recent study proposes an approach to calculate this gradient in O(N), enabling researchers to take advantage of gradient-based samplers such as HMC. The CPU implementation of this approach makes the calculation of the gradient computationally tractable for nucleotide-based models but falls short in performance for larger state-space size models, such as codon models. Here, we describe novel massively parallel algorithms to calculate the gradient of the log-likelihood wrt all BLS parameters that take advantage of graphics processing units (GPUs) and result in many fold higher speedups over previous CPU implementations. We benchmark these GPU algorithms on three computing systems using three evolutionary inference examples: carnivores, dengue and yeast, and observe a greater than 128-fold speedup over the CPU implementation for codon-based models and greater than 8-fold speedup for nucleotide-based models. As a practical demonstration, we also estimate the timing of the first introduction of West Nile virus into the continental United States under a codon model with a relaxed molecular clock from 104 full viral genomes, an inference task previously intractable. We provide an implementation of our GPU algorithms in BEAGLE v4.0.0, an open source library for statistical phylogenetics that enables parallel calculations on multi-core CPUs and GPUs.

Time: 4:00 pm
Location: Gibson 414


October 25

Title: Multiple Sequence Alignment and Heterotachy
Dr. Ben Redelings - Research Scholar at the Ronin Institute

Abstract:   Inference of evolutionary trees from DNA sequence data is essential for understanding the origin and spread of viral outbreaks, as well as for understanding the relationships and classification of all living organisms.  It has long been recognized that under-parameterized models of DNA sequence evolution can lead to inaccurate estimates of evolutionary trees.  For example, when DNA sequences display among-site rate variation (ASRV), inference under models that only allow a single rate for all sites can lead to biased tree estimates and exaggerated confidence in those estimates.
However, evolutionary trees are not inferred directly from observed DNA sequences.  Instead, DNA sequences are first arranged in a multiple sequence alignment (MSA) that groups DNA nucleotides sharing the same common ancestor into a single "column" of an alignment matrix.  The effect of under-parameterization on MSAs has not yet been explored, probably because MSAs are usually inferred via ad hoc non-statistical methods.  We thus explore the effect of under-parameterization on multiple sequence alignment in a Bayesian framework that jointly infers the alignment of DNA sequences and the evolutionary tree that relates those sequences.
We first examine the effect of ASRV models on inferred alignments and trees.  Such models assume that each DNA nucleotide evolves under a fixed rate that does not change over time.   However, if DNA nucleotides actually switch from being conserved to being unconserved over evolutionary time scales, ASRV models may also be underparameterized.  Inference under such models may split alignment columns into multiple sub-columns in order to fit a single rate to each column. Thus, we investigate the effect of Markov-modulated substitution models.  We also construct a Bayesian test for the presence of covarion-style heterotachy.

Time: 4:00 pm
Location: Gibson 414


November 1

Title: Approaches to understand variation in phylogenomics and molecular evolution
Jeremy M. Brown -  Louisiana State University

Abstract: Reconstructing evolutionary history from genomic sequences is increasingly central for answering a huge variety of questions in modern biology. However, as genome sequencing has become cheaper and easier, we have come to realize that both reconstructed evolutionary histories (phylogenies) and patterns of molecular evolution vary enormously across different regions of genomes. Despite recognition of this variation, we lack appropriate tools to fully understand and explore it. In this talk, we will discuss several mathematical and statistical approaches that my lab and collaborators have been developing to better understand such variation, and provide deeper insights into the evolutionary past.

Time: 4:00 pm
Location: Gibson 414


November 15

Title: Latent Models in Discrete Probabilistic Distributions
Dr. Manuel Lladser - University of Colorado Boulder

Abstract: Discrete datasets are central to modern biological research, yet they may diverge significantly from practitioners' assumptions due to unforeseen contamination or technological artifacts. This contrasts with numerical datasets where Gaussian-type errors are often well-justified. To address this issue, we introduce the concept of "latent weight,” which measures the largest expected fraction of samples from a discrete probabilistic distribution that conforms to a probabilistic model in a class of idealized ones. We examine various properties of latent weights, which we specialize to the class of discrete exchangeable distributions and binary vectors with independent marginals, and make the case that latent weights could be used as proxies for the approximate correctness of hypotheses without adhering to the rigid true-or-false dichotomy of classical hypothesis testing.

Time: 4:00 pm
Location: Gibson 414


November 29

Title: Model Selection in Contemporary Phylogenetics
Dr. April Wright - Southeastern Louisiana University

Abstract:  Bayesian hierarchical models have enabled new frontiers for integrating different data sources in phylogenetic analyses. One example is the fossilized birth-death (FBD model), in which multiple types of biological data are modeled jointly to estimate a phylogenetic tree. In particular, data from the fossil record is jointly modeled with extant phylogenetic characters (morphology or DNA/amino acid data). While these models show incredible promise for our understanding of deep-time evolution, they also pose new challenges for appropriate model selection.

In this talk, I will review how model selection is commonly performed in phylogenetics and how hierarchical models challenge this status quo. I will present new results from work in my lab that show promise for alleviating our model selection challenges.

Time: 4:00 pm
Location: Gibson 414