{"id":29,"date":"2025-10-10T15:36:03","date_gmt":"2025-10-10T15:36:03","guid":{"rendered":"https:\/\/erik.mvolz.com\/?p=29"},"modified":"2025-10-10T16:26:23","modified_gmt":"2025-10-10T16:26:23","slug":"can-we-predict-evolution-from-a-phylogeny","status":"publish","type":"post","link":"https:\/\/erik.mvolz.com\/?p=29","title":{"rendered":"Can we predict evolution from a phylogeny?"},"content":{"rendered":"\n<h1 class=\"wp-block-heading\">Summary<\/h1>\n\n\n\n<p>Our preprint asks whether you can spot tomorrow\u2019s &#8220;up-and-coming&#8221; lineages by reading patterns already encoded in today\u2019s phylogenetic tree. We introduce a simple statistic&#8211; <strong>coalescent odds<\/strong>&#8212; that scores each lineage\u2019s tendency to spawn descendants. It\u2019s fast, interpretable, and designed for messy real-world surveillance. <a href=\"https:\/\/www.biorxiv.org\/content\/10.1101\/2025.09.29.679185v1\">Here&#8217;s the preprint<\/a>. And, <a href=\"Software: https:\/\/github.com\/emvolz\/cod\">here&#8217;s the R package. <\/a><\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"768\" height=\"615\" src=\"https:\/\/erik.mvolz.com\/wp-content\/uploads\/2025\/10\/codaim3fig.png\" alt=\"\" class=\"wp-image-31\" style=\"width:599px;height:auto\" srcset=\"https:\/\/erik.mvolz.com\/wp-content\/uploads\/2025\/10\/codaim3fig.png 768w, https:\/\/erik.mvolz.com\/wp-content\/uploads\/2025\/10\/codaim3fig-300x240.png 300w\" sizes=\"auto, (max-width: 768px) 100vw, 768px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Why this matters<\/h2>\n\n\n\n<p>Epidemiologists sift through pathogen genomes to catch variants early. But turning millions of sequences into an early-warning signal is hard. We asked a basic question: <strong>does the shape of the tree itself contain enough signal to flag which lineages are likely to grow?<\/strong><\/p>\n\n\n\n<p>This isn&#8217;t a new problem, and ours isn&#8217;t the first proposed solution. For example, the <a href=\"https:\/\/elifesciences.org\/articles\/03568\">local branching index (LBI)<\/a> is also a simple statistic which can be readily calculated from a tree, and which is the default statistic reported on nextstrain as a proxy for pathogen fitness, e.g. <a href=\"https:\/\/nextstrain.org\/seasonal-flu\/h3n2\/ha\/2y?c=lbi\">here<\/a>. There&#8217;s also a lot of related work to pickout &#8220;clusters&#8221; or <a href=\"https:\/\/academic.oup.com\/sysbio\/article\/69\/5\/884\/5734655\">cryptic population structure<\/a> in phylogenies, and these can be used as the basis for epidemic <a href=\"https:\/\/doi.org\/10.1038\/s41586-024-08309-9\">early<\/a> <a href=\"https:\/\/doi.org\/10.1016\/j.ebiom.2023.104939\">warning<\/a> signals.<\/p>\n\n\n\n<p>But we had another look at this problem to see if we can come with a method that checks some boxes:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Can we define a statistic that is readily interpretable and based on a classic population-genetic model?<\/li>\n\n\n\n<li>Can we make it work with messy real-world data that is rife with sampling bias?<\/li>\n\n\n\n<li>Can we make it scalable to very large real-world datasets?<\/li>\n\n\n\n<li>Can we fit the model to data w\/o relying on arbitrary hyperparameters (e.g. clustering thresholds)?<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">The core idea: coalescent odds<\/h2>\n\n\n\n<p>Funnily enough, one of the most useful ways to study phylogenies is to read them &#8216;backwards&#8217; in time.<br>Branches <strong>coalesce<\/strong> when two lineages share a recent ancestor. If a lineage tends to sit near lots of recent coalescent events, it may be on a growth trajectory. Or it may just be in a region that is highly sampled! It&#8217;s important to have a method that can distinguish between these mechanisms.<\/p>\n\n\n\n<p>We define a <strong>continuous, heritable &#8220;propensity to coalesce&#8221;<\/strong> for each lineage. Intuitively, it\u2019s a score for &#8220;how likely this lineage is to be the parent of many near-future samples.&#8221; From this, we compute <strong>coalescent odds (cod&#8217;s)<\/strong>-a per-lineage number that can be tracked over time.<\/p>\n\n\n\n<p>In principle, this gives us a <em>model-light<\/em> approach, without heavy assumptions about evolution or fitness. Coalescent odds can vary in a flexible way across a phylogenetic tree, and we fit the model in a way that maximises the predictive power of the statistic for future growth.<\/p>\n\n\n\n<p>While the basic model is not terribly scalable (certainly not in comparison to methods that use message-passing algorithms), we developed a number of very efficient and accurate approximations that make this applicable to trees with 000&#8217;s of samples.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What we found (in brief)<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Prediction.<\/strong> Given a tree built from genomes available <em>today<\/em>, can cod&#8217;s help identify lineages that will be more common <em>tomorrow<\/em>?\n<ul class=\"wp-block-list\">\n<li>Yes! Even with realistic noise, cod&#8217;s contain predictive information about short-term lineage growth- enough to improve triage over naive baselines.<\/li>\n\n\n\n<li><strong>Temporal stability.<\/strong> The signal isn\u2019t a one-off artifact when natural selection is at play; it persists across time when updated with new data.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Robustness.<\/strong> Does it still work when sampling is uneven, noisy, or biased (as in real surveillance)?\n<ul class=\"wp-block-list\">\n<li>Because cod&#8217;s are model-light, it degrades gracefully under imperfect sampling. If informative metadata are available, sampling bias can be handled explicitly.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">What\u2019s next<\/h2>\n\n\n\n<p>One of the most exciting things about this approach is that there are many ways to build on it. Here&#8217;s what we&#8217;re working on now:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Since we have essentially transformed estimating lineage fitness into a linear regression problem (read the preprint to see how), it is possible to include covariate data for each lineage. For example, imagine using deep learning or <a href=\"https:\/\/evescape.org\/\">protein language models<\/a> to independently estimate fitness; or <a href=\"https:\/\/github.com\/dms-vep\/dms-vep-pipeline-3\">deep mutational scanning<\/a>; then these scores could be included when estimating cod&#8217;s to achieve an estimate of fitness that is using <strong>all<\/strong> the data- epidemiological, virological, and computational &#8211; to generate a combined estimate of fitness.<\/li>\n\n\n\n<li>One limitation of this approach is that it depends on the ability to estimate high-quality time-scaled phylogenies. This isn&#8217;t always possible; sometimes we deal with short and incomplete genomic data, and the best we can do is identify clusters. The good news is that the basic underlying model for cod&#8217;s could be used in this case as well. Showing how to do this will be very useful for applying these methods to new <a href=\"https:\/\/www.gov.uk\/government\/news\/ukhsa-launches-new-metagenomic-surveillance-for-health-security\">metagenomics surveillance systems<\/a>.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Summary Our preprint asks whether you can spot tomorrow\u2019s &#8220;up-and-coming&#8221; lineages by reading patterns already encoded in today\u2019s phylogenetic tree. We introduce a simple statistic&#8211; coalescent odds&#8212; that scores each lineage\u2019s tendency to spawn descendants. It\u2019s fast, interpretable, and designed for messy real-world surveillance. Here&#8217;s the preprint. And, here&#8217;s the R package. Why this matters [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-29","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/erik.mvolz.com\/index.php?rest_route=\/wp\/v2\/posts\/29","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/erik.mvolz.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/erik.mvolz.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/erik.mvolz.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/erik.mvolz.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=29"}],"version-history":[{"count":3,"href":"https:\/\/erik.mvolz.com\/index.php?rest_route=\/wp\/v2\/posts\/29\/revisions"}],"predecessor-version":[{"id":46,"href":"https:\/\/erik.mvolz.com\/index.php?rest_route=\/wp\/v2\/posts\/29\/revisions\/46"}],"wp:attachment":[{"href":"https:\/\/erik.mvolz.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=29"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/erik.mvolz.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=29"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/erik.mvolz.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=29"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}