Staff Scientist, National Institutes of Health
Polje Istraživanja: Oncology
Copy number alterations (CNA) is a phenomenon during cancer evolution where some regions of the genome may be amplified or deleted. This results in heterogeneous collections of cancer cells. Profiling and classification of CNA profiles play a vital role in understanding the cancer heterogeneity and evolution to better inform diagnosis and treatment. There are several short-reads haplotype-specific CNA profiling tools but short reads provide a limited phasing range. Long-reads facilitate the direct phasing of genomic variants into megabase-scale haplotypes, which supports the reconstruction of longer, up to chromosome-scale, CNA profiles. Here we present Wakhan, a tool to analyze haplotype-specific chromosome-scale somatic copy number aberrations using long reads. Leveraging high-quality genome assembly coverage profiles, we show that Wakhan significantly outperforms other common short- and long-read CNA callers in achieving chromosome-level CNA consistency. Wakhan uses tumor-normal long-read BAMs and phased germline SNP calls as input. It first extends the input phasing to be chromosome-scale by exploiting haplotype coverage imbalance. Wakhan detects those phase switch regions and corrects them by taking into consideration the changes in haplotype-specific coverage. Next, Severus utilizes this enhanced phasing to generate phased structural variant (SV) calls. Finally, Wakhan's integrated CNA algorithm uses the SV calls as boundaries and employs a haplotype coverage model to assign integer copy-number states to the resultant CNA regions. https://github.com/KolmogorovLab/Wakhan We sought to compare Wakhan's performance against several state-of-the-art haplotype-specific CNA calling tools. The tools selected for short-read analysis included: Purple, Hatchet, Battenberg and for long-read analysis Purple and Savana are included. As benchmarks for small variants and SV calling are available but no similar benchmarks for somatic CNA calls are available. We designed a CASTLE panel based CNA calling benchmark, consisting of 6 pairs of tumor/normal cell lines sequenced with multiple short- and long-read sequencing technologies. We define segment error (SE) as for each CNA segment, we calculate the haplotype-specific mean squared distance between expected and reference coverage at heterozygous SNPs. This is then used to compute a weighted chromosomal average, normalized by the tumor haplotype's mean coverage. Similarly, for chromosome error (CE), compare the phase of the whole chromosome against the reference coverage. In the five CASTLE datasets, Wakhan and PURPLE had the lowest SE50 and SE75, indicating high accuracy in reconstructing individual CNA segments. We also evaluated Wakhan on a tumor-only dataset. Both Wakhan and PURPLE handled the absence of normal samples well and accurately reflected the expected tumor/normal profiles. Tanveer Ahmad, Ayse Keskus, Mikhail Kolmogorov, Sergey Aganezov, Michael C. Dean, Midhat S. Farooqi, S. Cenk Sahinalp, Benedict Paten, Karen H. Miga, Salem Malikić, Yuelin Liu, Byunggil Yoo, Ataberk Ataberk Donmez, Anton Goretsky. Wakhan: Reconstruction of chromosome-scale copy number profiles of tumor genomes with long-read sequencing [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 6900.
Tumor evolution is driven by various mutational processes, ranging from single nucleotide variants (SNVs) to large structural variants (SVs) to dynamic shifts in DNA methylation. Current short-read sequencing methods struggle to accurately capture the full spectrum of these genomic and epigenomic alterations, as well as their relations, due to inherent technical limitations. Here we used Nanopore long-read sequencing to profile 23 subclones, each derived from a single cell of a mouse melanoma cell line, for precise detection and evolutionary ordering of SNVs, SVs, copy number alterations (CNAs), and DNA methylation changes at subclonal level. Through phylogenetic analysis of these subclones, we reconstruct the timing of mutational processes and their contributions to diverse clonal phenotypes. The analysis reveals recurrent amplifications of putative driver genes, generated by independent SVs across different lineages, suggesting parallel evolution. Additionally, we described lineage-specific methylation changes associated with aggressive tumor subclones, highlighting epigenetic trajectories linked to tumor progression. Overall, we demonstrate that our long-read approach enables a uniquely comprehensive view of melanoma progression, highlighting that SVs and methylation played an important role in initiation, clonal diversification, and development of therapeutic resistance in this tumor, in consistence with recent clinical findings. We will release the sequencing data and curated variant calls to encourage developments of new computational methods. Chi-Ping Day, Yuelin Liu, Anton Goretsky, Ayse Keskus, Salem Malikic, Eva Perez-Guijarro, Glenn Merlino, Eytan Ruppin, Suleyman Cenk Sahinalp, Mikhail Kolmogorov. Full-range genomic analysis at single-cell resolution reveals genetic, epigenetic, and parallel evolution of melanoma subclones [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 704.
Multi-sample bulk DNA sequencing enables reconstruction of a tumor’s clonal history, but scalable methods often rely on heuristic search and provide no optimality guarantees. We present CITUP2, an integrative combinatorial optimization framework that reconstructs clonal trees from descendant cell fractions (DCFs) of mutational clusters. CITUP2 formulates tree inference as a mixed-integer quadratic program (MIQP) that jointly determines the tree topology and clone prevalences across samples. It minimizes a weighted discrepancy between observed and inferred DCFs, with options to prioritize trees exhibiting consistency in the presence-absence patterns of parent-child clones. Under this formulation, CITUP2 returns provably optimal solutions (with respect to the model) and avoids the combinatorial explosion of exhaustive topology enumeration used by existing methods with optimality guarantees. In addition, CITUP2 can report a user-specified number of best trees. In simulations and analyses of a large, recently published multi-sample TRACERx cohort, CITUP2 scales to trees with tens of clones (approximately 30) and matches or improves on the fit attained by state-of-the-art approaches, while providing clear optimality certificates. Salem Malikic, Hamza Iseric, Chih Hao Wu, Erin Molloy, S. Cenk Sahinalp. Reconstruction of Tumor Clonal Trees with Multi-Sample Bulk Sequencing Data by Integrative Combinatorial Optimization [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 6905.
Understanding and comparing tumor evolutionary histories is fundamental to cancer genomics, with direct implications for tracking subclonal population dynamics, treatment resistance, and tumor heterogeneity. Clonal trees, widely used to model tumor progression, are rooted, unordered trees in which each node represents a subclone labeled by a set of distinct mutations. Various principled and efficient methods have been developed for inferring clonal trees from either bulk or single-cell sequencing data. However, no existing computational approach offers a method that is both efficient and principled to fully align clonal trees and to compare their subclonal architectures, which limits the robustness of any downstream analysis based on inferred clonal trees. We introduce omlta, the optimal multi-label tree alignment of two clonal trees, which removes the minimum number of mutation labels, so that the remaining trees are isomorphic. Computing omlta is NP-hard. Here, we present a fixed-parameter tractable algorithm to compute the omlta, with a running time of O(L^3 log L 2^k) where L is the number of mutation labels shared between the input trees and k is the minimum possible number of mutation labels that need to be removed for the alignment - which we call omltd, the optimal multi-label tree edit distance. Our approach provides an exponentially better (in k) asymptotic runtime than the state-of-the-art algorithm by Akutsu et al. for computing the classic tree alignment and edit distance, concepts similar to what omlta/omltd optimizes on clonal trees. We applied omlta to 126 multi-sample bulk-sequencing data from the TRACERx study on non-small cell lung cancers by comparing clonal trees inferred by CONIPHER and PairTree. Despite the theoretically exponential runtime, we could compute the tree alignment for each tumor quickly, often within seconds. The omltd between CONIPHER and PairTree clonal trees on the same tumor varies substantially across tumors and the distances are negatively associated with the mean cancer cell fraction among mutations. For the tumors characterized by mutations with low cancer cell fractions, it is thus advisable not to use a single tree, but rather the alignment of multiple alternative trees, so that downstream inferences are informed only by robustly placed mutations. We further evaluated our algorithm on an in-house melanoma sample with clonal trees inferred by PhISCS and ScisTree, highlighting the utility of omlta on trees inferred from single-cell sequencing data. On these datasets, our algorithm completed all analyses in practical wall-clock times and showed that it can identify common evolutionary trajectories among clonal trees representing (i) distinct tumors, (ii) distinct samples from the same tumor, (iii) distinct sequencing data from the same sample. Additional supplementary results demonstrate the robustness of our approach in comparison to alternatives on simulated data. Jacob Gilbert, Chih Hao Wu, Marina Knittel, Alejandro Schaffer, Salem Malikić, S. Cenk Sahinalp. Identifying robust subclonal structures through tumor progression tree alignment [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 6898.
Clonal evolution of cancer results in intratumor heterogeneity, making treatment and cure challenging. Single-cell sequencing has advanced our understanding of intratumor heterogeneity, but tracing subclonal evolution using mutational profiles of cells is limited by scale and noise. Moreover, available tumor progression tree inference methods usually offer a single tree to explain the progression of a tumor, and do not inform about alternative evolutionary scenarios. We introduce the bi-partition function for a tumor progression tree, to assess the reliability of any proposed subclonal structure in a single-cell sequenced tumor. By using the bi-partition function, we calculate the probability that any given subset R of mutation-profiled single cells from a tumor forms a clade rooted by a specified mutation ρ across all possible tumor progression trees. This provides the means to evaluate whether R forms a subclone with ρ as a possible subclonal driver, which is especially useful if the cells of R are biologically or clinically significant, e.g., have aggressive growth, therapy resistance, or metastatic potential. We also introduce an algorithm to estimate the bi-partition function, which treats the ground truth as a probability distribution derived from mutational profiles of single cells and samples a tumor progression tree from this distribution independently in each iteration. We prove that our algorithm’s estimate of the bi-partition function asymptotically approaches the ground truth and demonstrate its accuracy on simulated data. Applying our algorithm to the tumor progression tree inferred from single-cell-derived melanoma sublines revealed that, while major clades and their root mutations are robust, (i) the placement of one clade in the tree is unreliable, which we later observed to be a result of Loss of Heterozygosity, and (ii) some of the mutations identified as false positives in the tree are unreliable, which later turned out to be the result of a doublet - a subline which has contamination from another subline. Interestingly, bootstrapping, a technique commonly employed for species trees, failed to point out any of these issues. After correcting the input data for these issues, the reliability of the progression tree improved substantially, demonstrating how our bi-partition function algorithm can aid studies on tumor evolution and intratumor heterogeneity. Farid Rashidi Mehrabadi, Erfan Sadeqi Azer, John D. Bridgers, Teresa M. Przytycka, Salem Malikic, Funda Ergun, Cenk Sahinalp. A bi-partition function algorithm to evaluate inferred subclonal structures in single-cell sequencing data [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 6897.
Understanding and comparing tumor evolutionary histories is fundamental to cancer genomics. Clonal trees, used to model tumor progression, are rooted, unordered trees in which each node represents a subclone labeled by a set of distinct mutations. To compare two clonal trees, we introduce omlta, the optimal multi-label tree alignment, which removes the minimum number of mutation labels from the trees, so that the remaining trees are isomorphic. Computing omlta is NP-hard. Here, we present an algorithm to compute the omlta, with a running time of where L ≥ 1 is the total number of mutation labels occurring in the input trees and k is the minimum possible number of mutation labels that need to be removed for the alignment. Our implementation (https://github.com/algo-cancer/omlta) is the first computational tool for determining the optimal alignment between clonal trees. We applied omlta to 126 cases from the TRACERx study on non-small cell lung cancers and some melanoma single-cell data.
In the era of exponential data generation, a fast, consistent, and efficient string processing technique is necessary to represent extensive genomic data. One of the earliest string processing techniques, predating MinHash and minimizer-based sketching, is Locally Consistent Parsing (LCP). This technique partitions an input string and identifies short, exactly occurring substrings called cores, which collectively cover the input string while maintaining Partition and Labeling Consistency. The iterative application of LCP yields progressively longer cores in a compressed format, thereby substantially enhancing the efficiency of genomic sequence representation and subsequent downstream analysis. We have previously developed Lcptools as the first iterative implementation of LCP for the DNA alphabet and demonstrated its effectiveness in identifying cores with minimal collisions. Here, we introduce GenCore, a computational method that leverages LCP cores for the first time to sketch and estimate genomic distances for closely related large genomes, and successfully reconstruct simulated progression trees. GenCore also successfully recapitulates primate phylogeny using both telomere-to-telomere (T2T) assemblies and the PacBio HiFi reads for assembly-free comparisons. Availability GenCore is available at https://github.com/BilkentCompGen/gencore
Ova stranica koristi kolačiće da bi vam pružila najbolje iskustvo
Saznaj više