Staff Scientist, National Institutes of Health
Polje Istraživanja: Oncology
In the era of exponential data generation, a fast, consistent, and efficient string processing technique is necessary to represent extensive genomic data. One of the earliest string processing techniques, predating MinHash and minimizer-based sketching, is Locally Consistent Parsing (LCP). This technique partitions an input string and identifies short, exactly occurring substrings called cores, which collectively cover the input string while maintaining Partition and Labeling Consistency. The iterative application of LCP yields progressively longer cores in a compressed format, thereby substantially enhancing the efficiency of genomic sequence representation and subsequent downstream analysis. We have previously developed Lcptools as the first iterative implementation of LCP for the DNA alphabet and demonstrated its effectiveness in identifying cores with minimal collisions. Here, we introduce GenCore, a computational method that leverages LCP cores for the first time to sketch and estimate genomic distances for closely related large genomes, and successfully reconstruct simulated progression trees. GenCore also successfully recapitulates primate phylogeny using both telomere-to-telomere (T2T) assemblies and the PacBio HiFi reads for assembly-free comparisons. Availability GenCore is available at https://github.com/BilkentCompGen/gencore
Efficient and consistent string processing is critical in the exponentially growing genomic data era. Locally Consistent Parsing (LCP) addresses this need by partitioning an input genome string into short, exactly matching substrings (e.g., "cores"), ensuring consistency across partitions. Labeling the cores of an input string consistently not only provides a compact representation of the input but also enables the reapplication of LCP to refine the cores over multiple iterations, providing a progressively longer and more informative set of substrings for downstream analyses. We present the first iterative implementation of LCP with Lcptools and demonstrate its effectiveness in identifying cores with minimal collisions. Experimental results show that the number of cores at the i^th iteration is O(n/c^i) for c ~ 2.34, while the average length and the average distance between consecutive cores are O(c^i). Compared to the popular sketching techniques, LCP produces significantly fewer cores, enabling a more compact representation and faster analyses. To demonstrate the advantages of LCP in genomic string processing in terms of computation and memory efficiency, we also introduce LCPan, an efficient variation graph constructor. We show that LCPan generates variation graphs >10x faster than vg, while using >13x less memory.
Efficient and consistent string processing is critical in the exponentially growing genomic data era. Locally Consistent Parsing (LCP) addresses this need by partitioning an input genome string into short, exactly matching substrings (e.g.,"cores"), ensuring consistency across partitions. Labeling the cores of an input string consistently not only provides a compact representation of the input but also enables the reapplication of LCP to refine the cores over multiple iterations, providing a progressively longer and more informative set of substrings for downstream analyses. We present the first iterative implementation of LCP with Lcptools and demonstrate its effectiveness in identifying cores with minimal collisions. Experimental results show that the number of cores at the i^th iteration is O(n/c^i) for c ~ 2.34, while the average length and the average distance between consecutive cores are O(c^i). Compared to the popular sketching techniques, LCP produces significantly fewer cores, enabling a more compact representation and faster analyses. To demonstrate the advantages of LCP in genomic string processing in terms of computation and memory efficiency, we also introduce LCPan, an efficient variation graph constructor. We show that LCPan generates variation graphs>10x faster than vg, while using>13x less memory.
Tumor evolution is driven by various mutational processes, ranging from single-nucleotide vari- ants (SNVs) to large structural variants (SVs) to dynamic shifts in DNA methylation. Current short-read sequencing methods struggle to accurately capture the full spectrum of these genomic and epigenomic alter- ations due to inherent technical limitations. To overcome that, here we introduce an approach for long-read sequencing of single-cell derived subclones, and use it to profile 23 subclones of a mouse melanoma cell line, characterized with distinct growth phenotypes and treatment responses. We develop a computational frame- work for harmonization and joint analysis of different variant types in the evolutionary context. Uniquely, our framework enables detection of recurrent amplifications of putative driver genes, generated by indepen- dent SVs across different lineages, suggesting parallel evolution. In addition, our approach revealed gradual and lineage-specific methylation changes associated with aggressive clonal phenotypes. We also show our set of phylogeny-constrained variant calls along with openly released sequencing data can be a valuable resource for the development of new computational methods.
Most human cancers arise from somatic alterations, ranging from single nucleotide variations to structural variations (SVs) that can alter the genomic organization. Pathogenic SVs are identified in various cancer types and subtypes, and they play a crucial role in diagnosis and patient stratification. However, the studies on structural variations have been limited due to biological and computational challenges, including tumor heterogeneity, aneuploidy, and the diverse spectrum of SVs from simpler deletions and focal amplifications to catastrophic events shuffling large fragments from one or multiple chromosomes. Long-read sequencing provides the advantage of improved mappability and direct haplotype phasing. Yet, no tool currently exists to comprehensively analyze complex rearrangements within the cancer genome using long-read sequencing. Here, we present Severus, a tool for somatic SV calling and complex SV characterization using long reads. Severus first detects individual SV junctions from phased split alignments, then constructs a phased breakpoint graph to cluster junctions into complex rearrangement events. We first benchmarked the somatic SV calling performance using six tumor/normal cell line pairs (HCC1395, H1437, H2009, HCC1937, HCC1954, Hs578T). We sequenced all cell lines with Illumina, ONT, and PacBio HiFi. We then established a set of high-confidence calls supported by multiple technologies and tools. Severus consistently had the highest F1 scores compared to the HiFi, ONT, and Illumina methods against this high-confidence SV call set. We then extend our analysis to complex SVs. Severus accurately detected complex events, i.e., chromothripsis and chromoplexy, and templated insertion cycles/chains (TIC), reported for these cell lines. We then compared Severus’ performance with Jabba and Linx, two widely used tools for complex SV calling in short-read sequencing. Our comparison revealed that Severus showed higher agreement with Linx, while Jabba failed to detect most of the SV clusters identified by both Severus and Linx. Severus also outperformed the other tools in characterizing complex reciprocal translocations and TICs. Most of the junctions in complex SVs called by either of the tools but not Severus were either simple SVs with a single long-read junction or were not present in long-read sequencing. In contrast, Severus effectively resolved overlapping SVs by utilizing long-read connectivity, allowing for more accurate clustering of smaller genomic segments. We have also applied Severus to seventeen pediatric leukemia cases. Severus identified two chromoplexy and two cryptic translocations, which were missed by FISH and karyotype panels and were incomplete in Illumina SV calls, further validated by RNA-seq. This highlights the potential of the long-read whole genome sequencing approach for diagnosing complex cases driven by SVs. Ayse Keskus, Asher Bryant, Tanveer Ahmad, Anton Goretsky, Byunggil Yoo, Sergey Aganezov, Ataberk Donmez, Lisa A. Lansdon, Isabel Rodriguez, Jimin Park, Yuelin Liu, Xiwen Cui, Joshua Gardner, Brandy McNulty, Samuel Sacco, Jyoti Shetty, Yongmei Zhao, Bao Tran, Giuseppe Narzisi, Adrienne Helland, Daniel Cook, Pi-Chuan Chang, Alexey Kolesnikov, Andrew Carroll, Erin Molloy, Chengpeng Bi, Adam Walter, Margaret Gibson, Irina Pushel, Erin Guest, Tomi Pastinen, Kishwar Shafin, Karen Miga, Salem Malikic, Chi-Ping Day, Nicolas Robine, Cenk Sahinalp, Michael Dean, Midhat S. Farooqi, Benedict Paten, Mikhail Kolmogorov. Severus: A tool for detecting and characterizing complex structural variants in cancer using long-read sequencing [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2025; Part 1 (Regular Abstracts); 2025 Apr 25-30; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2025;85(8_Suppl_1):Abstract nr 2848.
Melanoma, a highly heterogeneous cancer, evolves through a complex interplay of genetic alterations, including both single nucleotide variants (SNVs) and structural variants (SVs). To study the evolutionary trajectory of melanoma, we established a model system composed of 24 single-cell-derived clonal sublines (C1-C24) from the M4 melanoma model, developed in a genetically engineered hepatocyte growth factor (HGF)-transgenic mouse. While SNVs have been extensively used to construct phylogenetic trees using Trisicell (Triple-toolkit for single-cell intratumor heterogeneity inference), a tool that analyzes intratumor heterogeneity and single-cell RNA mutations, the role and timing of SVs in melanoma evolution remain less well understood. This study integrates SV data with an SNV-driven phylogeny to investigate whether SV patterns align with SNV-based evolutionary trajectories in the mouse melanoma model, providing insights into the functional impact of SVs during tumor progression. We performed long-read sequencing on the 24 clonal sublines and detected SVs using Severus, a tool optimized for phasing in long-read sequencing. The SVs were mapped to the SNV-driven phylogeny using R and classified as either concordant (aligning with the SNV-based tree) or discordant (deviating from the SNV phylogeny). Gene ontology enrichment analysis revealed that concordant SVs were significantly enriched in genes associated with the hepatocyte growth factor receptor signaling pathway and the negative regulation of peptidyl-threonine phosphorylation, both of which represent core drivers of tumor progression. In contrast, discordant SVs were associated with a broader range of functional pathways, including the positive regulation of antigen receptor-mediated signaling and the regulation of natural killer cell-mediated cytotoxicity, though the exact mechanisms underlying these associations remain unclear. By integrating these SVs with an established SNV-driven phylogeny, this study highlights the distinct and critical roles SVs play in melanoma evolution. Concordant SVs appear to drive core oncogenic processes, while discordant SVs may contribute to other aspects of tumor evolution. These findings underscore the importance of considering SVs alongside SNVs to fully capture the complexity of melanoma evolution. Ongoing investigations will continue to explore the functional implications of these SVs and how the gene disruption patterns they cause shape the evolutionary trajectory of melanoma, offering potential targets for future therapeutic strategies. Xiwen Cui, Ayse G. Keskus, Salem Malikic, Yuelin Liu, Anton Goretsky, Chi-Ping Day, Farid R. Mehrabadi, Mikhail Kolmogorov, Glenn Merlino, S. Cenk Sahinalp. Integrating structural variants and single nucleotide variants to uncover evolutionary trajectories in melanoma [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2025; Part 1 (Regular Abstracts); 2025 Apr 25-30; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2025;85(8_Suppl_1):Abstract nr 3898.
Ova stranica koristi kolačiće da bi vam pružila najbolje iskustvo
Saznaj više