Natural killer (NK) cells are essential components of the innate immune system, with their activity significantly regulated by Killer cell Immunoglobulin-like Receptors (KIRs). The diversity and structural complexity of KIR genes present significant challenges for accurate genotyping, essential for understanding NK cell functions and their implications in health and disease. Traditional genotyping methods struggle with the variable nature of KIR genes, leading to inaccuracies that can impede immunogenetic research. These challenges extend to high-quality phased assemblies, which have been recently popularized by the Human Pangenome Consortium. This paper introduces BAKIR (Biologically-informed Annotator for KIR locus), a tailored computational tool designed to overcome the challenges of KIR genotyping and annotation on high-quality, phased genome assemblies. BAKIR aims to enhance the accuracy of KIR gene annotations by structuring its annotation pipeline around identifying key functional mutations, thereby improving the identification and subsequent relevance of gene and allele calls. It uses a multi-stage mapping, alignment, and variant calling process to ensure high-precision gene and allele identification, while also maintaining high recall for sequences that are significantly mutated or truncated relative to the known allele database. BAKIR has been evaluated on a subset of the HPRC assemblies, where BAKIR was able to improve many of the associated annotations and call novel variants. BAKIR is freely available on GitHub, offering ease of access and use through multiple installation methods, including pip, conda, and singularity container, and is equipped with a user-friendly command-line interface, thereby promoting its adoption in the scientific community.
Background Next-generation sequencing (NGS), including whole genome sequencing (WGS) and whole exome sequencing (WES), is increasingly being used for clinic care. While NGS data have the potential to be repurposed to support clinical pharmacogenomics (PGx), current computational approaches have not been widely validated using clinical data. In this study, we assessed the accuracy of the Aldy computational method to extract PGx genotypes from WGS and WES data for 14 and 13 major pharmacogenes, respectively. Methods Germline DNA was isolated from whole blood samples collected for 264 patients seen at our institutional molecular solid tumor board. DNA was used for panel-based genotyping within our institutional Clinical Laboratory Improvement Amendments- (CLIA-) certified PGx laboratory. DNA was also sent to other CLIA-certified commercial laboratories for clinical WGS or WES. Aldy v3.3 and v4.4 were used to extract PGx genotypes from these NGS data, and results were compared to the panel-based genotyping reference standard that contained 45 star allele-defining variants within CYP2B6, CYP2C8, CYP2C9, CYP2C19, CYP2D6, CYP3A4, CYP3A5, CYP4F2, DPYD, G6PD, NUDT15, SLCO1B1, TPMT, and VKORC1. Results Mean WGS read depth was >30x for all variant regions except for G6PD (average read depth was 29 reads), and mean WES read depth was >30x for all variant regions. For 94 patients with WGS, Aldy v3.3 diplotype calls were concordant with those from the genotyping reference standard in 99.5% of cases when excluding diplotypes with additional major star alleles not tested by targeted genotyping, ambiguous phasing, and CYP2D6 hybrid alleles. Aldy v3.3 identified 15 additional clinically actionable star alleles not covered by genotyping within CYP2B6, CYP2C19, DPYD, SLCO1B1, and NUDT15. Within the WGS cohort, Aldy v4.4 diplotype calls were concordant with those from genotyping in 99.7% of cases. When excluding patients with CYP2D6 copy number variation, all Aldy v4.4 diplotype calls except for one CYP3A4 diplotype call were concordant with genotyping for 161 patients in the WES cohort. Conclusion Aldy v3.3 and v4.4 called diplotypes for major pharmacogenes from clinical WES and WGS data with >99% accuracy. These findings support the use of Aldy to repurpose clinical NGS data to inform clinical PGx.
Domain-specific languages (DSLs) are able to provide intuitive high-level abstractions that are easy to work with while attaining better performance than general-purpose languages. Yet, implementing new DSLs is a burdensome task. As a result, new DSLs are usually embedded in general-purpose languages. While low-level languages like C or C++ often provide better performance as a host than high-level languages like Python, high-level languages are becoming more prevalent in many domains due to their ease and flexibility. Here, we present Codon, a domain-extensible compiler and DSL framework for high-performance DSLs with Python's syntax and semantics. Codon builds on previous work on ahead-of-time type checking and compilation of Python programs and leverages a novel intermediate representation to easily incorporate domain-specific optimizations and analyses. We showcase and evaluate several compiler extensions and DSLs for Codon targeting various domains, including bioinformatics, secure multi-party computation, block-based data compression and parallel programming, showing that Codon DSLs can provide benefits of familiar high-level languages and achieve performance typically only seen with low-level languages, thus bridging the gap between performance and usability.
High-throughput sequencing provides sufficient means for determining genotypes of clinically important pharmacogenes that can be used to tailor medical decisions to individual patients. However, pharmacogene genotyping, also known as star-allele calling, is a challenging problem that requires accurate copy number calling, structural variation identification, variant calling, and phasing within each pharmacogene copy present in the sample. Here we introduce Aldy 4, a fast and efficient tool for genotyping pharmacogenes that uses combinatorial optimization for accurate star-allele calling across different sequencing technologies. Aldy 4 adds support for long reads and uses a novel phasing model and improved copy number and variant calling models. We compare Aldy 4 against the current state-of-the-art star-allele callers on a large and diverse set of samples and genes sequenced by various sequencing technologies, such as whole-genome and targeted Illumina sequencing, barcoded 10x Genomics, and Pacific Biosciences (PacBio) HiFi. We show that Aldy 4 is the most accurate star-allele caller with near-perfect accuracy in all evaluated contexts, and hope that Aldy remains an invaluable tool in the clinical toolbox even with the advent of long-read sequencing technologies.
Pharmacogenomics (PGx)-guided drug treatment is one of the cornerstones of personalized medicine. However, the genes involved in drug response are highly complex and known to carry many (rare) variants. Current technologies (short-read sequencing and SNP panels) are limited in their ability to resolve these genes and characterize all variants. Moreover, these technologies cannot always phase variants to their allele of origin. Recent advance in long-read sequencing technologies have shown promise in resolving these problems. Here we present a long-read sequencing panel-based approach for PGx using PacBio HiFi sequencing. A capture based approach was developed using a custom panel of clinically-relevant pharmacogenes including up- and downstream regions. A total of 27 samples were sequenced and panel accuracy was determined using benchmarking variant calls for 3 Genome in a Bottle samples and GeT-RM star(*)-allele calls for 21 samples.. The coverage was uniform for all samples with an average of 94% of bases covered at >30×. When compared to benchmarking results, accuracy was high with an average F1 score of 0.89 for INDELs and 0.98 for SNPs. Phasing was good with an average of 68% the target region phased (compared to ~20% for short-reads) and an average phased haploblock size of 6.6kbp. Using Aldy 4, we compared our variant calls to GeT-RM data for 8 genes (CYP2B6, CYP2C19, CYP2C9, CYP2D6, CYP3A4, CYP3A5, SLCO1B1, TPMT), and observed highly accurate star(*)-allele calling with 98.2% concordance (165/168 calls), with only one discordance in CYP2C9 leading to a different predicted phenotype. We have shown that our long-read panel-based approach results in high accuracy and target phasing for SNVs as well as for clinical star(*)-alleles.
High-throughput sequencing provides sufficient means for determining genotypes of clinically important pharmacogenes that can be used to tailor medical decisions to individual patients. However, pharmacogene genotyping, also known as star-allele calling, is a challenging problem that requires accurate copy number calling, structural variation discovery, variant calling and phasing within each pharmacogene copy present in the sample. Here we introduce Aldy 4, a fast and efficient tool for genotyping pharmacogenes that utilizes combinatorial optimization for accurate star-allele calling across different sequencing technologies. Aldy 4 adds support for long reads and ships with a novel phasing model and improved copy number and variant calling models. We compare Aldy 4 against the current state-of-the-art star-allele callers on a large and diverse set of samples and genes sequenced by various sequencing technologies, such as whole-genome and targeted Illumina sequencing, barcoded 10X Genomics and PacBio HiFi. We show that Aldy 4 is the most accurate star-allele caller with near-perfect accuracy in all evaluated contexts. We hope that Aldy remains an invaluable tool in the clinical toolbox even with the advent of long-read sequencing technologies. Availability Aldy 4 is available at https://github.com/0xTCG/aldy.
Background Pharmacogenomics (PGx) testing can reduce toxicities and improve efficacy of several drugs used to treat cancer and associated symptoms. PGx results can be determined from germline whole-exome sequencing (WES), but somatic mutations may cause discordance between tumor and germline DNA. Since clinical diagnostic sequencing in oncology frequently only includes tumor DNA, there would be clinical value in calling germline PGx genotypes from tumor DNA. Thus, the purpose of this study was to assess the feasibility of using somatic WES data to call germline PGx genotypes. Methods Germline and somatic WES data were obtained as part of the clinical workflow for 64 patients treated at the solid molecular tumor board clinic at Indiana University. Aldy v3.3 was implemented in LifeOmic’s Precision Health Cloud™ to call PGx genotypes from somatic WES. Somatic Aldy calls were compared with previously validated Aldy germline calls for 8 genes: CYP2C9, CYP2C19, CYP2D6, CYP3A4, CYP3A5, CYP4F2, DPYD, and TPMT. Somatic read depth was >100x, except for the intronic CYP3A4*22 variant, which was >30x. Results Somatic and germline Aldy calls were compared for a total of 512 genotypes and 56 (11%) calls were discordant. Discordant calls were most common for CYP2B6 (23.4%), followed by CYP2D6 (14.1%), CYP2C19 (10.9%), CYP2C8 (6.3%), and DPYD (6.3%). In contrast, all Aldy calls were concordant for CYP3A5 and TPMT. 38 out of 64 subjects (59%) had discordant calls for at least one gene. The most common first cancer diagnoses in our cohort were colorectal (9.3%), breast (7.8%), and pancreatic (7.8%), and the rates of discordant Aldy calls did not differ by cancer type (p>0.05 for all cancer types). Based on our analyses of discordant calls, we anticipate that adjusting Aldy’s thresholds for variant calling may allow Aldy to determine genotypes from somatic WES data. Conclusion In most cases, genotype calls of drug metabolism genes from tumor DNA reflected the germline genotypes; however, additional work needs to be done to determine if the remaining discordant calls can be corrected by modifying the informatics tools or if they are due to somatic mutations. Citation Format: Wilberforce A. Osei, Tyler Shugg, Reynold C. Ly, Steven M. Bray, Benjamin A. Salisbury, Ryan R. Ratcliff, Victoria M. Pratt, Ibrahim Numanagić, Todd Skaar. Pharmacogenomics genotyping from clinical somatic whole exome sequencing: Aldy, a computational tool [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 1151.
Genomic data leaks are irreversible. Leaked DNA cannot be changed, stays disclosed indefinitely, and affects the owner's family members as well. The recent large-scale genomic data collections [1], [2] render the traditional privacy protection mechanisms, like the Health Insurance Portability and Accountability Act (HIPAA), inadequate for protection against the novel security attacks [3]. On the other hand, data access restrictions hinder important clinical research that requires large datasets to operate [4]. These concerns can be naturally addressed by the employment of privacy-enhancing technologies, such as a secure multiparty computation (MPC) [5]–[10]. Secure MPC enables computation on data without disclosing the data itself by dividing the data and computation between multiple computing parties in a distributed manner to prevent individual computing parties from accessing raw data. MPC systems are being increasingly adopted in fields that operate on sensitive datasets [11]–[13], such as computational genomics and biomedical research [14]–[22].
Exponentially-growing next-generation sequencing data requires high-performance tools and algorithms. Nevertheless, the implementation of high-performance computational genomics software is inaccessible to many scientists because it requires extensive knowledge of low-level software optimization techniques, forcing scientists to resort to high-level software alternatives that are less efficient. Here, we introduce Seq—a Python-based optimization framework that combines the power and usability of high-level languages like Python with the performance of low-level languages like C or C++. Seq allows for shorter, simpler code, is readily usable by a novice programmer, and obtains significant performance improvements over existing languages and frameworks. We showcase and evaluate Seq by implementing seven standard, widely-used applications from all stages of the genomics analysis pipeline, including genome index construction, finding maximal exact matches, long-read alignment and haplotype phasing, and demonstrate its implementations are up to an order of magnitude faster than existing hand-optimized implementations, with just a fraction of the code. By enabling researchers of all backgrounds to easily implement high-performance analysis tools, Seq further opens the door to the democratization and scalability of computational genomics.
Nema pronađenih rezultata, molimo da izmjenite uslove pretrage i pokušate ponovo!
Ova stranica koristi kolačiće da bi vam pružila najbolje iskustvo
Saznaj više