Exponentially-growing next-generation sequencing data requires high-performance tools and algorithms. Nevertheless, the implementation of high-performance computational genomics software is inaccessible to many scientists because it requires extensive knowledge of low-level software optimization techniques, forcing scientists to resort to high-level software alternatives that are less efficient. Here, we introduce Seq—a Python-based optimization framework that combines the power and usability of high-level languages like Python with the performance of low-level languages like C or C++. Seq allows for shorter, simpler code, is readily usable by a novice programmer, and obtains significant performance improvements over existing languages and frameworks. We showcase and evaluate Seq by implementing seven standard, widely-used applications from all stages of the genomics analysis pipeline, including genome index construction, finding maximal exact matches, long-read alignment and haplotype phasing, and demonstrate its implementations are up to an order of magnitude faster than existing hand-optimized implementations, with just a fraction of the code. By enabling researchers of all backgrounds to easily implement high-performance analysis tools, Seq further opens the door to the democratization and scalability of computational genomics.
The scope and scale of biological data are increasing at an exponential rate, as technologies like next-generation sequencing are becoming radically cheaper and more prevalent. Over the last two decades, the cost of sequencing a genome has dropped from $100 million to nearly $100—a factor of over 106—and the amount of data to be analyzed has increased proportionally. Yet, as Moore’s Law continues to slow, computational biologists can no longer rely on computing hardware to compensate for the ever-increasing size of biological datasets. In a field where many researchers are primarily focused on biological analysis over computational optimization, the unfortunate solution to this problem is often to simply buy larger and faster machines. Here, we introduce Seq, the first language tailored specifically to bioinformatics, which marries the ease and productivity of Python with C-like performance. Seq starts with a subset of Python—and is in many cases a drop-in replacement—yet also incorporates novel bioinformatics- and computational genomics-oriented data types, language constructs and optimizations. Seq enables users to write high-level, Pythonic code without having to worry about low-level or domain-specific optimizations, and allows for the seamless expression of the algorithms, idioms and patterns found in many genomics or bioinformatics applications. We evaluated Seq on several standard computational genomics tasks like reverse complementation, k-mer manipulation, sequence pattern matching and large genomic index queries. On equivalent CPython code, Seq attains a performance improvement of up to two orders of magnitude, and a 160× improvement once domain-specific language features and optimizations are used. With parallelism, we demonstrate up to a 650× improvement. Compared to optimized C++ code, which is already difficult for most biologists to produce, Seq frequently attains up to a 2× improvement, and with shorter, cleaner code. Thus, Seq opens the door to an age of democratization of highly-optimized bioinformatics software.
Motivation Segmental duplications (SDs) or low‐copy repeats, are segments of DNA > 1 Kbp with high sequence identity that are copied to other regions of the genome. SDs are among the most important sources of evolution, a common cause of genomic structural variation and several are associated with diseases of genomic origin including schizophrenia and autism. Despite their functional importance, SDs present one of the major hurdles for de novo genome assembly due to the ambiguity they cause in building and traversing both state‐of‐the‐art overlap‐layout‐consensus and de Bruijn graphs. This causes SD regions to be misassembled, collapsed into a unique representation, or completely missing from assembled reference genomes for various organisms. In turn, this missing or incorrect information limits our ability to fully understand the evolution and the architecture of the genomes. Despite the essential need to accurately characterize SDs in assemblies, there has been only one tool that was developed for this purpose, called Whole‐Genome Assembly Comparison (WGAC); its primary goal is SD detection. WGAC is comprised of several steps that employ different tools and custom scripts, which makes this strategy difficult and time consuming to use. Thus there is still a need for algorithms to characterize within‐assembly SDs quickly, accurately, and in a user friendly manner. Results Here we introduce SEgmental Duplication Evaluation Framework (SEDEF) to rapidly detect SDs through sophisticated filtering strategies based on Jaccard similarity and local chaining. We show that SEDEF accurately detects SDs while maintaining substantial speed up over WGAC that translates into practical run times of minutes instead of weeks. Notably, our algorithm captures up to 25% ‘pairwise error’ between segments, whereas previous studies focused on only 10%, allowing us to more deeply track the evolutionary history of the genome. Availability and implementation SEDEF is available at https://github.com/vpc‐ccg/sedef.
The original version of this Article contained errors in the affiliations of the authors Ibrahim Numanagić and Thomas A. Courtade, which were incorrectly given as ‘Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA’ and ‘Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA’, respectively. Also, the hyperlink for the source code in the Data Availability section was incorrectly given as https://github.iu.edu/kzhu/assembltrie, which links to a page that is not publicly accessible. The source code is publicly accessible at https://github.com/kyzhu/assembltrie. Furthermore, in the PDF version of the Article, the right-hand side of Figure 3 was inadvertently cropped. These errors have now been corrected in both the PDF and HTML versions of the Article.
The original version of this Article contained errors in the affiliations of the authors Ibrahim Numanagić and Thomas A. Courtade, which were incorrectly given as ‘Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA’ and ‘Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA’, respectively. Also, the hyperlink for the source code in the Data Availability section was incorrectly given as https://github.iu.edu/kzhu/assembltrie, which links to a page that is not publicly accessible. The source code is publicly accessible at https://github.com/kyzhu/assembltrie. Furthermore, in the PDF version of the Article, the right-hand side of Figure 3 was inadvertently cropped. These errors have now been corrected in both the PDF and HTML versions of the Article.
The most effective genomic data compression methods either assemble reads into contigs, or replace them with their alignment positions on a reference genome. Such methods require significant computational resources, but faster alternatives that avoid using explicit or de novo-constructed references fail to match their performance. Here, we introduce a new reference-free compressed representation for genomic data based on light de novo assembly of reads, where each read is represented as a node in a (compact) trie. We show how to efficiently build such tries to compactly represent reads and demonstrate that among all methods using this representation (including all de novo assembly based methods), our method achieves the shortest possible output. We also provide an lower bound on the compression rate achievable on uniformly sampled genomic read data, which is approximated by our method well. Our method significantly improves the compression performance of alternatives without compromising speed. Increase in high throughput sequencing (HTS) data warrants compression methods to facilitate better storage and communication. Here, Ginart et al. introduce Assembltrie, a reference-free compression tool which is guaranteed to achieve optimality for error-free reads.
High-throughput sequencing provides the means to determine the allelic decomposition for any gene of interest—the number of copies and the exact sequence content of each copy of a gene. Although many clinically and functionally important genes are highly polymorphic and have undergone structural alterations, no high-throughput sequencing data analysis tool has yet been designed to effectively solve the full allelic decomposition problem. Here we introduce a combinatorial optimization framework that successfully resolves this challenging problem, including for genes with structural alterations. We provide an associated computational tool Aldy that performs allelic decomposition of highly polymorphic, multi-copy genes through using whole or targeted genome sequencing data. For a large diverse sequencing data set, Aldy identifies multiple rare and novel alleles for several important pharmacogenes, significantly improving upon the accuracy and utility of current genotyping assays. As more data sets become available, we expect Aldy to become an essential component of genotyping toolkits. Many genes of functional and clinical significance are highly polymorphic and experience structural alterations. Here, Numanagić et al. develop Aldy, a computational tool for resolving the copy number and the sequence content of each copy of a gene by analyzing whole or targeted genome sequencing data.
Recent years have seen the emergence of several “third-generation” sequencing platforms, each of which aims to address shortcomings of standard next-generation short-read sequencing by producing data that capture long-range information, thereby allowing us to access regions of the genome that are inaccessible with short-reads alone. These technologies either produce physically longer reads typically with higher error rates or instead capture long-range information at low error rates by virtue of read “barcodes” as in 10x Genomics’ Chromium platform. As with virtually all sequencing data, sequence alignment for third-generation sequencing data is the foundation on which all downstream analyses are based. Here we introduce a latent variable model for improving barcoded read alignment, thereby enabling improved downstream genotyping and phasing. We demonstrate the feasibility of this approach through developing EMerAld— or EMA for short— and testing it on the barcoded short-reads produced by 10x’s sequencing technologies. EMA not only produces more accurate alignments, but unlike other methods also assigns interpretable probabilities to the alignments it generates. We show that genotypes called from EMA’s alignments contain over 30% fewer false positives than those called from Lariat’s (the current 10x alignment tool), with a fewer number of false negatives, on datasets of NA12878 and NA24385 as compared to NIST GIAB gold standard variant calls. Moreover, we demonstrate that EMA is able to effectively resolve alignments in regions containing nearby homologous elements— a particularly challenging problem in read mapping— through the introduction of a novel statistical binning optimization framework, which allows us to find variants in the pharmacogenomically important CYP2D region that go undetected when using Lariat or BWA. Lastly, we show that EMA’s alignments improve phasing performance compared to Lariat’s in both NA12878 and NA24385, producing fewer switch/mismatch errors and larger phase blocks on average. EMA software and datasets used are available at http://ema.csail.mit.edu.
Nema pronađenih rezultata, molimo da izmjenite uslove pretrage i pokušate ponovo!
Ova stranica koristi kolačiće da bi vam pružila najbolje iskustvo
Saznaj više