Using MinJoin algorithms for detecting large repeats in genomes (Korištenje MinJoin algoritama za pronalaženje velikih duplikacija u genomima)
Period trajanja: 1. 6. 2024. 31. 12. 2024.
Rok za prijave: 23. 4. 2024. 10. 5. 2024.
Large duplications—also known as segmental duplications (SDs)—are segments of DNA larger than 1kb that are highly similar to other regions within the genome. SDs are among the most important sources of evolution, a common cause of genomic structural variation, and several are associated with various diseases of genomic origin. Despite their importance, SDs—especially those that occurred in the distant past—are hard to detect due to the size of the genome and the computational complexity of the problem. We have recently proposed two methods, SEDEF and BISER, that utilize minimizer-based MinHash sketching to quickly identify the potential SD regions in a given genome. However, despite their success in uncovering many novel SDs at scale, both of these methods are probabilistic and rely on heuristics that are not guaranteed to characterize all SDs within a given genome. In this project, we are looking to explore the feasibility of theoretically sounder MinJoin family of sketching algorithms for approximating string similarity for this task, and to compare their performance with those of MinHash-based algorithms.
Tasks:
● Read and understand the literature
● Implement and integrate MinJoin++ in Codon programming language
● Integrate MinJoin within the SEDEF/BISER pipelines
● Compare the results between the MinJoin and the MinHash-based implementations
Kontakt
Ibrahim Numanagić |