Contextual Similarity: Quasilinear-Time Search and Comparison for Sequential Data
This paper proposes a new algorithm for the evaluation of similarity between two sequences in quasilinear time. It describes the theoretical, practical and implementational aspects of the algorithm. The proposed method is a new approach dedicated to the computation of sequential similarity in contrast to other methods like the Jaccard Index which although designed for the computation of similarity of sets have been frequently used on sequences. The method is generalizable and applicable to any form of sequential data of a finite alphabet (binary files, DNA sequences, natural language etc.) One of the problems tackled was the processing of multiple sequences at once, which allowed for quasilinear-time complexity for both search and comparison. The algorithm itself is parallelizable. Although the function is still a pseudo-metric as it doesn't fully satisfy the triangle inequality, it can still compute the distance from one sequence to another. This relation forms as a consequence of the decreasing probability of larger subsequences to form within two sequences without them being correlated. The mathematical model suggests that the contextual similarity of two random sequences approaches zero as their length increases.