6533b828fe1ef96bd1288e4d

RESEARCH PRODUCT

Some Investigations on Similarity Measures Based on Absent Words

Sabrina MantaciAntonio RestivoGiuseppa Castiglione

subject

sequence comparisonAlgebra and Number TheorySettore INF/01 - Informaticabusiness.industryComputer sciencePattern recognitionsimilarity measuresMinimal absent wordsTheoretical Computer ScienceComputational Theory and MathematicsSimilarity (network science)Artificial intelligencebusinessInformation Systems

description

In this paper we investigate similarity measures based on minimal absent words, introduced by Chairungsee and Crochemore in [1]. They make use of a length-weighted index on a sample set corresponding to the symmetric difference M(x)ΔM(y) of the minimal absent words M(x) and M(y) of two sequences x and y, respectively. We first propose a variant of this measure by choosing as a sample set a proper subset (x, y) of M(x)ΔM(y), which appears to be more appropriate for distinguishing x and y. From the algebraic point of view, we prove that (x, y) is the base of the ideal generated by M(x)ΔM(y). We then remark that such measures are able to recognize whether the sequences x and y share a common structure, but they are not able to detect the difference on the number of occurrences of such a structure in the two sequences. In order to take into account such a multiplicity, we introduce the notion of multifactor, and define a new measure that uses both absent words and multifactors. Surprisingly, we prove that this similarity measure coincides with a distance on sequences introduced by Ehrenfeucht and Haussler in [2], in the context of block-moves strategies. In this way, our result creates a non trivial bridge between similarity measures based on absent words and those based on the block-moves approach.

https://doi.org/10.3233/fi-2020-1874