6533b7dbfe1ef96bd1271497

RESEARCH PRODUCT

Constructing Antidictionaries of Long Texts in Output-Sensitive Space

Solon P. PissisGolnaz BadkobehAlice HéliouGabriele FiciLorraine A.k. Ayad

subject

0301 basic medicineAntidictionarySettore INF/01 - InformaticaOutput sensitive algorithm0102 computer and information sciencesSpace (mathematics)01 natural sciencesTheoretical Computer ScienceString algorithmPrefixSet (abstract data type)Combinatorics03 medical and health sciences030104 developmental biologyComputational Theory and Mathematics010201 computation theory & mathematicsData compressionOutput-sensitive algorithm[INFO]Computer Science [cs]SuffixAlphabetAbsent wordWord (group theory)Mathematics

description

AbstractA wordxthat is absent from a wordyis calledminimalif all its proper factors occur iny. Given a collection ofkwordsy1, … ,ykover an alphabetΣ, we are asked to compute the set$\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$M{y1,…,yk}ℓof minimal absent words of length at mostℓof the collection {y1, … ,yk}. The set$\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$M{y1,…,yk}ℓcontains all the wordsxsuch thatxis absent from all the words of the collection while there existi,j, such that the maximal proper suffix ofxis a factor ofyiand the maximal proper prefix ofxis a factor ofyj. In data compression, this corresponds to computing the antidictionary ofkdocuments. In bioinformatics, it corresponds to computing words that are absent from a genome ofkchromosomes. Indeed, the set$\mathrm {M}^{\ell }_{y}$Myℓof minimal absent words of a wordyis equal to$\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$M{y1,…,yk}ℓfor any decomposition ofyinto a collection of wordsy1, … ,yksuch that there is an overlap of length at leastℓ− 1 between any two consecutive words in the collection. This computation generally requiresΩ(n) space forn= |y| using any of the plenty available$\mathcal {O}(n)$O(n)-time algorithms. This is because anΩ(n)-sized text index is constructed overywhich can be impractical for largen. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when$\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| =o(n)$∥M{y1,…,yN}ℓ∥=o(n), for allN∈ [1,k], where ∥S∥ denotes the sum of the lengths of words in setS. For instance, in the human genome,n≈ 3 × 109but$\| \mathrm {M}^{12}_{\{y_1,\ldots ,y_k\}}\| \approx 10^{6}$∥M{y1,…,yk}12∥≈106. We consider a constant-sized alphabet for stating our results. We show thatall$\mathrm {M}^{\ell }_{y_{1}},\ldots ,\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$My1ℓ,…,M{y1,…,yk}ℓcan be computed in$\mathcal {O}(kn+{\sum }^{k}_{N=1}\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| )$O(kn+∑N=1k∥M{y1,…,yN}ℓ∥)total time using$\mathcal {O}(\textsc {MaxIn}+\textsc {MaxOut})$O(MaxIn+MaxOut)space, where MaxIn is the length of the longest word in {y1, … ,yk} and$\textsc {MaxOut}=\max \limits \{\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| :N\in [1,k]\}$MaxOut=max{∥M{y1,…,yN}ℓ∥:N∈[1,k]}. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

10.1007/s00224-020-10018-5https://research.vu.nl/en/publications/222d2908-30de-40ab-b3b6-7a5ab57bf5c0