6533b825fe1ef96bd1281f1e

RESEARCH PRODUCT

Adaptive learning of compressible strings

Rossano VenturiniGabriele FiciNicola Prezza

subject

FOS: Computer and information sciencesCentroid decompositionGeneral Computer ScienceString compressionAdaptive learningKolmogorov complexityContext (language use)Data_CODINGANDINFORMATIONTHEORYString reconstructionTheoretical Computer ScienceCombinatoricsString reconstruction; String learning; Adaptive learning; Kolmogorov complexity; String compression; Lempel-Ziv; Centroid decomposition; Suffix treeSuffix treeIntegerComputer Science - Data Structures and AlgorithmsOrder (group theory)Data Structures and Algorithms (cs.DS)Adaptive learning; Centroid decomposition; Kolmogorov complexity; Lempel-Ziv; String compression; String learning; String reconstruction; Suffix treeTime complexityComputer Science::DatabasesMathematicsLempel-ZivSettore INF/01 - InformaticaLinear spaceString (computer science)SubstringBounded functionString learning

description

Suppose an oracle knows a string $S$ that is unknown to us and that we want to determine. The oracle can answer queries of the form "Is $s$ a substring of $S$?". In 1995, Skiena and Sundaram showed that, in the worst case, any algorithm needs to ask the oracle $\sigma n/4 -O(n)$ queries in order to be able to reconstruct the hidden string, where $\sigma$ is the size of the alphabet of $S$ and $n$ its length, and gave an algorithm that spends $(\sigma-1)n+O(\sigma \sqrt{n})$ queries to reconstruct $S$. The main contribution of our paper is to improve the above upper-bound in the context where the string is compressible. We first present a universal algorithm that, given a (computable) compressor that compresses the string to $\tau$ bits, performs $q=O(\tau)$ substring queries; this algorithm, however, runs in exponential time. For this reason, the second part of the paper focuses on more time-efficient algorithms whose number of queries is bounded by specific compressibility measures. We first show that any string of length $n$ over an integer alphabet of size $\sigma$ with $rle$ runs can be reconstructed with $q=O(rle (\sigma + \log \frac{n}{rle}))$ substring queries in linear time and space. We then present an algorithm that spends $q \in O(\sigma g\log n)$ substring queries and runs in $O(n(\log n + \log \sigma)+ q)$ time using linear space, where $g$ is the size of a smallest straight-line program generating the string.

https://doi.org/10.1016/j.tcs.2021.10.003