Adaptive learning of compressible strings

6533b825fe1ef96bd1281f1e

RESEARCH PRODUCT

Adaptive learning of compressible strings

Rossano Venturini Gabriele Fici Nicola Prezza

subject

FOS: Computer and information sciences Centroid decomposition General Computer Science String compression Adaptive learning Kolmogorov complexity Context (language use)Data_CODINGANDINFORMATIONTHEORY String reconstruction Theoretical Computer Science Combinatorics String reconstruction; String learning; Adaptive learning; Kolmogorov complexity; String compression; Lempel-Ziv; Centroid decomposition; Suffix tree Suffix tree Integer Computer Science - Data Structures and Algorithms Order (group theory)Data Structures and Algorithms (cs.DS)Adaptive learning; Centroid decomposition; Kolmogorov complexity; Lempel-Ziv; String compression; String learning; String reconstruction; Suffix tree Time complexity Computer Science::Databases Mathematics Lempel-Ziv Settore INF/01 - Informatica Linear space String (computer science)Substring Bounded function String learning

description

Suppose an oracle knows a string $S$ that is unknown to us and that we want to determine. The oracle can answer queries of the form "Is $s$ a substring of $S$?". In 1995, Skiena and Sundaram showed that, in the worst case, any algorithm needs to ask the oracle $\sigma n/4 -O(n)$ queries in order to be able to reconstruct the hidden string, where $\sigma$ is the size of the alphabet of $S$ and $n$ its length, and gave an algorithm that spends $(\sigma-1)n+O(\sigma \sqrt{n})$ queries to reconstruct $S$. The main contribution of our paper is to improve the above upper-bound in the context where the string is compressible. We first present a universal algorithm that, given a (computable) compressor that compresses the string to $\tau$ bits, performs $q=O(\tau)$ substring queries; this algorithm, however, runs in exponential time. For this reason, the second part of the paper focuses on more time-efficient algorithms whose number of queries is bounded by specific compressibility measures. We first show that any string of length $n$ over an integer alphabet of size $\sigma$ with $rle$ runs can be reconstructed with $q=O(rle (\sigma + \log \frac{n}{rle}))$ substring queries in linear time and space. We then present an algorithm that spends $q \in O(\sigma g\log n)$ substring queries and runs in $O(n(\log n + \log \sigma)+ q)$ time using linear space, where $g$ is the size of a smallest straight-line program generating the string.

year	journal	country	edition	language
2020-11-13	Theoretical Computer Science

https://doi.org/10.1016/j.tcs.2021.10.003