6533b85afe1ef96bd12b97e0

RESEARCH PRODUCT

Reverse-Safe Text Indexing

Solon P. PissisHuiping ChenGrigorios LoukidesGabriele FiciGiulia Bernardini

subject

Data structuresComputer scienceSuffix treesuffix tree0102 computer and information sciences02 engineering and technologytext indexing01 natural sciencesTheoretical Computer Sciencelaw.inventionSet (abstract data type)law020204 information systems0202 electrical engineering electronic engineering information engineeringPattern matchingdata privacySettore INF/01 - InformaticaSearch engine indexingdata privacy; Data structures; pattern matching; suffix tree; text indexingData structureMatrix multiplicationpattern matching010201 computation theory & mathematicsData structureAlgorithmAdversary modelInteger (computer science)

description

We introduce the notion of reverse-safe data structures. These are data structures that prevent the reconstruction of the data they encode (i.e., they cannot be easily reversed). A data structure D is called z - reverse-safe when there exist at least z datasets with the same set of answers as the ones stored by D . The main challenge is to ensure that D stores as many answers to useful queries as possible, is constructed efficiently, and has size close to the size of the original dataset it encodes. Given a text of length n and an integer z , we propose an algorithm that constructs a z -reverse-safe data structure ( z -RSDS) that has size O(n) and answers decision and counting pattern matching queries of length at most d optimally, where d is maximal for any such z -RSDS. The construction algorithm takes O(nɷ log d) time, where ɷ is the matrix multiplication exponent. We show that, despite the nɷ factor, our engineered implementation takes only a few minutes to finish for million-letter texts. We also show that plugging our method in data analysis applications gives insignificant or no data utility loss. Furthermore, we show how our technique can be extended to support applications under realistic adversary models. Finally, we show a z -RSDS for decision pattern matching queries, whose size can be sublinear in n . A preliminary version of this article appeared in ALENEX 2020.

https://doi.org/10.1145/3461698