Binary jumbled string matching for highly run-length compressible texts

6533b7cefe1ef96bd1257a51

RESEARCH PRODUCT

Binary jumbled string matching for highly run-length compressible texts

Gabriele Fici Zsuzsanna Lipták Golnaz Badkobeh Steve Kroon

subject

FOS: Computer and information sciences String algorithms Structure (category theory)Binary number G.2.1 Data_CODINGANDINFORMATIONTHEORY 0102 computer and information sciences 02 engineering and technology String searching algorithm 01 natural sciences Computer Science - Information Retrieval Theoretical Computer Science Combinatorics data structures Simple (abstract algebra)Computer Science - Data Structures and Algorithms String algorithms; jumbled pattern matching; prefix normal form; data structures 0202 electrical engineering electronic engineering information engineering Parikh vector Data Structures and Algorithms (cs.DS)Run-length encoding Mathematics 68W32 68P05 68P20 String (computer science)prefix normal form Substring Computer Science Applications jumbled pattern matching 010201 computation theory & mathematics Data structure Signal Processing Run-length encoding 020201 artificial intelligence & image processing Constant (mathematics)Information Retrieval (cs.IR)Information Systems

description

The Binary Jumbled String Matching problem is defined as: Given a string $s$ over $\{a,b\}$ of length $n$ and a query $(x,y)$, with $x,y$ non-negative integers, decide whether $s$ has a substring $t$ with exactly $x$ $a$'s and $y$ $b$'s. Previous solutions created an index of size O(n) in a pre-processing step, which was then used to answer queries in constant time. The fastest algorithms for construction of this index have running time $O(n^2/\log n)$ [Burcsi et al., FUN 2010; Moosa and Rahman, IPL 2010], or $O(n^2/\log^2 n)$ in the word-RAM model [Moosa and Rahman, JDA 2012]. We propose an index constructed directly from the run-length encoding of $s$. The construction time of our index is $O(n+\rho^2\log \rho)$, where O(n) is the time for computing the run-length encoding of $s$ and $\rho$ is the length of this encoding---this is no worse than previous solutions if $\rho = O(n/\log n)$ and better if $\rho = o(n/\log n)$. Our index $L$ can be queried in $O(\log \rho)$ time. While $|L|= O(\min(n, \rho^{2}))$ in the worst case, preliminary investigations have indicated that $|L|$ may often be close to $\rho$. Furthermore, the algorithm for constructing the index is conceptually simple and easy to implement. In an attempt to shed light on the structure and size of our index, we characterize it in terms of the prefix normal forms of $s$ introduced in [Fici and Lipt\'ak, DLT 2011].

year	journal	country	edition	language
2012-06-12	Information Processing Letters

https://doi.org/10.1016/j.ipl.2013.05.007