Author: Zsuzsanna Lipták

0000000000122535

AUTHOR

Zsuzsanna Lipták

Binary jumbled string matching for highly run-length compressible texts

The Binary Jumbled String Matching problem is defined as: Given a string $s$ over $\{a,b\}$ of length $n$ and a query $(x,y)$, with $x,y$ non-negative integers, decide whether $s$ has a substring $t$ with exactly $x$ $a$'s and $y$ $b$'s. Previous solutions created an index of size O(n) in a pre-processing step, which was then used to answer queries in constant time. The fastest algorithms for construction of this index have running time $O(n^2/\log n)$ [Burcsi et al., FUN 2010; Moosa and Rahman, IPL 2010], or $O(n^2/\log^2 n)$ in the word-RAM model [Moosa and Rahman, JDA 2012]. We propose an index constructed directly from the run-length encoding of $s$. The construction time of our index i…

research product

Computing the original eBWT faster, simpler, and with less memory

Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of the BWT to a collection of strings, however, since this introduction, it has been used more generally to describe any BWT of a collection of strings and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We combine our ne…

research product

On Combinatorial Generation of Prefix Normal Words

A prefix normal word is a binary word with the property that no substring has more 1s than the prefix of the same length. This class of words is important in the context of binary jumbled pattern matching. In this paper we present an efficient algorithm for exhaustively listing the prefix normal words with a fixed length. The algorithm is based on the fact that the language of prefix normal words is a bubble language, a class of binary languages with the property that, for any word w in the language, exchanging the first occurrence of 01 by 10 in w results in another word in the language. We prove that each prefix normal word is produced in O(n) amortized time, and conjecture, based on expe…

research product

Computing the Original eBWT Faster, Simpler, and with Less Memory

Mantaci et al. [TCS 2007] defined the $\mathrm {eBWT}$ to extend the definition of the $\mathrm {BWT}$ to a collection of strings. However, since this introduction, it has been used more generally to describe any $\mathrm {BWT}$ of a collection of strings, and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original $\mathrm {eBWT}$, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the $\mathrm {BWT}$ of a single string that uses …

research product

r-Indexing the eBWT

The extended Burrows Wheeler Transform ($\mathrm {eBWT}$) was introduced by Mantaci et al. [TCS 2007] to extend the definition of the $\mathrm {BWT}$ to a collection of strings. In our prior work [SPIRE 2021], we give a linear-time algorithm for the $\mathrm {eBWT}$ that preserves the fundamental property of the original definition (i.e., the independence from the input order). The algorithm combines a modification of the Suffix Array Induced Sorting (SAIS) algorithm [IEEE Trans Comput 2011] with Prefix Free Parsing [AMB 2019; JCB 2020]. In this paper, we show how this construction algorithm leads to r-indexing the $\mathrm {eBWT}$, i.e., run-length encoded $\mathrm {eBWT}$ and \(…

research product

Novel Results on the Number of Runs of the Burrows-Wheeler-Transform

The Burrows-Wheeler-Transform (BWT), a reversible string transformation, is one of the fundamental components of many current data structures in string processing. It is central in data compression, as well as in efficient query algorithms for sequence data, such as webpages, genomic and other biological sequences, or indeed any textual data. The BWT lends itself well to compression because its number of equal-letter-runs (usually referred to as $r$) is often considerably lower than that of the original string; in particular, it is well suited for strings with many repeated factors. In fact, much attention has been paid to the $r$ parameter as measure of repetitiveness, especially to evalua…

research product

ALGORITHMS FOR JUMBLED PATTERN MATCHING IN STRINGS

The Parikh vector p(s) of a string s is defined as the vector of multiplicities of the characters. Parikh vector q occurs in s if s has a substring t with p(t)=q. We present two novel algorithms for searching for a query q in a text s. One solves the decision problem over a binary text in constant time, using a linear size index of the text. The second algorithm, for a general finite alphabet, finds all occurrences of a given Parikh vector q and has sub-linear expected time complexity; we present two variants, which both use a linear size index of the text.

research product

Generating a Gray code for prefix normal words in amortized polylogarithmic time per word

A prefix normal word is a binary word with the property that no substring has more $1$s than the prefix of the same length. By proving that the set of prefix normal words is a bubble language, we can exhaustively list all prefix normal words of length $n$ as a combinatorial Gray code, where successive strings differ by at most two swaps or bit flips. This Gray code can be generated in $\Oh(\log^2 n)$ amortized time per word, while the best generation algorithm hitherto has $\Oh(n)$ running time per word. We also present a membership tester for prefix normal words, as well as a novel characterization of bubble languages.

research product

On Prefix Normal Words

We present a new class of binary words: the prefix normal words. They are defined by the property that for any given length $k$, no factor of length $k$ has more $a$'s than the prefix of the same length. These words arise in the context of indexing for jumbled pattern matching (a.k.a. permutation matching or Parikh vector matching), where the aim is to decide whether a string has a factor with a given multiplicity of characters, i.e., with a given Parikh vector. Using prefix normal words, we give the first non-trivial characterization of binary words having the same set of Parikh vectors of factors. We prove that the language of prefix normal words is not context-free and is strictly contai…

research product

On prefix normal words and prefix normal forms

A $1$-prefix normal word is a binary word with the property that no factor has more $1$s than the prefix of the same length; a $0$-prefix normal word is defined analogously. These words arise in the context of indexed binary jumbled pattern matching, where the aim is to decide whether a word has a factor with a given number of $1$s and $0$s (a given Parikh vector). Each binary word has an associated set of Parikh vectors of the factors of the word. Using prefix normal words, we provide a characterization of the equivalence class of binary words having the same set of Parikh vectors of their factors. We prove that the language of prefix normal words is not context-free and is strictly contai…

research product

On Table Arrangements, Scrabble Freaks, and Jumbled Pattern Matching

Given a string s, the Parikh vector of s, denoted p(s), counts the multiplicity of each character in s. Searching for a match of Parikh vector q (a “jumbled string”) in the text s requires to find a substring t of s with p(t) = q. The corresponding decision problem is to verify whether at least one such match exists. So, for example for the alphabet Σ = {a, b, c}, the string s = abaccbabaaa has Parikh vector p(s) = (6,3,2), and the Parikh vector q = (2,1,1) appears once in s in position (1,4). Like its more precise counterpart, the renown Exact String Matching, Jumbled Pattern Matching has ubiquitous applications, e.g., string matching with a dyslectic word processor, table rearrangements, …

research product

On Approximate Jumbled Pattern Matching in Strings

Given a string s, the Parikh vector of s, denoted p(s), counts the multiplicity of each character in s. Searching for a match of a Parikh vector q in the text s requires finding a substring t of s with p(t) = q. This can be viewed as the task of finding a jumbled (permuted) version of a query pattern, hence the term Jumbled Pattern Matching. We present several algorithms for the approximate version of the problem: Given a string s and two Parikh vectors u, v (the query bounds), find all maximal occurrences in s of some Parikh vector q such that u <= q <= v. This definition encompasses several natural versions of approximate Parikh vector search. We present an algorithm solving this problem …

research product

On the Number of Closed Factors in a Word

A closed word (a.k.a. periodic-like word or complete first return) is a word whose longest border does not have internal occurrences, or, equivalently, whose longest repeated prefix is not right special. We investigate the structure of closed factors of words. We show that a word of length $n$ contains at least $n+1$ distinct closed factors, and characterize those words having exactly $n+1$ closed factors. Furthermore, we show that a word of length $n$ can contain $\Theta(n^{2})$ many distinct closed factors.

research product

Normal, Abby Normal, Prefix Normal

A prefix normal word is a binary word with the property that no substring has more 1s than the prefix of the same length. This class of words is important in the context of binary jumbled pattern matching. In this paper we present results about the number $\textit{pnw}(n)$ of prefix normal words of length n, showing that $\textit{pnw}(n) =\Omega\left(2^{n - c\sqrt{n\ln n}}\right)$ for some c and $\textit{pnw}(n) = O \left(\frac{2^n (\ln n)^2}{n}\right)$. We introduce efficient algorithms for testing the prefix normal property and a “mechanical algorithm” for computing prefix normal forms. We also include games which can be played with prefix normal words. In these games Alice wishes t…

research product