0000000000043255
AUTHOR
Marinella Sciortino
Suffix Automata and Standard Sturmian Words
Blumer et al. showed (cf. [3,2]) that the suffix automaton of a word w must have at least |w|+1 states and at most 2|w|-1 states. In this paper we characterize the language L of all binary words w whose minimal suffix automaton S(w) has exactly |w| + 1 states; they are precisely all prefixes of standard Sturmian words. In particular, we give an explicit construction of suffix automaton of words that are palindromic prefixes of standard words. Moreover, we establish a necessary and sufficient condition on S(w) which ensures that if w ∈ L and a ∈ {0, 1} then wa ∈ L. By using such a condition, we show how to construct the automaton S(wa) from S(w). More generally, we provide a simple construct…
Burrows–Wheeler transform and Sturmian words
Computing the original eBWT faster, simpler, and with less memory
Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of the BWT to a collection of strings, however, since this introduction, it has been used more generally to describe any BWT of a collection of strings and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We combine our ne…
Forbidden Factors and Fragment Assembly
In this paper methods and results related to the notion of minimal forbidden words are applied to the fragment assembly problem. The fragment assembly problem can be formulated, in its simplest form, as follows: reconstruct a word w from a given set I of substrings (fragments ) of a word w . We introduce an hypothesis involving the set of fragments I and the maximal length m(w) of the minimal forbidden factors of w . Such hypothesis allows us to reconstruct uniquely the word w from the set I in linear time. We prove also that, if w is a word randomly generated by a memoryless source with identical symbol probabilities, m(w) is logarithmic with respect to the size of w . This result shows th…
Logarithmic Equal-Letter Runs for BWT of Purely Morphic Words
In this paper we study the number r(bwt) of equal-letter runs produced by the Burrows-Wheeler transform (BWT) when it is applied to purely morphic finite words, which are words generated by iterating prolongable morphisms. Such a parameter r(bwt) is very significant since it provides a measure of the performances of the BWT, in terms of both compressibility and indexing. In particular, we prove that, when BWT is applied to whichever purely morphic finite word on a binary alphabet, r(bwt) is O(log n), where n is the length of the word. Moreover, we prove that r(bwt) is Theta(log n) for the binary words generated by a large class of prolongable binary morphisms. These bounds are proved by pro…
Languages with mismatches
AbstractIn this paper we study some combinatorial properties of a class of languages that represent sets of words occurring in a text S up to some errors. More precisely, we consider sets of words that occur in a text S with k mismatches in any window of size r. The study of this class of languages mainly focuses both on a parameter, called repetition index, and on the set of the minimal forbidden words of the language of factors of S with errors. The repetition index of a string S is defined as the smallest integer such that all strings of this length occur at most in a unique position of the text S up to errors. We prove that there is a strong relation between the repetition index of S an…
Comparing Sequences by the Burrows-Wheeler Transform
A New Combinatorial Approach to Sequence Comparison
In this paper we introduce a new alignment-free method for comparing sequences which is combinatorial by nature and does not use any compressor nor any information-theoretic notion. Such a method is based on an extension of the Burrows-Wheeler Transform, a transformation widely used in the context of Data Compression. The new extended transformation takes as input a multiset of sequences and produces as output a string obtained by a suitable rearrangement of the characters of all the input sequences. By using such a transformation we give a general method for comparing sequences that takes into account how much the characters coming from the different input sequences are mixed in the output…
Suffixes, Conjugates and Lyndon Words
In this paper we are interested in the study of the combinatorial aspects connecting three important constructions in the field of string algorithms: the suffix array, the Burrows-Wheeler transform (BWT) and the extended Burrows-Wheeler transform (EBWT). Such constructions involve the notions of suffixes and conjugates of words and are based on two different order relations, denoted by $\plex$ and $\pom$, that, even if strictly connected, are quite different from the computational point of view. In this study an important role is played by Lyndon words. In particular, we improve the upper bound on the number of symbol comparisons needed to establish the $\pom$ order between two primitive wo…
From First Principles to the Burrows and Wheeler Transform and Beyond, via Combinatorial Optimization
AbstractWe introduce a combinatorial optimization framework that naturally induces a class of optimal word permutations with respect to a suitably defined cost function taking into account various measures of relatedness between words. The Burrows and Wheeler transform (bwt) (cf. [M. Burrows, D. Wheeler, A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation, 1994]), and its analog for labelled trees (cf. [P. Ferragina, F. Luccio, G. Manzini, S. Muthukrishnan, Structuring labeled trees for optimal succinctness, and beyond, in: Proc. of the 45th Annual IEEE Symposium on Foundations of Computer Science, 2005, pp. 198–207]), are special cases i…
Optimal Partitions of Strings: A New Class of Burrows-Wheeler Compression Algorithms
The Burrows-Wheeler transform [1] is one of the mainstays of lossless data compression. In most cases, its output is fed to Move to Front or other variations of symbol ranking compression. One of the main open problems [2] is to establish whether Move to Front, or more in general symbol ranking compression, is an essential part of the compression process. We settle this question positively by providing a new class of Burrows-Wheeler algorithms that use optimal partitions of strings, rather than symbol ranking, for the additional step. Our technique is a quite surprising specialization to strings of partitioning techniques devised by Buchsbaum et al. [3] for two-dimensional table compression…
"Indexing structures for approximate string matching
In this paper we give the first, to our knowledge, structures and corresponding algorithms for approximate indexing, by considering the Hamming distance, having the following properties. i) Their size is linear times a polylog of the size of the text on average. ii) For each pattern x, the time spent by our algorithms for finding the list occ(x) of all occurrences of a pattern x in the text, up to a certain distance, is proportional on average to |x| + |occ(x)|, under an additional but realistic hypothesis.
Block Sorting-Based Transformations on Words: Beyond the Magic BWT
The Burrows-Wheeler Transform (BWT) is a word transformation introduced in 1994 for Data Compression and later results have contributed to make it a fundamental tool for the design of self-indexing compressed data structures. The Alternating Burrows-Wheeler Transform (ABWT) is a more recent transformation, studied in the context of Combinatorics on Words, that works in a similar way, using an alternating lexicographical order instead of the usual one. In this paper we study a more general class of block sorting-based transformations. The transformations in this new class prove to be interesting combinatorial tools that offer new research perspectives. In particular, we show that all the tra…
Computing the Original eBWT Faster, Simpler, and with Less Memory
Mantaci et al. [TCS 2007] defined the \(\mathrm {eBWT}\) to extend the definition of the \(\mathrm {BWT}\) to a collection of strings. However, since this introduction, it has been used more generally to describe any \(\mathrm {BWT}\) of a collection of strings, and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original \(\mathrm {eBWT}\), which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the \(\mathrm {BWT}\) of a single string that uses …
Nondeterministic Moore automata and Brzozowski's minimization algorithm
AbstractMoore automata represent a model that has many applications. In this paper we define a notion of coherent nondeterministic Moore automaton (NMA) and show that such a model has the same computational power of the classical deterministic Moore automaton. We consider also the problem of constructing the minimal deterministic Moore automaton equivalent to a given NMA. We propose an algorithm that is a variant of Brzozowski’s minimization algorithm in the sense that it is essentially structured as reverse operation and subset construction performed twice. Moreover, we explore more general classes of NMA and analyze the applicability of the algorithm. For some of such classes the algorith…
r-Indexing the eBWT
The extended Burrows Wheeler Transform (\(\mathrm {eBWT}\)) was introduced by Mantaci et al. [TCS 2007] to extend the definition of the \(\mathrm {BWT}\) to a collection of strings. In our prior work [SPIRE 2021], we give a linear-time algorithm for the \(\mathrm {eBWT}\) that preserves the fundamental property of the original definition (i.e., the independence from the input order). The algorithm combines a modification of the Suffix Array Induced Sorting (SAIS) algorithm [IEEE Trans Comput 2011] with Prefix Free Parsing [AMB 2019; JCB 2020]. In this paper, we show how this construction algorithm leads to r-indexing the \(\mathrm {eBWT}\), i.e., run-length encoded \(\mathrm {eBWT}\) and \(…
Distance measures for biological sequences: Some recent approaches
AbstractSequence comparison has become a very essential tool in modern molecular biology. In fact, in biomolecular sequences high similarity usually implies significant functional or structural similarity. Traditional approaches use techniques that are based on sequence alignment able to measure character level differences. However, the recent developments of whole genome sequencing technology give rise to need of similarity measures able to capture the rearrangements involving large segments contained in the sequences. This paper is devoted to illustrate different methods recently introduced for the alignment-free comparison of biological sequences. Goal of the paper is both to highlight t…
On extremal cases of Hopcroft’s algorithm
AbstractIn this paper we consider the problem of minimization of deterministic finite automata (DFA) with reference to Hopcroft’s algorithm. Hopcroft’s algorithm has several degrees of freedom, so there can exist different executions that can lead to different sequences of refinements of the set of the states up to the final partition. We find an infinite family of binary automata for which such a process is unique, whatever strategy is chosen. Some recent papers (cf. Berstel and Carton (2004) [3], Castiglione et al. (2008) [6] and Berstel et al. (2009) [1]) have been devoted to find families of automata for which Hopcroft’s algorithm has its worst execution time. They are unary automata as…
Novel Results on the Number of Runs of the Burrows-Wheeler-Transform
The Burrows-Wheeler-Transform (BWT), a reversible string transformation, is one of the fundamental components of many current data structures in string processing. It is central in data compression, as well as in efficient query algorithms for sequence data, such as webpages, genomic and other biological sequences, or indeed any textual data. The BWT lends itself well to compression because its number of equal-letter-runs (usually referred to as $r$) is often considerably lower than that of the original string; in particular, it is well suited for strings with many repeated factors. In fact, much attention has been paid to the $r$ parameter as measure of repetitiveness, especially to evalua…
Fragment assembly through minimal forbidden words
Epichristoffel Words and Minimization of Moore Automata
This paper is focused on the connection between the combinatorics of words and minimization of automata. The three main ingredients are the epichristoffel words, Moore automata and a variant of Hopcroft’s algorithm for their minimization. Epichristoffel words defined in [14] generalize some properties of circular sturmian words. Here we prove a factorization property and the existence of the reduction tree, that uniquely identifies the structure of the word. Furthermore, in the paper we investigate the problem of the minimization of Moore automata by defining a variant of Hopcroft’s minimization algorithm. The use of this variant makes simpler the computation of the running time and consequ…
Lightweight LCP construction for next-generation sequencing datasets
The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets. In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and B…
Lightweight LCP construction for very large collections of strings
The longest common prefix array is a very advantageous data structure that, combined with the suffix array and the Burrows-Wheeler transform, allows to efficiently compute some combinatorial properties of a string useful in several applications, especially in biological contexts. Nowadays, the input data for many problems are big collections of strings, for instance the data coming from "next-generation" DNA sequencing (NGS) technologies. In this paper we present the first lightweight algorithm (called extLCP) for the simultaneous computation of the longest common prefix array and the Burrows-Wheeler transform of a very large collection of strings having any length. The computation is reali…
Detecting mutations by eBWT
In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (eBWT) of a collection of DNA fragments tends to cluster together the copies of nucleotides sequenced from a genome G. Our theory accurately predicts how many copies of any nucleotide are expected inside each such cluster, and how an elegant and precise LCP array based procedure can locate these clusters in the eBWT. Our findings are very general and can be applied to a wide range of different problems. In this paper, we consider the case of alignment-free and reference-free SNPs discovery in multiple collections of reads. We note that, in accordance with our theoretical results, SNPs are clustered in th…
Preface
Permutations, Partitions and Combinatorial Compression Boosting
Preface
This special issue of Mathematical Structures in Computer Science is devoted to the fourteenth Italian Conference on Theoretical Computer Science (ICTCS) held at University of Palermo, Italy, from 9th to 11th September 2013. ICTCS is the conference of the Italian Chapter of the European Association for Theoretical Computer Science and covers a wide spectrum of topics in Theoretical Computer Science, ranging from computational complexity to logic, from algorithms and data structure to programming languages, from combinatorics on words to distributed computing. For this reason, the contributions here included come from very different areas of Theoretical Computer Science. In fact this special…
Hopcroft's algorithm and tree-like automata
Minimizing a deterministic finite automata (DFA) is a very important problem in theory of automata and formal languages. Hopcroft's algorithm represents the fastest known solution to the such a problem. In this paper we analyze the behavior of this algorithm on a family binary automata, called tree-like automata, associated to binary labeled trees constructed by words. We prove that all the executions of the algorithm on tree-like automata associated to trees, constructed by standard words, have running time with the same asymptotic growth rate. In particular, we provide a lower and upper bound for the running time of the algorithm expressed in terms of combinatorial properties of the trees…
Variable-order reference-free variant discovery with the Burrows-Wheeler Transform
Abstract Background In [Prezza et al., AMB 2019], a new reference-free and alignment-free framework for the detection of SNPs was suggested and tested. The framework, based on the Burrows-Wheeler Transform (BWT), significantly improves sensitivity and precision of previous de Bruijn graphs based tools by overcoming several of their limitations, namely: (i) the need to establish a fixed value, usually small, for the order k, (ii) the loss of important information such as k-mer coverage and adjacency of k-mers within the same read, and (iii) bad performance in repeated regions longer than k bases. The preliminary tool, however, was able to identify only SNPs and it was too slow and memory con…
On Extremal Cases of the Hopcroft's Algorithm
In this paper we consider the problem of minimization of deterministic finite automata (DFA) with reference to Hopcroft’s algorithm. Hopcroft’s algorithm has several degrees of freedom, so there can exist different executions that can lead to different sequences of refinements of the set of the states up to the final partition. We find an infinite family of binary automata for which such a process is unique, whatever strategy is chosen. Some recent papers (cf. Berstel and Carton (2004) [3], Castiglione et al. (2008) [6] and Berstel et al. (2009) [1]) have been devoted to find families of automata for which Hopcroft’s algorithm has its worst execution time. They are unary automata associated…
Hopcroft’s Algorithm and Cyclic Automata
Minimization of deterministic finite automata is a largely studied problem of the Theory of Automata and Formal Languages. It consists in finding the unique (up to isomorphism) minimal deterministic automaton recognizing a set of words. The first approaches to this topic can be traced back to the 1950’s with the works of Huffman and Moore (cf. [12,15]). Over the years several methods to solve this problem have been proposed but the most efficient algorithm in the worst case was given by Hopcroft in [11]. Such an algorithm computes in O(n log n) the minimal automaton equivalent to a given automaton with n states. The Hopcroft’s algorithm has been widely studied, described and implemented by …
Formal Languages and Automata: Models, Methods and Application. In Honour of the 70th Birthday of Antonio Restivo Preface
The Burrows-Wheeler Transform between Data Compression and Combinatorics on Words
The Burrows-Wheeler Transform (BWT) is a tool of fundamental importance in Data Compression and, recently, has found many applications well beyond its original purpose. The main goal of this paper is to highlight the mathematical and combinatorial properties on which the outstanding versatility of the $BWT$ is based, i.e. its reversibility and the clustering effect on the output. Such properties have aroused curiosity and fervent interest in the scientific world both for theoretical aspects and for practical effects. In particular, in this paper we are interested both to survey the theoretical research issues which, by taking their cue from Data Compression, have been developed in the conte…
Forbidden words in symbolic dynamics
AbstractWe introduce an equivalence relation≃between functions from N to N. By describing a symbolic dynamical system in terms of forbidden words, we prove that the≃-equivalence class of the function that counts the minimal forbidden words of a system is a topological invariant of the system. We show that the new invariant is independent from previous ones, but it is not characteristic. In the case of sofic systems, we prove that the≃-equivalence of the corresponding functions is a decidable question. As a more special application, we show, by using the new invariant, that two systems associated to Sturmian words having “different slope” are not conjugate.
An extension of the Burrows-Wheeler Transform to k words
Words and forbidden factors
AbstractGiven a finite or infinite word v, we consider the set M(v) of minimal forbidden factors of v. We show that the set M(v) is of fundamental importance in determining the structure of the word v. In the case of a finite word w we consider two parameters that are related to the size of M(w): the first counts the minimal forbidden factors of w and the second gives the length of the longest minimal forbidden factor of w. We derive sharp upper and lower bounds for both parameters. We prove also that the second parameter is related to the minimal period of the word w. We are further interested to the algorithmic point of view. Indeed, we design linear time algorithm for the following two p…
An extension of the Burrows-Wheeler Transform
AbstractWe describe and highlight a generalization of the Burrows–Wheeler Transform (bwt) to a multiset of words. The extended transformation, denoted by ebwt, is reversible. Moreover, it allows to define a bijection between the words over a finite alphabet A and the finite multisets of conjugacy classes of primitive words in A∗. Besides its mathematical interest, the extended transform can be useful for applications in the context of string processing. In the last part of this paper we illustrate one such application, providing a similarity measure between sequences based on ebwt.
Cyclic Complexity of Words
We introduce and study a complexity function on words $c_x(n),$ called \emph{cyclic complexity}, which counts the number of conjugacy classes of factors of length $n$ of an infinite word $x.$ We extend the well-known Morse-Hedlund theorem to the setting of cyclic complexity by showing that a word is ultimately periodic if and only if it has bounded cyclic complexity. Unlike most complexity functions, cyclic complexity distinguishes between Sturmian words of different slopes. We prove that if $x$ is a Sturmian word and $y$ is a word having the same cyclic complexity of $x,$ then up to renaming letters, $x$ and $y$ have the same set of factors. In particular, $y$ is also Sturmian of slope equ…
The Burrows-Wheeler Transform: from data compression to combinatorics on words
Novel Combinatorial and Information-Theoretic Alignment-Free Distances for Biological Data Mining
Among the plethora of alignment-free methods for comparing biological sequences, there are some that we have perceived as representative of the novel techniques that have been devised in the past few years and as being of a fundamental nature and of broad interest and applicability, ranging from combinatorics to information theory. In this chapter, we review these alignment free methods, by presenting both their mathematical definitions and the experiments in which they are involved in.
Suffix array and Lyndon factorization of a text
Abstract The main goal of this paper is to highlight the relationship between the suffix array of a text and its Lyndon factorization. It is proved in [15] that one can obtain the Lyndon factorization of a text from its suffix array. Conversely, here we show a new method for constructing the suffix array of a text that takes advantage of its Lyndon factorization. The surprising consequence of our results is that, in order to construct the suffix array, the local suffixes inside each Lyndon factor can be separately processed, allowing different implementative scenarios, such as online, external and internal memory, or parallel implementations. Based on our results, the algorithm that we prop…
Inducing the Lyndon Array
In this paper we propose a variant of the induced suffix sorting algorithm by Nong (TOIS, 2013) that computes simultaneously the Lyndon array and the suffix array of a text in $O(n)$ time using $\sigma + O(1)$ words of working space, where $n$ is the length of the text and $\sigma$ is the alphabet size. Our result improves the previous best space requirement for linear time computation of the Lyndon array. In fact, all the known linear algorithms for Lyndon array computation use suffix sorting as a preprocessing step and use $O(n)$ words of working space in addition to the Lyndon array and suffix array. Experimental results with real and synthetic datasets show that our algorithm is not onl…
Standard Sturmian words and automata minimization algorithms
The study of some close connections between the combinatorial properties of words and the performance of the automata minimization process constitutes the main focus of this paper. These relationships have been, in fact, the basis of the study of the tightness and the extremal cases of Hopcroft's algorithm, that is, up to now, the most efficient minimization method for deterministic finite state automata. Recently, increasing attention has been paid to another minimization method that, unlike the approach proposed by Hopcroft, is not based on refinement of the set of states of the automaton, but on automata operations such as determinization and reverse, and is also applicable to non-determ…
On fixed points of the Burrows-Wheeler transform
The Burrows-Wheeler Transform is a well known transformation widely used in Data Compression: important competitive compression software, such as Bzip (cf. [1]) and Szip (cf. [2]) and some indexing software, like the FM-index (cf. [3]), are deeply based on the Burrows Wheeler Transform. The main advantage of using BWT for data compression consists in its feature of "clustering" together equal characters. In this paper we show the existence of fixed points of BWT, i.e., words on which BWT has no effect. We show a characterization of the permutations associated to BWT of fixed points and we give the explicit form of fixed points on a binary ordered alphabet a, b having at most four b's and th…
Boosting Textual Compression in Optimal Linear Time
We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression performance guarantee. It displays the following remarkable properties: (a) it can turn any memoryless compressor into a compression algorithm that uses the “best possible” contexts; (b) it is very simple and optimal in terms of time; and (c) it admits a decompression algorithm again optimal in time. To the best of our knowledge, this is the first boosting technique displaying these properties.Technically, our boosting technique builds upon three main ingredients: the Burrows--Wheeler Transform, the Suffix Tree d…
A new sequence distance measure based on the Burrows-Wheeler Transform
Burrows-Wheeler transform and Run-Length Enconding
In this paper we study the clustering effect of the Burrows-Wheeler Transform (BWT) from a combinatorial viewpoint. In particular, given a word w we define the BWT-clustering ratio of w as the ratio between the number of clusters produced by BWT and the number of the clusters of w. The number of clusters of a word is measured by its Run-Length Encoding. We show that the BWT-clustering ratio ranges in ]0, 2]. Moreover, given a rational number \(r\,\in \,]0,2]\), it is possible to find infinitely many words having BWT-clustering ratio equal to r. Finally, we show how the words can be classified according to their BWT-clustering ratio. The behavior of such a parameter is studied for very well-…
On Extremal Cases of Hopcroft’s Algorithm
In this paper we consider the problem of minimization of deterministic finite automata (DFA) with reference to Hopcroft’s algorithm. Hopcroft’s algorithm has several degrees of freedom, so there can exist different sequences of refinements of the set of the states that lead to the final partition. We find an infinite family of binary automata for which such a process is unique. Some recent papers (cf. [3,7,1]) have been devoted to find families of automata for which Hopcroft’s algorithm has its worst execution time. They are unary automata associated to circular words. However, automata minimization can be achieved also in linear time when the alphabet has only one letter (cf. [14]), so in …
Moore Automata and Epichristoffel Words
Word assembly through minimal forbidden words
AbstractWe give a linear-time algorithm to reconstruct a finite word w over a finite alphabet A of constant size starting from a finite set of factors of w verifying a suitable hypothesis. We use combinatorics techniques based on the minimal forbidden words, which have been introduced in previous papers. This improves a previous algorithm which worked under the assumption of stronger hypothesis.
SORTING CONJUGATES AND SUFFIXES OF WORDS IN A MULTISET
In this paper we are interested in the study of the combinatorial aspects related to the extension of the Burrows-Wheeler transform to a multiset of words. Such study involves the notion of suffixes and conjugates of words and is based on two different order relations, denoted by <lex and ≺ω, that, even if strictly connected, are quite different from the computational point of view. In particular, we introduce a method that only uses the <lex sorting among suffixes of a multiset of words in order to sort their conjugates according to ≺ω-order. In this study an important role is played by Lyndon words. This strategy could be used in applications specially in the field of Bioinformatic…
The Burrows-Wheeler Transform: a new tool in Combinatorics on Words
Nondeterministic Moore Automata and Brzozowski’s Algorithm
Moore automata represent a model that has many applications. In this paper we define a notion of coherent nondeterministic Moore automaton (NMA) and show that such a model has the same computational power of the classical deterministic Moore automaton. We consider also the problem of constructing the minimal deterministic Moore automaton equivalent to a given NMA. In this paper we propose an algorithm that is a variant of Brzozowski's algorithm in the sense that it is essentially structured as reverse operation and subset construction performed twice.
A combinatorial view on string attractors
Abstract The notion of string attractor has recently been introduced in [Prezza, 2017] and studied in [Kempa and Prezza, 2018] to provide a unifying framework for known dictionary-based compressors. A string attractor for a word w = w 1 w 2 ⋯ w n is a subset Γ of the positions { 1 , … , n } , such that all distinct factors of w have an occurrence crossing at least one of the elements of Γ. In this paper we explore the notion of string attractor by focusing on its combinatorial properties. In particular, we show how the size of the smallest string attractor of a word varies when combinatorial operations are applied and we deduce that such a measure is not monotone. Moreover, we introduce a c…
Universal Lyndon Words
A word w over an alphabet Σ is a Lyndon word if there exists an order defined on Σ for which w is lexicographically smaller than all of its conjugates (other than itself). We introduce and study universal Lyndon words, which are words over an n-letter alphabet that have length n! and such that all the conjugates are Lyndon words. We show that universal Lyndon words exist for every n and exhibit combinatorial and structural properties of these words. We then define particular prefix codes, which we call Hamiltonian lex-codes, and show that every Hamiltonian lex-code is in bijection with the set of the shortest unrepeated prefixes of the conjugates of a universal Lyndon word. This allows us t…
Forbidden Factors and Fragment Assembly
In this paper we approach the fragment assembly problem by using the notion of minimal forbidden factors introduced in previous paper. Denoting by M(w) the set of minimal forbidden factors of a word w, we first focus on the evaluation of the size of elements in M(w) and on designing of an algorithm to recover the word w from M(w). Actually we prove that for a word w randomly generated by a memoryless source with identical symbol probabilities, the maximal length m(w) of words in M(w) is logarithmic and that the reconstruction algorithm runs in linear time. These results have an interesting application to the fragment assembly problem, i.e. reconstruct a word w from a given set I of substrin…
Circular words and automata minimization
The colored longest common prefix array computed via sequential scans
Due to the increased availability of large datasets of biological sequences, the tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most of the alignment-free approaches require the computation of statistics of the sequences in the dataset. Such computations become impractical in internal memory when very large collections of long sequences are considered. In this paper, we present a new conceptual data structure, the colored longest common prefix array (cLCP), that allows to efficiently tackle several problems with an alignment-free approach. In fact, we show that such a data structure can be computed via sequential scans in semi-exter…
Measuring the clustering effect of BWT via RLE
Abstract The Burrows–Wheeler Transform (BWT) is a reversible transformation on which are based several text compressors and many other tools used in Bioinformatics and Computational Biology. The BWT is not actually a compressor, but a transformation that performs a context-dependent permutation of the letters of the input text that often create runs of equal letters (clusters) longer than the ones in the original text, usually referred to as the “clustering effect” of BWT. In particular, from a combinatorial point of view, great attention has been given to the case in which the BWT produces the fewest number of clusters (cf. [5] , [16] , [21] , [23] ). In this paper we are concerned about t…
String Attractors and Infinite Words
The notion of string attractor has been introduced by Kempa and Prezza (STOC 2018) in the context of Data Compression and it represents a set of positions of a finite word in which all of its factors can be “attracted”. The smallest size γ∗ of a string attractor for a finite word is a lower bound for several repetitiveness measures associated with the most common compression schemes, including BWT-based and LZ-based compressors. The combinatorial properties of the measure γ∗ have been studied in [Mantaci et al., TCS 2021]. Very recently, a complexity measure, called string attractor profile function, has been introduced for infinite words, by evaluating γ∗ on each prefix. Such a measure has…
Circular sturmian words and Hopcroft’s algorithm
AbstractIn order to analyze some extremal cases of Hopcroft’s algorithm, we investigate the relationships between the combinatorial properties of a circular sturmian word (x) and the run of the algorithm on the cyclic automaton Ax associated to (x). The combinatorial properties of words taken into account make use of sturmian morphisms and give rise to the notion of reduction tree of a circular sturmian word. We prove that the shape of this tree uniquely characterizes the word itself. The properties of the run of Hopcroft’s algorithm are expressed in terms of the derivation tree of the automaton, which is a tree that represents the refinement process that, in the execution of Hopcroft’s alg…
Epichristoffel Words and Minimization of Moore Automata
This paper is focused on the connection between the combinatorics of words and minimization of automata. The three main ingredients are the epichristoffel words, Moore automata and a variant of Hopcroft's algorithm for their minimization. Epichristoffel words defined in [14] generalize some properties of circular sturmian words. Here we prove a factorization property and the existence of the reduction tree, that uniquely identifies the structure of the word. Furthermore, in the paper we investigate the problem of the minimization of Moore automata by defining a variant of Hopcroft's minimization algorithm. The use of this variant makes simpler the computation of the running time and consequ…
SNPs detection by eBWT positional clustering
Sequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants calling methods that only make use of (suitably indexed) raw reads data. We develop the positional clustering theory that (i) describes how the extended Burrows–Wheeler Transform (eBWT) of a collection of reads tends to cluster together bases that cover the same genome position (ii) predicts the size of such clusters, and (iii) exhibits an elegant and precise LCP array based procedure to locate such clusters in the e…
String attractors and combinatorics on words
The notion of \emph{string attractor} has recently been introduced in [Prezza, 2017] and studied in [Kempa and Prezza, 2018] to provide a unifying framework for known dictionary-based compressors. A string attractor for a word $w=w[1]w[2]\cdots w[n]$ is a subset $\Gamma$ of the positions $\{1,\ldots,n\}$, such that all distinct factors of $w$ have an occurrence crossing at least one of the elements of $\Gamma$. While finding the smallest string attractor for a word is a NP-complete problem, it has been proved in [Kempa and Prezza, 2018] that dictionary compressors can be interpreted as algorithms approximating the smallest string attractor for a given word. In this paper we explore the noti…
Words, Trees and Automata Minimization
In this paper we explore some connections between some combinatorial properties of words and the study of extremal cases of the automata minimization process. An intermediate role is played by the notion od word trees for which some properties of words are generalized. In particular, we describe an infinite family of binary automata, called word automata and constructed by using standard sturmian words and more specifically Fibonacci words, that represent the extremal case of some well known automata minimization algorithms, such as Moore’s and Hopcroft’s methods. As well as giving an overview of the main results in this context, the main purpose of this paper is to prove that, even if a re…
The Alternating BWT: an algorithmic perspective
Abstract The Burrows-Wheeler Transform (BWT) is a word transformation introduced in 1994 for Data Compression. It has become a fundamental tool for designing self-indexing data structures, with important applications in several areas in science and engineering. The Alternating Burrows-Wheeler Transform (ABWT) is another transformation recently introduced in Gessel et al. (2012) [21] and studied in the field of Combinatorics on Words. It is analogous to the BWT, except that it uses an alternating lexicographical order instead of the usual one. Building on results in Giancarlo et al. (2018) [23] , where we have shown that BWT and ABWT are part of a larger class of reversible transformations, …
The effects of mental rotation on computational thinking
Abstract Although several investigations of spatial reasoning and mental rotation skills have been conducted in research areas linked to STEM education, to the best of our knowledge, few of these studies have examined the relationship between spatial reasoning and computational thinking. Given this gap in the literature, the present study investigates the role and action of spatial reasoning, and specifically the effects of mental rotation on computational thinking within an embodied and enacted perspective. To achieve this, we carried out a study involving 92 students in five primary-school classes (1st grade - 5th grade). The findings reveal a positive correlation between computational th…
An extension of the Burrows-Wheeler Transform and applications to sequence comparison and data compression
We introduce a generalization of the Burrows-Wheeler Transform (BWT) that can be applied to a multiset of words. The extended transformation, denoted by E, is reversible, but, differently from BWT, it is also surjective. The E transformation allows to give a definition of distance between two sequences, that we apply here to the problem of the whole mitochondrial genome phylogeny. Moreover we give some consideration about compressing a set of words by using the E transformation as preprocessing.
A challenging family of automata for classical minimization algorithms
In this paper a particular family of deterministic automata that was built to reach the worst case complexity of Hopcroft's state minimization algorithm is considered. This family is also challenging for the two other classical minimization algorithms: it achieves the worst case for Moore's algorithm, as a consequence of a result by Berstel et al., and is of at least quadratic complexity for Brzozowski's solution, which is our main contribution. It therefore constitutes an interesting family, which can be useful to measure the efficiency of implementations of well-known or new minimization algorithms.
Hopcroft’s Algorithm and Tree-like Automata
In order to analyze some extremal cases of Hopcroft’s algorithm we deepened the relationship between combinatorial properties of circular words and the ex- ecution of the algorithm on cyclic automata associated to such words.In this paper we highlight the notion of word tree and in particular, we char- acterize the word trees for which Hopcroft’s algorithm on the associated tree-like automata has a unique refinement process. Moreover, we show the relationship between the time complexity of the refinements process of the Hopcroft’s algo- rithm on unary cyclic automata and binary tree-like automata. Such a result allows to exhibit a family of tree-like automata representing the worst case of …