0000000000012787
AUTHOR
Filippo Mignosi
Text Compression Using Antidictionaries
International audience; We give a new text compression scheme based on Forbidden Words ("antidictionary"). We prove that our algorithms attain the entropy for balanced binary sources. They run in linear time. Moreover, one of the main advantages of this approach is that it produces very fast decompressors. A second advantage is a synchronization property that is helpful to search compressed data and allows parallel compression. Our algorithms can also be presented as "compilers" that create compressors dedicated to any previously fixed source. The techniques used in this paper are from Information Theory and Finite Automata.
Periodicity, morphisms, and matrices
In 1965, Fine and Wilf proved the following theorem: if (fn)n≥0 and (gn)n≥0 are periodic sequences of real numbers, of period lengths h and k, respectively, and fn = gn for 0 ≤ n > h + k - gcd(h,k), then fn = gn for all n ≥ 0. Furthermore, the constant h + k - gcd(h,k) is best possible. In this paper, we consider some variations on this theorem. In particular, we study the case where fn ≤ gn, instead of fn = gn. We also obtain generalizations to more than two periods.We apply our methods to a previously unsolved conjecture on iterated morphisms, the decreasing length conjecture: if h : Σ* → Σ* is a morphism with |Σ|= n, and w is a word with |w| < |h(w)| < |h2(w)| < ... < |hk(w)|, then k ≤ n.
Automata and forbidden words
Abstract Let L ( M ) be the (factorial) language avoiding a given anti-factorial language M . We design an automaton accepting L ( M ) and built from the language M . The construction is effective if M is finite. If M is the set of minimal forbidden words of a single word ν, the automaton turns out to be the factor automaton of ν (the minimal automaton accepting the set of factors of ν). We also give an algorithm that builds the trie of M from the factor automaton of a single word. It yields a nontrivial upper bound on the number of minimal forbidden words of a word.
The rightmost equal-cost position problem.
LZ77-based compression schemes compress the input text by replacing factors in the text with an encoded reference to a previous occurrence formed by the couple (length, offset). For a given factor, the smallest is the offset, the smallest is the resulting compression ratio. This is optimally achieved by using the rightmost occurrence of a factor in the previous text. Given a cost function, for instance the minimum number of bits used to represent an integer, we define the Rightmost Equal-Cost Position (REP) problem as the problem of finding one of the occurrences of a factor whose cost is equal to the cost of the rightmost one. We present the Multi-Layer Suffix Tree data structure that, for…
Variations on a Theorem of Fine & Wilf
In 1965, Fine & Wilf proved the following theorem: if (fn)n≥0 and (gn)n≥0 are periodic sequences of real numbers, of periods h and k respectively, and fn = gn for 0 ≤ n ≤ h+k-gcd(h, k), then fn = gn for all n ≥ 0. Furthermore, the constant h + k - gcd(h, k) is best possible. In this paper we consider some variations on this theorem. In particular, we study the case where fn ≤ gn instead of fn = gn. We also obtain a generalization to more than two periods.
Forbidden Factors and Fragment Assembly
In this paper methods and results related to the notion of minimal forbidden words are applied to the fragment assembly problem. The fragment assembly problem can be formulated, in its simplest form, as follows: reconstruct a word w from a given set I of substrings (fragments ) of a word w . We introduce an hypothesis involving the set of fragments I and the maximal length m(w) of the minimal forbidden factors of w . Such hypothesis allows us to reconstruct uniquely the word w from the set I in linear time. We prove also that, if w is a word randomly generated by a memoryless source with identical symbol probabilities, m(w) is logarithmic with respect to the size of w . This result shows th…
Languages with mismatches
AbstractIn this paper we study some combinatorial properties of a class of languages that represent sets of words occurring in a text S up to some errors. More precisely, we consider sets of words that occur in a text S with k mismatches in any window of size r. The study of this class of languages mainly focuses both on a parameter, called repetition index, and on the set of the minimal forbidden words of the language of factors of S with errors. The repetition index of a string S is defined as the smallest integer such that all strings of this length occur at most in a unique position of the text S up to errors. We prove that there is a strong relation between the repetition index of S an…
A Note on a Conjecture of Duval and Sturmian Words
We prove a long standing conjecture of Duval in the special case of Sturmian words. Mathematics Subject Classication. ??????????????. Let U be a nonempty word on a nite alphabet A: A nonempty word B dierent from U is called a border of U if B is both a prex and sux of U: We say U is bordered if U admits a border, otherwise U is said to be unbordered. For example, U = 011001011 is bordered by the factor 011; while 00010001001 is unbordered. An integer 1 k n is a period of a word U = U1 :::U n if and only if for all 1 i n k we have Ui = Ui+k. It is easy to see that k is a period of U if and only if the prex B of U of length n k is a border of U or is empty. Let (U) denote the smallest period …
Dictionary-symbolwise flexible parsing
AbstractLinear-time optimal parsing algorithms are rare in the dictionary-based branch of the data compression theory. A recent result is the Flexible Parsing algorithm of Matias and Sahinalp (1999) that works when the dictionary is prefix closed and the encoding of dictionary pointers has a constant cost. We present the Dictionary-Symbolwise Flexible Parsing algorithm that is optimal for prefix-closed dictionaries and any symbolwise compressor under some natural hypothesis. In the case of LZ78-like algorithms with variable costs and any, linear as usual, symbolwise compressor we show how to implement our parsing algorithm in linear time. In the case of LZ77-like dictionaries and any symbol…
A NEW COMPLEXITY FUNCTION FOR WORDS BASED ON PERIODICITY
Motivated by the extension of the critical factorization theorem to infinite words, we study the (local) periodicity function, i.e. the function that, for any position in a word, gives the size of the shortest square centered in that position. We prove that this function characterizes any binary word up to exchange of letters. We then introduce a new complexity function for words (the periodicity complexity) that, for any position in the word, gives the average value of the periodicity function up to that position. The new complexity function is independent from the other commonly used complexity measures as, for instance, the factor complexity. Indeed, whereas any infinite word with bound…
Minimal forbidden words and factor automata
International audience; Let L(M) be the (factorial) language avoiding a given antifactorial language M. We design an automaton accepting L(M) and built from the language M. The construction is eff ective if M is finite. If M is the set of minimal forbidden words of a single word v, the automaton turns out to be the factor automaton of v (the minimal automaton accepting the set of factors of v). We also give an algorithm that builds the trie of M from the factor automaton of a single word. It yields a non-trivial upper bound on the number of minimal forbidden words of a word.
"Indexing structures for approximate string matching
In this paper we give the first, to our knowledge, structures and corresponding algorithms for approximate indexing, by considering the Hamming distance, having the following properties. i) Their size is linear times a polylog of the size of the text on average. ii) For each pattern x, the time spent by our algorithms for finding the list occ(x) of all occurrences of a pattern x in the text, up to a certain distance, is proportional on average to |x| + |occ(x)|, under an additional but realistic hypothesis.
A trie-based approach for compacting automata
International audience; We describe a new technique for reducing the number of nodes and symbols in automata based on tries. The technique stems from some results on anti-dictionaries for data compression and does not need to retain the input string, differently from other methods based on compact automata. The net effect is that of obtaining a lighter automaton than the directed acyclic word graph (DAWG) of Blumer et al., as it uses less nodes, still with arcs labeled by single characters.
If a DOL language is k-power free then it is circular
We prove that if a DOL language is k-power free then it is circular. By using this result we are able to give an algorithm which decides whether, fixed an integer k≥1, a DOL language is k-power free; we are also able to give a new simpler proof of a result, previously obtained by Ehrenfeucht and Rozenberg, that states that it is decidable whether a DOL language is k-power free for some integer k≥1.
The Expressibility of Languages and Relations by Word Equations
Classically, several properties and relations of words, such as being a power of a same word, can be expressed by using word equations. This paper is devoted to study in general the expressive power of word equations. As main results we prove theorems which allow us to show that certain properties of words are not expressible as components of solutions of word equations. In particular, the primitiveness and the equal length are such properties, as well as being any word over a proper subalphabet.
Abelian-Square-Rich Words
An abelian square is the concatenation of two words that are anagrams of one another. A word of length $n$ can contain at most $\Theta(n^2)$ distinct factors, and there exist words of length $n$ containing $\Theta(n^2)$ distinct abelian-square factors, that is, distinct factors that are abelian squares. This motivates us to study infinite words such that the number of distinct abelian-square factors of length $n$ grows quadratically with $n$. More precisely, we say that an infinite word $w$ is {\it abelian-square-rich} if, for every $n$, every factor of $w$ of length $n$ contains, on average, a number of distinct abelian-square factors that is quadratic in $n$; and {\it uniformly abelian-sq…
STURMIAN WORDS AND AMBIGUOUS CONTEXT-FREE LANGUAGES
If x is a rational number, 0<x≤1, then A(x)c is a context-free language, where A(x) is the set of factors of the infinite Sturmian words with asymptotic density of 1’s smaller than or equal to x. We also prove a “gap” theorem i.e. A(x) can never be an unambiguous co-context-free language. The “gap” theorem is established by proving that the counting generating function of A(x) is transcendental. We show some links between Sturmian words, combinatorics and number theory.
On lazy representations and Sturmian graphs
In this paper we establish a strong relationship between the set of lazy representations and the set of paths in a Sturmian graph associated with a real number α. We prove that for any non-negative integer i the unique path weighted i in the Sturmian graph associated with α represents the lazy representation of i in the Ostrowski numeration system associated with α. Moreover, we provide several properties of the representations of the natural integers in this numeration system.
Fragment assembly through minimal forbidden words
On the number of factors of Sturmian words
Abstract We prove that for m ⩾1, card( A m ) = 1+∑ m i =1 ( m − i +1) ϕ ( i ) where A m is the set of factors of length m of all the Sturmian words and ϕ is the Euler function. This result was conjectured by Dulucq and Gouyou-Beauchamps (1987) who proved that this result implies that the language (∪ m ⩾0 A m ) c is inherently ambiguous. We also give a combinatorial version of the Riemann hypothesis.
From Nerode's congruence to Suffix Automata with mismatches
AbstractIn this paper we focus on the minimal deterministic finite automaton Sk that recognizes the set of suffixes of a word w up to k errors. As first result we give a characterization of the Nerode’s right-invariant congruence that is associated with Sk. This result generalizes the classical characterization described in [A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. Chen, J. Seiferas, The smallest automaton recognizing the subwords of a text, Theoretical Computer Science, 40, 1985, 31–55]. As second result we present an algorithm that makes use of Sk to accept in an efficient way the language of all suffixes of w up to k errors in every window of size r of a text, where r is the…
Linear-size suffix tries
Suffix trees are highly regarded data structures for text indexing and string algorithms [MCreight 76, Weiner 73]. For any given string w of length n = | w | , a suffix tree for w takes O ( n ) nodes and links. It is often presented as a compacted version of a suffix trie for w, where the latter is the trie (or digital search tree) built on the suffixes of w. Here the compaction process replaces each maximal chain of unary nodes with a single arc. For this, the suffix tree requires that the labels of its arcs are substrings encoded as pointers to w (or equivalent information). On the contrary, the arcs of the suffix trie are labeled by single symbols but there can be Θ ( n 2 ) nodes and lin…
On Sturmian Graphs
AbstractIn this paper we define Sturmian graphs and we prove that all of them have a certain “counting” property. We show deep connections between this counting property and two conjectures, by Moser and by Zaremba, on the continued fraction expansion of real numbers. These graphs turn out to be the underlying graphs of compact directed acyclic word graphs of central Sturmian words. In order to prove this result, we give a characterization of the maximal repeats of central Sturmian words. We show also that, in analogy with the case of Sturmian words, these graphs converge to infinite ones.
Generalizations of the periodicity Theorem of Fine and Wilf
We provide three generalizations to the two-dimensional case of the well known periodicity theorem by Fine and Wilf [4] for strings (the one-dimensional case). The first and the second generalizations can be further extended to hold in the more general setting of Cayley graphs of groups. Weak forms of two of our results have been developed for the design of efficient algorithms for two-dimensional pattern matching [2, 3, 6].
Sturmian graphs and integer representations over numeration systems
AbstractIn this paper we consider a numeration system, originally due to Ostrowski, based on the continued fraction expansion of a real number α. We prove that this system has deep connections with the Sturmian graph associated with α. We provide several properties of the representations of the natural integers in this system. In particular, we prove that the set of lazy representations of the natural integers in this numeration system is regular if and only if the continued fraction expansion of α is eventually periodic. The main result of the paper is that for any number i the unique path weighted i in the Sturmian graph associated with α represents the lazy representation of i in the Ost…
On the suffix automaton with mismatches
International audience; In this paper we focus on the construction of the minimal deterministic finite automaton S_k that recognizes the set of suffixes of a word w up to k errors. We present an algorithm that makes use of S_k in order to accept in an efficient way the language of all suffixes of w up to k errors in every window of size r, where r is the value of the repetition index of w. Moreover, we give some experimental results on some well-known words, like prefixes of Fibonacci and Thue-Morse words, and we make a conjecture on the size of the suffix automaton with mismatches.
Forbidden words in symbolic dynamics
AbstractWe introduce an equivalence relation≃between functions from N to N. By describing a symbolic dynamical system in terms of forbidden words, we prove that the≃-equivalence class of the function that counts the minimal forbidden words of a system is a topological invariant of the system. We show that the new invariant is independent from previous ones, but it is not characteristic. In the case of sofic systems, we prove that the≃-equivalence of the corresponding functions is a decidable question. As a more special application, we show, by using the new invariant, that two systems associated to Sturmian words having “different slope” are not conjugate.
Words and forbidden factors
AbstractGiven a finite or infinite word v, we consider the set M(v) of minimal forbidden factors of v. We show that the set M(v) is of fundamental importance in determining the structure of the word v. In the case of a finite word w we consider two parameters that are related to the size of M(w): the first counts the minimal forbidden factors of w and the second gives the length of the longest minimal forbidden factor of w. We derive sharp upper and lower bounds for both parameters. We prove also that the second parameter is related to the minimal period of the word w. We are further interested to the algorithmic point of view. Indeed, we design linear time algorithm for the following two p…
On a Conjecture on Bidimensional Words
We prove that, given a double sequence w over the alphabet A (i.e. a mapping from Z2 to A), if there exists a pair (n0, m0) ∈ Z2 such that pw(n0, m0) < 1/100n0m0, then w has a periodicity vector, where pw is the complexity function in rectangles of w.
Abelian Powers and Repetitions in Sturmian Words
Richomme, Saari and Zamboni (J. Lond. Math. Soc. 83: 79-95, 2011) proved that at every position of a Sturmian word starts an abelian power of exponent $k$ for every $k > 0$. We improve on this result by studying the maximum exponents of abelian powers and abelian repetitions (an abelian repetition is an analogue of a fractional power) in Sturmian words. We give a formula for computing the maximum exponent of an abelian power of abelian period $m$ starting at a given position in any Sturmian word of rotation angle $\alpha$. vAs an analogue of the critical exponent, we introduce the abelian critical exponent $A(s_\alpha)$ of a Sturmian word $s_\alpha$ of angle $\alpha$ as the quantity $A(s_\a…
Sturmian Graphs and a conjecture of Moser
In this paper we define Sturmian graphs and we prove that all of them have a “counting” property. We show deep connections between this counting property and two conjectures, by Moser and by Zaremba, on the continued fraction expansion of real numbers. These graphs turn out to be the underlying graphs of CDAWGs of central Sturmian words. We show also that, analogously to the case of Sturmian words, these graphs converge to infinite ones.
On Fine and Wilf's theorem for bidimensional words
AbstractGeneralizations of Fine and Wilf's Periodicity Theorem are obtained for the case of bidimensional words using geometric arguments. The domains considered constitute a large class of convex subsets of R2 which include most parallelograms. A complete discussion is provided for the parallelogram case.
Automated Synthesis of Application-layer Connectors from Automata-based Specifications
Abstract Ubiquitous and Pervasive Computing, and the Internet of Things, promote dynamic interaction among heterogeneous systems. To achieve this vision, interoperability among heterogeneous systems represents a key enabler, and mediators are often built to solve protocol mismatches. Many approaches propose the synthesis of mediators. Unfortunately, a rigorous characterization of the concept of interoperability is still lacking, hence making hard to assess their applicability and soundness. In this paper, we provide a framework for the synthesis of mediators that allows us to: (i) characterize the conditions for the mediator existence and correctness; and (ii) establish the applicability bo…
A multidimensional critical factorization theorem
AbstractThe Critical Factorization Theorem is one of the principal results in combinatorics on words. It relates local periodicities of a word to its global periodicity. In this paper we give a multidimensional extension of it. More precisely, we give a new proof of the Critical Factorization Theorem, but in a weak form, where the weakness is due to the fact that we loose the tightness of the local repetition order. In exchange, we gain the possibility of extending our proof to the multidimensional case. Indeed, this new proof makes use of the Theorem of Fine and Wilf, that has several classical generalizations to the multidimensional case.
Languages with mismatches and an application to approximate indexing
In this paper we describe a factorial language, denoted by L(S, k,r), that contains all words that occur in a string 5 up to k mismatches every r symbols. Then we give some combinatorial properties of a parameter, called repetition index and denoted by R(S,k,r), defined as the smallest integer h ? 1 such that all strings of this length occur at most in a unique position of the text S up to k mismatches every r symbols. We prove that R(S, k, r) is a non-increasing function of r and a non-decreasing function of k and that the equation r = R(S, k, r) admits a unique solution. The repetition index plays an important role in the construction of an indexing data structure based on a trie that rep…
Word assembly through minimal forbidden words
AbstractWe give a linear-time algorithm to reconstruct a finite word w over a finite alphabet A of constant size starting from a finite set of factors of w verifying a suitable hypothesis. We use combinatorics techniques based on the minimal forbidden words, which have been introduced in previous papers. This improves a previous algorithm which worked under the assumption of stronger hypothesis.
Minimal forbidden words and symbolic dynamics
We introduce a new complexity measure of a factorial formal language L: the growth rate of the set of minimal forbidden words. We prove some combinatorial properties of minimal forbidden words. As main result we prove that the growth rate of the set of minimal forbidden words for L is a topological invariant of the dynamical system defined by L.
Fine and Wilf's Theorem for Three periods and a Generalization of Sturmian Words
AbstractWe extend the theorem of Fine and Wilf to words having three periods. We then define the set 3-PER of words of maximal length for which such result does not apply. We prove that the set 3-PER and the sequences of complexity 2n + 1, introduced by Arnoux and Rauzy to generalize Sturmian words, have the same set of factors.
On the longest common factor problem
The Longest Common Factor (LCF) of a set of strings is a well studied problem having a wide range of applications in Bioinformatics: from microarrays to DNA sequences analysis. This problem has been solved by Hui (2000) who uses a famous constant-time solution to the Lowest Common Ancestor (LCA) problem in trees coupled with use of suffix trees. A data structure for the LCA problem, although linear in space and construction time, introduces a multiplicative constant in both space and time that reduces the range of applications in many biological applications. In this article we present a new method for solving the LCF problem using the suffix tree structure with an auxiliary array that take…
Forbidden Factors and Fragment Assembly
In this paper we approach the fragment assembly problem by using the notion of minimal forbidden factors introduced in previous paper. Denoting by M(w) the set of minimal forbidden factors of a word w, we first focus on the evaluation of the size of elements in M(w) and on designing of an algorithm to recover the word w from M(w). Actually we prove that for a word w randomly generated by a memoryless source with identical symbol probabilities, the maximal length m(w) of words in M(w) is logarithmic and that the reconstruction algorithm runs in linear time. These results have an interesting application to the fragment assembly problem, i.e. reconstruct a word w from a given set I of substrin…
Abelian Repetitions in Sturmian Words
We investigate abelian repetitions in Sturmian words. We exploit a bijection between factors of Sturmian words and subintervals of the unitary segment that allows us to study the periods of abelian repetitions by using classical results of elementary Number Theory. We prove that in any Sturmian word the superior limit of the ratio between the maximal exponent of an abelian repetition of period $m$ and $m$ is a number $\geq\sqrt{5}$, and the equality holds for the Fibonacci infinite word. We further prove that the longest prefix of the Fibonacci infinite word that is an abelian repetition of period $F_j$, $j>1$, has length $F_j(F_{j+1}+F_{j-1} +1)-2$ if $j$ is even or $F_j(F_{j+1}+F_{j-1}…
Characteristic Sturmian words are extremal for the Critical Factorization Theorem
We prove that characteristic Sturmian words are extremal for the Critical Factorization Theorem (CFT) in the following sense. If p x ( n ) denotes the local period of an infinite word x at point n , we prove that x is a characteristic Sturmian word if and only if p x ( n ) is smaller than or equal to n + 1 for all n ≥ 1 and it is equal to n + 1 for infinitely many integers n . This result is extremal with respect to the \{CFT\} since a consequence of the \{CFT\} is that, for any infinite recurrent word x, either the function p x is bounded, and in such a case x is periodic, or p x ( n ) ≥ n + 1 for infinitely many integers n . As a byproduct of the techniques used in the paper we extend a r…
On the number of Arnoux–Rauzy words
On numeration systems and Sturmian graphs
Minimal forbidden patterns of multi-dimensional shifts
We study whether the entropy (or growth rate) of minimal forbidden patterns of symbolic dynamical shifts of dimension 2 or more, is a conjugacy invariant. We prove that the entropy of minimal forbidden patterns is a conjugacy invariant for uniformly semi-strongly irreducible shifts. We prove a weaker invariant in the general case.
Words with the Maximum Number of Abelian Squares
An abelian square is the concatenation of two words that are anagrams of one another. A word of length n can contain \(\varTheta (n^2)\) distinct factors that are abelian squares. We study infinite words such that the number of abelian square factors of length n grows quadratically with n.