0000000000012787

AUTHOR

Filippo Mignosi

Text Compression Using Antidictionaries

International audience; We give a new text compression scheme based on Forbidden Words ("antidictionary"). We prove that our algorithms attain the entropy for balanced binary sources. They run in linear time. Moreover, one of the main advantages of this approach is that it produces very fast decompressors. A second advantage is a synchronization property that is helpful to search compressed data and allows parallel compression. Our algorithms can also be presented as "compilers" that create compressors dedicated to any previously fixed source. The techniques used in this paper are from Information Theory and Finite Automata.

research product

Periodicity, morphisms, and matrices

In 1965, Fine and Wilf proved the following theorem: if (fn)n≥0 and (gn)n≥0 are periodic sequences of real numbers, of period lengths h and k, respectively, and fn = gn for 0 ≤ n > h + k - gcd(h,k), then fn = gn for all n ≥ 0. Furthermore, the constant h + k - gcd(h,k) is best possible. In this paper, we consider some variations on this theorem. In particular, we study the case where fn ≤ gn, instead of fn = gn. We also obtain generalizations to more than two periods.We apply our methods to a previously unsolved conjecture on iterated morphisms, the decreasing length conjecture: if h : Σ* → Σ* is a morphism with |Σ|= n, and w is a word with |w| < |h(w)| < |h2(w)| < ... < |hk(w)|, then k ≤ n.

research product

Automata and forbidden words

Abstract Let L ( M ) be the (factorial) language avoiding a given anti-factorial language M . We design an automaton accepting L ( M ) and built from the language M . The construction is effective if M is finite. If M is the set of minimal forbidden words of a single word ν, the automaton turns out to be the factor automaton of ν (the minimal automaton accepting the set of factors of ν). We also give an algorithm that builds the trie of M from the factor automaton of a single word. It yields a nontrivial upper bound on the number of minimal forbidden words of a word.

research product

The rightmost equal-cost position problem.

LZ77-based compression schemes compress the input text by replacing factors in the text with an encoded reference to a previous occurrence formed by the couple (length, offset). For a given factor, the smallest is the offset, the smallest is the resulting compression ratio. This is optimally achieved by using the rightmost occurrence of a factor in the previous text. Given a cost function, for instance the minimum number of bits used to represent an integer, we define the Rightmost Equal-Cost Position (REP) problem as the problem of finding one of the occurrences of a factor whose cost is equal to the cost of the rightmost one. We present the Multi-Layer Suffix Tree data structure that, for…

research product

Variations on a Theorem of Fine &amp; Wilf

In 1965, Fine & Wilf proved the following theorem: if (fn)n≥0 and (gn)n≥0 are periodic sequences of real numbers, of periods h and k respectively, and fn = gn for 0 ≤ n ≤ h+k-gcd(h, k), then fn = gn for all n ≥ 0. Furthermore, the constant h + k - gcd(h, k) is best possible. In this paper we consider some variations on this theorem. In particular, we study the case where fn ≤ gn instead of fn = gn. We also obtain a generalization to more than two periods.

research product

Forbidden Factors and Fragment Assembly

In this paper methods and results related to the notion of minimal forbidden words are applied to the fragment assembly problem. The fragment assembly problem can be formulated, in its simplest form, as follows: reconstruct a word w from a given set I of substrings (fragments ) of a word w . We introduce an hypothesis involving the set of fragments I and the maximal length m(w) of the minimal forbidden factors of w . Such hypothesis allows us to reconstruct uniquely the word w from the set I in linear time. We prove also that, if w is a word randomly generated by a memoryless source with identical symbol probabilities, m(w) is logarithmic with respect to the size of w . This result shows th…

research product

Languages with mismatches

AbstractIn this paper we study some combinatorial properties of a class of languages that represent sets of words occurring in a text S up to some errors. More precisely, we consider sets of words that occur in a text S with k mismatches in any window of size r. The study of this class of languages mainly focuses both on a parameter, called repetition index, and on the set of the minimal forbidden words of the language of factors of S with errors. The repetition index of a string S is defined as the smallest integer such that all strings of this length occur at most in a unique position of the text S up to errors. We prove that there is a strong relation between the repetition index of S an…

research product

A Note on a Conjecture of Duval and Sturmian Words

We prove a long standing conjecture of Duval in the special case of Sturmian words. Mathematics Subject Classication. ??????????????. Let U be a nonempty word on a nite alphabet A: A nonempty word B dierent from U is called a border of U if B is both a prex and sux of U: We say U is bordered if U admits a border, otherwise U is said to be unbordered. For example, U = 011001011 is bordered by the factor 011; while 00010001001 is unbordered. An integer 1 k n is a period of a word U = U1 :::U n if and only if for all 1 i n k we have Ui = Ui+k. It is easy to see that k is a period of U if and only if the prex B of U of length n k is a border of U or is empty. Let (U) denote the smallest period …

research product

Dictionary-symbolwise flexible parsing

AbstractLinear-time optimal parsing algorithms are rare in the dictionary-based branch of the data compression theory. A recent result is the Flexible Parsing algorithm of Matias and Sahinalp (1999) that works when the dictionary is prefix closed and the encoding of dictionary pointers has a constant cost. We present the Dictionary-Symbolwise Flexible Parsing algorithm that is optimal for prefix-closed dictionaries and any symbolwise compressor under some natural hypothesis. In the case of LZ78-like algorithms with variable costs and any, linear as usual, symbolwise compressor we show how to implement our parsing algorithm in linear time. In the case of LZ77-like dictionaries and any symbol…

research product

A NEW COMPLEXITY FUNCTION FOR WORDS BASED ON PERIODICITY

Motivated by the extension of the critical factorization theorem to infinite words, we study the (local) periodicity function, i.e. the function that, for any position in a word, gives the size of the shortest square centered in that position. We prove that this function characterizes any binary word up to exchange of letters. We then introduce a new complexity function for words (the periodicity complexity) that, for any position in the word, gives the average value of the periodicity function up to that position. The new complexity function is independent from the other commonly used complexity measures as, for instance, the factor complexity. Indeed, whereas any infinite word with bound…

research product

Minimal forbidden words and factor automata

International audience; Let L(M) be the (factorial) language avoiding a given antifactorial language M. We design an automaton accepting L(M) and built from the language M. The construction is eff ective if M is finite. If M is the set of minimal forbidden words of a single word v, the automaton turns out to be the factor automaton of v (the minimal automaton accepting the set of factors of v). We also give an algorithm that builds the trie of M from the factor automaton of a single word. It yields a non-trivial upper bound on the number of minimal forbidden words of a word.

research product

"Indexing structures for approximate string matching

In this paper we give the first, to our knowledge, structures and corresponding algorithms for approximate indexing, by considering the Hamming distance, having the following properties. i) Their size is linear times a polylog of the size of the text on average. ii) For each pattern x, the time spent by our algorithms for finding the list occ(x) of all occurrences of a pattern x in the text, up to a certain distance, is proportional on average to |x| + |occ(x)|, under an additional but realistic hypothesis.

research product

A trie-based approach for compacting automata

International audience; We describe a new technique for reducing the number of nodes and symbols in automata based on tries. The technique stems from some results on anti-dictionaries for data compression and does not need to retain the input string, differently from other methods based on compact automata. The net effect is that of obtaining a lighter automaton than the directed acyclic word graph (DAWG) of Blumer et al., as it uses less nodes, still with arcs labeled by single characters.

research product

If a DOL language is k-power free then it is circular

We prove that if a DOL language is k-power free then it is circular. By using this result we are able to give an algorithm which decides whether, fixed an integer k≥1, a DOL language is k-power free; we are also able to give a new simpler proof of a result, previously obtained by Ehrenfeucht and Rozenberg, that states that it is decidable whether a DOL language is k-power free for some integer k≥1.

research product

The Expressibility of Languages and Relations by Word Equations

Classically, several properties and relations of words, such as being a power of a same word, can be expressed by using word equations. This paper is devoted to study in general the expressive power of word equations. As main results we prove theorems which allow us to show that certain properties of words are not expressible as components of solutions of word equations. In particular, the primitiveness and the equal length are such properties, as well as being any word over a proper subalphabet.

research product

Abelian-Square-Rich Words

An abelian square is the concatenation of two words that are anagrams of one another. A word of length $n$ can contain at most $\Theta(n^2)$ distinct factors, and there exist words of length $n$ containing $\Theta(n^2)$ distinct abelian-square factors, that is, distinct factors that are abelian squares. This motivates us to study infinite words such that the number of distinct abelian-square factors of length $n$ grows quadratically with $n$. More precisely, we say that an infinite word $w$ is {\it abelian-square-rich} if, for every $n$, every factor of $w$ of length $n$ contains, on average, a number of distinct abelian-square factors that is quadratic in $n$; and {\it uniformly abelian-sq…

research product

STURMIAN WORDS AND AMBIGUOUS CONTEXT-FREE LANGUAGES

If x is a rational number, 0&lt;x≤1, then A(x)c is a context-free language, where A(x) is the set of factors of the infinite Sturmian words with asymptotic density of 1’s smaller than or equal to x. We also prove a “gap” theorem i.e. A(x) can never be an unambiguous co-context-free language. The “gap” theorem is established by proving that the counting generating function of A(x) is transcendental. We show some links between Sturmian words, combinatorics and number theory.

research product

On lazy representations and Sturmian graphs

In this paper we establish a strong relationship between the set of lazy representations and the set of paths in a Sturmian graph associated with a real number α. We prove that for any non-negative integer i the unique path weighted i in the Sturmian graph associated with α represents the lazy representation of i in the Ostrowski numeration system associated with α. Moreover, we provide several properties of the representations of the natural integers in this numeration system.

research product

Fragment assembly through minimal forbidden words

research product

On the number of factors of Sturmian words

Abstract We prove that for m ⩾1, card( A m ) = 1+∑ m i =1 ( m − i +1) ϕ ( i ) where A m is the set of factors of length m of all the Sturmian words and ϕ is the Euler function. This result was conjectured by Dulucq and Gouyou-Beauchamps (1987) who proved that this result implies that the language (∪ m ⩾0 A m ) c is inherently ambiguous. We also give a combinatorial version of the Riemann hypothesis.

research product

From Nerode's congruence to Suffix Automata with mismatches

AbstractIn this paper we focus on the minimal deterministic finite automaton Sk that recognizes the set of suffixes of a word w up to k errors. As first result we give a characterization of the Nerode’s right-invariant congruence that is associated with Sk. This result generalizes the classical characterization described in [A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. Chen, J. Seiferas, The smallest automaton recognizing the subwords of a text, Theoretical Computer Science, 40, 1985, 31–55]. As second result we present an algorithm that makes use of Sk to accept in an efficient way the language of all suffixes of w up to k errors in every window of size r of a text, where r is the…

research product

Linear-size suffix tries

Suffix trees are highly regarded data structures for text indexing and string algorithms [MCreight 76, Weiner 73]. For any given string w of length n = | w | , a suffix tree for w takes O ( n ) nodes and links. It is often presented as a compacted version of a suffix trie for w, where the latter is the trie (or digital search tree) built on the suffixes of w. Here the compaction process replaces each maximal chain of unary nodes with a single arc. For this, the suffix tree requires that the labels of its arcs are substrings encoded as pointers to w (or equivalent information). On the contrary, the arcs of the suffix trie are labeled by single symbols but there can be Θ ( n 2 ) nodes and lin…

research product

On Sturmian Graphs

AbstractIn this paper we define Sturmian graphs and we prove that all of them have a certain “counting” property. We show deep connections between this counting property and two conjectures, by Moser and by Zaremba, on the continued fraction expansion of real numbers. These graphs turn out to be the underlying graphs of compact directed acyclic word graphs of central Sturmian words. In order to prove this result, we give a characterization of the maximal repeats of central Sturmian words. We show also that, in analogy with the case of Sturmian words, these graphs converge to infinite ones.

research product

Generalizations of the periodicity Theorem of Fine and Wilf

We provide three generalizations to the two-dimensional case of the well known periodicity theorem by Fine and Wilf [4] for strings (the one-dimensional case). The first and the second generalizations can be further extended to hold in the more general setting of Cayley graphs of groups. Weak forms of two of our results have been developed for the design of efficient algorithms for two-dimensional pattern matching [2, 3, 6].

research product

Sturmian graphs and integer representations over numeration systems

AbstractIn this paper we consider a numeration system, originally due to Ostrowski, based on the continued fraction expansion of a real number α. We prove that this system has deep connections with the Sturmian graph associated with α. We provide several properties of the representations of the natural integers in this system. In particular, we prove that the set of lazy representations of the natural integers in this numeration system is regular if and only if the continued fraction expansion of α is eventually periodic. The main result of the paper is that for any number i the unique path weighted i in the Sturmian graph associated with α represents the lazy representation of i in the Ost…

research product

On the suffix automaton with mismatches

International audience; In this paper we focus on the construction of the minimal deterministic finite automaton S_k that recognizes the set of suffixes of a word w up to k errors. We present an algorithm that makes use of S_k in order to accept in an efficient way the language of all suffixes of w up to k errors in every window of size r, where r is the value of the repetition index of w. Moreover, we give some experimental results on some well-known words, like prefixes of Fibonacci and Thue-Morse words, and we make a conjecture on the size of the suffix automaton with mismatches.

research product

Forbidden words in symbolic dynamics

AbstractWe introduce an equivalence relation≃between functions from N to N. By describing a symbolic dynamical system in terms of forbidden words, we prove that the≃-equivalence class of the function that counts the minimal forbidden words of a system is a topological invariant of the system. We show that the new invariant is independent from previous ones, but it is not characteristic. In the case of sofic systems, we prove that the≃-equivalence of the corresponding functions is a decidable question. As a more special application, we show, by using the new invariant, that two systems associated to Sturmian words having “different slope” are not conjugate.

research product

Words and forbidden factors

AbstractGiven a finite or infinite word v, we consider the set M(v) of minimal forbidden factors of v. We show that the set M(v) is of fundamental importance in determining the structure of the word v. In the case of a finite word w we consider two parameters that are related to the size of M(w): the first counts the minimal forbidden factors of w and the second gives the length of the longest minimal forbidden factor of w. We derive sharp upper and lower bounds for both parameters. We prove also that the second parameter is related to the minimal period of the word w. We are further interested to the algorithmic point of view. Indeed, we design linear time algorithm for the following two p…

research product

On a Conjecture on Bidimensional Words

We prove that, given a double sequence w over the alphabet A (i.e. a mapping from Z2 to A), if there exists a pair (n0, m0) ∈ Z2 such that pw(n0, m0) < 1/100n0m0, then w has a periodicity vector, where pw is the complexity function in rectangles of w.

research product

Abelian Powers and Repetitions in Sturmian Words

Richomme, Saari and Zamboni (J. Lond. Math. Soc. 83: 79-95, 2011) proved that at every position of a Sturmian word starts an abelian power of exponent $k$ for every $k > 0$. We improve on this result by studying the maximum exponents of abelian powers and abelian repetitions (an abelian repetition is an analogue of a fractional power) in Sturmian words. We give a formula for computing the maximum exponent of an abelian power of abelian period $m$ starting at a given position in any Sturmian word of rotation angle $\alpha$. vAs an analogue of the critical exponent, we introduce the abelian critical exponent $A(s_\alpha)$ of a Sturmian word $s_\alpha$ of angle $\alpha$ as the quantity $A(s_\a…

research product

Sturmian Graphs and a conjecture of Moser

In this paper we define Sturmian graphs and we prove that all of them have a “counting” property. We show deep connections between this counting property and two conjectures, by Moser and by Zaremba, on the continued fraction expansion of real numbers. These graphs turn out to be the underlying graphs of CDAWGs of central Sturmian words. We show also that, analogously to the case of Sturmian words, these graphs converge to infinite ones.

research product

On Fine and Wilf's theorem for bidimensional words

AbstractGeneralizations of Fine and Wilf's Periodicity Theorem are obtained for the case of bidimensional words using geometric arguments. The domains considered constitute a large class of convex subsets of R2 which include most parallelograms. A complete discussion is provided for the parallelogram case.

research product

Automated Synthesis of Application-layer Connectors from Automata-based Specifications

Abstract Ubiquitous and Pervasive Computing, and the Internet of Things, promote dynamic interaction among heterogeneous systems. To achieve this vision, interoperability among heterogeneous systems represents a key enabler, and mediators are often built to solve protocol mismatches. Many approaches propose the synthesis of mediators. Unfortunately, a rigorous characterization of the concept of interoperability is still lacking, hence making hard to assess their applicability and soundness. In this paper, we provide a framework for the synthesis of mediators that allows us to: (i) characterize the conditions for the mediator existence and correctness; and (ii) establish the applicability bo…

research product

A multidimensional critical factorization theorem

AbstractThe Critical Factorization Theorem is one of the principal results in combinatorics on words. It relates local periodicities of a word to its global periodicity. In this paper we give a multidimensional extension of it. More precisely, we give a new proof of the Critical Factorization Theorem, but in a weak form, where the weakness is due to the fact that we loose the tightness of the local repetition order. In exchange, we gain the possibility of extending our proof to the multidimensional case. Indeed, this new proof makes use of the Theorem of Fine and Wilf, that has several classical generalizations to the multidimensional case.

research product

Languages with mismatches and an application to approximate indexing

In this paper we describe a factorial language, denoted by L(S, k,r), that contains all words that occur in a string 5 up to k mismatches every r symbols. Then we give some combinatorial properties of a parameter, called repetition index and denoted by R(S,k,r), defined as the smallest integer h ? 1 such that all strings of this length occur at most in a unique position of the text S up to k mismatches every r symbols. We prove that R(S, k, r) is a non-increasing function of r and a non-decreasing function of k and that the equation r = R(S, k, r) admits a unique solution. The repetition index plays an important role in the construction of an indexing data structure based on a trie that rep…

research product

Word assembly through minimal forbidden words

AbstractWe give a linear-time algorithm to reconstruct a finite word w over a finite alphabet A of constant size starting from a finite set of factors of w verifying a suitable hypothesis. We use combinatorics techniques based on the minimal forbidden words, which have been introduced in previous papers. This improves a previous algorithm which worked under the assumption of stronger hypothesis.

research product

Minimal forbidden words and symbolic dynamics

We introduce a new complexity measure of a factorial formal language L: the growth rate of the set of minimal forbidden words. We prove some combinatorial properties of minimal forbidden words. As main result we prove that the growth rate of the set of minimal forbidden words for L is a topological invariant of the dynamical system defined by L.

research product

Fine and Wilf's Theorem for Three periods and a Generalization of Sturmian Words

AbstractWe extend the theorem of Fine and Wilf to words having three periods. We then define the set 3-PER of words of maximal length for which such result does not apply. We prove that the set 3-PER and the sequences of complexity 2n + 1, introduced by Arnoux and Rauzy to generalize Sturmian words, have the same set of factors.

research product

On the longest common factor problem

The Longest Common Factor (LCF) of a set of strings is a well studied problem having a wide range of applications in Bioinformatics: from microarrays to DNA sequences analysis. This problem has been solved by Hui (2000) who uses a famous constant-time solution to the Lowest Common Ancestor (LCA) problem in trees coupled with use of suffix trees. A data structure for the LCA problem, although linear in space and construction time, introduces a multiplicative constant in both space and time that reduces the range of applications in many biological applications. In this article we present a new method for solving the LCF problem using the suffix tree structure with an auxiliary array that take…

research product

Forbidden Factors and Fragment Assembly

In this paper we approach the fragment assembly problem by using the notion of minimal forbidden factors introduced in previous paper. Denoting by M(w) the set of minimal forbidden factors of a word w, we first focus on the evaluation of the size of elements in M(w) and on designing of an algorithm to recover the word w from M(w). Actually we prove that for a word w randomly generated by a memoryless source with identical symbol probabilities, the maximal length m(w) of words in M(w) is logarithmic and that the reconstruction algorithm runs in linear time. These results have an interesting application to the fragment assembly problem, i.e. reconstruct a word w from a given set I of substrin…

research product

Abelian Repetitions in Sturmian Words

We investigate abelian repetitions in Sturmian words. We exploit a bijection between factors of Sturmian words and subintervals of the unitary segment that allows us to study the periods of abelian repetitions by using classical results of elementary Number Theory. We prove that in any Sturmian word the superior limit of the ratio between the maximal exponent of an abelian repetition of period $m$ and $m$ is a number $\geq\sqrt{5}$, and the equality holds for the Fibonacci infinite word. We further prove that the longest prefix of the Fibonacci infinite word that is an abelian repetition of period $F_j$, $j&gt;1$, has length $F_j(F_{j+1}+F_{j-1} +1)-2$ if $j$ is even or $F_j(F_{j+1}+F_{j-1}…

research product

Characteristic Sturmian words are extremal for the Critical Factorization Theorem

We prove that characteristic Sturmian words are extremal for the Critical Factorization Theorem (CFT) in the following sense. If p x ( n ) denotes the local period of an infinite word x at point n , we prove that x is a characteristic Sturmian word if and only if p x ( n ) is smaller than or equal to n + 1 for all n ≥ 1 and it is equal to n + 1 for infinitely many integers n . This result is extremal with respect to the \{CFT\} since a consequence of the \{CFT\} is that, for any infinite recurrent word x, either the function p x is bounded, and in such a case x is periodic, or p x ( n ) ≥ n + 1 for infinitely many integers n . As a byproduct of the techniques used in the paper we extend a r…

research product

On the number of Arnoux–Rauzy words

research product

On numeration systems and Sturmian graphs

research product

Minimal forbidden patterns of multi-dimensional shifts

We study whether the entropy (or growth rate) of minimal forbidden patterns of symbolic dynamical shifts of dimension 2 or more, is a conjugacy invariant. We prove that the entropy of minimal forbidden patterns is a conjugacy invariant for uniformly semi-strongly irreducible shifts. We prove a weaker invariant in the general case.

research product

Words with the Maximum Number of Abelian Squares

An abelian square is the concatenation of two words that are anagrams of one another. A word of length n can contain \(\varTheta (n^2)\) distinct factors that are abelian squares. We study infinite words such that the number of abelian square factors of length n grows quadratically with n.

research product

Approximate string matching: indexing and the k-mismatch problem

research product