0000000000012786
AUTHOR
Alessandra Gabriele
Languages with mismatches
AbstractIn this paper we study some combinatorial properties of a class of languages that represent sets of words occurring in a text S up to some errors. More precisely, we consider sets of words that occur in a text S with k mismatches in any window of size r. The study of this class of languages mainly focuses both on a parameter, called repetition index, and on the set of the minimal forbidden words of the language of factors of S with errors. The repetition index of a string S is defined as the smallest integer such that all strings of this length occur at most in a unique position of the text S up to errors. We prove that there is a strong relation between the repetition index of S an…
"Indexing structures for approximate string matching
In this paper we give the first, to our knowledge, structures and corresponding algorithms for approximate indexing, by considering the Hamming distance, having the following properties. i) Their size is linear times a polylog of the size of the text on average. ii) For each pattern x, the time spent by our algorithms for finding the list occ(x) of all occurrences of a pattern x in the text, up to a certain distance, is proportional on average to |x| + |occ(x)|, under an additional but realistic hypothesis.
On lazy representations and Sturmian graphs
In this paper we establish a strong relationship between the set of lazy representations and the set of paths in a Sturmian graph associated with a real number α. We prove that for any non-negative integer i the unique path weighted i in the Sturmian graph associated with α represents the lazy representation of i in the Ostrowski numeration system associated with α. Moreover, we provide several properties of the representations of the natural integers in this numeration system.
Functional Information, Biomolecular Messages and Complexity of BioSequences and Structures
In the quest for a mathematical measure able to capture and shed light on the dual notions of information and complexity in biosequences, Hazen et al. have introduced the notion of Functional Information (FI for short). It is also the result of earlier considerations and findings by Szostak and Carothers et al. Based on the experiments by Charoters et al., regarding FI in RNA binding activities, we decided to study the relation existing between FI and classic measures of complexity applied on protein-DNA interactions on a genome-wide scale. Using classic complexity measures, i.e, Shannon entropy and Kolmogorov Complexity as both estimated by data compression, we found that FI applied to pro…
From Nerode's congruence to Suffix Automata with mismatches
AbstractIn this paper we focus on the minimal deterministic finite automaton Sk that recognizes the set of suffixes of a word w up to k errors. As first result we give a characterization of the Nerode’s right-invariant congruence that is associated with Sk. This result generalizes the classical characterization described in [A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. Chen, J. Seiferas, The smallest automaton recognizing the subwords of a text, Theoretical Computer Science, 40, 1985, 31–55]. As second result we present an algorithm that makes use of Sk to accept in an efficient way the language of all suffixes of w up to k errors in every window of size r of a text, where r is the…
Sturmian graphs and integer representations over numeration systems
AbstractIn this paper we consider a numeration system, originally due to Ostrowski, based on the continued fraction expansion of a real number α. We prove that this system has deep connections with the Sturmian graph associated with α. We provide several properties of the representations of the natural integers in this system. In particular, we prove that the set of lazy representations of the natural integers in this numeration system is regular if and only if the continued fraction expansion of α is eventually periodic. The main result of the paper is that for any number i the unique path weighted i in the Sturmian graph associated with α represents the lazy representation of i in the Ost…
On the suffix automaton with mismatches
International audience; In this paper we focus on the construction of the minimal deterministic finite automaton S_k that recognizes the set of suffixes of a word w up to k errors. We present an algorithm that makes use of S_k in order to accept in an efficient way the language of all suffixes of w up to k errors in every window of size r, where r is the value of the repetition index of w. Moreover, we give some experimental results on some well-known words, like prefixes of Fibonacci and Thue-Morse words, and we make a conjecture on the size of the suffix automaton with mismatches.
Novel Combinatorial and Information-Theoretic Alignment-Free Distances for Biological Data Mining
Among the plethora of alignment-free methods for comparing biological sequences, there are some that we have perceived as representative of the novel techniques that have been devised in the past few years and as being of a fundamental nature and of broad interest and applicability, ranging from combinatorics to information theory. In this chapter, we review these alignment free methods, by presenting both their mathematical definitions and the experiments in which they are involved in.
Languages with mismatches and an application to approximate indexing
In this paper we describe a factorial language, denoted by L(S, k,r), that contains all words that occur in a string 5 up to k mismatches every r symbols. Then we give some combinatorial properties of a parameter, called repetition index and denoted by R(S,k,r), defined as the smallest integer h ? 1 such that all strings of this length occur at most in a unique position of the text S up to k mismatches every r symbols. We prove that R(S, k, r) is a non-increasing function of r and a non-decreasing function of k and that the equation r = R(S, k, r) admits a unique solution. The repetition index plays an important role in the construction of an indexing data structure based on a trie that rep…
On the longest common factor problem
The Longest Common Factor (LCF) of a set of strings is a well studied problem having a wide range of applications in Bioinformatics: from microarrays to DNA sequences analysis. This problem has been solved by Hui (2000) who uses a famous constant-time solution to the Lowest Common Ancestor (LCA) problem in trees coupled with use of suffix trees. A data structure for the LCA problem, although linear in space and construction time, introduces a multiplicative constant in both space and time that reduces the range of applications in many biological applications. In this article we present a new method for solving the LCF problem using the suffix tree structure with an auxiliary array that take…
Genome-wide characterization of chromatin binding and nucleosome spacing activity of the nucleosome remodelling ATPase ISWI
The evolutionarily conserved ATP-dependent nucleosome remodelling factor ISWI can space nucleosomes affecting a variety of nuclear processes. In Drosophila, loss of ISWI leads to global transcriptional defects and to dramatic alterations in higher-order chromatin structure, especially on the male X chromosome. In order to understand if chromatin condensation and gene expression defects, observed in ISWI mutants, are directly correlated with ISWI nucleosome spacing activity, we conducted a genome-wide survey of ISWI binding and nucleosome positioning in wild-type and ISWI mutant chromatin. Our analysis revealed that ISWI binds both genic and intergenic regions. Remarkably, we found that ISWI…