0000000000675166

AUTHOR

Sabrina Mantaci

Burrows–Wheeler transform and Sturmian words

research product

Isometric Words Based on Swap and Mismatch Distance

An edit distance is a metric between words that quantifies how two words differ by counting the number of edit operations needed to transform one word into the other one. A word f is said isometric with respect to an edit distance if, for any pair of f-free words u and v, there exists a transformation of minimal length from u to v via the related edit operations such that all the intermediate words are also f-free. The adjective 'isometric' comes from the fact that, if the Hamming distance is considered (i.e., only mismatches), then isometric words are connected with definitions of isometric subgraphs of hypercubes. We consider the case of edit distance with swap and mismatch. We compare it…

research product

Comparing Sequences by the Burrows-Wheeler Transform

research product

A New Combinatorial Approach to Sequence Comparison

In this paper we introduce a new alignment-free method for comparing sequences which is combinatorial by nature and does not use any compressor nor any information-theoretic notion. Such a method is based on an extension of the Burrows-Wheeler Transform, a transformation widely used in the context of Data Compression. The new extended transformation takes as input a multiset of sequences and produces as output a string obtained by a suitable rearrangement of the characters of all the input sequences. By using such a transformation we give a general method for comparing sequences that takes into account how much the characters coming from the different input sequences are mixed in the output…

research product

Transducers for the bidirectional decoding of prefix codes

AbstractWe construct a transducer for the bidirectional decoding of words encoded by the method introduced by Girod (1999) in [5] and we prove that it is bideterministic and that it can be used both for the left-to-right and the right-to-left decoding.We also give a similar construction for a transducer that decodes in both directions words encoded by a generalization of Girod’s encoding method. We prove that it has the same properties as those of the previous transducer. In addition we show that it has a single initial/final state and that it is minimal.

research product

An algorithm for the solution of tree equations

We consider the problem of solving equations over k-ary trees. Here an equation is a pair of labeled α-ary trees, where α is a function associating an arity to each label. A solution to an equation is a morphism from α-ary trees to k-ary trees that maps the left and right hand side of the equation to the same k-ary tree.

research product

Suffixes, Conjugates and Lyndon Words

In this paper we are interested in the study of the combinatorial aspects connecting three important constructions in the field of string algorithms: the suffix array, the Burrows-Wheeler transform (BWT) and the extended Burrows-Wheeler transform (EBWT). Such constructions involve the notions of suffixes and conjugates of words and are based on two different order relations, denoted by $\plex$ and $\pom$, that, even if strictly connected, are quite different from the computational point of view. In this study an important role is played by Lyndon words. In particular, we improve the upper bound on the number of symbol comparisons needed to establish the $\pom$ order between two primitive wo…

research product

Distance measures for biological sequences: Some recent approaches

AbstractSequence comparison has become a very essential tool in modern molecular biology. In fact, in biomolecular sequences high similarity usually implies significant functional or structural similarity. Traditional approaches use techniques that are based on sequence alignment able to measure character level differences. However, the recent developments of whole genome sequencing technology give rise to need of similarity measures able to capture the rearrangements involving large segments contained in the sequences. This paper is devoted to illustrate different methods recently introduced for the alignment-free comparison of biological sequences. Goal of the paper is both to highlight t…

research product

DEFECT THEOREMS FOR TREES

We generalize different notions of a rank of a set of words to sets of trees. We prove that almost all of those ranks can be used to formulate a defect theorem. However, as we show, the prefix rank forms an exception.

research product

Preface

This special issue of Mathematical Structures in Computer Science is devoted to the fourteenth Italian Conference on Theoretical Computer Science (ICTCS) held at University of Palermo, Italy, from 9th to 11th September 2013. ICTCS is the conference of the Italian Chapter of the European Association for Theoretical Computer Science and covers a wide spectrum of topics in Theoretical Computer Science, ranging from computational complexity to logic, from algorithms and data structure to programming languages, from combinatorics on words to distributed computing. For this reason, the contributions here included come from very different areas of Theoretical Computer Science. In fact this special…

research product

BALANCE PROPERTIES AND DISTRIBUTION OF SQUARES IN CIRCULAR WORDS

We study balance properties of circular words over alphabets of size greater than two. We give some new characterizations of balanced words connected to the Kawasaki-Ising model and to the notion of derivative of a word. Moreover we consider two different generalizations of the notion of balance, and we find some relations between them. Some of our results can be generalized to non periodic infinite words as well.

research product

Some Investigations on Similarity Measures Based on Absent Words

In this paper we investigate similarity measures based on minimal absent words, introduced by Chairungsee and Crochemore in [1]. They make use of a length-weighted index on a sample set corresponding to the symmetric difference M(x)ΔM(y) of the minimal absent words M(x) and M(y) of two sequences x and y, respectively. We first propose a variant of this measure by choosing as a sample set a proper subset (x, y) of M(x)ΔM(y), which appears to be more appropriate for distinguishing x and y. From the algebraic point of view, we prove that (x, y) is the base of the ideal generated by M(x)ΔM(y). We then remark that such measures are able to recognize whether the sequences x and y share a common s…

research product

Formal Languages and Automata: Models, Methods and Application. In Honour of the 70th Birthday of Antonio Restivo Preface

research product

An extension of the Burrows-Wheeler Transform to k words

research product

A Generalization of Girod’s Bidirectional Decoding Method to Codes with a Finite Deciphering Delay

In this paper we generalize an encoding method due to Girod (cf. [6]) using prefix codes, that allows a bidirectional decoding of the encoded messages. In particular we generalize it to any finite alphabet A, to any operation defined on A, to any code with finite deciphering delay and to any key x ∈ A+ , on a length depending on the deciphering delay. We moreover define, as in [4], a deterministic transducer for such generalized method. We prove that, fixed a code X ∈ A* with finite deciphering delay and a key x ∈ A *, the transducers associated to different operations are isomorphic as unlabelled graphs. We also prove that, for a fixed code X with finite deciphering delay, transducers asso…

research product

An extension of the Burrows-Wheeler Transform

AbstractWe describe and highlight a generalization of the Burrows–Wheeler Transform (bwt) to a multiset of words. The extended transformation, denoted by ebwt, is reversible. Moreover, it allows to define a bijection between the words over a finite alphabet A and the finite multisets of conjugacy classes of primitive words in A∗. Besides its mathematical interest, the extended transform can be useful for applications in the context of string processing. In the last part of this paper we illustrate one such application, providing a similarity measure between sequences based on ebwt.

research product

The Burrows-Wheeler Transform: from data compression to combinatorics on words

research product

Suffix array and Lyndon factorization of a text

Abstract The main goal of this paper is to highlight the relationship between the suffix array of a text and its Lyndon factorization. It is proved in [15] that one can obtain the Lyndon factorization of a text from its suffix array. Conversely, here we show a new method for constructing the suffix array of a text that takes advantage of its Lyndon factorization. The surprising consequence of our results is that, in order to construct the suffix array, the local suffixes inside each Lyndon factor can be separately processed, allowing different implementative scenarios, such as online, external and internal memory, or parallel implementations. Based on our results, the algorithm that we prop…

research product

Inducing the Lyndon Array

In this paper we propose a variant of the induced suffix sorting algorithm by Nong (TOIS, 2013) that computes simultaneously the Lyndon array and the suffix array of a text in $O(n)$ time using $\sigma + O(1)$ words of working space, where $n$ is the length of the text and $\sigma$ is the alphabet size. Our result improves the previous best space requirement for linear time computation of the Lyndon array. In fact, all the known linear algorithms for Lyndon array computation use suffix sorting as a preprocessing step and use $O(n)$ words of working space in addition to the Lyndon array and suffix array. Experimental results with real and synthetic datasets show that our algorithm is not onl…

research product

A Generalization of Girod's Bidirectional Decoding Method to Codes with a Finite Deciphering Delay

Girod’s encoding method has been introduced in order to efficiently decode from both directions messages encoded by using finite prefix codes. In the present paper, we generalize this method to finite codes with a finite deciphering delay. In particular, we show that our decoding algorithm can be realized by a deterministic finite transducer. We also investigate some properties of the underlying unlabeled graph.

research product

On fixed points of the Burrows-Wheeler transform

The Burrows-Wheeler Transform is a well known transformation widely used in Data Compression: important competitive compression software, such as Bzip (cf. [1]) and Szip (cf. [2]) and some indexing software, like the FM-index (cf. [3]), are deeply based on the Burrows Wheeler Transform. The main advantage of using BWT for data compression consists in its feature of "clustering" together equal characters. In this paper we show the existence of fixed points of BWT, i.e., words on which BWT has no effect. We show a characterization of the permutations associated to BWT of fixed points and we give the explicit form of fixed points on a binary ordered alphabet a, b having at most four b's and th…

research product

A new sequence distance measure based on the Burrows-Wheeler Transform

research product

Burrows-Wheeler transform and Run-Length Enconding

In this paper we study the clustering effect of the Burrows-Wheeler Transform (BWT) from a combinatorial viewpoint. In particular, given a word w we define the BWT-clustering ratio of w as the ratio between the number of clusters produced by BWT and the number of the clusters of w. The number of clusters of a word is measured by its Run-Length Encoding. We show that the BWT-clustering ratio ranges in ]0, 2]. Moreover, given a rational number \(r\,\in \,]0,2]\), it is possible to find infinitely many words having BWT-clustering ratio equal to r. Finally, we show how the words can be classified according to their BWT-clustering ratio. The behavior of such a parameter is studied for very well-…

research product

SORTING CONJUGATES AND SUFFIXES OF WORDS IN A MULTISET

In this paper we are interested in the study of the combinatorial aspects related to the extension of the Burrows-Wheeler transform to a multiset of words. Such study involves the notion of suffixes and conjugates of words and is based on two different order relations, denoted by <lex and ≺ω, that, even if strictly connected, are quite different from the computational point of view. In particular, we introduce a method that only uses the <lex sorting among suffixes of a multiset of words in order to sort their conjugates according to ≺ω-order. In this study an important role is played by Lyndon words. This strategy could be used in applications specially in the field of Bioinformatic…

research product

The Burrows-Wheeler Transform: a new tool in Combinatorics on Words

research product

On the decomposition of prefix codes

Abstract In this paper we focus on the decomposition of rational and maximal prefix codes. We present an effective procedure that allows us to decide whether such a code is decomposable. In this case, the procedure also produces the factors of some of its decompositions. We also give partial results on the problem of deciding whether a rational maximal prefix code decomposes over a finite prefix code.

research product

A combinatorial view on string attractors

Abstract The notion of string attractor has recently been introduced in [Prezza, 2017] and studied in [Kempa and Prezza, 2018] to provide a unifying framework for known dictionary-based compressors. A string attractor for a word w = w 1 w 2 ⋯ w n is a subset Γ of the positions { 1 , … , n } , such that all distinct factors of w have an occurrence crossing at least one of the elements of Γ. In this paper we explore the notion of string attractor by focusing on its combinatorial properties. In particular, we show how the size of the smallest string attractor of a word varies when combinatorial operations are applied and we deduce that such a measure is not monotone. Moreover, we introduce a c…

research product

Balance Properties and Distribution of Squares in Circular Words

We study balance properties of circular words over alphabets of size greater than two. We give some new characterizations of balanced words connected to the Kawasaki-Ising model and to the notion of derivative of a word. Moreover we consider two different generalizations of the notion of balance, and we find some relations between them. Some of our results can be generalised to non periodic infinite words as well.

research product

Measuring the clustering effect of BWT via RLE

Abstract The Burrows–Wheeler Transform (BWT) is a reversible transformation on which are based several text compressors and many other tools used in Bioinformatics and Computational Biology. The BWT is not actually a compressor, but a transformation that performs a context-dependent permutation of the letters of the input text that often create runs of equal letters (clusters) longer than the ones in the original text, usually referred to as the “clustering effect” of BWT. In particular, from a combinatorial point of view, great attention has been given to the case in which the BWT produces the fewest number of clusters (cf. [5] , [16] , [21] , [23] ). In this paper we are concerned about t…

research product

String attractors and combinatorics on words

The notion of \emph{string attractor} has recently been introduced in [Prezza, 2017] and studied in [Kempa and Prezza, 2018] to provide a unifying framework for known dictionary-based compressors. A string attractor for a word $w=w[1]w[2]\cdots w[n]$ is a subset $\Gamma$ of the positions $\{1,\ldots,n\}$, such that all distinct factors of $w$ have an occurrence crossing at least one of the elements of $\Gamma$. While finding the smallest string attractor for a word is a NP-complete problem, it has been proved in [Kempa and Prezza, 2018] that dictionary compressors can be interpreted as algorithms approximating the smallest string attractor for a given word. In this paper we explore the noti…

research product

On the size of transducers for bidirectional decoding of prefix codes

In a previous paper [L. Giambruno and S. Mantaci, Theoret. Comput. Sci. 411 (2010) 1785–1792] a bideterministic transducer is defined for the bidirectional deciphering of words by the method introduced by Girod [ IEEE Commun. Lett. 3 (1999) 245–247]. Such a method is defined using prefix codes. Moreover a coding method, inspired by the Girod’s one, is introduced, and a transducer that allows both right-to-left and left-to-right decoding by this method is defined. It is proved also that this transducer is minimal. Here we consider the number of states of such a transducer, related to some features of the considered prefix code X . We find some bounds of such a number of states in relation wi…

research product

Equations on trees

We introduce the notion of equation on trees, generalizing the corresponding notion for words, and we develop the first steps of a theory of tree equations. The main result of the paper states that, if a pair of trees is the solution of a tree equation with two indeterminates, then the two trees are both powers of the same tree. As an application, we show that a tree can be expressed in a unique way as a power of a primitive tree. This extends a basic result of combinatorics on words to trees. Some open problems are finally proposed.

research product

An extension of the Burrows-Wheeler Transform and applications to sequence comparison and data compression

We introduce a generalization of the Burrows-Wheeler Transform (BWT) that can be applied to a multiset of words. The extended transformation, denoted by E, is reversible, but, differently from BWT, it is also surjective. The E transformation allows to give a definition of distance between two sequences, that we apply here to the problem of the whole mitochondrial genome phylogeny. Moreover we give some consideration about compressing a set of words by using the E transformation as preprocessing.

research product