0000000000009901
AUTHOR
Alessio Langiu
ON-LINE CONSTRUCTION OF A SMALL AUTOMATON FOR A FINITE SET OF WORDS
In this paper we describe a "light" algorithm for the on-line construction of a small automaton recognising a finite set of words. The algorithm runs in linear time. We carried out good experimental results on real dictionaries, on biological sequences and on the sets of suffixes (resp. factors) of a set of words that shows how our automaton is near to the minimal one. For the suffixes of a text, we propose a modified construction that leads to an even smaller automaton. We moreover construct linear algorithms for the insertion and deletion of a word in a finite set, directly from the constructed automaton.
On parsing optimality for dictionary-based text compression—the Zip case
Dictionary-based compression schemes are the most commonly used data compression schemes since they appeared in the foundational paper of Ziv and Lempel in 1977, and generally referred to as LZ77. Their work is the base of Zip, gZip, 7-Zip and many other compression software utilities. Some of these compression schemes use variants of the greedy approach to parse the text into dictionary phrases; others have left the greedy approach to improve the compression ratio. Recently, two bit-optimal parsing algorithms have been presented filling the gap between theory and best practice. We present a survey on the parsing problem for dictionary-based text compression, identifying noticeable results …
The rightmost equal-cost position problem.
LZ77-based compression schemes compress the input text by replacing factors in the text with an encoded reference to a previous occurrence formed by the couple (length, offset). For a given factor, the smallest is the offset, the smallest is the resulting compression ratio. This is optimally achieved by using the rightmost occurrence of a factor in the previous text. Given a cost function, for instance the minimum number of bits used to represent an integer, we define the Rightmost Equal-Cost Position (REP) problem as the problem of finding one of the occurrences of a factor whose cost is equal to the cost of the rightmost one. We present the Multi-Layer Suffix Tree data structure that, for…
Dictionary-symbolwise flexible parsing
AbstractLinear-time optimal parsing algorithms are rare in the dictionary-based branch of the data compression theory. A recent result is the Flexible Parsing algorithm of Matias and Sahinalp (1999) that works when the dictionary is prefix closed and the encoding of dictionary pointers has a constant cost. We present the Dictionary-Symbolwise Flexible Parsing algorithm that is optimal for prefix-closed dictionaries and any symbolwise compressor under some natural hypothesis. In the case of LZ78-like algorithms with variable costs and any, linear as usual, symbolwise compressor we show how to implement our parsing algorithm in linear time. In the case of LZ77-like dictionaries and any symbol…
Pelagic species identification by using a PNN neural network and echo-sounder data
For several years, a group of CNR researchers conducted acoustic surveys in the Sicily Channel to estimate the biomass of small pelagic species, their geographical distribution and their variations over time. The instrument used to carry out these surveys is the scientific echo-sounder, set for different frequencies. The processing of the back scattered signals in the volume of water under investigation determines the abundance of the species. These data are then correlated with the biological data of experimental catches, to attribute the composition of the various fish schools investigated. Of course, the recognition of the fish schools helps to produce very good results, that is very clo…
Artificial neural networks for fault tollerance of an air-pressure sensor network
A meteorological tsunami, commonly called Meteotsunami, is a tsunami-like wave originated by rapid changes in barometric pressure that involve the displacement of a body of water. This phenomenon is usually present in the sea cost area of Mazara del Vallo (Sicily, Italy), in particular in the internal part of the seaport canal, sometimes making local population at risk. The Institute for Coastal Marine Environment (IAMC) of the National Research Council in Italy (CNR) have already conducted several studies upon meteotsunami phenomenon. One of the project has regarded the creation of a sensors network composed by micro-barometric sensors, located in 4 different stations close to the seaport …
Abelian Powers and Repetitions in Sturmian Words
Richomme, Saari and Zamboni (J. Lond. Math. Soc. 83: 79-95, 2011) proved that at every position of a Sturmian word starts an abelian power of exponent $k$ for every $k > 0$. We improve on this result by studying the maximum exponents of abelian powers and abelian repetitions (an abelian repetition is an analogue of a fractional power) in Sturmian words. We give a formula for computing the maximum exponent of an abelian power of abelian period $m$ starting at a given position in any Sturmian word of rotation angle $\alpha$. vAs an analogue of the critical exponent, we introduce the abelian critical exponent $A(s_\alpha)$ of a Sturmian word $s_\alpha$ of angle $\alpha$ as the quantity $A(s_\a…
Optimal Parsing for Dictionary-Based Compression
Dictionary-based compression algorithms include a parsing strategy to transform the input text into a sequence of dictionary phrases. Given a text, such process usually is not unique and, for compression purposes, it makes sense to find one of the possible parsing that minimise the final compression ratio. This is the parsing problem. In more than 30 years of history of dictionary-based text compression only few optimal parsing algorithms were presented. Most of the practical dictionary-based compression solutions need or prefer to factorise the input data into a sequence of dictionary-phrases and symbols. Those two output categories are usually encoded via two different encoders producing …
Optimal Parsing for Dictionary Text Compression
Dictionary-based compression algorithms include a parsing strategy to transform the input text into a sequence of dictionary phrases. Given a text, such process usually is not unique and, for compression purpose, it makes sense to find one of the possible parsing that minimize the final compression ratio. This is the parsing problem. An optimal parsing is a parsing strategy or a parsing algorithm that solve the parsing problem taking account of all the constraints of a compression algorithm or of a class of homogeneous compression algorithms. Compression algorithm constrains are, for instance, the dictionary itself, i.e. the dynamic set of available phrases, and how much a phrase weights on…
Indexing a sequence for mapping reads with a single mismatch
Mapping reads against a genome sequence is an interesting and useful problem in computational molecular biology and bioinformatics. In this paper, we focus on the problem of indexing a sequence for mapping reads with a single mismatch. We first focus on a simpler problem where the length of the pattern is given beforehand during the data structure construction. This version of the problem is interesting in its own right in the context of the next generation sequencing. In the sequel, we show how to solve the more general problem. In both cases, our algorithm can construct an efficient data structure in time and space and can answer subsequent queries in time. Here, n is the length of the s…
Metodo e apparato per la compressione e la decompressione dati tramite il quale si possono utilizzare efficientemente molteplici compressori e decompressori tra i quali almeno un compressore e’ di tipo ” a dizionario ” .
Abelian Repetitions in Sturmian Words
We investigate abelian repetitions in Sturmian words. We exploit a bijection between factors of Sturmian words and subintervals of the unitary segment that allows us to study the periods of abelian repetitions by using classical results of elementary Number Theory. We prove that in any Sturmian word the superior limit of the ratio between the maximal exponent of an abelian repetition of period $m$ and $m$ is a number $\geq\sqrt{5}$, and the equality holds for the Fibonacci infinite word. We further prove that the longest prefix of the Fibonacci infinite word that is an abelian repetition of period $F_j$, $j>1$, has length $F_j(F_{j+1}+F_{j-1} +1)-2$ if $j$ is even or $F_j(F_{j+1}+F_{j-1}…
Bacteria classification using minimal absent words
Bacteria classification has been deeply investigated with different tools for many purposes, such as early diagnosis, metagenomics, phylogenetics. Classification methods based on ribosomal DNA sequences are considered a reference in this area. We present a new classificatier for bacteria species based on a dissimilarity measure of purely combinatorial nature. This measure is based on the notion of Minimal Absent Words, a combinatorial definition that recently found applications in bioinformatics. We can therefore incorporate this measure into a probabilistic neural network in order to classify bacteria species. Our approach is motivated by the fact that there is a vast literature on the com…