0000000000115065

AUTHOR

Paolo Ferragina

0000-0003-1353-360x

showing 5 related works from this author

The Myriad Virtues of Wavelet Trees

2009

Wavelet Trees have been introduced in [Grossi, Gupta and Vitter, SODA '03] and have been rapidly recognized as a very flexible tool for the design of compressed full-text indexes and data compressors. Although several papers have investigated the beauty and usefulness of this data structure in the full-text indexing scenario, its impact on data compression has not been fully explored. In this paper we provide a complete theoretical analysis of a wide class of compression algorithms based on Wavelet Trees. We also show how to improve their asymptotic performance by introducing a novel framework, called Generalized Wavelet Trees, that aims for the best combination of binary compressors (like,…

Binary treeWeight-balanced treeWavelet transformCascade algorithmData_CODINGANDINFORMATIONTHEORYHuffman codingData CompressionTheoretical Computer ScienceComputer Science ApplicationsSet partitioning in hierarchical treessymbols.namesakeWaveletComputational Theory and Mathematicssymbolsempirical entropyBurrows-Wheeler TransformAlgorithmData compressionMathematicsInformation SystemsWavelet Trees
researchProduct

Indexed Two-Dimensional String Matching

2016

Settore INF/01 - InformaticaTwo-dimensional index data structuresString searching algorithm0102 computer and information sciences02 engineering and technologyApproximate string matching01 natural sciencesCombinatorics010201 computation theory & mathematicsIndex data structures for matrices or imageIndexing for matrices or image0202 electrical engineering electronic engineering information engineeringTwo-dimensional indexing for pattern matching020201 artificial intelligence & image processingString metricMathematics
researchProduct

Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment.

2007

Abstract Background Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment methods seem inadequate for post-genomic studies since they do not scale well with data set size and they seem to be confined only to genomic and proteomic sequences. Therefore, alignment-free similarity measures are actively pursued. Among those, USM (Universal Similarity Metric) has gained prominence. It is based on the deep theory of Kolmogorov Complexity and universality is its most novel striking feature. Since it can only be approximated via data compression, USM is a methodology rath…

Computer scienceAlgorismesPrediction by partial matchingCompression dissimilaritycomputer.software_genreBiochemistryProtein Structure SecondaryPhylogenetic studiesStructural BiologySequence Analysis ProteinDatabases Proteinlcsh:QH301-705.5Biological dataNCDApplied MathematicsGenomicsClassificationCDComputer Science ApplicationsBenchmarking:Informàtica::Informàtica teòrica [Àrees temàtiques de la UPC]Universal compression dissimilarityArea Under CurveMetric (mathematics)lcsh:R858-859.7Data miningAlgorithmsData compressionResearch Article:Informàtica::Aplicacions de la informàtica::Bioinformàtica [Àrees temàtiques de la UPC]Normalization (statistics)lcsh:Computer applications to medicine. Medical informaticsBioinformatics Sequence Alignment AlgorithmsSet (abstract data type)Similarity (network science)Normalized compression sissimilarityData compression (Computer science)AnimalsHumansAmino Acid SequenceMolecular BiologyBiologyDades -- Compressió (Informàtica)USMUniversal similarity metricProteinsUCDProtein Structure TertiaryData setGenòmicaStatistical classificationlcsh:Biology (General)ROC CurvecomputerSequence AlignmentSoftwareBMC bioinformatics
researchProduct

The Engineering of a Compression Boosting Library: Theory vs Practice in BWT Compression

2006

Data Compression is one of the most challenging arenas both for algorithm design and engineering. This is particularly true for Burrows and Wheeler Compression a technique that is important in itself and for the design of compressed indexes. There has been considerable debate on how to design and engineer compression algorithms based on the BWT paradigm. In particular, Move-to-Front Encoding is generally believed to be an "inefficient " part of the Burrows-Wheeler compression process. However, only recently two theoretically superior alternatives to Move-to-Front have been proposed, namely Compression Boosting and Wavelet Trees. The main contribution of this paper is to provide the first ex…

Lossless compressionBoosting (machine learning)Computer sciencebusiness.industrySupervised learningCompression Boosting LibraryData_CODINGANDINFORMATIONTHEORYMachine learningcomputer.software_genreWaveletAlgorithm designArtificial intelligencebusinesscomputerAlgorithmsData compression
researchProduct

Boosting Textual Compression in Optimal Linear Time

2005

We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression performance guarantee. It displays the following remarkable properties: (a) it can turn any memoryless compressor into a compression algorithm that uses the “best possible” contexts; (b) it is very simple and optimal in terms of time; and (c) it admits a decompression algorithm again optimal in time. To the best of our knowledge, this is the first boosting technique displaying these properties.Technically, our boosting technique builds upon three main ingredients: the Burrows--Wheeler Transform, the Suffix Tree d…

Theoretical computer scienceBurrows–Wheeler transformSuffix treeString (computer science)Data_CODINGANDINFORMATIONTHEORYBurrows-Wheeler transformSubstringArithmetic codinglaw.inventionLempel-Ziv compressorsArtificial IntelligenceHardware and ArchitectureControl and Systems Engineeringlawtext compressionempirical entropyArithmetic codingGreedy algorithmTime complexityAlgorithmSoftwareInformation SystemsMathematicsData compression
researchProduct