0000000000024015
AUTHOR
Raffaele Giancarlo
Forewords
The Myriad Virtues of Wavelet Trees
Wavelet Trees have been introduced in [Grossi, Gupta and Vitter, SODA '03] and have been rapidly recognized as a very flexible tool for the design of compressed full-text indexes and data compressors. Although several papers have investigated the beauty and usefulness of this data structure in the full-text indexing scenario, its impact on data compression has not been fully explored. In this paper we provide a complete theoretical analysis of a wide class of compression algorithms based on Wavelet Trees. We also show how to improve their asymptotic performance by introducing a novel framework, called Generalized Wavelet Trees, that aims for the best combination of binary compressors (like,…
Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer
Abstract Background Inferring cluster structure in microarray datasets is a fundamental task for the so-called -omic sciences. It is also a fundamental question in Statistics, Data Analysis and Classification, in particular with regard to the prediction of the number of clusters in a dataset, usually established via internal validation measures. Despite the wealth of internal measures available in the literature, new ones have been recently proposed, some of them specifically for microarray data. Results We consider five such measures: Clest, Consensus (Consensus Clustering), FOM (Figure of Merit), Gap (Gap Statistics) and ME (Model Explorer), in addition to the classic WCSS (Within Cluster…
ValWorkBench: an open source Java library for cluster validation, with applications to microarray data analysis.
Background: Cluster analysis is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from statistics to computer science. It is central to the life sciences due to the advent of high throughput technologies, e.g., classification of tumors. In particular, in cluster analysis, it is of relevance to assess cluster quality and to predict the number of clusters in a dataset, if any. This latter task is usually performed via internal validation measures. Despite their potentially important role, both the use of classic internal validation measures and the design of new ones, specific for microarray data, do not seem to have grea…
Basic Statistical Indices for SeqAn
On the Suitability of Neural Networks as Building Blocks for the Design of Efficient Learned Indexes
With the aim of obtaining time/space improvements in classic Data Structures, an emerging trend is to combine Machine Learning techniques with the ones proper of Data Structures. This new area goes under the name of Learned Data Structures. The motivation for its study is a perceived change of paradigm in Computer Architectures that would favour the use of Graphics Processing Units and Tensor Processing Units over conventional Central Processing Units. In turn, that would favour the use of Neural Networks as building blocks of Classic Data Structures. Indeed, Learned Bloom Filters, which are one of the main pillars of Learned Data Structures, make extensive use of Neural Networks to improve…
Topological ranks reveal functional knowledge encoded in biological networks: a comparative analysis
Abstract Motivation Biological networks topology yields important insights into biological function, occurrence of diseases and drug design. In the last few years, different types of topological measures have been introduced and applied to infer the biological relevance of network components/interactions, according to their position within the network structure. Although comparisons of such measures have been previously proposed, to what extent the topology per se may lead to the extraction of novel biological knowledge has never been critically examined nor formalized in the literature. Results We present a comparative analysis of nine outstanding topological measures, based on compact vie…
Multi-dimensional pattern matching with dimensional wildcards
We introduce a new multi-dimensional pattern matching problem, which is a natural generalization of the on-line search in string matching. We are given a text matrix A[1: n1, ..., 1:n d ] of size N= n1×n2×...×n d , which we may preprocess. Then, we are given, online, an r-dimensional pattern matrix B[1:m1,...,1:m r ] of size M= m1×m2×...×m r , with 1≤r≤d. We would like to know whether B*=B*[*, 1:m1,*, ...,1: mr, *] occurs in A, where * is a dimensional wildcard such that B* is any d-dimensional matrix having size 1 × ... × m1×...1×m r ×...1 and containing the same elements as B. Notice that there might be (d/r)≤2d occurrences of B* for each position of A. We give CRCW-PRAM algorithms for pr…
Differential Expression of Proteolytic Enzymes During Epithelia-Mesenchyma Transcation of Endothelial Cells
Indexed Two-Dimensional String Matching
On the determinization of weighted finite automata
We study determinization of weighted finite-state automata (WFAs), which has important applications in automatic speech recognition (ASR). We provide the first polynomial-time algorithm to test for the twins property, which determines if a WFA admits a deterministic equivalent. We also provide a rigorous analysis of a determinization algorithm of Mohri, with tight bounds for acyclic WFAs. Given that WFAs can expand exponentially when determinized, we explore why those used in ASR tend to shrink. The folklore explanation is that ASR WFAs have an acyclic, multi-partite structure. We show, however, that there exist such WFAs that always incur exponential expansion when determinized. We then in…
Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment.
Abstract Background Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment methods seem inadequate for post-genomic studies since they do not scale well with data set size and they seem to be confined only to genomic and proteomic sequences. Therefore, alignment-free similarity measures are actively pursued. Among those, USM (Universal Similarity Metric) has gained prominence. It is based on the deep theory of Kolmogorov Complexity and universality is its most novel striking feature. Since it can only be approximated via data compression, USM is a methodology rath…
Guest Editors' Introduction to the Special Section on Algorithms in Bioinformatics
From First Principles to the Burrows and Wheeler Transform and Beyond, via Combinatorial Optimization
AbstractWe introduce a combinatorial optimization framework that naturally induces a class of optimal word permutations with respect to a suitably defined cost function taking into account various measures of relatedness between words. The Burrows and Wheeler transform (bwt) (cf. [M. Burrows, D. Wheeler, A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation, 1994]), and its analog for labelled trees (cf. [P. Ferragina, F. Luccio, G. Manzini, S. Muthukrishnan, Structuring labeled trees for optimal succinctness, and beyond, in: Proc. of the 45th Annual IEEE Symposium on Foundations of Computer Science, 2005, pp. 198–207]), are special cases i…
FASTdoop: A versatile and efficient library for the input of FASTA and FASTQ files for MapReduce Hadoop bioinformatics applications
Abstract Summary MapReduce Hadoop bioinformatics applications require the availability of special-purpose routines to manage the input of sequence files. Unfortunately, the Hadoop framework does not provide any built-in support for the most popular sequence file formats like FASTA or BAM. Moreover, the development of these routines is not easy, both because of the diversity of these formats and the need for managing efficiently sequence datasets that may count up to billions of characters. We present FASTdoop, a generic Hadoop library for the management of FASTA and FASTQ files. We show that, with respect to analogous input management routines that have appeared in the Literature, it offers…
Differential expression of proteolytic enzymes during epithelial-mesenchymal transaction of endothelial cells.
Optimal Partitions of Strings: A New Class of Burrows-Wheeler Compression Algorithms
The Burrows-Wheeler transform [1] is one of the mainstays of lossless data compression. In most cases, its output is fed to Move to Front or other variations of symbol ranking compression. One of the main open problems [2] is to establish whether Move to Front, or more in general symbol ranking compression, is an essential part of the compression process. We settle this question positively by providing a new class of Burrows-Wheeler algorithms that use optimal partitions of strings, rather than symbol ranking, for the additional step. Our technique is a quite surprising specialization to strings of partitioning techniques devised by Buchsbaum et al. [3] for two-dimensional table compression…
Block Sorting-Based Transformations on Words: Beyond the Magic BWT
The Burrows-Wheeler Transform (BWT) is a word transformation introduced in 1994 for Data Compression and later results have contributed to make it a fundamental tool for the design of self-indexing compressed data structures. The Alternating Burrows-Wheeler Transform (ABWT) is a more recent transformation, studied in the context of Combinatorics on Words, that works in a similar way, using an alternating lexicographical order instead of the usual one. In this paper we study a more general class of block sorting-based transformations. The transformations in this new class prove to be interesting combinatorial tools that offer new research perspectives. In particular, we show that all the tra…
Distance Functions, Clustering Algorithms and Microarray Data Analysis
Distance functions are a fundamental ingredient of classification and clustering procedures, and this holds true also in the particular case of microarray data. In the general data mining and classification literature, functions such as Euclidean distance or Pearson correlation have gained their status of de facto standards thanks to a considerable amount of experimental validation. For microarray data, the issue of which distance function works best has been investigated, but no final conclusion has been reached. The aim of this extended abstract is to shed further light on that issue. Indeed, we present an experimental study, involving several distances, assessing (a) their intrinsic sepa…
Parallel Construction and Query of Index Data Structures for Pattern Matching on Square Matrices
AbstractWe describe fast parallel algorithms for building index data structures that can be used to gather various statistics on square matrices. The main data structure is the Lsuffix tree, which is a generalization of the classical suffix tree for strings. Given ann×ntext matrixA, we build our data structures inO(logn) time withn2processors on a CRCW PRAM, so that we can quickly processAin parallel as follows: (i) report some statistical information aboutA, e.g., find the largest repeated square submatrices that appear at least twice inAor determine, for each position inA, the smallest submatrix that occurs only there; (ii) given, on-line, anm×mpattern matrixPAT, check whether it occurs i…
On finding common neighborhoods in massive graphs
AbstractWe consider the problem of finding pairs of vertices that share large common neighborhoods in massive graphs. We prove lower bounds on the resources needed to solve this problem on resource-bounded models of computation. In streaming models, in which algorithms can access the input only a constant number of times and only sequentially, we show that, even with randomization, any algorithm that determines if there exists any pair of vertices with a large common neighborhood must essentially store and process the input graph off line. In sampling models, in which algorithms can only query an oracle for the common neighborhoods of specified vertex pairs, we show that any algorithm must …
Network Centralities and Node Ranking
An important problem in network analysis is understanding how much nodes are important in order to “propagate” the information across the input network. To this aim, many centrality measures have been proposed in the literature and our main goal here is that of providing an overview of the most important of them. In particular, we distinguish centrality measures based on walks computation from those based on shortest-paths computation. We also provide some examples in order to clarify how these measures can be calculated, with special attention to Degree Centrality, Closeness Centrality and Betweennes Centrality.
Preface
Textual data compression in computational biology: Algorithmic techniques
Abstract In a recent review [R. Giancarlo, D. Scaturro, F. Utro, Textual data compression in computational biology: a synopsis, Bioinformatics 25 (2009) 1575–1586] the first systematic organization and presentation of the impact of textual data compression for the analysis of biological data has been given. Its main focus was on a systematic presentation of the key areas of bioinformatics and computational biology where compression has been used together with a technical presentation of how well-known notions from information theory have been adapted to successfully work on biological data. Rather surprisingly, the use of data compression is pervasive in computational biology. Starting from…
The Power of Word-Frequency Based Alignment-Free Functions: a Comprehensive Large-Scale Experimental Analysis
Abstract Motivation Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e. their ability to identify true similarity, has been limited to some members of the D2 family. The corresponding experimental studies have concentrated on short sequences, a scenario no longer adequate for current applications, where sequence lengths may vary considerably. Such a State of the Art is methodologically problematic, since information regarding a key feature such as power is either mi…
On-line construction of two-dimensional suffix trees
We present a new technique, which we refer to as implicit updates, based on which we obtain: (a) an algorithm for the on-line construction of the Lsuffix tree of an n x n matrix A — this data structure, described in [13], is the two-dimensional analog of the suffix tree of a string; (b) simple algorithms implementing primitive operations for LZ1-type on-dine lossless image compression methods. Those methods, recently introduced by Storer [35], are generalizations of LZl-type compression methods for strings (see also [24, 31]). For the problem in (a), we get nearly an order of magnitude improvement over algorithms that can be derived from known techniques [13]. For the problem in (b), we do …
Speeding up the Consensus Clustering methodology for microarray data analysis
Abstract Background The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speed-up of Consensus (Consensus Clustering), a methodology whose purpose…
The Engineering of a Compression Booster: Theory Vs Practice in BWT Compression
Multi-Dimensional Pattern Matching with Dimensional Wildcards: Data Structures and Optimal On-Line Search Algorithms
We introduce a new multidimensional pattern matching problem that is a natural generalization of string matching, a well studied problem1. The motivation for its algorithmic study is mainly theoretical. LetA1:n1,?,1:nd be a text matrix withN=n1?ndentries andB1:m1,?,1:mr be a pattern matrix withM=m1?mrentries, whered?r?1 (the matrix entries are taken from an ordered alphabet ?). We study the problem of checking whether somer-dimensional submatrix ofAis equal toB(i.e., adecisionquery).Acan be preprocessed andBis given on-line. We define a new data structure for preprocessingAand propose CRCW-PRAM algorithms that build it inO(logN) time withN2/nmaxprocessors, wherenmax=max(n1,?,nd), such that …
Articles selected from posters presented at the Tenth Annual International Conference on Research in Computational Biology - Preface and Special Issue
Mapreduce in computational biology via hadoop and spark
Bioinformatics has a long history of software solutions developed on multi-core computing systems for solving computational intensive problems. This option suffer from some issues solvable by shifting to Distributed Systems. In particular, the MapReduce computing paradigm, and its implementations, Hadoop and Spark, is becoming increasingly popular in the Bioinformatics field because it allows for virtual-unlimited horizontal scalability while being easy-to-use. Here we provide a qualitative evaluation of some of the most significant MapReduce bioinformatics applications. We also focus on one of these applications to show the importance of correctly engineering an application to fully exploi…
O(n 2 log n) Time On-Line Construction of Two-Dimensional Suffix Trees
The two-dimensional suffix tree of an n × n square matrix A is a compacted trie that represents all square submatrices of Ai¾?[9]. For the off-line case, i.e., A is given in advance to the algorithm, it is known how to build it in optimal time, for any type of alphabet sizei¾?[9,15]. Motivated by applications in Image Compressioni¾?[18], Giancarlo and Guaianai¾?[12] considered the on-line version of the two-dimensional suffix tree and presented an On2log2n-time algorithm, which we refer to as GG. That algorithm is a non-trivial generalization of Ukkonen's on-line algorithm for standard suffix trees [19]. The main contribution in this paper is an Olog n factor improvement in the time complex…
A Critical Analysis of Classifier Selection in Learned Bloom Filters
Learned Bloom Filters, i.e., models induced from data via machine learning techniques and solving the approximate set membership problem, have recently been introduced with the aim of enhancing the performance of standard Bloom Filters, with special focus on space occupancy. Unlike in the classical case, the "complexity" of the data used to build the filter might heavily impact on its performance. Therefore, here we propose the first in-depth analysis, to the best of our knowledge, for the performance assessment of a given Learned Bloom Filter, in conjunction with a given classifier, on a dataset of a given classification complexity. Indeed, we propose a novel methodology, supported by soft…
Textual data compression in computational biology: a synopsis.
Abstract Motivation: Textual data compression, and the associated techniques coming from information theory, are often perceived as being of interest for data communication and storage. However, they are also deeply related to classification and data mining and analysis. In recent years, a substantial effort has been made for the application of textual data compression techniques to various computational biology tasks, ranging from storage and indexing of large datasets to comparison and reverse engineering of biological networks. Results: The main focus of this review is on a systematic presentation of the key areas of bioinformatics and computational biology where compression has been use…
Permutations, Partitions and Combinatorial Compression Boosting
Functional Information, Biomolecular Messages and Complexity of BioSequences and Structures
In the quest for a mathematical measure able to capture and shed light on the dual notions of information and complexity in biosequences, Hazen et al. have introduced the notion of Functional Information (FI for short). It is also the result of earlier considerations and findings by Szostak and Carothers et al. Based on the experiments by Charoters et al., regarding FI in RNA binding activities, we decided to study the relation existing between FI and classic measures of complexity applied on protein-DNA interactions on a genome-wide scale. Using classic complexity measures, i.e, Shannon entropy and Kolmogorov Complexity as both estimated by data compression, we found that FI applied to pro…
Epigenomic k-mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning
Abstract Motivation: Information-theoretic and compositional analysis of biological sequences, in terms of k-mer dictionaries, has a well established role in genomic and proteomic studies. Much less so in epigenomics, although the role of k-mers in chromatin organization and nucleosome positioning is particularly relevant. Fundamental questions concerning the informational content and compositional structure of nucleosome favouring and disfavoring sequences with respect to their basic building blocks still remain open. Results: We present the first analysis on the role of k-mers in the composition of nucleosome enriched and depleted genomic regions (NER and NDR for short) that is: (i) exhau…
Learned Sorted Table Search and Static Indexes in Small-Space Data Models
Machine-learning techniques, properly combined with data structures, have resulted in Learned Static Indexes, innovative and powerful tools that speed up Binary Searches with the use of additional space with respect to the table being searched into. Such space is devoted to the machine-learning models. Although in their infancy, these are methodologically and practically important, due to the pervasiveness of Sorted Table Search procedures. In modern applications, model space is a key factor, and a major open question concerning this area is to assess to what extent one can enjoy the speeding up of Binary Searches achieved by Learned Indexes while using constant or nearly constant-space mod…
Algorithmic Aspects of Speech Recognition: A Synopsis
Speech recognition is an area with a sizable literature, but there is little discussion of the topic within the computer science algorithms community. Since many of the problems arising in speech recognition are well suited for algorithmic studies, we present them in terms familiar to algorithm designers. Such cross fertilization can breed fresh insights from new perspectives. This material is abstracted from A. L. Buchsbaum and R. Giancarlo, Algorithmic Aspects of Speech Recognition: An Introduction, ACM Journal of Experimental Algorithmics, Vol. 2, 1997, http://www.jea.acm.org.
Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis
AbstractThe advent of high throughput technologies, in particular microarrays, for biological research has revived interest in clustering, resulting in a plethora of new clustering algorithms. However, model selection, i.e., the identification of the correct number of clusters in a dataset, has received relatively little attention. Indeed, although central for statistics, its difficulty is also well known. Fortunately, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained prominence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of pre…
On the Construction of Classes of Suffix Trees for Square Matrices: Algorithms and Applications
AbstractWe provide a uniform framework for the study of index data structures for a two-dimensional matrixTEXT[1:n, 1:n] whose entries are drawn from an ordered alphabetΣ. An index forTEXTcan be informally seen as the two-dimensional analog of the suffix tree for a string. It allows on-line searches and statistics to be performed onTEXTby representing compactly theΘ(n3) square submatrices ofTEXTin optimalO(n2) space. We identify 4n−1families of indices forTEXT, each containing ∏ni=1(2i−1)! isomorphic data structures. We also develop techniques leading to a single algorithm that efficiently builds any index in any family inO(n2logn) time andO(n2) space. Such an algorithm improves in various …
FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy
Abstract Background Storage of genomic data is a major cost for the Life Sciences, effectively addressed via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic. Results We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy …
The Engineering of a Compression Boosting Library: Theory vs Practice in BWT Compression
Data Compression is one of the most challenging arenas both for algorithm design and engineering. This is particularly true for Burrows and Wheeler Compression a technique that is important in itself and for the design of compressed indexes. There has been considerable debate on how to design and engineer compression algorithms based on the BWT paradigm. In particular, Move-to-Front Encoding is generally believed to be an "inefficient " part of the Burrows-Wheeler compression process. However, only recently two theoretically superior alternatives to Move-to-Front have been proposed, namely Compression Boosting and Wavelet Trees. The main contribution of this paper is to provide the first ex…
Standard Vs Uniform Binary Search and Their Variants in Learned Static Indexing: The Case of the Searching on Sorted Data Benchmarking Software Platform
Learned Indexes are a novel approach to search in a sorted table. A model is used to predict an interval in which to search into and a Binary Search routine is used to finalize the search. They are quite effective. For the final stage, usually, the lower_bound routine of the Standard C++ library is used, although this is more of a natural choice rather than a requirement. However, recent studies, that do not use Machine Learning predictions, indicate that other implementations of Binary Search or variants, namely k-ary Search, are better suited to take advantage of the features offered by modern computer architectures. With the use of the Searching on Sorted Sets SOSD Learned Indexing bench…
Generalizations of the periodicity Theorem of Fine and Wilf
We provide three generalizations to the two-dimensional case of the well known periodicity theorem by Fine and Wilf [4] for strings (the one-dimensional case). The first and the second generalizations can be further extended to hold in the more general setting of Cayley graphs of groups. Weak forms of two of our results have been developed for the design of efficient algorithms for two-dimensional pattern matching [2, 3, 6].
A basic analysis toolkit for biological sequences
This paper presents a software library, nicknamed BATS, for some basic sequence analysis tasks. Namely, local alignments, via approximate string matching, and global alignments, via longest common subsequence and alignments with affine and concave gap cost functions. Moreover, it also supports filtering operations to select strings from a set and establish their statistical significance, via z-score computation. None of the algorithms is new, but although they are generally regarded as fundamental for sequence analysis, they have not been implemented in a single and consistent software package, as we do here. Therefore, our main contribution is to fill this gap between algorithmic theory an…
Pattern Matching Algorithms
Table Compression
Data Compression Techniques for massive tables are described. Related methodological results are also presented.
DNA combinatorial messages and Epigenomics: The case of chromatin organization and nucleosome occupancy in eukaryotic genomes
Abstract Epigenomics is the study of modifications on the genetic material of a cell that do not depend on changes in the DNA sequence, since those latter involve specific proteins around which DNA wraps. The end result is that Epigenomic changes have a fundamental role in the proper working of each cell in Eukaryotic organisms. A particularly important part of Epigenomics concentrates on the study of chromatin, that is, a fiber composed of a DNA-protein complex and very characterizing of Eukaryotes. Understanding how chromatin is assembled and how it changes is fundamental for Biology. In more than thirty years of research in this area, Mathematics and Theoretical Computer Science have gai…
Forewords-Special Issue Combinatorial Pattern Matching 2011
The Three Steps of Clustering In The Post-Genomic Era
This chapter descibes the basic algorithmic components that are involved in clustering, with particular attention to classification of microarray data.
Novel Combinatorial and Information-Theoretic Alignment-Free Distances for Biological Data Mining
Among the plethora of alignment-free methods for comparing biological sequences, there are some that we have perceived as representative of the novel techniques that have been devised in the past few years and as being of a fundamental nature and of broad interest and applicability, ranging from combinatorics to information theory. In this chapter, we review these alignment free methods, by presenting both their mathematical definitions and the experiments in which they are involved in.
A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis
Abstract Background Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from statistics to computer science. Following Handl et al., it can be summarized as a three step process: (1) choice of a distance function; (2) choice of a clustering algorithm; (3) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research. Results A procedure is proposed for the assessment of the discriminative ability of a distance functi…
Periodicity and repetitions in parameterized strings
AbstractOne of the most beautiful and useful notions in the Mathematical Theory of Strings is that of a Period, i.e., an initial piece of a given string that can generate that string by repeating itself at regular intervals. Periods have an elegant mathematical structure and a wealth of applications [F. Mignosi and A. Restivo, Periodicity, Algebraic Combinatorics on Words, in: M. Lothaire (Ed.), Cambridge University Press, Cambridge, pp. 237–274, 2002]. At the hearth of their theory, there are two Periodicity Lemmas: one due to Lyndon and Schutzenberger [The equation aM=bNcP in a free group, Michigan Math. J. 9 (1962) 289–298], referred to as the Weak Version, and the other due to Fine and …
Boosting Textual Compression in Optimal Linear Time
We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression performance guarantee. It displays the following remarkable properties: (a) it can turn any memoryless compressor into a compression algorithm that uses the “best possible” contexts; (b) it is very simple and optimal in terms of time; and (c) it admits a decompression algorithm again optimal in time. To the best of our knowledge, this is the first boosting technique displaying these properties.Technically, our boosting technique builds upon three main ingredients: the Burrows--Wheeler Transform, the Suffix Tree d…
Statistical Indexes for Computational and Data Driven Class Discovery in Microarray Data
Alignment-free Genomic Analysis via a Big Data Spark Platform
Abstract Motivation Alignment-free distance and similarity functions (AF functions, for short) are a well-established alternative to pairwise and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in computational biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results We fill this impo…
Bayesian versus data driven model selection for microarray data
Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. In this beautiful area, one of the most difficult challenges is a particular instance of the model selection problem, i.e., the identification of the correct number of clusters in a dataset. In what follows, for ease of reference, we refer to that instance still as model selection. It is an important part of any statistical analysis. The techniques used for solving it are mainly either Bayesian or data-driven, and are both based on internal knowledge. That is, they use information obtained by processing the input data. A…
$O(n^2 log n)$ Time On-line Construction of Two-Dimensional Suffix Trees
The two-dimensional suffix tree of an n × n square matrix A is a compacted trie that represents all square submatrices of A [11]. For the off-line case, i.e., A is given in advance to the algorithm, it is known how to build it in optimal time, for any type of alphabet size [11], [18]. Motivated by applications in Image Compression [22], Giancarlo and Guaiana [14] considered the on-line version of the two-dimensional suffix tree and presented an O(n2 log2 n)-time algorithm, which we refer to as GG. That algorithm is a nontrivial generalization of Ukkonen’s on-line algorithm for standard suffix trees [23]. The main contribution in this paper is an O(logn) factor improvement in the time comple…
Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies
High-throughput sequencing technologies produce large collections of data, mainly DNA sequences with additional information, requiring the design of efficient and effective methodologies for both their compression and storage. In this context, we first provide a classification of the main techniques that have been proposed, according to three specific research directions that have emerged from the literature and, for each, we provide an overview of the current techniques. Finally, to make this review useful to researchers and technicians applying the existing software and tools, we include a synopsis of the main characteristics of the described approaches, including details on their impleme…
Learned Sorted Table Search and Static Indexes in Small Model Space
Machine Learning Techniques, properly combined with Data Structures, have resulted in Learned Static Indexes, innovative and powerful tools that speed-up Binary Search, with the use of additional space with respect to the table being searched into. Such space is devoted to the ML model. Although in their infancy, they are methodologically and practically important, due to the pervasiveness of Sorted Table Search procedures. In modern applications, model space is a key factor and, infact, a major open question concerning this area is to assess to whatextent one can enjoy the speed-up of Learned Indexes while using constant or nearly constant space models.We address it here by (a) introducing…
SAIL: String Algorithms, Information and Learning, Preface and Special Issue
Computation Cluster Validation in the Big Data Era
Data-driven class discovery, i.e., the inference of cluster structure in a dataset, is a fundamental task in Data Analysis, in particular for the Life Sciences. We provide a tutorial on the most common approaches used for that task, focusing on methodologies for the prediction of the number of clusters in a dataset. Although the methods that we present are general in terms of the data for which they can be used, we offer a case study relevant for Microarray Data Analysis.
An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop
Alignment-free methods are one of the mainstays of biological sequence comparison, i.e., the assessment of how similar two biological sequences are to each other, a fundamental and routine task in computational biology and bioinformatics. They have gained popularity since, even on standard desktop machines, they are faster than methods based on alignments. However, with the advent of Next-Generation Sequencing Technologies, datasets whose size, i.e., number of sequences and their total length, is a challenge to the execution of alignment-free methods on those standard machines are quite common. Here, we propose the first paradigm for the computation of k-mer-based alignment-free methods for…
Stability-Based Model Selection for High Throughput Genomic Data: An Algorithmic Paradigm
Clustering is one of the most well known activities in scien- tific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. In this beautiful area, one of the most difficult challenges is the model selection problem, i.e., the identifi- cation of the correct number of clusters in a dataset. In the last decade, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained promi- nence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of predic- tion, but the slowest in terms of time. Unfortunately…
Longest Common Subsequence from Fragments via Sparse Dynamic Programming
Sparse Dynamic Programming has emerged as an essential tool for the design of efficient algorithms for optimization problems coming from such diverse areas as Computer Science, Computational Biology and Speech Recognition [7,11,15]. We provide a new Sparse Dynamic Programming technique that extends the Hunt-Szymanski [2,9,8] paradigm for the computation of the Longest Common Subsequence (LCS) and apply it to solve the LCS from Fragments problem: given a pair of strings X and Y (of length n and m, resp.) and a set M of matching substrings of X and Y, find the longest common subsequence based only on the symbol correspondences induced by the substrings. This problem arises in an application t…
Alignment-Free Sequence Comparison over Hadoop for Computational Biology
Sequence comparison i.e., The assessment of how similar two biological sequences are to each other, is a fundamental and routine task in Computational Biology and Bioinformatics. Classically, alignment methods are the de facto standard for such an assessment. In fact, considerable research efforts for the development of efficient algorithms, both on classic and parallel architectures, has been carried out in the past 50 years. Due to the growing amount of sequence data being produced, a new class of methods has emerged: Alignment-free methods. Research in this ares has become very intense in the past few years, stimulated by the advent of Next Generation Sequencing technologies, since those…
Algorithms in Bioinformatics
Algorithmics for the Life Sciences
The life sciences, in particular molecular biology and medicine, have wit- nessed fundamental progress since the discovery of the “the Double Helix”. A rele- vant part of such an incredible advancement in knowledge has been possible thanks to synergies with the mathematical sciences, on the one hand, and computer science, on the other. Here we review some of the most relevant aspects of this cooperation focusing on contributions given by the design, analysis and engineering of fast al- gorithms for the life sciences.
New results for finding common neighborhoods in massive graphs in the data stream model
AbstractWe consider the problem of finding pairs of vertices that share large common neighborhoods in massive graphs. We give lower bounds for randomized, two-sided error algorithms that solve this problem in the data-stream model of computation. Our results correct and improve those of Buchsbaum, Giancarlo, and Westbrook [On finding common neighborhoods in massive graphs, Theoretical Computer Science, 299 (1–3) 707–718 (2004)]
Foreword: Special issue in honor of the 60th Birthday of Professor Alberto Apostolico
The intrinsic combinatorial organization and information theoretic content of a sequence are correlated to the DNA encoded nucleosome organization of eukaryotic genomes
Abstract Motivation: Thanks to research spanning nearly 30 years, two major models have emerged that account for nucleosome organization in chromatin: statistical and sequence specific. The first is based on elegant, easy to compute, closed-form mathematical formulas that make no assumptions of the physical and chemical properties of the underlying DNA sequence. Moreover, they need no training on the data for their computation. The latter is based on some sequence regularities but, as opposed to the statistical model, it lacks the same type of closed-form formulas that, in this case, should be based on the DNA sequence only. Results: We contribute to close this important methodological gap …
Algoritmica Per Le Scienze Della Vita
Capito dove si esemplifica come l' algoritmica contribuisce a carpire i segreti della Monna Lisa della scienza moderna.
A Tutorial on Computational Cluster Analysis with Applications to Pattern Discovery in Microarray Data
Background Inferring cluster structure in microarray datasets is a fundamental task for the so-called -omic sciences. It is also a fundamental question in Statistics, Data Analysis and Classification, in particular with regard to the prediction of the number of clusters in a dataset, usually established via internal validation measures. Despite the wealth of internal measures available in the literature, new ones have been recently proposed, some of them specifically for microarray data. Results We consider five such measures: Clest, Consensus (Consensus Clustering), FOM (Figure of Merit), Gap (Gap Statistics) and ME (Model Explorer), in addition to the classic WCSS (Within Cluster Sum-of-S…
The Three Steps of Clustering in the Post-Genomic Era: A Synopsis
Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. Following Handl et al., it can be summarized as a three step process: (a) choice of a distance function; (b) choice of a clustering algorithm; (c) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research. Unfortunately, the high dimensionality of the data and their noisy nature makes cluster analysis of genomic data particul…
Additional file 1 of FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy
Additional file 1. Supplementary Material.
Improving table compression with combinatorial optimization
We study the problem of compressing massive tables within the partition-training paradigm introduced by Buchsbaum et al. [SODA'00], in which a table is partitioned by an off-line training procedure into disjoint intervals of columns, each of which is compressed separately by a standard, on-line compressor like gzip. We provide a new theory that unifies previous experimental observations on partitioning and heuristic observations on column permutation, all of which are used to improve compression rates. Based on the theory, we devise the first on-line training algorithms for table compression, which can be applied to individual files, not just continuously operating sources; and also a new, …
Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark
With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of "omics" data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here we propose algorithms for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our algorithms are the first ones that distribute the index computation and not only the input dataset, allowing to fully benefit of the available cloud resources.
Grid-K: A Cometa Virtual Organization Service for Compressio-Based Classification of Biological Sequences and Structures
In vitro versus in vivo compositional landscapes of histone sequence preferences in eucaryotic genomes
Abstract Motivation Although the nucleosome occupancy along a genome can be in part predicted by in vitro experiments, it has been recently observed that the chromatin organization presents important differences in vitro with respect to in vivo. Such differences mainly regard the hierarchical and regular structures of the nucleosome fiber, whose existence has long been assumed, and in part also observed in vitro, but that does not apparently occur in vivo. It is also well known that the DNA sequence has a role in determining the nucleosome occupancy. Therefore, an important issue is to understand if, and to what extent, the structural differences in the chromatin organization between in vit…
Learning from Data to Speed-up Sorted Table Search Procedures: Methodology and Practical Guidelines
Sorted Table Search Procedures are the quintessential query-answering tool, with widespread usage that now includes also Web Applications, e.g, Search Engines (Google Chrome) and ad Bidding Systems (AppNexus). Speeding them up, at very little cost in space, is still a quite significant achievement. Here we study to what extend Machine Learning Techniques can contribute to obtain such a speed-up via a systematic experimental comparison of known efficient implementations of Sorted Table Search procedures, with different Data Layouts, and their Learned counterparts developed here. We characterize the scenarios in which those latter can be profitably used with respect to the former, accounting …
The Myriad Virtes of Wavelet Trees
A new data structure, the wavelet tree, is analysied and discussed with particular attention to data compression
On the construction of classes of suffix trees for square matrices: Algorithms and applications
Given an n × n TEXT matrix with entries defined over an ordered alphabet σ, we introduce 4n−1 classes of index data structures for TEXT. Those indices are informally the two-dimensional analog of the suffix tree of a string [15], allowing on-line searches and statistics to be performed on TEXT. We provide one simple algorithm that efficiently builds any chosen index in those classes in O(n2 log n) worst case time using O(n2) space. The algorithm can be modified to require optimal O(n2) expected time for bounded σ.
Mapreduce in computational biology - A synopsis
In the past 20 years, the Life Sciences have witnessed a paradigm shift in the way research is performed. Indeed, the computational part of biological and clinical studies has become central or is becoming so. Correspondingly, the amount of data that one needs to process, compare and analyze, has experienced an exponential growth. As a consequence, High Performance Computing (HPC, for short) is being used intensively, in particular in terms of multi-core architectures. However, recently and thanks to the advances in the processing of other scientific and commercial data, Distributed Computing is also being considered for Bioinformatics applications. In particular, the MapReduce paradigm, to…
Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms
Abstract Motivation Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in {A,C,G,T}k occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in…
2D-Pattern Indexing
Data Structures for two-dimensional pattern matching are presented and discussed.
Genome-wide characterization of chromatin binding and nucleosome spacing activity of the nucleosome remodelling ATPase ISWI
The evolutionarily conserved ATP-dependent nucleosome remodelling factor ISWI can space nucleosomes affecting a variety of nuclear processes. In Drosophila, loss of ISWI leads to global transcriptional defects and to dramatic alterations in higher-order chromatin structure, especially on the male X chromosome. In order to understand if chromatin condensation and gene expression defects, observed in ISWI mutants, are directly correlated with ISWI nucleosome spacing activity, we conducted a genome-wide survey of ISWI binding and nucleosome positioning in wild-type and ISWI mutant chromatin. Our analysis revealed that ISWI binds both genic and intergenic regions. Remarkably, we found that ISWI…
Longest Motifs with a Functionally Equivalent Central Block
International audience; This paper presents a generalization of the notion of longest repeats with a block of k don't care symbols introduced by [Crochemore et al., LATIN 2004] (for k fixed) to longest motifs composed of three parts: a first and last that parameterize match (that is, match via some symbol renaming, initially unknown), and a functionally equivalent central block. Such three-part motifs are called longest block motifs. Different types of functional equivalence, and thus of matching criteria for the central block are considered, which include as a subcase the one treated in [Crochemore et al., LATIN 2004] and extend to the case of regular expressions with no Kleene closure or …
An Approximate Determinization Algorithm for Weighted Finite-State Automata
Nondeterministic weighted finite-state automata are a key abstraction in automatic speech recognition systems. The efficiency of automatic speech recognition depends directly on the sizes of these automata and the degree of nondeterminism present, so recent research has studied ways to determinize and minimize them, using analogues of classical automata determinization and minimization. Although, as we describe here, determinization can in the worst case cause poly-exponential blowup in the number of states of a weighted finite-state automaton, in practice it is remarkably successful. In extensive experiments in automatic speech recognition systems, deterministic weighted finite-state autom…
The Alternating BWT: an algorithmic perspective
Abstract The Burrows-Wheeler Transform (BWT) is a word transformation introduced in 1994 for Data Compression. It has become a fundamental tool for designing self-indexing data structures, with important applications in several areas in science and engineering. The Alternating Burrows-Wheeler Transform (ABWT) is another transformation recently introduced in Gessel et al. (2012) [21] and studied in the field of Combinatorics on Words. It is analogous to the BWT, except that it uses an alternating lexicographical order instead of the usual one. Building on results in Giancarlo et al. (2018) [23] , where we have shown that BWT and ABWT are part of a larger class of reversible transformations, …
GenClust: A genetic algorithm for clustering gene expression data
Abstract Background Clustering is a key step in the analysis of gene expression data, and in fact, many classical clustering algorithms are used, or more innovative ones have been designed and validated for the task. Despite the widespread use of artificial intelligence techniques in bioinformatics and, more generally, data analysis, there are very few clustering algorithms based on the genetic paradigm, yet that paradigm has great potential in finding good heuristic solutions to a difficult optimization problem such as clustering. Results GenClust is a new genetic algorithm for clustering gene expression data. It has two key features: (a) a novel coding of the search space that is simple, …
Sparse Dynamic Programming for Longest Common Subsequence from Fragments
Sparse Dynamic Programming has emerged as an essential tool for the design of efficient algorithms for optimization problems coming from such diverse areas as computer science, computational biology, and speech recognition. We provide a new sparse dynamic programming technique that extends the Hunt?Szymanski paradigm for the computation of the longest common subsequence (LCS) and apply it to solve the LCS from Fragments problem: given a pair of strings X and Y (of length n and m, respectively) and a set M of matching substrings of X and Y, find the longest common subsequence based only on the symbol correspondences induced by the substrings. This problem arises in an application to analysis…
The Myriad Virtues of Suffix Trees
Wavelet Trees have been introduced in [Grossi, Gupta and Vitter, SODA ’03] and have been rapidly recognized as a very flexible tool for the design of compressed full-text indexes and data compressors. Although several papers have investigated the beauty and usefulness of this data structure in the full-text indexing scenario, its impact on data compression has not been fully explored. In this paper we provide a complete theoretical analysis of a wide class of compression algorithms based on Wavelet Trees. We also show how to improve their asymp- totic performance by introducing a novel framework, called Generalized Wavelet Trees, that aims for the best combination of binary compressors (lik…
On-line Construction of Two-Dimensional Suffix Trees
AbstractWe say that a data structure is builton-lineif, at any instant, we have the data structure corresponding to the input we have seen up to that instant. For instance, consider the suffix tree of a stringx[1,n]. An algorithm building iton-lineis such that, when we have read the firstisymbols ofx[1,n], we have the suffix tree forx[1,i]. We present a new technique, which we refer to asimplicit updates, based on which we obtain: (a) an algorithm for theon-lineconstruction of the Lsuffix tree of ann×nmatrixA—this data structure is the two-dimensional analog of the suffix tree of a string; (b) simple algorithms implementing primitive operations forLZ1-typeon-line losslessimage compression m…
Genome-wide characterization of chromatin binding and nucleosome spacing activity of the nucleosome remodelling ATPase ISWI.
The evolutionarily conserved ATP-dependent nucleosome remodelling factor ISWI can space nucleosomes affecting a variety of nuclear processes. In Drosophila, loss of ISWI leads to global transcriptional defects and to dramatic alterations in higher-order chromatin structure, especially on the male X chromosome. In order to understand if chromatin condensation and gene expression defects, observed in ISWI mutants, are directly correlated with ISWI nucleosome spacing activity, we conducted a genome-wide survey of ISWI binding and nucleosome positioning in wild-type and ISWI mutant chromatin. Our analysis revealed that ISWI binds both genic and intergenic regions. Remarkably, we found that ISWI…
Pattern Discovery in the Post-Genome
An O(n^2log n) Time on-line construction of two-dimensional suffix trees
Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics
Abstract Background Distributed approaches based on the MapReduce programming paradigm have started to be proposed in the Bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of MapReduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the …
A Learned Sorted Table Search Library
This library includes a collection of methods for performing element search in ordered tables, starting from textbook implementations to more complex algorithms
A Benchmarking Platform for Atomic Learned Indexes
This repository provides a benchmarking platform to evaluate how Feed Forward Neural Networks can be effectively used as index data structures.