Search results for "Data structures"
showing 10 items of 258 documents
Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis
2012
AbstractThe advent of high throughput technologies, in particular microarrays, for biological research has revived interest in clustering, resulting in a plethora of new clustering algorithms. However, model selection, i.e., the identification of the correct number of clusters in a dataset, has received relatively little attention. Indeed, although central for statistics, its difficulty is also well known. Fortunately, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained prominence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of pre…
Indexed Two-Dimensional String Matching
2016
A New Class of Searchable and Provably Highly Compressible String Transformations
2019
The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited success. In this paper we bring new lymph to this area by introducing a whole new family of transformations that have all the "myriad virtues" of the BWT: they can be computed and inverted in linear time, they produce provably highly compressible strings, and they support linear time pattern search direc…
Identifying the k Best Targets for an Advertisement Campaign via Online Social Networks
2020
We propose a novel approach for the recommendation of possible customers (users) to advertisers (e.g., brands) based on two main aspects: (i) the comparison between On-line Social Network profiles, and (ii) neighborhood analysis on the On-line Social Network. Profile matching between users and brands is considered based on bag-of-words representation of textual contents coming from the social media, and measures such as the Term Frequency-Inverse Document Frequency are used in order to characterize the importance of words in the comparison. The approach has been implemented relying on Big Data Technologies, allowing this way the efficient analysis of very large Online Social Networks. Resul…
Clique Percolation Method: Memory Efficient Almost Exact Communities
2022
Automatic detection of relevant groups of nodes in large real-world graphs, i.e. community detection, has applications in many fields and has received a lot of attention in the last twenty years. The most popular method designed to find overlapping communities (where a node can belong to several communities) is perhaps the clique percolation method (CPM). This method formalizes the notion of community as a maximal union of $k$-cliques that can be reached from each other through a series of adjacent $k$-cliques, where two cliques are adjacent if and only if they overlap on $k-1$ nodes. Despite much effort CPM has not been scalable to large graphs for medium values of $k$. Recent work has sho…
Special factors and the combinatorics of suffix and factor automata
2011
AbstractThe suffix automaton (resp. factor automaton) of a finite word w is the minimal deterministic automaton recognizing the set of suffixes (resp. factors) of w. We study the relationships between the structure of the suffix and factor automata and classical combinatorial parameters related to the special factors of w. We derive formulae for the number of states of these automata. We also characterize the languages LSA and LFA of words having respectively suffix automaton and factor automaton with the minimal possible number of states.
Adaptive reference-free compression of sequence quality scores
2014
Motivation: Rapid technological progress in DNA sequencing has stimulated interest in compressing the vast datasets that are now routinely produced. Relatively little attention has been paid to compressing the quality scores that are assigned to each sequence, even though these scores may be harder to compress than the sequences themselves. By aggregating a set of reads into a compressed index, we find that the majority of bases can be predicted from the sequence of bases that are adjacent to them and hence are likely to be less informative for variant calling or other applications. The quality scores for such bases are aggressively compressed, leaving a relatively small number at full reso…
Repetitiveness Measures based on String Attractors and Burrows-Wheeler Transform: Properties and Applications
2023
SimpleBIM: From full ifcOWL graphs to simplified building graphs
2016
International audience; Recent research in semantic web technologies for the built environment has resulted in several proposals to further improve information exchange among stakeholders from the domain. Most notable is the production of several OWL ontologies that allow to capture building data in RDF graphs. For example, an ifcOWL ontology allows to capture IFC data in an RDF graph. As the building data is now available in a semantic graph with an explicit formal basis, it can be restructured and simplified so that it more easily matches the different requirements associated with practical use case scenarios. In this paper, we investigate several proposals and technological approaches to…
Dictionary-symbolwise flexible parsing
2012
AbstractLinear-time optimal parsing algorithms are rare in the dictionary-based branch of the data compression theory. A recent result is the Flexible Parsing algorithm of Matias and Sahinalp (1999) that works when the dictionary is prefix closed and the encoding of dictionary pointers has a constant cost. We present the Dictionary-Symbolwise Flexible Parsing algorithm that is optimal for prefix-closed dictionaries and any symbolwise compressor under some natural hypothesis. In the case of LZ78-like algorithms with variable costs and any, linear as usual, symbolwise compressor we show how to implement our parsing algorithm in linear time. In the case of LZ77-like dictionaries and any symbol…