Search results for " Data Structures"

showing 10 items of 80 documents

Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis

2012

AbstractThe advent of high throughput technologies, in particular microarrays, for biological research has revived interest in clustering, resulting in a plethora of new clustering algorithms. However, model selection, i.e., the identification of the correct number of clusters in a dataset, has received relatively little attention. Indeed, although central for statistics, its difficulty is also well known. Fortunately, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained prominence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of pre…

Settore INF/01 - InformaticaGeneral Computer Sciencebusiness.industryComputer scienceBioinformaticsModel selectionGeneral statisticsMachine learningcomputer.software_genreTheoretical Computer ScienceComputational biologyAnalysis of massive datasetsMachine learningCluster (physics)Algorithms and data structures General statistics Analysis of massive datasets Machine learning Computational biology BioinformaticsAlgorithms and data structuresAlgorithm designArtificial intelligenceCluster analysisbusinessCompleteness (statistics)computerComputer Science(all)Theoretical Computer Science

researchProduct

Indexed Two-Dimensional String Matching

2016

Settore INF/01 - InformaticaTwo-dimensional index data structuresString searching algorithm0102 computer and information sciences02 engineering and technologyApproximate string matching01 natural sciencesCombinatorics010201 computation theory & mathematicsIndex data structures for matrices or imageIndexing for matrices or image0202 electrical engineering electronic engineering information engineeringTwo-dimensional indexing for pattern matching020201 artificial intelligence & image processingString metricMathematics

researchProduct

A New Class of Searchable and Provably Highly Compressible String Transformations

2019

The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited success. In this paper we bring new lymph to this area by introducing a whole new family of transformations that have all the "myriad virtues" of the BWT: they can be computed and inverted in linear time, they produce provably highly compressible strings, and they support linear time pattern search direc…

Settore ING-INF/05 - Sistemi Di Elaborazione Delle InformazioniFOS: Computer and information sciences050101 languages & linguisticsBurrows-wheeler transformation; Combinatorics on words; Data indexing and compression000 Computer science knowledge general worksSettore INF/01 - InformaticaCombinatorics on words05 social sciences02 engineering and technologyData_CODINGANDINFORMATIONTHEORYComputer ScienceBurrows-wheeler transformationComputer Science - Data Structures and Algorithms0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processing0501 psychology and cognitive sciencesData Structures and Algorithms (cs.DS)Data indexing and compressionCombinatorics on word

researchProduct

Identifying the k Best Targets for an Advertisement Campaign via Online Social Networks

2020

We propose a novel approach for the recommendation of possible customers (users) to advertisers (e.g., brands) based on two main aspects: (i) the comparison between On-line Social Network profiles, and (ii) neighborhood analysis on the On-line Social Network. Profile matching between users and brands is considered based on bag-of-words representation of textual contents coming from the social media, and measures such as the Term Frequency-Inverse Document Frequency are used in order to characterize the importance of words in the comparison. The approach has been implemented relying on Big Data Technologies, allowing this way the efficient analysis of very large Online Social Networks. Resul…

Social and Information Networks (cs.SI)FOS: Computer and information sciencesMatching (statistics)Social networkSettore INF/01 - Informaticabusiness.industryComputer scienceBig dataDatabases (cs.DB)AdvertisingComputer Science - Social and Information NetworksOnline Social Networks Social Advertising tf-idf Profile Matching.Term (time)Computer Science - Information RetrievalSet (abstract data type)Computer Science - DatabasesOrder (business)Computer Science - Data Structures and AlgorithmsData Structures and Algorithms (cs.DS)Social mediabusinessRepresentation (mathematics)Information Retrieval (cs.IR)

researchProduct

Clique Percolation Method: Memory Efficient Almost Exact Communities

2022

Automatic detection of relevant groups of nodes in large real-world graphs, i.e. community detection, has applications in many fields and has received a lot of attention in the last twenty years. The most popular method designed to find overlapping communities (where a node can belong to several communities) is perhaps the clique percolation method (CPM). This method formalizes the notion of community as a maximal union of $k$-cliques that can be reached from each other through a series of adjacent $k$-cliques, where two cliques are adjacent if and only if they overlap on $k-1$ nodes. Despite much effort CPM has not been scalable to large graphs for medium values of $k$. Recent work has sho…

Social and Information Networks (cs.SI)FOS: Computer and information sciencesPhysics - Physics and Society[INFO.INFO-SI] Computer Science [cs]/Social and Information Networks [cs.SI][PHYS.PHYS.PHYS-SOC-PH]Physics [physics]/Physics [physics]/Physics and Society [physics.soc-ph][INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS]FOS: Physical sciences[INFO.INFO-DS] Computer Science [cs]/Data Structures and Algorithms [cs.DS]Computer Science - Social and Information NetworksPhysics and Society (physics.soc-ph)[INFO.INFO-SI]Computer Science [cs]/Social and Information Networks [cs.SI]Computer Science - Information Retrieval[PHYS.PHYS.PHYS-SOC-PH] Physics [physics]/Physics [physics]/Physics and Society [physics.soc-ph][INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]Computer Science - Data Structures and AlgorithmsData Structures and Algorithms (cs.DS)[INFO.INFO-IR] Computer Science [cs]/Information Retrieval [cs.IR]Information Retrieval (cs.IR)MathematicsofComputing_DISCRETEMATHEMATICS

researchProduct

Adaptive reference-free compression of sequence quality scores

2014

Motivation: Rapid technological progress in DNA sequencing has stimulated interest in compressing the vast datasets that are now routinely produced. Relatively little attention has been paid to compressing the quality scores that are assigned to each sequence, even though these scores may be harder to compress than the sequences themselves. By aggregating a set of reads into a compressed index, we find that the majority of bases can be predicted from the sequence of bases that are adjacent to them and hence are likely to be less informative for variant calling or other applications. The quality scores for such bases are aggressively compressed, leaving a relatively small number at full reso…

Statistics and ProbabilityFOS: Computer and information sciencesComputer sciencemedia_common.quotation_subjectReference-freecomputer.software_genreBiochemistryDNA sequencingSet (abstract data type)Redundancy (information theory)BWTComputer Science - Data Structures and AlgorithmsCode (cryptography)AnimalsHumansQuality (business)Data Structures and Algorithms (cs.DS)Quantitative Biology - GenomicsCaenorhabditis elegansMolecular Biologymedia_commonGenomics (q-bio.GN)SequenceGenomeSettore INF/01 - Informaticareference-free compressionHigh-Throughput Nucleotide SequencingGenomicsSequence Analysis DNAData CompressioncompressionComputer Science ApplicationsComputational MathematicsComputational Theory and MathematicsFOS: Biological sciencesData miningquality scoreMetagenomicscomputerBWT; compression; quality score; reference-free compressionAlgorithmsReference genome

researchProduct

Repetitiveness Measures based on String Attractors and Burrows-Wheeler Transform: Properties and Applications

2023

String AttractorSettore INF/01 - InformaticaMeasure of repetitiveneBurrows-Wheeler TransformCompressed Data StructuresData CompressionCombinatorics on WordStringology

researchProduct

$O(n^2 log n)$ Time On-line Construction of Two-Dimensional Suffix Trees

2007

The two-dimensional suffix tree of an n × n square matrix A is a compacted trie that represents all square submatrices of A [11]. For the off-line case, i.e., A is given in advance to the algorithm, it is known how to build it in optimal time, for any type of alphabet size [11], [18]. Motivated by applications in Image Compression [22], Giancarlo and Guaiana [14] considered the on-line version of the two-dimensional suffix tree and presented an O(n2 log2 n)-time algorithm, which we refer to as GG. That algorithm is a nontrivial generalization of Ukkonen’s on-line algorithm for standard suffix trees [23]. The main contribution in this paper is an O(logn) factor improvement in the time comple…

Two-dimensional suffix tree On-line algorithm Index data structures.

researchProduct

Decremental 2- and 3-connectivity on planar graphs

1996

We study the problem of maintaining the 2-edge-, 2-vertex-, and 3-edge-connected components of a dynamic planar graph subject to edge deletions. The 2-edge-connected components can be maintained in a total ofO(n logn) time under any sequence of at mostO(n) deletions. This givesO(logn) amortized time per deletion. The 2-vertex- and 3-edge-connected components can be maintained in a total ofO(n log2n) time. This givesO(log2n) amortized time per deletion. The space required by all our data structures isO(n). All our time bounds improve previous bounds.

Vertex (graph theory)Discrete mathematicsDynamic data structuresAmortized analysisGeneral Computer ScienceApplied MathematicsVertex connectivityPlanar graphsData structureEdge connectivityComputer Science ApplicationsPlanar graphCombinatoricssymbols.namesakeAnalysis of algorithms Dynamic data structures Edge connectivity Planar graphs Vertex connectivitysymbolsAnalysis of algorithmsVertex connectivityDynamic data structuresAnalysis of algorithmsMathematicsAlgorithmica

researchProduct

Lightweight LCP construction for next-generation sequencing datasets

2012

The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets. In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and B…

Whole genome sequencingGenomics (q-bio.GN)FOS: Computer and information sciencesSequenceBWT; LCP; next-generation sequencing datasetsBWT LCP text indexes next-generation sequencing datasets massive datasetsSettore INF/01 - InformaticaComputer scienceComputationString (computer science)LCP arrayParallel computingData structureDNA sequencingSubstringBWTLCPFOS: Biological sciencesComputer Science - Data Structures and AlgorithmsQuantitative Biology - GenomicsData Structures and Algorithms (cs.DS)next-generation sequencing datasets

researchProduct