Search results for " data structure."

showing 8 items of 88 documents

A Big Data Approach for Sequences Indexing on the Cloud via Burrows Wheeler Transform

2020

Indexing sequence data is important in the context of Precision Medicine, where large amounts of ``omics'' data have to be daily collected and analyzed in order to categorize patients and identify the most effective therapies. Here we propose an algorithm for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our approach is the first that distributes the index computation and not only the input dataset, allowing to fully benefit of the available cloud resources.

FOS: Computer and information sciencesArtificial Intelligence (cs.AI)Computer Science - Distributed Parallel and Cluster ComputingComputer Science - Artificial IntelligenceComputer Science - Data Structures and AlgorithmsData_FILESData Structures and Algorithms (cs.DS)Distributed Parallel and Cluster Computing (cs.DC)
researchProduct

Normal, Abby Normal, Prefix Normal

2014

A prefix normal word is a binary word with the property that no substring has more 1s than the prefix of the same length. This class of words is important in the context of binary jumbled pattern matching. In this paper we present results about the number $pnw(n)$ of prefix normal words of length $n$, showing that $pnw(n) =\Omega\left(2^{n - c\sqrt{n\ln n}}\right)$ for some $c$ and $pnw(n) = O \left(\frac{2^n (\ln n)^2}{n}\right)$. We introduce efficient algorithms for testing the prefix normal property and a "mechanical algorithm" for computing prefix normal forms. We also include games which can be played with prefix normal words. In these games Alice wishes to stay normal but Bob wants t…

FOS: Computer and information sciencesDiscrete Mathematics (cs.DM)Formal Languages and Automata Theory (cs.FL)Computer Science - Data Structures and AlgorithmsFOS: MathematicsMathematics - CombinatoricsData Structures and Algorithms (cs.DS)Computer Science - Formal Languages and Automata TheoryCombinatorics (math.CO)Data_CODINGANDINFORMATIONTHEORYComputer Science - Discrete Mathematics
researchProduct

2D-Pattern Indexing

2008

Data Structures for two-dimensional pattern matching are presented and discussed.

Settore INF/01 - InformaticaAlgorothms Data Structures
researchProduct

The Alternating BWT: an algorithmic perspective

2020

Abstract The Burrows-Wheeler Transform (BWT) is a word transformation introduced in 1994 for Data Compression. It has become a fundamental tool for designing self-indexing data structures, with important applications in several areas in science and engineering. The Alternating Burrows-Wheeler Transform (ABWT) is another transformation recently introduced in Gessel et al. (2012) [21] and studied in the field of Combinatorics on Words. It is analogous to the BWT, except that it uses an alternating lexicographical order instead of the usual one. Building on results in Giancarlo et al. (2018) [23] , where we have shown that BWT and ABWT are part of a larger class of reversible transformations, …

Discrete mathematicsFOS: Computer and information sciencesSettore INF/01 - InformaticaGeneral Computer ScienceBasis (linear algebra)Computer scienceAlternating Burrows-Wheeler TransformGalois wordRank-invertibilityField (mathematics)Data structureTheoretical Computer ScienceTransformation (function)Difference cover algorithmComputer Science - Data Structures and AlgorithmsData Structures and Algorithms (cs.DS)Time complexityAlternating Burrows-Wheeler Transform; Difference cover algorithm; Galois word; Rank-invertibilityWord (computer architecture)Data compression
researchProduct

Enhanced query processing for NoSQL crowdsourcing systems

2014

In this paper, we provide a novel approach for effectively and efficiently support query processing tasks in novel NoSQL crowdsourcing systems. The idea of our method is to exploit the social knowledge available from reviews about products of any kind, freely provided by customers through specialized web sites. We thus define a NoSQL database system for large collections of product reviews, where queries can be expressed in terms of natural language sentences whose answers are modeled as lists of products ranked based on the relevance of reviews w.r.t. the natural language sentences. The best ranked products in the result list can be seen as the best hints for the user based on crowd opinio…

Motion picturesData structuresExploitComputer scienceNatural languagesCrowdsourcingNoSQLcomputer.software_genreSemanticsComputational Theory and MathematicRelevance (information retrieval)Data miningComputational Theory and Mathematics; 1707; Software1707Information retrievalbusiness.industrySearch engine indexingSemantics; Natural languages; Motion pictures; Data mining; Indexing; Data structures;Data structureSemanticsComputational Theory and MathematicsIndexingbusinessSettore ING-INF/05 - Sistemi di Elaborazione delle InformazionicomputerNatural languageSoftware
researchProduct

Uncommon Suffix Tries

2011

Common assumptions on the source producing the words inserted in a suffix trie with $n$ leaves lead to a $\log n$ height and saturation level. We provide an example of a suffix trie whose height increases faster than a power of $n$ and another one whose saturation level is negligible with respect to $\log n$. Both are built from VLMC (Variable Length Markov Chain) probabilistic sources; they are easily extended to families of sources having the same properties. The first example corresponds to a ''logarithmic infinite comb'' and enjoys a non uniform polynomial mixing. The second one corresponds to a ''factorial infinite comb'' for which mixing is uniform and exponential.

FOS: Computer and information sciencesCompressed suffix arrayPolynomialLogarithmGeneral MathematicsSuffix treevariable length Markov chain[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS]Generalized suffix treeprobabilistic source0102 computer and information sciences02 engineering and technologysuffix trie01 natural scienceslaw.inventionCombinatoricslawComputer Science - Data Structures and AlgorithmsTrieFOS: Mathematics0202 electrical engineering electronic engineering information engineeringData Structures and Algorithms (cs.DS)Mixing (physics)[ INFO.INFO-DS ] Computer Science [cs]/Data Structures and Algorithms [cs.DS]MathematicsDiscrete mathematicsApplied MathematicsProbability (math.PR)020206 networking & telecommunicationssuffix trie.Computer Graphics and Computer-Aided Design[MATH.MATH-PR]Mathematics [math]/Probability [math.PR]010201 computation theory & mathematicsmixing properties60J05 37E05Suffix[ MATH.MATH-PR ] Mathematics [math]/Probability [math.PR]Mathematics - ProbabilitySoftware
researchProduct

Adaptive reference-free compression of sequence quality scores

2014

Motivation: Rapid technological progress in DNA sequencing has stimulated interest in compressing the vast datasets that are now routinely produced. Relatively little attention has been paid to compressing the quality scores that are assigned to each sequence, even though these scores may be harder to compress than the sequences themselves. By aggregating a set of reads into a compressed index, we find that the majority of bases can be predicted from the sequence of bases that are adjacent to them and hence are likely to be less informative for variant calling or other applications. The quality scores for such bases are aggressively compressed, leaving a relatively small number at full reso…

Statistics and ProbabilityFOS: Computer and information sciencesComputer sciencemedia_common.quotation_subjectReference-freecomputer.software_genreBiochemistryDNA sequencingSet (abstract data type)Redundancy (information theory)BWTComputer Science - Data Structures and AlgorithmsCode (cryptography)AnimalsHumansQuality (business)Data Structures and Algorithms (cs.DS)Quantitative Biology - GenomicsCaenorhabditis elegansMolecular Biologymedia_commonGenomics (q-bio.GN)SequenceGenomeSettore INF/01 - Informaticareference-free compressionHigh-Throughput Nucleotide SequencingGenomicsSequence Analysis DNAData CompressioncompressionComputer Science ApplicationsComputational MathematicsComputational Theory and MathematicsFOS: Biological sciencesData miningquality scoreMetagenomicscomputerBWT; compression; quality score; reference-free compressionAlgorithmsReference genome
researchProduct

Algorithms for Computing Abelian Periods of Words

2012

Constantinescu and Ilie (Bulletin EATCS 89, 167--170, 2006) introduced the notion of an \emph{Abelian period} of a word. A word of length $n$ over an alphabet of size $\sigma$ can have $\Theta(n^{2})$ distinct Abelian periods. The Brute-Force algorithm computes all the Abelian periods of a word in time $O(n^2 \times \sigma)$ using $O(n \times \sigma)$ space. We present an off-line algorithm based on a $\sel$ function having the same worst-case theoretical complexity as the Brute-Force one, but outperforming it in practice. We then present on-line algorithms that also enable to compute all the Abelian periods of all the prefixes of $w$.

FOS: Computer and information sciencesDiscrete Mathematics (cs.DM)Abelian repetitionElementary abelian groupRank of an abelian groupCombinatoricsComputer Science - Data Structures and AlgorithmsFOS: MathematicsDiscrete Mathematics and CombinatoricsMathematics - CombinatoricsData Structures and Algorithms (cs.DS)Abelian groupOnline algorithmMathematicsArithmetic of abelian varietiesDiscrete mathematicsCombinatorics on wordsApplied MathematicsAbelian periodText algorithmWeak repetitionPrefixCombinatorics on wordsDesign of algorithmCombinatorics (math.CO)AlgorithmWord (computer architecture)Computer Science::Formal Languages and Automata TheoryComputer Science - Discrete Mathematics
researchProduct