Search results for "Alignment-free"

showing 10 items of 11 documents

The colored longest common prefix array computed via sequential scans

2018

Due to the increased availability of large datasets of biological sequences, the tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most of the alignment-free approaches require the computation of statistics of the sequences in the dataset. Such computations become impractical in internal memory when very large collections of long sequences are considered. In this paper, we present a new conceptual data structure, the colored longest common prefix array (cLCP), that allows to efficiently tackle several problems with an alignment-free approach. In fact, we show that such a data structure can be computed via sequential scans in semi-exter…

0301 basic medicineFOS: Computer and information sciencesAlignment-free methodsBurrows–Wheeler transformComputer scienceComputationAverage common substring0206 medical engineeringMatching statisticsScale (descriptive set theory)02 engineering and technologyTheoretical Computer Science03 medical and health sciencesComputer Science - Data Structures and AlgorithmsData Structures and Algorithms (cs.DS)Burrows-wheeler transformString (computer science)Computer Science (all)LCP arrayMatching statisticData structureSubstring030104 developmental biologyAlignment-free methods; Average common substring; Burrows-wheeler transform; Longest common prefix; Matching statistics; Theoretical Computer Science; Computer Science (all)Pairwise comparisonLongest common prefixAlgorithm020602 bioinformaticsAlignment-free method
researchProduct

An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop

2016

Alignment-free methods are one of the mainstays of biological sequence comparison, i.e., the assessment of how similar two biological sequences are to each other, a fundamental and routine task in computational biology and bioinformatics. They have gained popularity since, even on standard desktop machines, they are faster than methods based on alignments. However, with the advent of Next-Generation Sequencing Technologies, datasets whose size, i.e., number of sequences and their total length, is a challenge to the execution of alignment-free methods on those standard machines are quite common. Here, we propose the first paradigm for the computation of k-mer-based alignment-free methods for…

0301 basic medicineTheoretical computer science030102 biochemistry & molecular biologySettore INF/01 - InformaticaComputer scienceComputationExtension (predicate logic)Information SystemHash tableDistributed computingTask (project management)Theoretical Computer Science03 medical and health sciences030104 developmental biologyAlignment-free sequence comparison and analysisHadoopHardware and Architecturealignment-free sequence comparison and analysis; distributed computing; Hadoop; MapReduce; software; theoretical computer science; information systems; hardware and architectureSequence comparisonMapReduceAlignment-free sequence comparison and analysiAlignment-free sequence comparison and analysis; Distributed computing; Hadoop; MapReduce; Theoretical Computer Science; Software; Information Systems; Hardware and ArchitectureSoftwareInformation Systems
researchProduct

Textual data compression in computational biology: Algorithmic techniques

2012

Abstract In a recent review [R. Giancarlo, D. Scaturro, F. Utro, Textual data compression in computational biology: a synopsis, Bioinformatics 25 (2009) 1575–1586] the first systematic organization and presentation of the impact of textual data compression for the analysis of biological data has been given. Its main focus was on a systematic presentation of the key areas of bioinformatics and computational biology where compression has been used together with a technical presentation of how well-known notions from information theory have been adapted to successfully work on biological data. Rather surprisingly, the use of data compression is pervasive in computational biology. Starting from…

Biological dataData Compression Theory and Practice Alignment-free sequence comparison Entropy Huffman coding Hidden Markov Models Kolmogorov complexity Lempel–Ziv compressors Minimum Description Length principle Pattern discovery in bioinformatics Reverse engineering of biological networks Sequence alignmentSettore INF/01 - InformaticaGeneral Computer ScienceKolmogorov complexityComputer scienceSearch engine indexingComputational biologyInformation theoryInformation scienceTheoretical Computer ScienceTechnical PresentationEntropy (information theory)Data compressionComputer Science Review
researchProduct

Variable-order reference-free variant discovery with the Burrows-Wheeler Transform

2020

Abstract Background In [Prezza et al., AMB 2019], a new reference-free and alignment-free framework for the detection of SNPs was suggested and tested. The framework, based on the Burrows-Wheeler Transform (BWT), significantly improves sensitivity and precision of previous de Bruijn graphs based tools by overcoming several of their limitations, namely: (i) the need to establish a fixed value, usually small, for the order k, (ii) the loss of important information such as k-mer coverage and adjacency of k-mers within the same read, and (iii) bad performance in repeated regions longer than k bases. The preliminary tool, however, was able to identify only SNPs and it was too slow and memory con…

Burrows–Wheeler transformComputer science[SDV]Life Sciences [q-bio]Value (computer science)SNPAssembly-free0102 computer and information scienceslcsh:Computer applications to medicine. Medical informatics01 natural sciencesBiochemistryPolymorphism Single Nucleotide03 medical and health sciencesBWTChromosome (genetic algorithm)Structural BiologyHumansSensitivity (control systems)Molecular Biologylcsh:QH301-705.5Alignment-free; Assembly-free; BWT; INDEL; SNP030304 developmental biologyAlignment-free; Assembly-free; BWT; INDEL; SNP;De Bruijn sequence0303 health sciencesSettore INF/01 - InformaticaAlignment-freeApplied MathematicsResearchGenomicsSequence Analysis DNAINDELData structureGraphComputer Science ApplicationsVariable (computer science)lcsh:Biology (General)010201 computation theory & mathematicsAdjacency listlcsh:R858-859.7Suffix[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM]AlgorithmAlgorithmsBMC Bioinformatics
researchProduct

An extension of the Burrows-Wheeler Transform

2007

AbstractWe describe and highlight a generalization of the Burrows–Wheeler Transform (bwt) to a multiset of words. The extended transformation, denoted by ebwt, is reversible. Moreover, it allows to define a bijection between the words over a finite alphabet A and the finite multisets of conjugacy classes of primitive words in A∗. Besides its mathematical interest, the extended transform can be useful for applications in the context of string processing. In the last part of this paper we illustrate one such application, providing a similarity measure between sequences based on ebwt.

Discrete mathematicsMultisetSimilarity (geometry)General Computer ScienceBurrows–Wheeler transformGeneralizationAlignment-free distance measure; Burrows-Wheeler transform; Sequence comparisonContext (language use)Similarity measureBurrows-Wheeler transformSequence comparisonTheoretical Computer ScienceConjugacy classBijectionAlignment-free distance measureBurrows–Wheeler transformComputer Science::Formal Languages and Automata TheoryComputer Science(all)Mathematics
researchProduct

Applications of alignment-free methods in epigenomics

2013

Epigenetic mechanisms play an important role in the regulation of cell type-specific gene activities, yet how epigenetic patterns are established and maintained remains poorly understood. Recent studies have supported a role of DNA sequences in recruitment of epigenetic regulators. Alignment-free methods have been applied to identify distinct sequence features that are associated with epigenetic patterns and to predict epigenomic profiles. Here, we review recent advances in such applications, including the methods to map DNA sequence to feature space, sequence comparison and prediction models. Computational studies using these methods have provided important insights into the epigenetic reg…

EpigenomicsSupport Vector MachineDNA sequenceSequence alignmentComputational biologyBiologyDNA sequencingEpigenesis GeneticArtificial IntelligenceSequence comparisonHumansNucleosomeEpigeneticsMolecular BiologyGeneEpigenomicsSequence (medicine)GeneticsModels GeneticSettore INF/01 - InformaticanucleosomeChromosome MappingComputational BiologySequence Analysis DNAmachine learningPapersSequence Alignmentepigeneticalignment-free methodInformation SystemsBriefings in Bioinformatics
researchProduct

2014

The majority of next-generation sequencing short-reads can be properly aligned by leading aligners at high speed. However, the alignment quality can still be further improved, since usually not all reads can be correctly aligned to large genomes, such as the human genome, even for simulated data. Moreover, even slight improvements in this area are important but challenging, and usually require significantly more computational endeavor. In this paper, we present CUSHAW3, an open-source parallelized, sensitive and accurate short-read aligner for both base-space and color-space sequences. In this aligner, we have investigated a hybrid seeding approach to improve alignment quality, which incorp…

GeneticsMultidisciplinarySource codeHeuristic (computer science)business.industrymedia_common.quotation_subjectPipeline (computing)Sequence alignmentColor spaceBiologyRanking (information retrieval)SoftwarebusinessAlgorithmAlignment-free sequence analysismedia_commonPLOS ONE
researchProduct

Novel Combinatorial and Information-Theoretic Alignment-Free Distances for Biological Data Mining

2010

Among the plethora of alignment-free methods for comparing biological sequences, there are some that we have perceived as representative of the novel techniques that have been devised in the past few years and as being of a fundamental nature and of broad interest and applicability, ranging from combinatorics to information theory. In this chapter, we review these alignment free methods, by presenting both their mathematical definitions and the experiments in which they are involved in.

Settore INF/01 - InformaticaComputer scienceAlignment-free distances for biological sequenceBiological data miningData miningcomputer.software_genrecomputer
researchProduct

Alignment-Free Sequence Comparison over Hadoop for Computational Biology

2015

Sequence comparison i.e., The assessment of how similar two biological sequences are to each other, is a fundamental and routine task in Computational Biology and Bioinformatics. Classically, alignment methods are the de facto standard for such an assessment. In fact, considerable research efforts for the development of efficient algorithms, both on classic and parallel architectures, has been carried out in the past 50 years. Due to the growing amount of sequence data being produced, a new class of methods has emerged: Alignment-free methods. Research in this ares has become very intense in the past few years, stimulated by the advent of Next Generation Sequencing technologies, since those…

SpeedupTheoretical computer scienceSettore INF/01 - InformaticaComputer scienceAlignment-free sequence comparison and analysis; Distributed computing; Hadoop; MapReduce; Software; Mathematics (all); Hardware and ArchitectureSequence alignmentContext (language use)Computational biologyDNA sequencingDistributed computingTask (project management)Alignment-free sequence comparison and analysisHadoopHardware and ArchitectureMathematics (all)Relevance (information retrieval)MapReducePattern matchingAlignment-free sequence comparison and analysiSoftware
researchProduct

The Power of Word-Frequency Based Alignment-Free Functions: a Comprehensive Large-Scale Experimental Analysis

2021

Abstract Motivation Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e. their ability to identify true similarity, has been limited to some members of the D2 family. The corresponding experimental studies have concentrated on short sequences, a scenario no longer adequate for current applications, where sequence lengths may vary considerably. Such a State of the Art is methodologically problematic, since information regarding a key feature such as power is either mi…

Statistics and ProbabilitySequenceSimilarity (geometry)Settore INF/01 - Informaticasequence analysisComputer sciencepower statisticsAlignment-Free Genomic Analysis Big Data Software Platforms Bioinformatics AlgorithmsScale (descriptive set theory)Function (mathematics)computer.software_genreBiochemistryComputer Science ApplicationsSet (abstract data type)Computational MathematicsRange (mathematics)Computational Theory and Mathematicssequence analysis; power statistics; alignment-free functionsalignment-free functionsData miningCompleteness (statistics)Molecular BiologycomputerType I and type II errors
researchProduct