Search results for "BWT"

showing 7 items of 17 documents

Lightweight algorithms for constructing and inverting the BWT of string collections

2013

Recent progress in the field of \{DNA\} sequencing motivates us to consider the problem of computing the Burrows‚ÄìWheeler transform (BWT) of a collection of strings. A human genome sequencing experiment might yield a billion or more sequences, each 100 characters in length. Such a dataset can now be generated in just a few days on a single sequencing machine. Many algorithms and data structures for compression and indexing of text have the \{BWT\} at their heart, and it would be of great interest to explore their applications to sequence collections such as these. However, computing the \{BWT\} for 100 billion characters or more of data remains a computational challenge. In this work we ad…

SequenceTheoretical computer scienceSettore INF/01 - InformaticaGeneral Computer ScienceComputer scienceString (computer science)Search engine indexingProcess (computing)Data_CODINGANDINFORMATIONTHEORYData structureField (computer science)Theoretical Computer ScienceBWTConstant (computer programming)Text indexeBWT; Text indexes; Next-generation sequencingText indexesNext-generation sequencingAlphabetAlgorithmAuxiliary memoryTheoretical Computer Science
researchProduct

Suffix array and Lyndon factorization of a text

2014

Abstract The main goal of this paper is to highlight the relationship between the suffix array of a text and its Lyndon factorization. It is proved in [15] that one can obtain the Lyndon factorization of a text from its suffix array. Conversely, here we show a new method for constructing the suffix array of a text that takes advantage of its Lyndon factorization. The surprising consequence of our results is that, in order to construct the suffix array, the local suffixes inside each Lyndon factor can be separately processed, allowing different implementative scenarios, such as online, external and internal memory, or parallel implementations. Based on our results, the algorithm that we prop…

Sorting suffixes; BWT; Suffix array; Lyndon word; Lyndon factorizationCompressed suffix arraySettore INF/01 - InformaticaSorting suffixesGeneralized suffix treeSuffix arrayOrder (ring theory)Construct (python library)Lyndon wordSorting suffixeTheoretical Computer Sciencelaw.inventionBWTLyndon factorizationComputational Theory and MathematicsFactorizationlawSuffix arrayFactor (programming language)Internal memoryDiscrete Mathematics and CombinatoricsArithmeticcomputerMathematicscomputer.programming_languageJournal of Discrete Algorithms
researchProduct

Adaptive reference-free compression of sequence quality scores

2014

Motivation: Rapid technological progress in DNA sequencing has stimulated interest in compressing the vast datasets that are now routinely produced. Relatively little attention has been paid to compressing the quality scores that are assigned to each sequence, even though these scores may be harder to compress than the sequences themselves. By aggregating a set of reads into a compressed index, we find that the majority of bases can be predicted from the sequence of bases that are adjacent to them and hence are likely to be less informative for variant calling or other applications. The quality scores for such bases are aggressively compressed, leaving a relatively small number at full reso…

Statistics and ProbabilityFOS: Computer and information sciencesComputer sciencemedia_common.quotation_subjectReference-freecomputer.software_genreBiochemistryDNA sequencingSet (abstract data type)Redundancy (information theory)BWTComputer Science - Data Structures and AlgorithmsCode (cryptography)AnimalsHumansQuality (business)Data Structures and Algorithms (cs.DS)Quantitative Biology - GenomicsCaenorhabditis elegansMolecular Biologymedia_commonGenomics (q-bio.GN)SequenceGenomeSettore INF/01 - Informaticareference-free compressionHigh-Throughput Nucleotide SequencingGenomicsSequence Analysis DNAData CompressioncompressionComputer Science ApplicationsComputational MathematicsComputational Theory and MathematicsFOS: Biological sciencesData miningquality scoreMetagenomicscomputerBWT; compression; quality score; reference-free compressionAlgorithmsReference genome
researchProduct

The Burrows-Wheeler Transform between Data Compression and Combinatorics on Words

2013

The Burrows-Wheeler Transform (BWT) is a tool of fundamental importance in Data Compression and, recently, has found many applications well beyond its original purpose. The main goal of this paper is to highlight the mathematical and combinatorial properties on which the outstanding versatility of the $BWT$ is based, i.e. its reversibility and the clustering effect on the output. Such properties have aroused curiosity and fervent interest in the scientific world both for theoretical aspects and for practical effects. In particular, in this paper we are interested both to survey the theoretical research issues which, by taking their cue from Data Compression, have been developed in the conte…

Theoretical computer scienceSettore INF/01 - InformaticaBurrows–Wheeler transformmedia_common.quotation_subjectTheoretical researchContext (language use)Data_CODINGANDINFORMATIONTHEORYBurrows Wheeler transform; Clustering effect; Combinatorial propertiesCombinatorial propertiesBurrows Wheeler transformCombinatorics on wordsClustering effectBWT balancing optimal partitioning text-compressionCuriosityArithmeticCluster analysisFocus (optics)media_commonData compressionMathematics
researchProduct

Lightweight LCP construction for next-generation sequencing datasets

2012

The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets. In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and B…

Whole genome sequencingGenomics (q-bio.GN)FOS: Computer and information sciencesSequenceBWT; LCP; next-generation sequencing datasetsBWT LCP text indexes next-generation sequencing datasets massive datasetsSettore INF/01 - InformaticaComputer scienceComputationString (computer science)LCP arrayParallel computingData structureDNA sequencingSubstringBWTLCPFOS: Biological sciencesComputer Science - Data Structures and AlgorithmsQuantitative Biology - GenomicsData Structures and Algorithms (cs.DS)next-generation sequencing datasets
researchProduct

Lossless and nearly-lossless image compression based on combinatorial transforms

2011

Common image compression standards are usually based on frequency transform such as Discrete Cosine Transform or Wavelets. We present a different approach for loss-less image compression, it is based on combinatorial transform. The main transform is Burrows Wheeler Transform (BWT) which tends to reorder symbols according to their following context. It becomes a promising compression approach based on contextmodelling. BWT was initially applied for text compression software such as BZIP2 ; nevertheless it has been recently applied to the image compression field. Compression scheme based on Burrows Wheeler Transform is usually lossless ; therefore we imple-ment this algorithm in medical imagi…

[INFO.INFO-OH] Computer Science [cs]/Other [cs.OH][INFO.INFO-OH]Computer Science [cs]/Other [cs.OH]Compression sans perte et quasi sans[ INFO.INFO-OH ] Computer Science [cs]/Other [cs.OH]Transformé de Burrows-WheelerBurrows-Wheeler Transform (BWT)Lossless (nearly lossless) image
researchProduct

SNPs detection by eBWT positional clustering

2019

Sequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants calling methods that only make use of (suitably indexed) raw reads data. We develop the positional clustering theory that (i) describes how the extended Burrows–Wheeler Transform (eBWT) of a collection of reads tends to cluster together bases that cover the same genome position (ii) predicts the size of such clusters, and (iii) exhibits an elegant and precise LCP array based procedure to locate such clusters in the e…

lcsh:QH426-470Computer scienceLCP arrayReference-free[SDV]Life Sciences [q-bio]0206 medical engineeringSequencing dataSNPAssembly-free02 engineering and technologyBWT LCP array SNPs Reference-free Assembly-freecomputer.software_genreSoftwareBWTStructural BiologyComputational Theory and MathematicCluster (physics)Cluster analysislcsh:QH301-705.5Molecular BiologyComputingMilieux_MISCELLANEOUSSettore INF/01 - Informaticabusiness.industryResearchApplied MathematicsLCP arrayData structurePipeline (software)lcsh:GeneticsComputational Theory and Mathematicslcsh:Biology (General)Data miningBWT; LCP array; SNPs; Reference-free; Assembly-free[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM]businessRaw datacomputer020602 bioinformaticsSNPs
researchProduct