Search results for "BWT"
showing 10 items of 17 documents
Detecting mutations by eBWT
2018
In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (eBWT) of a collection of DNA fragments tends to cluster together the copies of nucleotides sequenced from a genome G. Our theory accurately predicts how many copies of any nucleotide are expected inside each such cluster, and how an elegant and precise LCP array based procedure can locate these clusters in the eBWT. Our findings are very general and can be applied to a wide range of different problems. In this paper, we consider the case of alignment-free and reference-free SNPs discovery in multiple collections of reads. We note that, in accordance with our theoretical results, SNPs are clustered in th…
Measuring the clustering effect of BWT via RLE
2017
Abstract The Burrows–Wheeler Transform (BWT) is a reversible transformation on which are based several text compressors and many other tools used in Bioinformatics and Computational Biology. The BWT is not actually a compressor, but a transformation that performs a context-dependent permutation of the letters of the input text that often create runs of equal letters (clusters) longer than the ones in the original text, usually referred to as the “clustering effect” of BWT. In particular, from a combinatorial point of view, great attention has been given to the case in which the BWT produces the fewest number of clusters (cf. [5] , [16] , [21] , [23] ). In this paper we are concerned about t…
Computing the Original eBWT Faster, Simpler, and with Less Memory
2021
Mantaci et al. [TCS 2007] defined the \(\mathrm {eBWT}\) to extend the definition of the \(\mathrm {BWT}\) to a collection of strings. However, since this introduction, it has been used more generally to describe any \(\mathrm {BWT}\) of a collection of strings, and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original \(\mathrm {eBWT}\), which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the \(\mathrm {BWT}\) of a single string that uses …
Burrows-Wheeler Transform on Purely Morphic Words
2022
The study of the compressibility of repetitive sequences is an issue that is attracting great interest. We consider purely morphic words, which are highly repetitive sequences generated by iterating a morphism φ that admits a fixed point (denoted by φ^∞(a) ) starting from a given character a belonging to the finite alphabet A , i.e. φ^∞(a)=lim_{i→∞}φ^i(a) . Such morphisms are called prolongable on a . Here we focus on the compressibility via the Burrows-Wheeler Transform (BWT) of infinite families of finite sequences generated by morphisms. In particular, denoted by r(w) the number of equal-letter runs of a word w , we provide new upper bounds on r(bwt(φ^i(a))) , i.e. the number of equal-le…
Variable-order reference-free variant discovery with the Burrows-Wheeler Transform
2020
Abstract Background In [Prezza et al., AMB 2019], a new reference-free and alignment-free framework for the detection of SNPs was suggested and tested. The framework, based on the Burrows-Wheeler Transform (BWT), significantly improves sensitivity and precision of previous de Bruijn graphs based tools by overcoming several of their limitations, namely: (i) the need to establish a fixed value, usually small, for the order k, (ii) the loss of important information such as k-mer coverage and adjacency of k-mers within the same read, and (iii) bad performance in repeated regions longer than k bases. The preliminary tool, however, was able to identify only SNPs and it was too slow and memory con…
Balanced Words Having Simple Burrows-Wheeler Transform
2009
The investigation of the "clustering effect" of the Burrows-Wheeler transform (BWT) leads to study the words having simple BWT , i.e. words w over an ordered alphabet $A=\{a_1,a_2,\ldots,a_k\}$, with $a_1 < a_2 < \ldots <a_k$, such that $bwt(w)$ is of the form $a_k^{n_k} a_{k-1}^{n_{k-1}} \cdots a_1^{n_1}$, for some non-negative integers $n_1, n_2, \ldots, n_k$. We remark that, in the case of binary alphabets, there is an equivalence between words having simple BWT, the family of (circular) balanced words and the conjugates of standard words. In the case of alphabets of size greater than two, there is no more equivalence between these notions. As a main result of this paper we prove that, u…
Sorting suffixes of a text via its Lyndon Factorization
2013
The process of sorting the suffixes of a text plays a fundamental role in Text Algorithms. They are used for instance in the constructions of the Burrows-Wheeler transform and the suffix array, widely used in several fields of Computer Science. For this reason, several recent researches have been devoted to finding new strategies to obtain effective methods for such a sorting. In this paper we introduce a new methodology in which an important role is played by the Lyndon factorization, so that the local suffixes inside factors detected by this factorization keep their mutual order when extended to the suffixes of the whole word. This property suggests a versatile technique that easily can b…
Suffixes, Conjugates and Lyndon Words
2013
In this paper we are interested in the study of the combinatorial aspects connecting three important constructions in the field of string algorithms: the suffix array, the Burrows-Wheeler transform (BWT) and the extended Burrows-Wheeler transform (EBWT). Such constructions involve the notions of suffixes and conjugates of words and are based on two different order relations, denoted by $\plex$ and $\pom$, that, even if strictly connected, are quite different from the computational point of view. In this study an important role is played by Lyndon words. In particular, we improve the upper bound on the number of symbol comparisons needed to establish the $\pom$ order between two primitive wo…
r-Indexing the eBWT
2021
The extended Burrows Wheeler Transform (\(\mathrm {eBWT}\)) was introduced by Mantaci et al. [TCS 2007] to extend the definition of the \(\mathrm {BWT}\) to a collection of strings. In our prior work [SPIRE 2021], we give a linear-time algorithm for the \(\mathrm {eBWT}\) that preserves the fundamental property of the original definition (i.e., the independence from the input order). The algorithm combines a modification of the Suffix Array Induced Sorting (SAIS) algorithm [IEEE Trans Comput 2011] with Prefix Free Parsing [AMB 2019; JCB 2020]. In this paper, we show how this construction algorithm leads to r-indexing the \(\mathrm {eBWT}\), i.e., run-length encoded \(\mathrm {eBWT}\) and \(…
Lightweight BWT Construction for Very Large String Collections
2011
A modern DNA sequencing machine can generate a billion or more sequence fragments in a matter of days. The many uses of the BWT in compression and indexing are well known, but the computational demands of creating the BWT of datasets this large have prevented its applications from being widely explored in this context. We address this obstacle by presenting two algorithms capable of computing the BWT of very large string collections. The algorithms are lightweight in that the first needs O(m log m) bits of memory to process m strings and the memory requirements of the second are constant with respect to m. We evaluate our algorithms on collections of up to 1 billion strings and compare thei…