6533b7d8fe1ef96bd126a554

RESEARCH PRODUCT

Computing the Original eBWT Faster, Simpler, and with Less Memory

Massimiliano RossiChristina BoucherMarinella SciortinoDavide CenzatoZsuzsanna Lipták

subject

2019-20 coronavirus outbreakSpeedupString collectionsBig BWTSettore INF/01 - InformaticaSevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2)String (computer science)Suffix arrayOrder (ring theory)omega-orderQuantitative Biology::GenomicsBurrows-Wheeler-TransformBurrows-Wheeler-Transform String collections SAIS Big BWT prefix-free parsing extended BWTlaw.inventionCombinatoricsprefix-free parsingSimple (abstract algebra)lawSAISSAIS algorithmIndependence (probability theory)extended BWTMathematics

description

Mantaci et al. [TCS 2007] defined the \(\mathrm {eBWT}\) to extend the definition of the \(\mathrm {BWT}\) to a collection of strings. However, since this introduction, it has been used more generally to describe any \(\mathrm {BWT}\) of a collection of strings, and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original \(\mathrm {eBWT}\), which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the \(\mathrm {BWT}\) of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We combine our new \(\mathrm {eBWT}\) construction with a variation of prefix-free parsing to allow for scalable construction of the \(\mathrm {eBWT}\). We evaluate our algorithm (pfpebwt) on sets of human chromosomes 19, Salmonella, and SARS-CoV2 genomes, and demonstrate that it is the fastest method for all collections, with a maximum speedup of 7.6\(\times \) on the second best method. The peak memory is at most 2\(\times \) larger than the second best method. Comparing with methods that are also, as our algorithm, able to report suffix array samples, we obtain a 57.1\(\times \) improvement in peak memory. The source code is publicly available at https://github.com/davidecenzato/PFP-eBWT.

https://doi.org/10.1007/978-3-030-86692-1_11