6533b824fe1ef96bd1281567

RESEARCH PRODUCT

Lightweight algorithms for constructing and inverting the BWT of string collections

Anthony J. CoxMarkus J. BauerGiovanna Rosone

subject

SequenceTheoretical computer scienceSettore INF/01 - InformaticaGeneral Computer ScienceComputer scienceString (computer science)Search engine indexingProcess (computing)Data_CODINGANDINFORMATIONTHEORYData structureField (computer science)Theoretical Computer ScienceBWTConstant (computer programming)Text indexeBWT; Text indexes; Next-generation sequencingText indexesNext-generation sequencingAlphabetAlgorithmAuxiliary memory

description

Recent progress in the field of \{DNA\} sequencing motivates us to consider the problem of computing the Burrows–Wheeler transform (BWT) of a collection of strings. A human genome sequencing experiment might yield a billion or more sequences, each 100 characters in length. Such a dataset can now be generated in just a few days on a single sequencing machine. Many algorithms and data structures for compression and indexing of text have the \{BWT\} at their heart, and it would be of great interest to explore their applications to sequence collections such as these. However, computing the \{BWT\} for 100 billion characters or more of data remains a computational challenge. In this work we address this obstacle by presenting a methodology for computing the \{BWT\} of a string collection in a lightweight fashion. A first implementation of our algorithm needs O ( m log m ) bits of memory to process m strings, while a second variant makes additional use of external memory to achieve \{RAM\} usage that is constant with respect to m and negligible in size for a small alphabet such as DNA. The algorithms work on any number of strings and any size. We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. We take further steps toward making the \{BWT\} a practical tool for processing string collections on this scale. First, we give two algorithms for recovering the strings in a collection from its BWT. Second, we show that if sequences are added to or removed from the collection, then the \{BWT\} of the original collection can be efficiently updated to obtain the \{BWT\} of the revised collection.

https://doi.org/10.1016/j.tcs.2012.02.002