6533b7d0fe1ef96bd125a3ac

RESEARCH PRODUCT

kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers.

Limsoon WongBertil SchmidtNingjiang ChenYiqi WangJie LuoPeng JiangPingji DengLiang ZhaoLiang ZhaoXiangjun Tang

subject

Statistics and ProbabilitySource codeComputer sciencemedia_common.quotation_subject0206 medical engineeringHash function02 engineering and technologyBiochemistry03 medical and health sciencesEncoding (memory)Molecular BiologyTime complexity030304 developmental biologyBlock (data storage)media_common0303 health sciencesSequence Analysis DNAData structureComputer Science ApplicationsComputational MathematicsComputational Theory and MathematicsError detection and correctionAlgorithmSequence Alignment020602 bioinformaticsAlgorithmsSoftware

description

Abstract Motivation K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability. Results We introduce a novel idea of encoding k-mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom filter-like data structure to encode counted k-mers by coupled-bit arrays—one for k-mer representation and the other for frequency encoding. Experiments on five real datasets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude. Availability and implementation The source codes of our algorithm are available at github.com/lzhLab/kmcEx. Supplementary information Supplementary data are available at Bioinformatics online.

10.1093/bioinformatics/btz299https://pubmed.ncbi.nlm.nih.gov/31038666