6533b7d0fe1ef96bd125a3ac
RESEARCH PRODUCT
kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers.
Limsoon WongBertil SchmidtNingjiang ChenYiqi WangJie LuoPeng JiangPingji DengLiang ZhaoLiang ZhaoXiangjun Tangsubject
Statistics and ProbabilitySource codeComputer sciencemedia_common.quotation_subject0206 medical engineeringHash function02 engineering and technologyBiochemistry03 medical and health sciencesEncoding (memory)Molecular BiologyTime complexity030304 developmental biologyBlock (data storage)media_common0303 health sciencesSequence Analysis DNAData structureComputer Science ApplicationsComputational MathematicsComputational Theory and MathematicsError detection and correctionAlgorithmSequence Alignment020602 bioinformaticsAlgorithmsSoftwaredescription
Abstract Motivation K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability. Results We introduce a novel idea of encoding k-mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom filter-like data structure to encode counted k-mers by coupled-bit arrays—one for k-mer representation and the other for frequency encoding. Experiments on five real datasets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude. Availability and implementation The source codes of our algorithm are available at github.com/lzhLab/kmcEx. Supplementary information Supplementary data are available at Bioinformatics online.
year | journal | country | edition | language |
---|---|---|---|---|
2018-06-29 | Bioinformatics (Oxford, England) |