MetaCache: context-aware classification of metagenomic reads using minhashing.

6533b7dcfe1ef96bd1272192

RESEARCH PRODUCT

MetaCache: context-aware classification of metagenomic reads using minhashing.

Christian Hundt André Müller Bertil Schmidt Andreas Hildebrandt Thomas Hankeln

subject

0301 basic medicine Statistics and Probability Computer science Sequence analysis Context (language use)Biochemistry Genome 03 medical and health sciences chemistry.chemical_compound 0302 clinical medicine RefSeq Humans Molecular Biology Information retrieval Shotgun sequencing High-Throughput Nucleotide Sequencing Sequence Analysis DNA Computer Science Applications Computational Mathematics 030104 developmental biology Computational Theory and Mathematics chemistry Metagenomics Metagenomics 030217 neurology & neurosurgery DNA Algorithms Software Reference genome

description

Abstract Motivation Metagenomic shotgun sequencing studies are becoming increasingly popular with prominent examples including the sequencing of human microbiomes and diverse environments. A fundamental computational problem in this context is read classification, i.e. the assignment of each read to a taxonomic label. Due to the large number of reads produced by modern high-throughput sequencing technologies and the rapidly increasing number of available reference genomes corresponding software tools suffer from either long runtimes, large memory requirements or low accuracy. Results We introduce MetaCache—a novel software for read classification using the big data technique minhashing. Our approach performs context-aware classification of reads by computing representative subsamples of k-mers within both, probed reads and locally constrained regions of the reference genomes. As a result, MetaCache consumes significantly less memory compared to the state-of-the-art read classifiers Kraken and CLARK while achieving highly competitive sensitivity and precision at comparable speed. For example, using NCBI RefSeq draft and completed genomes with a total length of around 140 billion bases as reference, MetaCache’s database consumes only 62 GB of memory while both Kraken and CLARK fail to construct their respective databases on a workstation with 512 GB RAM. Our experimental results further show that classification accuracy continuously improves when increasing the amount of utilized reference genome data. Availability and implementation MetaCache is open source software written in C ++ and can be downloaded at http://github.com/muellan/metacache. Supplementary information Supplementary data are available at Bioinformatics online.

year	journal	country	edition	language
2017-02-15	Bioinformatics (Oxford, England)

10.1093/bioinformatics/btx520 https://pubmed.ncbi.nlm.nih.gov/28961782