Partitioned learning of deep Boltzmann machines for SNP data.

6533b82ffe1ef96bd1295233

RESEARCH PRODUCT

Partitioned learning of deep Boltzmann machines for SNP data.

Stefan Lenz Moritz Hess Harald Binder Harald Binder Lars Bullinger Tamara Jacqueline Blaette

subject

0301 basic medicine Statistics and Probability Computer science Machine learning computer.software_genre 01 natural sciences Biochemistry Polymorphism Single Nucleotide Machine Learning 010104 statistics & probability 03 medical and health sciences symbols.namesake Joint probability distribution Humans 0101 mathematics Molecular Biology Statistical hypothesis testing Artificial neural network business.industry Gene Expression Regulation Leukemic Deep learning Univariate Computational Biology Manifold Computer Science Applications Data set Computational Mathematics 030104 developmental biology ComputingMethodologies_PATTERNRECOGNITION Computational Theory and Mathematics Leukemia Myeloid Boltzmann constant symbols Data mining Artificial intelligence business computer Software Curse of dimensionality

description

Abstract Motivation Learning the joint distributions of measurements, and in particular identification of an appropriate low-dimensional manifold, has been found to be a powerful ingredient of deep leaning approaches. Yet, such approaches have hardly been applied to single nucleotide polymorphism (SNP) data, probably due to the high number of features typically exceeding the number of studied individuals. Results After a brief overview of how deep Boltzmann machines (DBMs), a deep learning approach, can be adapted to SNP data in principle, we specifically present a way to alleviate the dimensionality problem by partitioned learning. We propose a sparse regression approach to coarsely screen the joint distribution of SNPs, followed by training several DBMs on SNP partitions that were identified by the screening. Aggregate features representing SNP patterns and the corresponding SNPs are extracted from the DBMs by a combination of statistical tests and sparse regression. In simulated case–control data, we show how this can uncover complex SNP patterns and augment results from univariate approaches, while maintaining type 1 error control. Time-to-event endpoints are considered in an application with acute myeloid leukemia patients, where SNP patterns are modeled after a pre-screening based on gene expression data. The proposed approach identified three SNPs that seem to jointly influence survival in a validation dataset. This indicates the added value of jointly investigating SNPs compared to standard univariate analyses and makes partitioned learning of DBMs an interesting complementary approach when analyzing SNP data. Availability and implementation A Julia package is provided at ‘http://github.com/binderh/BoltzmannMachines.jl’. Supplementary information Supplementary data are available at Bioinformatics online.

year	journal	country	edition	language
2016-12-20	Bioinformatics (Oxford, England)

10.1093/bioinformatics/btx408 https://pubmed.ncbi.nlm.nih.gov/28655145