Missing value imputation in proximity extension assay-based targeted proteomics data

6533b858fe1ef96bd12b585b

RESEARCH PRODUCT

Missing value imputation in proximity extension assay-based targeted proteomics data

Philipp S. Wild Thomas Koeck Miguel A. Andrade-navarro Kirsten Leineweber Andreas Schulz Steffen Rapp Lisa Eggebrecht Karl J. Lackner Vincent Ten Cate Thomas Münzel Marina Panova-noeva Madeleine Sauer Markus Nagler Jürgen H. Prochaska Michael Lenz

subject

Proteomics Male Multivariate analysis Protein Expression Biochemistry Protein expression Database and Informatics Methods Limit of Detection Statistics Medicine and Health Sciences Biochemical Simulations Imputation (statistics)Immune Response Mathematics Multidisciplinary Proteomic Databases Q R Eukaryota Blood Proteins Venous Thromboembolism Plants Middle Aged Legumes Targeted proteomics symbols Engineering and Technology Medicine Female Algorithms Research Article Quality Control Adult Science Immunology Research and Analysis Methods symbols.namesake Signs and Symptoms Bias Industrial Engineering Protein Concentration Assays Gene Expression and Vector Techniques Missing value imputation Humans Molecular Biology Techniques Molecular Biology Aged Inflammation Molecular Biology Assays and Analysis Techniques Interleukin-6 Organisms Peas Biology and Life Sciences Computational Biology Missing data Pearson product-moment correlation coefficient Biological Databases Multivariate Analysis Clinical Medicine Venous thromboembolism

description

Targeted proteomics utilizing antibody-based proximity extension assays provides sensitive and highly specific quantifications of plasma protein levels. Multivariate analysis of this data is hampered by frequent missing values (random or left censored), calling for imputation approaches. While appropriate missing-value imputation methods exist, benchmarks of their performance in targeted proteomics data are lacking. Here, we assessed the performance of two methods for imputation of values missing completely at random, the previously top-benchmarked ‘missForest’ and the recently published ‘GSimp’ method. Evaluation was accomplished by comparing imputed with remeasured relative concentrations of 91 inflammation related circulating proteins in 86 samples from a cohort of 645 patients with venous thromboembolism. The median Pearson correlation between imputed and remeasured protein expression values was 69.0% for missForest and 71.6% for GSimp (p = 5.8e-4). Imputation with missForest resulted in stronger reduction of variance compared to GSimp (median relative variance of 25.3% vs. 68.6%, p = 2.4e-16) and undesired larger bias in downstream analyses. Irrespective of the imputation method used, the 91 imputed proteins revealed large variations in imputation accuracy, driven by differences in signal to noise ratio and information overlap between proteins. In summary, GSimp outperformed missForest, while both methods show good overall imputation accuracy with large variations between proteins.

year	journal	country	edition	language
2020-12-14	PLOS ONE

https://doi.org/10.1371/journal.pone.0243487