6533b86cfe1ef96bd12c821e

RESEARCH PRODUCT

Simulation Framework for Realistic Large-scale Individual-level Data Generation with an Application in the Health Domain

Santtu TikkaJussi HakanenMirka SaarelaJuha Karvanen

subject

Methodology (stat.ME)FOS: Computer and information sciencesApplications (stat.AP)Statistics - ApplicationsStatistics - Methodology

description

We propose a framework for realistic data generation and simulation of complex systems and demonstrate its capabilities in the health domain. The main use cases of the framework are predicting the development of risk factors and disease occurrence, evaluating the impact of interventions and policy decisions, and statistical method development. We present the fundamentals of the framework using rigorous mathematical definitions. The framework supports calibration to a real population as well as various manipulations and data collection processes. The freely available open-source implementation in R embraces efficient data structures, parallel computing and fast random number generation which ensure reproducibility and scalability. With the framework it is possible to run daily-level simulations for populations of millions of individuals for decades of simulated time. An example on the occurrence of stroke, type 2 diabetes and mortality illustrates the usage of the framework in the Finnish context. In the example, we demonstrate the data-collection functionality by studying the impact of non-participation on the estimated risk models and interventions related to controlling the additional salt intake.

https://dx.doi.org/10.48550/arxiv.2008.13558