6533b833fe1ef96bd129bfd5

RESEARCH PRODUCT

Big Data in Medical Science–a Biostatistical View

Harald BinderMaria Blettner

subject

Gigabytebusiness.industrymedia_common.quotation_subjectBig dataByteCloud computingGeneral MedicineTerabyteBioinformaticsData scienceData analysisMedicinebusinessFunction (engineering)media_commonDatasets as Topic

description

Big data” is a universal buzzword in business and science, referring to the retrieval and handling of ever-growing amounts of information. It can be assumed, for example, that a typical hospital generates hundreds of terabytes (1 TB = 1012 bytes) of data annually in the course of patient care (1). For instance, exome sequencing, which results in 5 gigabytes (1 GB = 109 bytes) of data per patient, is on the way to becoming routine (2). The analysis of such enormous volumes of information, i.e., organization and description of the data and the drawing of (scientifically valid) conclusions, can already hardly be accomplished with the traditional tools of computer science and statistics. For example, examination of the exomes of several hundred patients requires sophisticated analytical approaches and the selection of statistical methods that optimize computation time to avoid exceeding the available storage capacity. This is a challenge for the discipline of statistics, which has traditionally analyzed data not only from clinical studies but also from observational studies. Inter alia, techniques have to cope with a number of characteristics per individual that greatly exceeds the number of individuals observed, e.g., in the acquisition of 5 million single-nucleotide polymorphisms from each of a cohort of 100 patients. In the following description of scenarios, techniques, and problems we focus on medical science, i.e., on the question of where and how big data approaches to the processing of large volumes of information can contribute to the advancement of scientific knowledge in medicine. While the description of the corresponding data analysis techniques takes a predominantly scientific perspective, the three scenarios preceding the discussion of techniques are intended to guide the reader in how these approaches can be used in handling routine data. Because clinical studies are our reference point, applications that have little in common with the structure of such studies, e.g., the prediction of disease spread from search engine data (Box), will not be discussed. Furthermore, concepts for technical implementation, e.g., cloud computing (5), will not be presented. Instead, we focus on biostatistical aspects, such as the undistorted estimation of treatment effects, which represent a crucial precondition for progress in medical science (6). Box The debate about a big data showpiece: Google Flu Trends In the Google Flu Trends project (3), the frequency of Google searches for certain terms is used to predict the influenza activity at regional level in a large number of countries. The original publication (3) shows that this method enables precise prediction of data that have traditionally been acquired in much more cumbersome fashion, e.g., by the United States Centers for Disease Control and Prevention (CDC), and did not use to be available until some time later. The possibility of rapid reaction opened up by the Google approach is often cited as a successful application of big data. However, later investigations (4) showed serious systematic deviations from predicted values in the period covered by (3). These may have been caused by modification of the search engine algorithm for business reasons, i.e., to optimize the primary function, with resulting impairment of the secondary function of influenza prediction.

https://doi.org/10.3238/arztebl.2015.0137