6533b830fe1ef96bd1297aa5

RESEARCH PRODUCT

Missing values in deduplication of electronic patient data

Andreas BorgMurat SariyarKlaus Pommerening

subject

Computer sciencemedia_common.quotation_subjectInferenceHealth InformaticsAmbiguityPatient dataMissing datacomputer.software_genreResearch and ApplicationsRegressionNeoplasmsStatisticsData deduplicationElectronic Health RecordsHumansData miningImputation (statistics)Medical Record LinkageRegistriescomputerRecord linkagemedia_common

description

Data deduplication refers to the process in which records referring to the same real-world entities are detected in datasets such that duplicated records can be eliminated. The denotation ‘record linkage’ is used here for the same problem.1 A typical application is the deduplication of medical registry data.2 3 Medical registries are institutions that collect medical and personal data in a standardized and comprehensive way. The primary aims are the creation of a pool of patients eligible for clinical or epidemiological studies and the computation of certain indices such as the incidence in order to oversee the development of diseases. The latter task in particular requires a database in which synonyms and homonyms do not distort the measures. For instance, synonyms would lead to an overestimation of the incidence and thereby possibly to false resource allocations. The record linkage procedure must itself be reliable and of high quality in order to achieve clean data (for measures regarding the quality of record linkage methods see also Christen and Goiser4). A number of other important works have also investigated record linkage.5–16 Missing values in record linkage applications constitute serious problems in addition to the difficulties introduced by them in areas in which there is no necessity for computing comparison patterns. In settings such as survey analysis missing values emerge, for example, due to missing responses or knowledge of the participants. Analyses based on the data gathered can be biased in this case because of unfilled fields, for example, higher wages are less likely to be revealed than lower ones. Papers that deal with missing values in survey analysis are, for example, the ones of Acock17 and King et al.18 In contrast, in record linkage of electronic health records using personal data, the impact of missing values is augmented because they occur in comparison fields if any of the underlying fields has a missing value. Therefore, missingness in record linkage applications with a significant number of NA values is not ignorable, ie, not random. This non-randomness can also occur when blocking is applied in order to reduce the number of resulting record pairs: one or more features are selected as grouping variables and only pairs with agreement in these variables are considered. A comprehensive survey regarding blocking is given by Christen.19 The distinction into missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) of Little and Rubin20 is only relevant as a starting point. An introduction to missing values in clinical trials based on these distinctions is given by Molenberghs and Kenward.21 Ding and Simonoff22 show that the Little/Rubin distinctions are unrelated to the accuracy of different missing-value treatments when classification trees are used in prediction time and the missingness is independent of the class value. This holds for three of the four evaluated datasets in our study (see next section). We give a short overview of the notions in Little and Rubin:20 MCAR applies when the probability that a value of a variable is missing (NA) does not depend on the values of other observed or unobserved variables o and u, that is, P(NA | o, u) = P(NA); MAR is present when the probability of NA depends only on (the values of other) observed variables, that is, P(NA | o, u) = P(NA | o); MNAR means that P(NA | o, u) cannot be quantified without additional assumptions. The most used technique for dealing with missing values seems to be imputation, which means to replace every NA by a value estimated from the data available. Imputation can be point based or distribution based. In the latter case the (conditional) distribution of the missing value is calculated and predictions are based on this estimated distribution. Multiple (or repeated) imputation generates some complete versions of the data that are combined for final inference in a statistical setting. Regarding further information on this variant we refer to Little and Rubin.20 There is no internationally published systematic approach to missing values in record linkage, as far as we know. Works such as the ones by McGlincy23 or James24 do not—as their titles might suggest—deal with the missing values in the matching attributes but with predicting matches as such. The former paper states that the ‘problem of missing links is similar to the problem of non-response in surveys’, which renders missing values in matching attributes out of sight. Our paper is meant to serve as the base for future work regarding missing values in record linkage. Relevant papers regarding classification trees with missing values are the papers of Ding and Simonoff22 and Saar-Tsechansky and Provost.25 The former work investigates six different approaches—probabilistic split, complete case method, grand mode/mean imputation, separate class, surrogate split, and complete variable method—to missing values and concludes that treating missing values as a separate class (in this paper: imputation with unique value 0.5) performs best when missingness is related to the response variable, otherwise results exhibit more ambiguity. The authors use real datasets and simulated datasets in which missing values are increased based on MCAR, MAR and MNAR sampling. Among others, they use a classification induction tree algorithm that is used in this paper (ie, classification and regression trees (CART); see Methods section). In the articles by Saar-Tsechansky and Provost25 a set of C4.5-classification trees induced on reduced sets of attributes (ie, reduced-model classification) exhibit the best results. For further information regarding the classification-tree induction approach C4.5 we refer to Salzberg.26 This reduced model classification is compared with predictive value imputation (eg, surrogate-split mechanism in CART; see Methods section) and distribution-based imputation (eg, sample-based induction; see Methods section) used by C4.5. Datasets with ‘naturally occurring’ missing values and with increased numbers of missing values (chosen at random: MCAR) were considered. The authors explicitly deal solely with missingness in prediction time. We want to tackle the induction time as well. This paper empirically studies the effect of different approaches for missing values on the accuracy in a record linkage setting in which classification trees are used for the classification of record pairs as match or non-match. Our main aim is to determine the best record linkage strategy on a large amount of real-world data as well as on data based on them in which NA values are manually increased. The number of the data items considered in the evaluation is above five million, which is unusually large for classification-tree settings: datasets in Saar-Tsechansky and Provost25 have at most 21 000 items and Ding and Simonoff22 perform classification with CART with at most 100 000 items (their implementation of CART cannot cope with more data in prediction time).

10.1136/amiajnl-2011-000461https://europepmc.org/articles/PMC3392851/