0000000001250495
AUTHOR
Camelia Goga
Variance estimation and asymptotic confidence bands for the mean estimator of sampled functional data with high entropy unequal probability sampling designs
For fixed size sampling designs with high entropy it is well known that the variance of the Horvitz-Thompson estimator can be approximated by the H\'ajek formula. The interest of this asymptotic variance approximation is that it only involves the first order inclusion probabilities of the statistical units. We extend this variance formula when the variable under study is functional and we prove, under general conditions on the regularity of the individual trajectories and the sampling design, that we can get a uniformly convergent estimator of the variance function of the Horvitz-Thompson estimator of the mean function. Rates of convergence to the true variance function are given for the re…
Conditional Bias Robust Estimation of the Total of Curve Data by Sampling in a Finite Population: An Illustration on Electricity Load Curves
Abstract For marketing or power grid management purposes, many studies based on the analysis of total electricity consumption curves of groups of customers are now carried out by electricity companies. Aggregated totals or mean load curves are estimated using individual curves measured at fine time grid and collected according to some sampling design. Due to the skewness of the distribution of electricity consumptions, these samples often contain outlying curves which may have an important impact on the usual estimation procedures. We introduce several robust estimators of the total consumption curve which are not sensitive to such outlying curves. These estimators are based on the conditio…
Multivariate statistical analysis for exploring road crash-related factors in the Franche-Comté region of France
Understanding and modelling road crash data is crucial in fulfilling safety goals by helping national authorities to take necessary measures to reduce crash frequency and severity. This work aims at giving a multivariate statistical analysis of road crash data from the French region of Franche-Comte with special attention to road crash gravity. The first step for this multivariate analysis was to perform Multiple Correspondence Analysis in order to assess associations between the road crash injury and several important accident-related factors and circumstances. Log-linear models are used next in order to detect associations between road crash severity and related factors such as al-cohol/d…
Efficient Estimation of Nonlinear Finite Population Parameters Using Nonparametrics
Currently, the high-precision estimation of nonlinear parameters such as Gini indices, low-income proportions or other measures of inequality is particularly crucial. In the present paper, we propose a general class of estimators for such parameters that take into account univariate auxiliary information assumed to be known for every unit in the population. Through a nonparametric model-assisted approach, we construct a unique system of survey weights that can be used to estimate any nonlinear parameter associated with any study variable of the survey, using a plug-in principle. Based on a rigorous functional approach and a linearization principle, the asymptotic variance of the proposed es…
Design-based estimation for geometric quantiles with application to outlier detection
Geometric quantiles are investigated using data collected from a complex survey. Geometric quantiles are an extension of univariate quantiles in a multivariate set-up that uses the geometry of multivariate data clouds. A very important application of geometric quantiles is the detection of outliers in multivariate data by means of quantile contours. A design-based estimator of geometric quantiles is constructed and used to compute quantile contours in order to detect outliers in both multivariate data and survey sampling set-ups. An algorithm for computing geometric quantile estimates is also developed. Under broad assumptions, the asymptotic variance of the quantile estimator is derived an…
Asymptotic efficiency of the calibration estimator in a high-dimensional data setting
Abstract In a finite population sampling survey, auxiliary information is commonly used to improve the Horvitz-Thompson estimators and calibration has been extensively used by national statistical agencies over the last decades for that purpose. This method enables to make estimators consistent with known totals of auxiliary variables and to reduce variance if the calibration variables are explanatory for the variable of interest. Nowadays, it is not unusual anymore to have high-dimensional auxiliary data sets and adding too much additional calibration variables may increase the variance of calibration estimators. We study in this paper the asymptotic efficiency of the calibration estimator…
Model-Assisted Estimation Through Random Forests in Finite Population Sampling
In surveys, the interest lies in estimating finite population parameters such as population totals and means. In most surveys, some auxiliary information is available at the estimation stage. This information may be incorporated in the estimation procedures to increase their precision. In this article, we use random forests (RFs) to estimate the functional relationship between the survey variable and the auxiliary variables. In recent years, RFs have become attractive as National Statistical Offices have now access to a variety of data sources, potentially exhibiting a large number of observations on a large number of variables. We establish the theoretical properties of model-assisted proc…
B-Spline Estimation in a Survey Sampling Framework
Nonparametric regression models have been used more and more over the last years to model survey data and incorporate efficiently auxiliary information in order to improve the estimation of totals, means or other study parameters such as Gini index or poverty rate. B-spline nonparametric regression has the benefit of being very flexible in modeling nonlinear survey data while keeping many similarities and properties of the classical linear regression. This method proved to be efficient for deriving a unique system of weights which allowed to estimate in an efficient way and simultaneously many study parameters. Applications on real and simulated survey data showed its high efficiency. This …
Efficient Estimation of Non-Linear Finite Population Parameters by Using Non-Parametrics
Summary Currently, high precision estimation of non-linear parameters such as Gini indices, low income proportions or other measures of inequality is particularly crucial. We propose a general class of estimators for such parameters that take into account univariate auxiliary information assumed to be known for every unit in the population. Through a non-parametric model-assisted approach, we construct a unique system of survey weights that can be used to estimate any non-linear parameter that is associated with any study variable of the survey, using a plug-in principle. Based on a rigorous functional approach and a linearization principle, the asymptotic variance of the estimators propose…
Use of functionals in linearization and composite estimation with application to two-sample survey data
An important problem associated with two-sample surveys is the estimation of nonlinear functions of finite population totals such as ratios, correlation coefficients or measures of income inequality. Computation and estimation of the variance of such complex statistics are made more difficult by the existence of overlapping units. In one-sample surveys, the linearization method based on the influence function approach is a powerful tool for variance estimation. We introduce a two-sample linearization technique that can be viewed as a generalization of the one-sample influence function approach. Our technique is based on expressing the parameters of interest as multivariate functionals of fi…
Variance Estimation and Asymptotic Confidence Bands for the Mean Estimator of Sampled Functional Data with High Entropy Unequal Probability Sampling Designs
For fixed size sampling designs with high entropy it is well known that the variance of the Horvitz-Thompson estimator can be approximated by the Hajek formula. The interest of this asymptotic variance approximation is that it only involves the first order inclusion probabilities of the statistical units. We extend this variance formula when the variable under study is functional and we prove, under general conditions on the regularity of the individual trajectories and the sampling design, that it asymptotically provides a uniformly consistent estimator of the variance function of the Horvitz-Thompson estimator of the mean function. Rates of convergence to the true variance function are gi…
Uniform convergence and asymptotic confidence bands for model-assisted estimators of the mean of sampled functional data
When the study variable is functional and storage capacities are limited or transmission costs are high, selecting with survey sampling techniques a small fraction of the observations is an interesting alternative to signal compression techniques, particularly when the goal is the estimation of simple quantities such as means or totals. We extend, in this functional framework, model-assisted estimators with linear regression models that can take account of auxiliary variables whose totals over the population are known. We first show, under weak hypotheses on the sampling design and the regularity of the trajectories, that the estimator of the mean function as well as its variance estimator …
Estimation of total electricity consumption curves of small areas by sampling in a finite population
International audience; Many studies carried out in the French electricity company EDF are based on the analysis of the total electricity consumption curves of groups of customers. These aggregated electricity consumption curves are estimated by using samples of thousands of curves measured at a small time step and collected according to a sampling design. Small area estimation is very usual in survey sampling. It is often addressed by using implicit or explicit domain models between the interest variable and the auxiliary variables. The goal here is to estimate totals of electricity consumption curves over domains or areas. Three approaches are compared: the rst one consists in modeling th…
Estimating with kernel smoothers the mean of functional data in a finite population setting. A note on variance estimation in presence of partially observed trajectories
In the near future, millions of load curves measuring the electricity consumption of French households in small time grids (probably half hours) will be available. All these collected load curves represent a huge amount of information which could be exploited using survey sampling techniques. In particular, the total consumption of a specific cus- tomer group (for example all the customers of an electricity supplier) could be estimated using unequal probability random sampling methods. Unfortunately, data collection may undergo technical problems resulting in missing values. In this paper we study a new estimation method for the mean curve in the presence of missing values which consists in…
Estimation of total electricity consumption curves by sampling in a finite population when some trajectories are partially unobserved
International audience; Millions of smart meters that are able to collect individual load curves, that is, electricity consumption time series, of residential and business customers at fine scale time grids are now deployed by electricity companies all around the world. It may be complex and costly to transmit and exploit such a large quantity of information, therefore it can be relevant to use survey sampling techniques to estimate mean load curves of specific groups of customers. Data collection, like every mass process, may undergo technical problems at every point of the metering and collection chain resulting in missing values. We consider imputation approaches (linear interpolation, k…
Using Complex Surveys to Estimate theL1-Median of a Functional Variable: Application to Electricity Load Curves
Mean proles are widely used as indicators of the electricity consumption habits of customers. Currently, Electricit e De France (EDF), estimates class load proles by using point-wise mean function. Unfortunately, it is well known that the mean is highly sensitive to the presence of outliers, such as one or more consumers with unusually high-levels of consumption. In this paper, we propose an alternative to the mean prole: the L1-median prole which is more robust. When dealing with large datasets of functional data (load curves for example), survey sampling approaches are useful for estimating the median prole and avoid storing all of the data. We propose here estimators of the median trajec…
Estimation de paramètres non linéaires par des méthodes non-paramétriques en population finie
International audience; Nous considérons dans cet article l'estimation de paramètres non-linéaires de totaux en population finie quand une variable auxiliaire est disponible pour chaque individu de la population. Une nouvelle classe d'estimateurs par substitution est obtenue en remplaçant chaque total par un estimateur assisté par un modèle et basé sur une régression non-paramétrique. Pour obtenir la variance asymptotique, la statistique complexe obtenue est ensuite linéarisée par la technique de la fonction d'influence proposée par Deville (1999).
Quantiles géométriques et sondage
International audience; Dans ce travail, nous nous sommes intéressées à l'estimation du quantile géométrique pour des données issues d'un plan de sondage. Nous donnons un estimateur du quan- tile géométrique basé sur le plan de sondage ainsi qu'une méthode itérative pour l'obtenir à partir des données d'échantillonnage. Sous des conditions générales, nous dérivons la variance asymptotique de l'estimateur du quantile et nous proposons un estimateur con- vergent de cette variance. Le bon comportement de l'estimateur du quantile géométrique est véri fié par une étude par simulation.
Imputation Procedures in Surveys Using Nonparametric and Machine Learning Methods: An Empirical Comparison
Abstract Nonparametric and machine learning methods are flexible methods for obtaining accurate predictions. Nowadays, data sets with a large number of predictors and complex structures are fairly common. In the presence of item nonresponse, nonparametric and machine learning procedures may thus provide a useful alternative to traditional imputation procedures for deriving a set of imputed values used next for the estimation of study parameters defined as solution of population estimating equation. In this paper, we conduct an extensive empirical investigation that compares a number of imputation procedures in terms of bias and efficiency in a wide variety of settings, including high-dimens…
Functional Principal Components Analysis with Survey Data
This work aims at performing Functional Principal Components Analysis (FPCA) with Horvitz-Thompson estimators when the observations are curves collected with survey sampling techniques. FPCA relies on estimations of the eigenelements of the covariance operator which can be seen as nonlinear functionals. Adapting to our functional context the linearization technique based on the influence function developed by Deville (1999), we prove that these estimators are asymptotically design unbiased and convergent. Under mild assumptions, asymptotic variances are derived for the FPCA’ estimators and convergent estimators of them are proposed. Our approach is illustrated with a simulation study and we…
Properties of Design-Based Functional Principal Components Analysis.
This work aims at performing Functional Principal Components Analysis (FPCA) with Horvitz-Thompson estimators when the observations are curves collected with survey sampling techniques. One important motivation for this study is that FPCA is a dimension reduction tool which is the first step to develop model assisted approaches that can take auxiliary information into account. FPCA relies on the estimation of the eigenelements of the covariance operator which can be seen as nonlinear functionals. Adapting to our functional context the linearization technique based on the influence function developed by Deville (1999), we prove that these estimators are asymptotically design unbiased and con…
Using complex surveys to estimate the $L_1$-median of a functional variable: application to electricity load curves
Mean profiles are widely used as indicators of the electricity consumption habits of customers. Currently, in \'Electricit\'e De France (EDF), class load profiles are estimated using point-wise mean function. Unfortunately, it is well known that the mean is highly sensitive to the presence of outliers, such as one or more consumers with unusually high-levels of consumption. In this paper, we propose an alternative to the mean profile: the $L_1$-median profile which is more robust. When dealing with large datasets of functional data (load curves for example), survey sampling approaches are useful for estimating the median profile avoiding storing the whole data. We propose here estimators of…