0000000000017483
AUTHOR
Mariantonietta Ruggieri
A Multisite-Multipollutant Air Quality Index
Abstract In this paper, starting from a multivariate spatio-temporal array, containing air pollution data collected for the main pollutants at different monitoring sites over a 1-year period, a new approach is proposed to get a Multipollutant-Multisite Air Quality Index (AQI) time series. A two steps aggregation, related to space and to pollutants, is considered. For the first aggregation (spatial synthesis) a PCA is performed on data array opportunely rearranged, while the index I2, proposed in Ruggieri and Plaia (2011) , is used for the second aggregation (pollutant synthesis), obtaining the new index I 2 MS . Daily data of four air pollutants from the city of Palermo (Italy) are analyzed…
Filling in long gap sequences by performing jointly EOF and FDA
In this paper the EOF methodology is performed jointly with the FDA approach on a spatiotemporal multivariate data set with the aim to fill in missing values as accurately as possible when long gap sequences occur. Simulated data sets, containing ”artificial” gaps, are considered in order to test the performance of two proposed procedures; in the first one, observed data are reconstructed by EOF and then converted into functional ones; in the second one, observed data are transformed into functional ones and then EOF reconstruction is applied. By comparing some performance indicators computed for the two procedures, it is shown that a pre-processing of data by FDA, followed by the EOF, may …
Extending Functional kriging to a multivariate context
Environmental data usually have a spatio-temporal structure; pollutant concentrations, for example, are recorded along time and space. Generalized Additive Models (GAMs) represent a suitable tool to model spatial and/or temporal trends of this kind of data, that can be treated as functional, although they are collected as discrete observations. Frequently, the attention is focused on the prediction of a single pollutant at an unmonitored site and, at this aim, we extend kriging for functional data to a multivariate context by exploiting the correlation with the other pollutants. In particular, we propose two procedures: the first one (FKED) combines the regression of a variable (pollutant),…
A Statistical Calibration Method based on Non-Linear Mixed Model for Affymetrix Probe Level Data
Gene expression microarrays allow a researcher to measure the simultaneous response of thousands of genes to external conditions. Affymetrix GeneChipr expression array technology has become a standard tool in medical research. Anyway, a preprocessing step is usually necessary in order to obtain a gene expression measure. Aim of this paper is to propose a calibration method to estimate the nominal concentration based on a non-linear mixed model. This method is an enhancement of a method proposed in Mineo et al. (2006). The relationship between raw intensities and concentration is obtained by using the Langmuir isotherm theory.
A new proposal for microarray background correction by means of a GLMM
La tecnologia microarray ha il grosso pregio di misurare simultaneamente il livello di espressione di migliaia di geni. All’elevata quantità d’informazione fornita da un singolo chip si contrappone la necessità di un adeguato pretrattamento dei dati grezzi al fine di ottenere una misura “affidabile” del livello di espressione genetico. Scopo del lavoro è analizzare, attraverso un modello lineare generalizzato misto, il legame esistente fra il livello d’intensità osservato ed il livello di concentrazione, attraverso l’utilizzo degli esperimenti Spike-In forniti dall’Affymetrix. Si propone, quindi, un nuovo metodo per la correzione del background.
An aggregate air quality index considering interactions among pollutants
Several countries provide an Air Quality Index (AQI) to communicate air pollution, but there is not a unique and nternationally accepted methodology for constructing it. The most of the proposed indices are based on the USA AQI by EPA and are defined by the value of the pollutant with the highest concentration. For each pollutant, a sub-index is computed by linear interpolation according to the grid in a table, but the breakpoints of such a table may differ from one country to another, as well as the descriptors of each category, the air quality standards, the functions chosen as daily synthesis to aggregate hourly values at each site for each pollutant, and so on. Anyway the main drawback …
Comparing air quality indices aggregated by pollutant
In this paper a new aggregate Air Quality Index (AQI) useful for describing the global air pollution situation for a given area is proposed. The index, unlike most of currently used AQIs, takes into account the combined effects of all the considered pollutants to human health. Its good performance, tested by means of a simulation plan, is confirmed by a comparison with two other indices proposed in the literature, one of which is based on the Relative Risk of daily mortality, considering an application to real data.
From a multivariate spatio-temporal array to a multipollutant - multisite Air Quality Index
AQIs are computed on air pollution data that are usually collected according to time, space and type of pollutant: in a given town/region, data consisting of hourly levels of K pollutants recorded in S monitoring sites, are usually organized in a three-mode array. A first aggregation step usually concerns time, and allows to pass from hourly data to a daily synthesis: in this paper data will be aggregated by time according to the guidelines provided by the national agencies producing the three mode array X. Here we will propose a new approach to get a Multipollutant-Multisite Air Quality Index time series from a multivariate spatio-temporal array. This implies a two step aggregation, accord…
Air quality assessment via functional principal component analysis
The knowledge of the global urban air quality situation represents the first step to face air pollution issues. For the last decades many urban areas can rely on a monitoring network, recording hourly data for the main pollutants. Such data need to be aggregated according to different dimensions, such as time, space and type of pollutant, in order to provide a synthetic air quality index which takes into account interactions among pollutants and correlation among monitoring sites.This paper focuses on Functional Principal Component techniques for the statistical analysis of a set of environmental data x(spt), where s stands for the monitoring site, p for the pollutant and t for time, usuall…
A new index to measure association between categorical and ordinal variables
In this paper a new index to analyse the dependence between categorical variables is presented and is compared to other measures of association, mainly based on the X2 Pearson’s statistic. The new index is compared to well-known measures of cograduation, as well. To restrict our comparisons, the domain we consider includes all the square contingency tables belonging to the same Fréchet class, that is all the contingency tables in which the marginal frequencies are given as fixed in both characters. Anyway, the new index is good even if there are no constrain on the marginal distributions, but only a constrain on n, the total of the observations.
Functional Principal Component Analysis for the explorative analysis of multisite-multivariate air pollution time series with long gaps
The knowledge of the urban air quality represents the first step to face air pollution issues. For the last decades many cities can rely on a network of monitoring stations recording concentration values for the main pollutants. This paper focuses on functional principal component analysis (FPCA) to investigate multiple pollutant datasets measured over time at multiple sites within a given urban area. Our purpose is to extend what has been proposed in the literature to data that are multisite and multivariate at the same time. The approach results to be effective to highlight some relevant statistical features of the time series, giving the opportunity to identify significant pollutants and…
Laboratorio informatico-statistico con R
Questo manuale introduce all’ambiente R, un linguaggio di programmazione a oggetti, e fornisce gli strumenti essenziali per poter effettuare un’analisi statistica dei dati. Costituisce un valido supporto per gli studenti che seguono i corsi di Statistica descrittiva (modulo I) e Statistica inferenziale (modulo II), ma anche per i ricercatori che si accingono per la prima volta a usare questo linguaggio. Il testo richiama i concetti e gli argomenti fondamentali della Statistica di base ed è corredato di numerosi esempi ed esercizi svolti, elementi che lo rendono utile anche a studiosi di altre discipline.
A Software Tool for the Exponential Power Distribution: The normalp Package
In this paper we present the normalp package, a package for the statistical environment R that has a set of tools for dealing with the exponential power distribution. In this package there are functions to compute the density function, the distribution function and the quantiles from an exponential power distribution and to generate pseudo-random numbers from the same distribution. Moreover, methods concerning the estimation of the distribution parameters are described and implemented. It is also possible to estimate linear regression models when we assume the random errors distributed according to an exponential power distribution. A set of functions is designed to perform simulation studi…
Modelling the background correction in microarray data analysis
Microarray technology has been adopted in many areas of biomedical research for quantitative and highly parallel measurements of gene expressions. In this field, the high density oligonucleotide microarray technology is the most used platform; in this platform oligonucleotides of 25 base pairs are used as probe genes. Two types of probes are considered: perfect match (PM) and mismatch (MM) probes. In theory, MM probes are used to quantify and remove two types of error: optical noise and non specific binding. The correction of these two types of error is known as background correction. Preprocessing is an essential step of the analysis in which the intensity, read from each probe, is manipul…
Functional principal component analysis of quantile curves
Literature on functional data analysis is mainly focused on estimation of individuals curves and characterization of average dynamics. The idea underlying this proposal is to focus attention on other particular features of the distribution of the observed data, moving from mean functions towards functional quantiles. The motivating examples are functional data sets that are collections of high frequency data recorded along time. As quantiles provide information on various aspects of a time series, we propose a modelling framework for the joint estimation of functional quantiles, varying along time, and functional principal components, summarizing some common dynamics shared by the functiona…
Air quality indices: a review
National directives on air quality oblige nations to monitor and report on their air quality, allowing the public to be informed on the ambient pollution levels. The last is the reason for the always increasing interest, demonstrated by the number of publications on this topic in recent years, in air quality/pollution indices: since the concentration of individual pollutants can be confusing, concentration measurements are conveniently transformed in terms of an air quality index. In this way, complex situations are summarized in a single figure, letting comparisons in time and space be possible. In this paper we will give an overview about the Air Quality/Pollution Indices proposed in lite…
Long gaps in multivariate spatio-temporal data: an approach based on functional data analysis
The main aim of this paper is to perform Functional Principal Component Analysis (FPCA) taking into account spatio-temporal correlation structures, in order to fill in missing values in spatio-temporal multivariate data set. A spatial and a spatio-temporal variant of the classical temporal FPCA is considered; in other words, FPCA is carried out after modeling data with respect to more than one dimension: space (long, lat) or space+time. Moreover, multidimensional FPCA is extended to multivariate context (more than one variable). Information on spatial or spatiotemporal structures are efficiently extracted by applying Generalized Additive Models (GAMs). Both simulation studies and some perfo…
Functional principal component analysis for multivariate multidimensional environmental data
Data with spatio-temporal structure can arise in many contexts, therefore a considerable interest in modelling these data has been generated, but the complexity of spatio-temporal models, together with the size of the dataset, results in a challenging task. The modelization is even more complex in presence of multivariate data. Since some modelling problems are more natural to think through in functional terms, even if only a finite number of observations is available, treating the data as functional can be useful (Berrendero et al. in Comput Stat Data Anal 55:2619–2634, 2011). Although in Ramsay and Silverman (Functional data analysis, 2nd edn. Springer, New York, 2005) the case of multiva…
Prediction of the gene expression measure by means of a GLMM
Microarrays permit to scientists the screening of thousands of genes simultaneously to determine, for example, whether those genes are active, hyperactive or silent in normal or cancerous tissues. A primary task in microarray analysis is to obtain a good measure of the gene expression that can be used for a so called higher level analysis. Different methods have been proposed for high density oligonucleotide arrays (see Cope et al. (2004) for a review). Aim of this paper is to obtain a new gene expression measure based on the background correction model proposed by Mineo et al. (2006). The proposed method is validated by means of a free available data-set called Spike-In133 experiment, wher…
Aggregate air pollution indices: a new proposal
A new aggregate Air Quality Index (I2) to represent the global air pollution situation for a given city/region is proposed. Accounting for simultaneous exposure to common pollutants and their effects on human health, this index overcomes existing AQIs. Its goodness and utility is shown by a simulation plan and by an application to a real dataset on main pollutants.
Missing Data in Space-time: Long Gaps Imputation Based On Functional Data Analysis
High dimensional data with spatio-temporal structures are of great interest in many elds of research, but their exhibited complexity leads to practical issues when formulating statistical models. Functional data analysis through smoothing methods is a proper framework for incorporating space-time structures: extending the basic methodology to the multivariate spatio-temporal setting, we refer to Generalized Additive Models for estimating functional data taking the spatial and temporal dependences into account, and to Functional Principal Component Analysis as a classical dimension reduction technique to cope with the high dimensionality and with the number of estimated eects. Since spatial …
Principal components for multivariate spatiotemporal functional data
Multivariate spatio-temporal data consist of a three way array with two dimensions’ domains both structured, temporally and spatially; think for example to a set of different pollutant levels recorded for a month/year at different sites. In this kind of dataset we can recognize time series along one dimension, spatial series along another and multivariate data along the third dimension. Statistical techniques aiming at handling huge amounts of information are very important in this context and classical dimension reduction techniques, such as Principal Components, are relevant, allowing to compress the information without much loss. Although time series, as well as spatial series, are recor…
Comparing Spatial and Spatio-temporal FPCA to Impute Large Continuous Gaps in Space
Multivariate spatio-temporal data analysis methods usually assume fairly complete data, while a number of gaps often occur along time or in space. In air quality data long gaps may be due to instrument malfunctions; moreover, not all the pollutants of interest are measured in all the monitoring stations of a network. In literature, many statistical methods have been proposed for imputing short sequences of missing values, but most of them are not valid when the fraction of missing values is high. Furthermore, the limitation of the methods commonly used consists in exploiting temporal only, or spatial only, correlation of the data. The objective of this paper is to provide an approach based …
AN ASSOCIATION INDEX FOR RECTANGULAR CONTINGENCY TABLES WITH ORDERED/UNORDERED VARIABLES
In a previous work, Mineo and Ruggieri (2005) introduce a new index to measure the association in square contingency tables; in this paper such an index is extended to rectangular tables, preserving the same properties. The considered domain includes all the contingency tables with equal n (the total number of observations), since the tables of maximum dependence, useful for computing the denominator of the proposed index, belong to it. The main findings, which prove the effectiveness of the proposed measure, are presented. In particular, the new measure assumes values in the range [-1,+1], taking value zero in tables with distributive independence and positive/negative values if associatio…
Empirical Orthogonal Function and Functional Data Analysis Procedures to Impute Long Gaps in Environmental Data
Air pollution data sets are usually spatio-temporal multivariate data related to time series of different pollutants recorded by a monitoring network. To improve the estimate of functional data when missing values, and mainly long gaps, are present in the original data set, some procedures are here proposed considering jointly Functional Data Analysis and Empirical Orthogonal Function approaches. In order to compare and validate the proposed procedures, a simulation plan is carried out and some performance indicators are computed. The obtained results show that one of the proposed procedures works better than the others, providing a better reconstruction especially in presence of long gaps.
A BoD Composite Indicator to Measure the Italian “Sole 24 Ore” Quality of Life
The measure of Quality of Life (QoL) is still a topic widely discussed in literature. In Italy, the newspaper “Il Sole 24 Ore” publishes a famous ranking that highlights strong disparities among provinces. In this paper, “Il Sole 24 Ore” and BoD-DEA methods are compared in order to show how different types of normaliza- tion and aggregation significantly influence the results making these rankings very fragile and questionable.
An aggregate AQI: comparing different standardizations and introducing a variability index
Many studies demonstrate a strong relationship between air pollution and respiratory and cardiovascular diseases. For this reason, assessing air pollution, and conveying information about its possible adverse health effects, may encourage population and policy makers to reduce those activities increasing pollution levels. In this paper a relative index of variability, to be associated with the aggregate Air Quality Index (AQI) among pollutants proposed by Ruggieri and Plaia (2011), is developed in order to better investigate air pollution conditions for the whole area of a city/region. The most widely-used and up to date pollution indices, based mainly on AQI computed by the US Environmenta…
Comparing FPCA Based on Conditional Quantile Functions and FPCA Based on Conditional Mean Function
In this work functional principal component analysis (FPCA) based on quantile functions is proposed as an alternative to the classical approach, based on the functional mean. Quantile regression characterizes the conditional distribution of a response variable and, in particular, some features like the tails behavior; smoothing splines have also been usefully applied to quantile regression to allow for a more flexible modelling. This framework finds application in contexts involving multiple high frequency time series, for which the functional data analysis (FDA) approach is a natural choice. Quantile regression is then extended to the estimation of functional quantiles and our proposal exp…
EOFs for gap filling in multivariate air quality data: a FDA approach
Missing values are a common concern in spatiotemporal data sets. During recent years a great number of methods have been developed for gap filling. One of the emerging approaches is based on the Empirical Orthogonal Function (EOF) methodology, applied mainly on raw and univariate data sets presenting irregular missing patterns. In this paper EOF is carried out on a multivariate space-time data set, related to concentrations of pollutants recorded at different sites, after denoising raw data by FDA approach. Some performance indicators are computed on simulated incomplete data sets with also long gaps in order to show that the EOF reconstruction appears to be an improved procedure especially…
GAMs and functional kriging for air quality data
Data having spatio-temporal structure are often observed in environmental sciences. They may be considered as discrete observations from curves along time and/or space and treated as functional. Generalized Additive Models (GAMs) represent a useful tool for modelling, for example, as pollutant concentrations describing their spatial and/or temporal trends.Usually, the prediction of a curve at an unmonitored site is necessary and, with this aim, we extend kriging for functional data to a multivariate context. Moreover, even if we are interested only in predicting a single pollutant, such as PM10, the estimation can be improved exploiting its correlation with the other pollutants. Cross valid…