0000000001039174
AUTHOR
Luigi Augugliaro
An Extension of the DgLARS Method to High-Dimensional Relative Risk Regression Models
In recent years, clinical studies, where patients are routinely screened for many genomic features, are becoming more common. The general aim of such studies is to find genomic signatures useful for treatment decisions and the development of new treatments. However, genomic data are typically noisy and high dimensional, not rarely outstripping the number of patients included in the study. For this reason, sparse estimators are usually used in the study of high-dimensional survival data. In this paper, we propose an extension of the differential geometric least angle regression method to high-dimensional relative risk regression models.
Genetic Network construction in CML gene expression profile data analysis
Aim of this paper is to define a new statistical framework to identify central modules in Gaussian Graphical Models (GGMs) estimated by gene expression data measured on a sample of patients with negative molecular response to imatinib. A central module is defined as a module of a GGM which contains genes that are defined differentially expressed.
miR-155 regulative network in FLT3 mutated acute myeloid leukemia
Abstract Background Acute myeloid leukemia (AML) represents a heterogeneous disorder with recurrent chromosomal alterations and molecular abnormalities. Among AML with normal karyotype (NK-AML) FLT3 activating mutation, internal tandem duplication (FLT3-ITD), is present in about 30% of patients, conferring unfavorable outcome. Our previous data demonstrated specific up-regulation of miR-155 in FLT3-ITD+ AML. miR-155 is known to be directly implicated in normal hematopoiesis and in some pathologies such as myeloid hyperplasia and acute lymphoblastic leukemia. Methods and results To investigate about the potential influence of miR-155 de-regulation in FLT3-mutated AML we generated a transcrip…
An efficient algorithm to estimate the sparse group structure of an high-dimensional generalized linear model
Massive regression is one of the new frontiers of computational statistics. In this paper we propose a generalization of the group least angle regression method based on the differential geometrical structure of a generalized linear model specified by a fixed and known group structure of the predictors. An efficient algorithm is also proposed to compute the proposed solution curve.
Extending graphical models for applications: on covariates, missingness and normality
The authors of the paper “Bayesian Graphical Models for Modern Biological Applications” have put forward an important framework for making graphical models more useful in applied settings. In this discussion paper, we give a number of suggestions for making this framework even more suitable for practical scenarios. Firstly, we show that an alternative and simplified definition of covariate might make the framework more manageable in high-dimensional settings. Secondly, we point out that the inclusion of missing variables is important for practical data analysis. Finally, we comment on the effect that the Gaussianity assumption has in identifying the underlying conditional independence graph…
Generalizing LARS algorithm using differential geometry
We propose a path following algorithm for generalized linear models that can be considered a differential geometric generalization of the LARS algorithm. In our approach we use differential geometry to generalize the equiangular condition on which is based the LARS algorithm and then we use a predictor-corrector method to compute the solution path of the coefficients.
Variable selection with unbiased estimation: the CDF penalty
We propose a new SCAD-type penalty in general regression models. The new penalty can be considered a competitor of the LASSO, SCAD or MCP penalties, as it guarantees sparse variable selection, i.e., null regression coefficient estimates, while attenuating bias for the non-null estimates. In this work, the method is discussed, and some comparisons are presented.
Hierarchical Bayesian models for analysing fish biomass data. An application to Parapenaeus longirostris biomass data
The Mediterranean International Trawl Survey (MEDITS) programme provides spatially referenced ecological data. We adopted a hierarchical Bayesian model to analyse Parapenaeus longirostris biomass data. The model comprises three parts, each of which identifies: the variability due to the explanatory variables, the variability due to the spatial domain (seen as a Gaussian Process) and the irregular component modelled as white noise. The estimated parameters show that some seabed characteristics affect biomass quantity and that the estimated behaviour of the Gaussian Process changes over different groups of years.
The Joint Censored Gaussian Graphical Lasso Model
The Gaussian graphical model is one of the most used tools for inferring genetic networks. Nowadays, the data are often collected from different sources or under different biological conditions, resulting in heterogeneous datasets that exhibit a dependency structure that varies across groups. The complex structure of these data is typically recovered using regularized inferential procedures that use two penalties, one that encourages sparsity within each graph and the other that encourages common structures among the different groups. To this date, these approaches have not been developed for handling the case of censored data. However, these data are often generated by gene expression tech…
A Software Tool For Sparse Estimation Of A General Class Of High-dimensional GLMs
Generalized linear models are the workhorse of many inferential problems. Also in the modern era with high-dimensional settings, such models have been proven to be effective exploratory tools. Most attention has been paid to Gaussian, binomial and Poisson settings, which have efficient computational implementations and where either the dispersion parameter is largely irrelevant or absent. However, general GLMs have dispersion parameters φ that affect the value of the log- likelihood. This in turn, affects the value of various information criteria such as AIC and BIC, and has a considerable impact on the computation and selection of the optimal model.The R-package dglars is one of the standa…
cglasso: An R Package for Conditional Graphical Lasso Inference with Censored and Missing Values
Sparse graphical models have revolutionized multivariate inference. With the advent of high-dimensional multivariate data in many applied fields, these methods are able to detect a much lower-dimensional structure, often represented via a sparse conditional independence graph. There have been numerous extensions of such methods in the past decade. Many practical applications have additional covariates or suffer from missing or censored data. Despite the development of these extensions of sparse inference methods for graphical models, there have been so far no implementations for, e.g., conditional graphical models. Here we present the general-purpose package cglasso for estimating sparse co…
A differential-geometric approach to generalized linear models with grouped predictors
We propose an extension of the differential-geometric least angle regression method to perform sparse group inference in a generalized linear model. An efficient algorithm is proposed to compute the solution curve. The proposed group differential-geometric least angle regression method has important properties that distinguish it from the group lasso. First, its solution curve is based on the invariance properties of a generalized linear model. Second, it adds groups of variables based on a group equiangularity condition, which is shown to be related to score statistics. An adaptive version, which includes weights based on the Kullback-Leibler divergence, improves its variable selection fea…
Quantile regression via iterative least squares computations
We present an estimating framework for quantile regression where the usual L 1-norm objective function is replaced by its smooth parametric approximation. An exact path-following algorithm is derived, leading to the well-known ‘basic’ solutions interpolating exactly a number of observations equal to the number of parameters being estimated. We discuss briefly possible practical implications of the proposed approach, such as early stopping for large data sets, confidence intervals, and additional topics for future research.
Differential geometric least angle regression: a differential geometric approach to sparse generalized linear models
Summary Sparsity is an essential feature of many contemporary data problems. Remote sensing, various forms of automated screening and other high throughput measurement devices collect a large amount of information, typically about few independent statistical subjects or units. In certain cases it is reasonable to assume that the underlying process generating the data is itself sparse, in the sense that only a few of the measured variables are involved in the process. We propose an explicit method of monotonically decreasing sparsity for outcomes that can be modelled by an exponential family. In our approach we generalize the equiangular condition in a generalized linear model. Although the …
A Statistical Calibration Method based on Non-Linear Mixed Model for Affymetrix Probe Level Data
Gene expression microarrays allow a researcher to measure the simultaneous response of thousands of genes to external conditions. Affymetrix GeneChipr expression array technology has become a standard tool in medical research. Anyway, a preprocessing step is usually necessary in order to obtain a gene expression measure. Aim of this paper is to propose a calibration method to estimate the nominal concentration based on a non-linear mixed model. This method is an enhancement of a method proposed in Mineo et al. (2006). The relationship between raw intensities and concentration is obtained by using the Langmuir isotherm theory.
Estimation of sparse generalized linear models: the dglars package
dglars is a public available R package that implements the method proposed in Augugliaro, Mineo and Wit (2013) developed to study the sparse structure of a generalized linear model. This method, called dgLARS, is based on a differential geometrical extension of the least angle regression method (LARS). The core of the dglars package consists of two algorithms implemented in Fortran 90 to efficiently compute the solution curve; specifically a predictor-corrector algorithm and a cyclic coordinate descent algorithm.
Mixing modelling ideas for microarray data
Mixed models have typically been used for modelling structural effects in presence of random variations. These type of models can be used rather naturally when we work with microarray data. In this paper, we shall look at two extensions of the usual mixed effect models.
Sparse relative risk survival modelling
Cancer survival is thought to closed linked to the genimic constitution of the tumour. Discovering such signatures will be useful in the diagnosis of the patient and may be used for treatment decisions and perhaps even the development of new treatments. However, genomic data are typically noisy and high-dimensional, often outstripping the number included in the study. Regularized survival models have been proposed to deal with such scenary. These methods typically induce sparsity by means of a coincidental match of the geometry of the convex likelihood and (near) non-convex regularizer.
Differential expression of specific microRNA and their targets in acute myeloid leukemia
Acute myeloid leukemia (AML) the most common acute leukemia in adults is characterized by various cytogenetic and molecular abnormalities. However, the genetic etiology of the disease is not yet fully understood. MicroRNAs (miRNA) are small noncoding RNAs which regulate the expression of target mRNAs both at transcriptional and translational level. In recent years, miRNAs have been identified as a novel mechanism in gene regulation, which show variable expression during myeloid differentiation. We studied miRNA expression of leukemic blasts of 29 cases of newly diagnosed and genetically defined AML using quantitative reverse transcription polymerase chain reaction (RT-PCR) for 365 human miR…
A new proposal for microarray background correction by means of a GLMM
La tecnologia microarray ha il grosso pregio di misurare simultaneamente il livello di espressione di migliaia di geni. All’elevata quantità d’informazione fornita da un singolo chip si contrappone la necessità di un adeguato pretrattamento dei dati grezzi al fine di ottenere una misura “affidabile” del livello di espressione genetico. Scopo del lavoro è analizzare, attraverso un modello lineare generalizzato misto, il legame esistente fra il livello d’intensità osservato ed il livello di concentrazione, attraverso l’utilizzo degli esperimenti Spike-In forniti dall’Affymetrix. Si propone, quindi, un nuovo metodo per la correzione del background.
Sparse relative risk regression models
Summary Clinical studies where patients are routinely screened for many genomic features are becoming more routine. In principle, this holds the promise of being able to find genomic signatures for a particular disease. In particular, cancer survival is thought to be closely linked to the genomic constitution of the tumor. Discovering such signatures will be useful in the diagnosis of the patient, may be used for treatment decisions and, perhaps, even the development of new treatments. However, genomic data are typically noisy and high-dimensional, not rarely outstripping the number of patients included in the study. Regularized survival models have been proposed to deal with such scenarios…
L1-Penalized Censored Gaussian Graphical Model
Graphical lasso is one of the most used estimators for inferring genetic networks. Despite its diffusion, there are several fields in applied research where the limits of detection of modern measurement technologies make the use of this estimator theoretically unfounded, even when the assumption of a multivariate Gaussian distribution is satisfied. Typical examples are data generated by polymerase chain reactions and flow cytometer. The combination of censoring and high-dimensionality make inference of the underlying genetic networks from these data very challenging. In this article, we propose an $\ell_1$-penalized Gaussian graphical model for censored data and derive two EM-like algorithm…
Extended differential geometric LARS for high-dimensional GLMs with general dispersion parameter
A large class of modeling and prediction problems involves outcomes that belong to an exponential family distribution. Generalized linear models (GLMs) are a standard way of dealing with such situations. Even in high-dimensional feature spaces GLMs can be extended to deal with such situations. Penalized inference approaches, such as the $$\ell _1$$ or SCAD, or extensions of least angle regression, such as dgLARS, have been proposed to deal with GLMs with high-dimensional feature spaces. Although the theory underlying these methods is in principle generic, the implementation has remained restricted to dispersion-free models, such as the Poisson and logistic regression models. The aim of this…
Statistical Analysis of the Gene Expression Profile in Patients with Chronic Myeloid Leukemia and Innately Resistant to Imatinib
ℓ1-Penalized Methods in High-Dimensional Gaussian Markov Random Fields
In the last 20 years, we have witnessed the dramatic development of new data acquisition technologies allowing to collect massive amount of data with relatively low cost. is new feature leads Donoho to define the twenty-first century as the century of data. A major characteristic of this modern data set is that the number of measured variables is larger than the sample size; the word high-dimensional data analysis is referred to the statistical methods developed to make inference with this new kind of data. This chapter is devoted to the study of some of the most recent ℓ1-penalized methods proposed in the literature to make sparse inference in a Gaussian Markov random field (GMRF) defined …
An enhancement of the plaid model algorithm
Microarrays have become a standard tool for studying gene functions. For example, we can investigate if a subset of genes shows a coherent expression pattern under different conditions. The plaid model, a model-based biclustering method, can be used to incorporate the addiction structure used for the microarray experiment. In this paper we describe an enhancement for the plaid model algorithm based on the theory of the false discovery rate.
Using Differential Geometry for Sparse High-Dimensional Risk Regression Models
With the introduction of high-throughput technologies in clinical and epidemiological studies, the need for inferential tools that are able to deal with fat data-structures, i.e., relatively small number of observations compared to the number of features, is becoming more prominent. In this paper we propose an extension of the dgLARS method to high-dimensional risk regression models. The main idea of the proposed method is to use the differential geometric structure of the partial likelihood function in order to select the optimal subset of covariates.
SPARSE INFERENCE IN COVARIATE ADJUSTED CENSORED GAUSSIAN GRAPHICAL MODELS
The covariate adjusted glasso is one of the most used estimators for inferring genetic networks. Despite its diffusion, there are several fields in applied research where the limits of detection of modern measurement technologies make the use of this estimator theoretically unfounded, even when the assumption of a multivariate Gaussian distribution is satisfied. In this paper we propose an extension to censored data.
Gene Expression Profile of Chronic Myeloid Leukemia Innately Resistant to Imatinib
Background. Most chronic myeloid leukemia patients who receive imatinib as first line-terapy will obtain, after 12 months treatment, complete cytogenetic and molecular response . However several cases will not achieve molecular response, but their innate mechanism(s) of resistance remain poorly understood. We tried to explore the molecular events involved in innate resistance in CML. Study design. Five patients who were molecular “non responder” and seven “major” responder were investigated by using the expression profile of a set of 380 genes. Multiple testing procedure (MTP), Significance Analysis of Microarrays (SAM), Empirical Bayes Analysis of Microarrays (EBAM), False Discovery Rate (…
Modelling the background correction in microarray data analysis
Microarray technology has been adopted in many areas of biomedical research for quantitative and highly parallel measurements of gene expressions. In this field, the high density oligonucleotide microarray technology is the most used platform; in this platform oligonucleotides of 25 base pairs are used as probe genes. Two types of probes are considered: perfect match (PM) and mismatch (MM) probes. In theory, MM probes are used to quantify and remove two types of error: optical noise and non specific binding. The correction of these two types of error is known as background correction. Preprocessing is an essential step of the analysis in which the intensity, read from each probe, is manipul…
A computational method to estimate sparse multiple Gaussian graphical models
In recent years several researchers have proposed the use of the Gaussian graphical model defined on a high dimensional setting to explore the dependence relationships between random variables. Standard methods, usually proposed in literature, are based on the use of a specific penalty function, such as the L1-penalty function. In this paper our aim is to estimate and compare two or more Gaussian graphical models defined in a high dimensional setting. In order to accomplish our aim, we propose a new computational method, based on glasso method, which lets us to extend the notion of p-value.
An extension of the censored gaussian lasso estimator
The conditional glasso is one of the most used estimators for inferring genetic networks. Despite its diffusion, there are several fields in applied research where the limits of detection of modern measurement technologies make the use of this estimator theoretically unfounded, even when the assumption of a multivariate Gaussian distribution is satisfied. In this paper we propose an extension to censored data.
Using differential LARS algorithm to study the expression profile of a sample of patients with latex-fruit syndrome
Natural rubber latex IgE-mediated hypersensitivity is one of the most important health problems in allergy during recent years. The prevalence of individuals allergic to latex shows an associated hypersensitivity to some plant-derived foods, especially freshly consumed fruit. This association of latex allergy and allergy to plant-derived foods is called latex-fruit syndrome. The aim of this study is to use the differential geometric generalization of the LARS algorithm to identify candidate genes that may be associated with the pathogenesis of allergy to latex or vegetable food.
Prediction of the gene expression measure by means of a GLMM
Microarrays permit to scientists the screening of thousands of genes simultaneously to determine, for example, whether those genes are active, hyperactive or silent in normal or cancerous tissues. A primary task in microarray analysis is to obtain a good measure of the gene expression that can be used for a so called higher level analysis. Different methods have been proposed for high density oligonucleotide arrays (see Cope et al. (2004) for a review). Aim of this paper is to obtain a new gene expression measure based on the background correction model proposed by Mineo et al. (2006). The proposed method is validated by means of a free available data-set called Spike-In133 experiment, wher…
The conditional censored graphical lasso estimator
© 2020, Springer Science+Business Media, LLC, part of Springer Nature. In many applied fields, such as genomics, different types of data are collected on the same system, and it is not uncommon that some of these datasets are subject to censoring as a result of the measurement technologies used, such as data generated by polymerase chain reactions and flow cytometer. When the overall objective is that of network inference, at possibly different levels of a system, information coming from different sources and/or different steps of the analysis can be integrated into one model with the use of conditional graphical models. In this paper, we develop a doubly penalized inferential procedure for…
Plaid model for microarray data: an enhancement of the pruning step
Microarrays have become a standard tool for studying gene functions. For example, we can investigate if a subset of genes shows a coherent expression pattern under different conditions. The plaid model, a model-based biclustering method, can be used to incorporate the addiction structure used for the microarray experiment. In this paper we describe an enhancement for the plaid model algorithm based on the theory of the false discovery rate.
Covariate adjusted censored gaussian lasso estimator
The covariate adjusted glasso is one of the most used estimators for in- ferring genetic networks. Despite its diffusion, there are several fields in applied research where the limits of detection of modern measurement technologies make the use of this estimator theoretically unfounded, even when the assumption of a multivariate Gaussian distribution is satisfied. In this paper we propose an extension to censored data.
Variable Selection with Quasi-Unbiased Estimation: the CDF Penalty
We propose a new non-convex penalty in linear regression models. The new penalty function can be considered a competitor of the LASSO, SCAD or MCP penalties, as it guarantees sparse variable selection while reducing bias for the non-null estimates. We introduce the methodology and present some comparisons among different approaches.
Model selection for factorial Gaussian graphical models with an application to dynamic regulatory networks.
Abstract Factorial Gaussian graphical Models (fGGMs) have recently been proposed for inferring dynamic gene regulatory networks from genomic high-throughput data. In the search for true regulatory relationships amongst the vast space of possible networks, these models allow the imposition of certain restrictions on the dynamic nature of these relationships, such as Markov dependencies of low order – some entries of the precision matrix are a priori zeros – or equal dependency strengths across time lags – some entries of the precision matrix are assumed to be equal. The precision matrix is then estimated by l 1-penalized maximum likelihood, imposing a further constraint on the absolute value…
Applying differential geometric LARS algorithm to ultra-high dimensional feature space
Variable selection is fundamental in high-dimensional statistical modeling. Many techniques to select relevant variables in generalized linear models are based on a penalized likelihood approach. In a recent paper, Fan and Lv (2008) proposed a sure independent screening (SIS) method to select relevant variables in a linear regression model defined on a ultrahigh dimensional feature space. Aim of this paper is to define a generalization of the SIS method for generalized linear models based on a differential geometric approach.
dglars: An R Package to Estimate Sparse Generalized Linear Models
dglars is a publicly available R package that implements the method proposed in Augugliaro, Mineo, and Wit (2013), developed to study the sparse structure of a generalized linear model. This method, called dgLARS, is based on a differential geometrical extension of the least angle regression method proposed in Efron, Hastie, Johnstone, and Tibshirani (2004). The core of the dglars package consists of two algorithms implemented in Fortran 90 to efficiently compute the solution curve: a predictor-corrector algorithm, proposed in Augugliaro et al. (2013), and a cyclic coordinate descent algorithm, proposed in Augugliaro, Mineo, and Wit (2012). The latter algorithm, as shown here, is significan…
A statistical calibration model for Affymetrix probe level data
Gene expression microarrays allow a researcher to measure the simultaneous response of thousands of genes to external conditions. Affymetrix GeneChip{ $Ⓡ$} expression array technology has become a standard tool in medical research. Anyway, a preprocessing step is usually necessary in order to obtain a gene expression measure. Aim of this paper is to propose a calibration method to estimate the nominal concentration based on a nonlinear mixed model. This method is an enhancement of a method proposed in Mineo et al. (2006). The relationship between raw intensities and concentration is obtained by using the Langmuir isotherm theory.
Random effects elliptically distributed in unbalanced linear models
In linear mixed effects models, random effects are used for modelling the variance-covariance structure of the response variable. These models are based on the assumption that the random effects are normally distributed, but in literature alternative random effect distributions have been proposed and the consequences of misspecification are investigated. These studies consider only balanced designs. Aim of this paper is to study an unbalanced linear mixed model with random effects elliptically distributed.
Using the dglars Package to Estimate a Sparse Generalized Linear Model
dglars is a publicly available R package that implements the method proposed in Augugliaro et al. (J. R. Statist. Soc. B 75(3), 471-498, 2013) developed to study the sparse structure of a generalized linear model (GLM). This method, called dgLARS, is based on a differential geometrical extension of the least angle regression method. The core of the dglars package consists of two algorithms implemented in Fortran 90 to efficiently compute the solution curve. dglars is a publicly available R package that implements the method proposed in Augugliaro et al. (J. R. Statist. Soc. B 75(3), 471-498, 2013) developed to study the sparse structure of a generalized linear model (GLM). This method, call…
Identifying modularity structure of a genetic network in gene expression profile data
Aim of this paper is to define a new statistical framework to identify central modules in Gaussian Graphical Models (GGMs) estimated by gene expression data measured on a sample of patients with negative molecular response to Imatinib. Imanitib is a drug used to treat certain types of cancer that in many statistical studies has been reported to have a significant clinical effect on chronic myeloid leukemia (CML) in chronic phase as well as in blast crisis. For central module in a GGM we intend a module containing genes that are defined differently expressed.
Using differential geometric LARS algorithm to study the expression profile of a sample of patients with latex-fruit syndrome
Natural rubber latex IgE-mediated hypersensitivity is one of the most important health problems in allergy during recent years. The prevalence of individuals allergic to latex shows an associated hypersensitivity to some plant-derived foods, especially freshly consumed fruit. This association of latex allergy and allergy to plant-derived foods is called latex-fruit syndrome. The aim of this study is to use the differential geometric generalization of the LARS algorithm to identify candidate genes that may be associated with the pathogenesis of allergy to latex or vegetable.