Evaluation of the effect of chance correlations on variable selection using Partial Least Squares -Discriminant Analysis
Variable subset selection is often mandatory in high throughput metabolomics and proteomics. However, depending on the variable to sample ratio there is a significant susceptibility of variable selection towards chance correlations. The evaluation of the predictive capabilities of PLSDA models estimated by cross-validation after feature selection provides overly optimistic results if the selection is performed on the entire set and no external validation set is available. In this work, a simulation of the statistical null hypothesis is proposed to test whether the discrimination capability of a PLSDA model after variable selection estimated by cross-validation is statistically higher than t…
Using Unfold-PCA for batch-to-batch start-up process understanding and steady-state identification in a sequencing batch reactor
In chemical and biochemical processes, steady-state models are widely used for process assessment, control and optimisation. In these models, parameter adjustment requires data collected under nearly steady-state conditions. Several approaches have been developed for steady-state identification (SSID) in continuous processes, but no attempt has been made to adapt them to the singularities of batch processes. The main aim of this paper is to propose an automated method based on batch-wise unfolding of the three-way batch process data followed by a principal component analysis (Unfold-PCA) in combination with the methodology of Brown and Rhinehart 2 for SSID. A second goal of this paper is to…
Process understanding of a wastewater batch reactor with block-wise PLS
In this work a systematic methodology ‘block-wise PLS’ has been applied to thoroughly analyse data from a sequencing batch reactor (SBR) operated for biological phosphorus removal from wastewater. The aim of this study was to diagnose process variables (collected by the inexpensive and low-maintenance sensors installed in the SBR) likely related to the main key indicator of process performance: the phosphorus removal efficiency (PRE), determined off-line in the quality control laboratory. In this way, it is intended to aid the process operators in the detection of abnormal values of these critical variables which would indicate undesirable process performance, so that, they could act on the…
Missing Data
In this chapter, we deal with the problem of missing data in principal component analysis (PCA) and partial least squares (PLS) methods. First, we review several statistical methods proposed in the literature for handling missing data. Both single and multiple imputation (MI) methods are studied and compared using simulated data. After this, we particularize the missing data problem for building and exploiting multivariate calibration models. Several approaches proposed in the literature are introduced and their performance compared based on several real data sets.
MCR-ALS on metabolic networks: Obtaining more meaningful pathways
[EN] With the aim of understanding the flux distributions across a metabolic network, i.e. within living cells, Principal Component Analysis (PCA) has been proposed to obtain a set of orthogonal components (pathways) capturing most of the variance in the flux data. The problems with this method are (i) that no additional information can be included in the model, and (ii) that orthogonality imposes a hard constraint, not always reasonably. To overcome these drawbacks, here we propose to use a more flexible approach such as Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS) to obtain this set of biological pathways through the network. By using this method, different constraint…
Metabolic flux understanding of Pichia pastoris grown on heterogenous culture media
[EN] Within the emergent field of Systems Biology, mathematical models obtained from physical chemical laws (the so-called first principles-based models) of microbial systems are employed to discern the principles that govern cellular behaviour and achieve a predictive understanding of cellular functions. The reliance on this biochemical knowledge has the drawback that some of the assumptions (specific kinetics of the reaction system, unknown dynamics and values of the model parameters) may not be valid for all the metabolic possible states of the network. In this uncertainty context, the combined use of fundamental knowledge and data measured in the fermentation that describe the behaviour…
Multivariate SPC of a sequencing batch reactor for wastewater treatment
Data from a sequencing batch reactor (SBR) operated for enhanced biological phosphorus removal from wastewater have been analysed in order to propose an efficient MSPC scheme of the process. Different multivariate bilinear approaches have been applied and compared in terms of their capabilities for on-line and off-line fault detection and diagnosis. The typical three-way data structure from a batch process was unfolded batch-wise and variable-wise. In the latter case, two models were built: with (AT) and without (WKFH) removing the main non-linear behaviour of the process data. Since the process consists of several stages, the monitoring strategies tested include: one model for all stages a…
How to simulate normal data sets with the desired correlation structure
The Cholesky decomposition is a widely used method to draw samples from multivariate normal distribution with non-singular covariance matrices. In this work we introduce a simple method by using singular value decomposition (SVD) to simulate multivariate normal data even if the covariance matrix is singular, which is often the case in chemometric problems. The covariance matrix can be specified by the user or can be generated by specifying a subset of the eigenvalues. The latter can be an advantage for simulating data sets with a particular latent structure. This can be useful for testing the performance of chemometric methods with data sets matching the theoretical conditions for their app…
Comparison of different predictive models for nutrient estimation in a sequencing batch reactor for wastewater treatment
Abstract In this paper different predictive models for nutrient estimation in a sequencing batch reactor (SBR) for wastewater treatment are compared: principal component regression (PCR), partial least squares (PLS), and artificial neural networks (ANNs). Two unfolding procedures were used: batch-wise and variable-wise. For the latter unfolding method, X and Y matrix augmentation with lagged variables were used in some models to incorporate process dynamics. The results have shown that batch-wise unfolding PLS models outperform the other approaches. The ANN models are good predictive models, but in this particular case-study, they do not outperform those multivariate projection models that …