0000000000845144
AUTHOR
Paulo J. G. Lisboa
Clustering categorical data: A stability analysis framework
Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm is a popular choice, but K-means is not generally appropriate for categorical data. A specific extension of k-means for categorical data is the k-modes algorithm. Both of these partition clustering methods are sensitive to the initialization of prototypes, which creates the difficulty of selecting the best solution for a given problem. In addition, selecting the number of clusters can be an issue. Further, the k-modes method is especially prone to instability when presented with ‘noisy’ data, since the calculation of the mode lacks the smoothing effect inherent in the calculation …
An AI Walk from Pharmacokinetics to Marketing
This work is intended for providing a review of reallife practical applications of Artificial Intelligence (AI) methods. We focus on the use of Machine Learning (ML) methods applied to rather real problems than synthetic problems with standard and controlled environment. In particular, we will describe the following problems in next sections: • Optimization of Erythropoietin (EPO) dosages in anaemic patients undergoing Chronic Renal Failure (CRF). • Optimization of a recommender system for citizen web portal users. • Optimization of a marketing campaign. The choice of these problems is due to their relevance and their heterogeneity. This heterogeneity shows the capabilities and versatility …
Towards interpretable classifiers with blind signal separation
Blind signal separation (BSS) is a powerful tool to open-up complex signals into component sources that are often interpretable. However, BSS methods are generally unsupervised, therefore the assignment of class membership from the elements of the mixing matrix may be sub-optimal. This paper proposes a three-stage approach using Fisher information metric to define a natural metric for the data, from which a Euclidean approximation can then be used to drive BSS. Results with synthetic data models of real-world high-dimensional data show that the classification accuracy of the method is good for challenging problems, while retaining interpretability.
Data Mining in Cancer Research [Application Notes
This article is not intended as a comprehensive survey of data mining applications in cancer. Rather, it provides starting points for further, more targeted, literature searches, by embarking on a guided tour of computational intelligence applications in cancer medicine, structured in increasing order of the physical scales of biological processes.
An approach based on the Adaptive Resonance Theory for analysing the viability of recommender systems in a citizen Web portal
This paper proposes a methodology to optimise the future accuracy of a collaborative recommender application in a citizen Web portal. There are four stages namely, user modelling, benchmarking of clustering algorithms, prediction analysis and recommendation. The first stage is to develop analytical models of common characteristics of Web-user data. These artificial data sets are then used to evaluate the performance of clustering algorithms, in particular benchmarking the ART2 neural network with K-means clustering. Afterwards, it is evaluated the predictive accuracy of the clusters applied to a real-world data set derived from access logs to the citizen Web portal Infoville XXI (http://www…
Preface to Data Mining in Biomedical Informatics and Healthcare
Probabilistic quantum clustering
Abstract Quantum Clustering is a powerful method to detect clusters with complex shapes. However, it is very sensitive to a length parameter that controls the shape of the Gaussian kernel associated with a wave function, which is employed in the Schrodinger equation with the role of a density estimator. In addition, linking data points into clusters requires local estimates of covariance which requires further parameters. This paper proposes a Bayesian framework that provides an objective measure of goodness-of-fit to the data, to optimise the adjustable parameters. This also quantifies the probabilities of cluster membership, thus partitioning the data into a specific number of clusters, w…
A Novel Semi-Supervised Methodology for Extracting Tumor Type-Specific MRS Sources in Human Brain Data
Background: The clinical investigation of human brain tumors often starts with a non-invasive imaging study, providing \ud information about the tumor extent and location, but little insight into the biochemistry of the analyzed tissue. Magnetic \ud Resonance Spectroscopy can complement imaging by supplying a metabolic fingerprint of the tissue. This study analyses \ud single-voxel magnetic resonance spectra, which represent signal information in the frequency domain. Given that a single \ud voxel may contain a heterogeneous mix of tissues, signal source identification is a relevant challenge for the problem of\ud tumor type classification from the spectroscopic signal.\ud Methodology/Princ…
Scalable implementation of measuring distances in a Riemannian manifold based on the Fisher Information metric
This paper focuses on the scalability of the Fisher Information manifold by applying techniques of distributed computing. The main objective is to investigate methodologies to improve two bottlenecks associated with the measurement of distances in a Riemannian manifold formed by the Fisher Information metric. The first bottleneck is the quadratic increase in the number of pairwise distances. The second is the computation of global distances, approximated through a fully connected network of the observed pairwise distances, where the challenge is the computation of the all sources shortest path (ASSP). The scalable implementation for the pairwise distances is performed in Spark. The scalable…
Classical Training Methods
This chapter reviews classical training methods for multilayer neural networks. These methods are widely used for classification and function modelling tasks. Nevertheless, they show a number of flaws or drawbacks that should be addressed in the development of such systems. They work by searching the minimum of an error function which defines the optimal behaviour of the neural network. Different standard problems are used to show the capabilities of these models; in particular, we have benchmarked the algorithms in a nonlinear classification problem and in three function modelling problems.
Quantum clustering in non-spherical data distributions: Finding a suitable number of clusters
Quantum Clustering (QC) provides an alternative approach to clustering algorithms, several of which are based on geometric relationships between data points. Instead, QC makes use of quantum mechanics concepts to find structures (clusters) in data sets by finding the minima of a quantum potential. The starting point of QC is a Parzen estimator with a fixed length scale, which significantly affects the final cluster allocation. This dependence on an adjustable parameter is common to other methods. We propose a framework to find suitable values of the length parameter σ by optimising twin measures of cluster separation and consistency for a given cluster number. This is an extension of the Se…
Robust Conditional Independence maps of single-voxel Magnetic Resonance Spectra to elucidate associations between brain tumours and metabolites.
The aim of the paper is two-fold. First, we show that structure finding with the PC algorithm can be inherently unstable and requires further operational constraints in order to consistently obtain models that are faithful to the data. We propose a methodology to stabilise the structure finding process, minimising both false positive and false negative error rates. This is demonstrated with synthetic data. Second, to apply the proposed structure finding methodology to a data set comprising single-voxel Magnetic Resonance Spectra of normal brain and three classes of brain tumours, to elucidate the associations between brain tumour types and a range of observed metabolites that are known to b…
An integrated framework for risk profiling of breast cancer patients following surgery.
Objective: An integrated decision support framework is proposed for clinical oncologists making prognostic assessments of patients with operable breast cancer. The framework may be delivered over a web interface. It comprises a triangulation of prognostic modelling, visualisation of historical patient data and an explanatory facility to interpret risk group assignments using empirically derived Boolean rules expressed directly in clinical terms. Methods and materials: The prognostic inferences in the interface are validated in a multicentre longitudinal cohort study by modelling retrospective data from 917 patients recruited at Christie Hospital, Wilmslow between 1983 and 1989 and predictin…
A principled approach to network-based classification and data representation
Measures of similarity are fundamental in pattern recognition and data mining. Typically the Euclidean metric is used in this context, weighting all variables equally and therefore assuming equal relevance, which is very rare in real applications. In contrast, given an estimate of a conditional density function, the Fisher information calculated in primary data space implicitly measures the relevance of variables in a principled way by reference to auxiliary data such as class labels. This paper proposes a framework that uses a distance metric based on Fisher information to construct similarity networks that achieve a more informative and principled representation of data. The framework ena…
Making nonlinear manifold learning models interpretable: The manifold grand tour
Smooth nonlinear topographic maps of the data distribution to guide a Grand Tour visualisation.Prioritisation of data linear views that are most consistent with data structure in the maps.Useful visualisations that cannot be obtained by other more classical approaches. Dimensionality reduction is required to produce visualisations of high dimensional data. In this framework, one of the most straightforward approaches to visualising high dimensional data is based on reducing complexity and applying linear projections while tumbling the projection axes in a defined sequence which generates a Grand Tour of the data. We propose using smooth nonlinear topographic maps of the data distribution to…