Search results for "Clustering"
showing 10 items of 446 documents
Detection of spatial disease clusters with LISA functions.
2011
Detection of disease clusters is an important tool in epidemiology that can help to identify risk factors associated with the disease and in understanding its etiology. In this article we propose a method for the detection of spatial clusters where the locations of a set of cases and a set of controls are available. The method is based on local indicators of spatial association functions (LISA functions), particularly on the development of a local version of the product density, which is a second-order characteristic of spatial point processes. The behavior of the method is evaluated and compared with Kulldorff's spatial scan statistic by means of a simulation study. It is shown that the LI…
Sample size planning for survival prediction with focus on high-dimensional data
2011
Sample size planning should reflect the primary objective of a trial. If the primary objective is prediction, the sample size determination should focus on prediction accuracy instead of power. We present formulas for the determination of training set sample size for survival prediction. Sample size is chosen to control the difference between optimal and expected prediction error. Prediction is carried out by Cox proportional hazards models. The general approach considers censoring as well as low-dimensional and high-dimensional explanatory variables. For dimension reduction in the high-dimensional setting, a variable selection step is inserted. If not all informative variables are included…
Sparse relative risk regression models
2020
Summary Clinical studies where patients are routinely screened for many genomic features are becoming more routine. In principle, this holds the promise of being able to find genomic signatures for a particular disease. In particular, cancer survival is thought to be closely linked to the genomic constitution of the tumor. Discovering such signatures will be useful in the diagnosis of the patient, may be used for treatment decisions and, perhaps, even the development of new treatments. However, genomic data are typically noisy and high-dimensional, not rarely outstripping the number of patients included in the study. Regularized survival models have been proposed to deal with such scenarios…
A fast and recursive algorithm for clustering large datasets with k-medians
2012
Clustering with fast algorithms large samples of high dimensional data is an important challenge in computational statistics. Borrowing ideas from MacQueen (1967) who introduced a sequential version of the $k$-means algorithm, a new class of recursive stochastic gradient algorithms designed for the $k$-medians loss criterion is proposed. By their recursive nature, these algorithms are very fast and are well adapted to deal with large samples of data that are allowed to arrive sequentially. It is proved that the stochastic gradient algorithm converges almost surely to the set of stationary points of the underlying loss criterion. A particular attention is paid to the averaged versions, which…
Modeling the coupled return-spread high frequency dynamics of large tick assets
2015
Large tick assets, i.e. assets where one tick movement is a significant fraction of the price and bid-ask spread is almost always equal to one tick, display a dynamics in which price changes and spread are strongly coupled. We introduce a Markov-switching modeling approach for price change, where the latent Markov process is the transition between spreads. We then use a finite Markov mixture of logit regressions on past squared returns to describe the dependence of the probability of price changes. The model can thus be seen as a Double Chain Markov Model. We show that the model describes the shape of return distribution at different time aggregations, volatility clustering, and the anomalo…
Algorithms and tools for protein-protein interaction networks clustering, with a special focus on population-based stochastic methods
2014
Abstract Motivation: Protein–protein interaction (PPI) networks are powerful models to represent the pairwise protein interactions of the organisms. Clustering PPI networks can be useful for isolating groups of interacting proteins that participate in the same biological processes or that perform together specific biological functions. Evolutionary orthologies can be inferred this way, as well as functions and properties of yet uncharacterized proteins. Results: We present an overview of the main state-of-the-art clustering methods that have been applied to PPI networks over the past decade. We distinguish five specific categories of approaches, describe and compare their main features and …
Anthropometry: An R Package for Analysis of Anthropometric Data
2017
The development of powerful new 3D scanning techniques has enabled the generation of large up-to-date anthropometric databases which provide highly valued data to improve the ergonomic design of products adapted to the user population. As a consequence, Ergonomics and Anthropometry are two increasingly quantitative fields, so advanced statistical methodologies and modern software tools are required to get the maximum benefit from anthropometric data. This paper presents a new R package, called Anthropometry, which is available on the Comprehensive R Archive Network. It brings together some statistical methodologies concerning clustering, statistical shape analysis, statistical archetypal an…
Migration and students' performance: detecting geographical differences following a curves clustering approach
2020
Students’ migration mobility is the new form of migration: students migrate to improve their skills and become more valued for the job market. The data regard the migration of Italian Bachelors who enrolled at Master Degree level, moving typically from poor to rich areas. This paper investigates the migration and other possible determinants on the Master Degree students’ performance. The Clustering of Effects approach for Quantile Regression Coefficients Modelling has been used to cluster the effects of some variables on the students’ performance for three Italian macro-areas. Results show evidence of similarity between Southern and Centre students, with respect to the Northern ones.
Ranking Scientific Journals Via Latent Class Models for Polytomous Item Response Data
2015
Summary We propose a model-based strategy for ranking scientific journals starting from a set of observed bibliometric indicators that represent imperfect measures of the unobserved ‘value’ of a journal. After discretizing the available indicators, we estimate an extended latent class model for polytomous item response data and use the estimated model to cluster journals. We illustrate our approach by using the data from the Italian research evaluation exercise that was carried out for the period 2004–2010, focusing on the set of journals that are considered relevant for the subarea statistics and financial mathematics. Using four bibliometric indicators (IF, IF5, AIS and the h-index), some…
Bayesian Markov switching models for the early detection of influenza epidemics
2008
The early detection of outbreaks of diseases is one of the most challenging objectives of epidemiological surveillance systems. In this paper, a Markov switching model is introduced to determine the epidemic and non-epidemic periods from influenza surveillance data: the process of differenced incidence rates is modelled either with a first-order autoregressive process or with a Gaussian white-noise process depending on whether the system is in an epidemic or in a non-epidemic phase. The transition between phases of the disease is modelled as a Markovian process. Bayesian inference is carried out on the former model to detect influenza epidemics at the very moment of their onset. Moreover, t…