Search results for "statistical"
showing 10 items of 4960 documents
Combining multiple hypothesis testing with machine learning increases the statistical power of genome-wide association studies
2016
Mieth, Bettina et al.
Stagewise pseudo-value regression for time-varying effects on the cumulative incidence
2015
In a competing risks setting, the cumulative incidence of an event of interest describes the absolute risk for this event as a function of time. For regression analysis, one can either choose to model all competing events by separate cause-specific hazard models or directly model the association between covariates and the cumulative incidence of one of the events. With a suitable link function, direct regression models allow for a straightforward interpretation of covariate effects on the cumulative incidence. In practice, where data can be right-censored, these regression models are implemented using a pseudo-value approach. For a grid of time points, the possibly unobserved binary event s…
Partitioned learning of deep Boltzmann machines for SNP data.
2016
Abstract Motivation Learning the joint distributions of measurements, and in particular identification of an appropriate low-dimensional manifold, has been found to be a powerful ingredient of deep leaning approaches. Yet, such approaches have hardly been applied to single nucleotide polymorphism (SNP) data, probably due to the high number of features typically exceeding the number of studied individuals. Results After a brief overview of how deep Boltzmann machines (DBMs), a deep learning approach, can be adapted to SNP data in principle, we specifically present a way to alleviate the dimensionality problem by partitioned learning. We propose a sparse regression approach to coarsely screen…
Variance component analysis to assess protein quantification in biomarker discovery. Application to MALDI-TOF mass spectrometry.
2017
International audience; Controlling the technological variability on an analytical chain is critical for biomarker discovery. The sources of technological variability should be modeled, which calls for specific experimental design, signal processing, and statistical analysis. Furthermore, with unbalanced data, the various components of variability cannot be estimated with the sequential or adjusted sums of squares of usual software programs. We propose a novel approach to variance component analysis with application to the matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) technology and use this approach for protein quantification by a classical signal processing algori…
LEGO-based generalized set of two linear algebraic 3D bio-macro-molecular descriptors: Theory and validation by QSARs
2019
Abstract Novel 3D protein descriptors based on bilinear, quadratic and linear algebraic maps in R n are proposed. The latter employs the kth 2-tuple (dis) similarity matrix to codify information related to covalent and non-covalent interactions in these biopolymers. The calculation of the inter-amino acid distances is generalized by using several dis-similarity coefficients, where normalization procedures based on the simple stochastic and mutual probability schemes are applied. A new local-fragment approach based on amino acid-types and amino acid-groups is proposed to characterize regions of interest in proteins. Topological and geometric macromolecular cutoffs are defined using local and…
The intrinsic combinatorial organization and information theoretic content of a sequence are correlated to the DNA encoded nucleosome organization of…
2015
Abstract Motivation: Thanks to research spanning nearly 30 years, two major models have emerged that account for nucleosome organization in chromatin: statistical and sequence specific. The first is based on elegant, easy to compute, closed-form mathematical formulas that make no assumptions of the physical and chemical properties of the underlying DNA sequence. Moreover, they need no training on the data for their computation. The latter is based on some sequence regularities but, as opposed to the statistical model, it lacks the same type of closed-form formulas that, in this case, should be based on the DNA sequence only. Results: We contribute to close this important methodological gap …
Reference genome assessment from a population scale perspective: an accurate profile of variability and noise.
2017
Abstract Motivation Current plant and animal genomic studies are often based on newly assembled genomes that have not been properly consolidated. In this scenario, misassembled regions can easily lead to false-positive findings. Despite quality control scores are included within genotyping protocols, they are usually employed to evaluate individual sample quality rather than reference sequence reliability. We propose a statistical model that combines quality control scores across samples in order to detect incongruent patterns at every genomic region. Our model is inherently robust since common artifact signals are expected to be shared between independent samples over misassembled regions …
Evolutionary distances corrected for purifying selection and ancestral polymorphisms.
2019
Abstract Evolutionary distance formulas that take into account effects due to ancestral polymorphisms and purifying selection are obtained on the basis of the full solution of Jukes–Cantor and Kimura DNA substitution models. In the case of purifying selection two different methods are developed. It is shown that avoiding the dimensional reduction implicitly carried out in the conventional model solving is instrumental to incorporate the quoted effects into the formalism. The problem of estimating the numerical values of the model parameters, as well as those of the correction terms, is not addressed.
Multiplicity- and dependency-adjusted p-values for control of the family-wise error rate
2016
Abstract Under the multiple testing framework, we propose the multiplicity- and dependency-adjustment method (MADAM) which transforms test statistics into adjusted p -values for control of the family-wise error rate. For demonstration, we apply the MADAM to data from a genetic association study.
Pitfalls of hypothesis tests and model selection on bootstrap samples: Causes and consequences in biometrical applications
2015
The bootstrap method has become a widely used tool applied in diverse areas where results based on asymptotic theory are scarce. It can be applied, for example, for assessing the variance of a statistic, a quantile of interest or for significance testing by resampling from the null hypothesis. Recently, some approaches have been proposed in the biometrical field where hypothesis testing or model selection is performed on a bootstrap sample as if it were the original sample. P-values computed from bootstrap samples have been used, for example, in the statistics and bioinformatics literature for ranking genes with respect to their differential expression, for estimating the variability of p-v…