Search results for "machine"
showing 10 items of 2592 documents
Learning Molecular Classes from Small Numbers of Positive Examples Using Graph Grammars
2021
We consider the following problem: A researcher identified a small number of molecules with a certain property of interest and now wants to find further molecules sharing this property in a database. This can be described as learning molecular classes from small numbers of positive examples. In this work, we propose a method that is based on learning a graph grammar for the molecular class. We consider the type of graph grammars proposed by Althaus et al. [2], as it can be easily interpreted and allows relatively efficient queries. We identify rules that are frequently encountered in the positive examples and use these to construct a graph grammar. We then classify a molecule as being conta…
A new paradigm for pattern classification: Nearest Border Techniques
2013
Published version of a chapter in the book: AI 2013: Advances in Artificial Intelligence. Also available from the publisher at: http://dx.doi.org/10.1007/978-3-319-03680-9_44 There are many paradigms for pattern classification. As opposed to these, this paper introduces a paradigm that has not been reported in the literature earlier, which we shall refer to as the Nearest Border (NB) paradigm. The philosophy for developing such a NB strategy is as follows: Given the training data set for each class, we shall first attempt to create borders for each individual class. After that, we advocate that testing is accomplished by assigning the test sample to the class whose border it lies closest to…
Models of Computation, Riemann Hypothesis, and Classical Mathematics
1998
Classical mathematics is a source of ideas used by Computer Science since the very first days. Surprisingly, there is still much to be found. Computer scientists, especially, those in Theoretical Computer Science find inspiring ideas both in old notions and results, and in the 20th century mathematics. The latest decades have brought us evidence that computer people will soon study quantum physics and modern biology just to understand what computers are doing.
Variability of Classification Results in Data with High Dimensionality and Small Sample Size
2021
The study focuses on the analysis of biological data containing information on the number of genome sequences of intestinal microbiome bacteria before and after antibiotic use. The data have high dimensionality (bacterial taxa) and a small number of records, which is typical of bioinformatics data. Classification models induced on data sets like this usually are not stable and the accuracy metrics have high variance. The aim of the study is to create a preprocessing workflow and a classification model that can perform the most accurate classification of the microbiome into groups before and after the use of antibiotics and lessen the variability of accuracy measures of the classifier. To ev…
Comparative analysis of architectures for monitoring cloud computing infrastructures
2015
The lack of control over the cloud resources is one of the main disadvantages associated to cloud computing. The design of efficient architectures for monitoring such resources can help to overcome this problem. This contribution describes a complete set of architectures for monitoring cloud computing infrastructures, and provides a taxonomy of them. The architectures are described in detail, compared among them, and analysed in terms of performance, scalability, usage of resources, and security capabilities. The architectures have been implemented in real world settings and empirically validated against a real cloud computing infrastructure based on OpenStack. More than 1000 virtual machin…
A local complexity based combination method for decision forests trained with high-dimensional data
2012
Accurate machine learning with high-dimensional data is affected by phenomena known as the “curse” of dimensionality. One of the main strategies explored in the last decade to deal with this problem is the use of multi-classifier systems. Several of such approaches are inspired by the Random Subspace Method for the construction of decision forests. Furthermore, other studies rely on estimations of the individual classifiers' competence, to enhance the combination in the multi-classifier and improve the accuracy. We propose a competence estimate which is based on local complexity measurements, to perform a weighted average combination of the decision forest. Experimental results show how thi…
Data Analysis and Bioinformatics
2007
Data analysis methods and techniques are revisited in the case of biological data sets. Particular emphasis is given to clustering and mining issues. Clustering is still a subject of active research in several fields such as statistics, pattern recognition, and machine learning. Data mining adds to clustering the complications of very large data-sets with many attributes of different types. And this is a typical situation in biology. Some cases studies are also described.
Distance Functions, Clustering Algorithms and Microarray Data Analysis
2010
Distance functions are a fundamental ingredient of classification and clustering procedures, and this holds true also in the particular case of microarray data. In the general data mining and classification literature, functions such as Euclidean distance or Pearson correlation have gained their status of de facto standards thanks to a considerable amount of experimental validation. For microarray data, the issue of which distance function works best has been investigated, but no final conclusion has been reached. The aim of this extended abstract is to shed further light on that issue. Indeed, we present an experimental study, involving several distances, assessing (a) their intrinsic sepa…
Regularized Regression Incorporating Network Information: Simultaneous Estimation of Covariate Coefficients and Connection Signs
2014
We develop an algorithm that incorporates network information into regression settings. It simultaneously estimates the covariate coefficients and the signs of the network connections (i.e. whether the connections are of an activating or of a repressing type). For the coefficient estimation steps an additional penalty is set on top of the lasso penalty, similarly to Li and Li (2008). We develop a fast implementation for the new method based on coordinate descent. Furthermore, we show how the new methods can be applied to time-to-event data. The new method yields good results in simulation studies concerning sensitivity and specificity of non-zero covariate coefficients, estimation of networ…
Bayesian versus data driven model selection for microarray data
2014
Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. In this beautiful area, one of the most difficult challenges is a particular instance of the model selection problem, i.e., the identification of the correct number of clusters in a dataset. In what follows, for ease of reference, we refer to that instance still as model selection. It is an important part of any statistical analysis. The techniques used for solving it are mainly either Bayesian or data-driven, and are both based on internal knowledge. That is, they use information obtained by processing the input data. A…