0000000000470986
AUTHOR
Stefan Kramer
Multi-label Classification Using Stacked Hierarchical Dirichlet Processes with Reduced Sampling Complexity
Nonparametric topic models based on hierarchical Dirichlet processes (HDPs) allow for the number of topics to be automatically discovered from the data. The computational complexity of standard Gibbs sampling techniques for model training is linear in the number of topics. Recently, it was reduced to be linear in the number of topics per word using a technique called alias sampling combined with Metropolis Hastings (MH) sampling. We propose a different proposal distribution for the MH step based on the observation that distributions on the upper hierarchy level change slower than the document-specific distributions at the lower level. This reduces the sampling complexity, making it linear i…
Prototype-based learning on concept-drifting data streams
Data stream mining has gained growing attentions due to its wide emerging applications such as target marketing, email filtering and network intrusion detection. In this paper, we propose a prototype-based classification model for evolving data streams, called SyncStream, which dynamically models time-changing concepts and makes predictions in a local fashion. Instead of learning a single model on a sliding window or ensemble learning, SyncStream captures evolving concepts by dynamically maintaining a set of prototypes in a new data structure called the P-tree. The prototypes are obtained by error-driven representativeness learning and synchronization-inspired constrained clustering. To ide…
Online Density Estimation of Heterogeneous Data Streams in Higher Dimensions
The joint density of a data stream is suitable for performing data mining tasks without having access to the original data. However, the methods proposed so far only target a small to medium number of variables, since their estimates rely on representing all the interdependencies between the variables of the data. High-dimensional data streams, which are becoming more and more frequent due to increasing numbers of interconnected devices, are, therefore, pushing these methods to their limits. To mitigate these limitations, we present an approach that projects the original data stream into a vector space and uses a set of representatives to provide an estimate. Due to the structure of the est…
Towards identifying drug side effects from social media using active learning and crowd sourcing.
Motivation Social media is a largely untapped source of information on side effects of drugs. Twitter in particular is widely used to report on everyday events and personal ailments. However, labeling this noisy data is a difficult problem because labeled training data is sparse and automatic labeling is error-prone. Crowd sourcing can help in such a scenario to obtain more reliable labels, but is expensive in comparison because workers have to be paid. To remedy this, semi-supervised active learning may reduce the number of labeled data needed and focus the manual labeling process on important information. Results We extracted data from Twitter using the public API. We subsequently use Ama…
Polymeric Nanoparticles: Polymeric Nanoparticles with Neglectable Protein Corona (Small 18/2020)
Secure Sum Outperforms Homomorphic Encryption in (Current) Collaborative Deep Learning
Deep learning (DL) approaches are achieving extraordinary results in a wide range of domains, but often require a massive collection of private data. Hence, methods for training neural networks on the joint data of different data owners, that keep each party's input confidential, are called for. We address a specific setting in federated learning, namely that of deep learning from horizontally distributed data with a limited number of parties, where their vulnerable intermediate results have to be processed in a privacy-preserving manner. This setting can be found in medical and healthcare as well as industrial applications. The predominant scheme for this is based on homomorphic encryption…
Forest of Normalized Trees: Fast and Accurate Density Estimation of Streaming Data
Density estimation of streaming data is a relevant task in numerous domains. In this paper, a novel non-parametric density estimator called FRONT (forest of normalized trees) is introduced. It uses a structure of multiple normalized trees, segments the feature space of the data stream through a periodically updated linear transformation and is able to adapt to ever evolving data streams. FRONT provides accurate density estimation and performs favorably compared to existing online density estimators in terms of the average log score on multiple standard data sets. Its low complexity, linear runtime as well as constant memory usage, makes FRONT by design suitable for large data streams. Final…
A label compression method for online multi-label classification
Abstract Many modern applications deal with multi-label data, such as functional categorizations of genes, image labeling and text categorization. Classification of such data with a large number of labels and latent dependencies among them is a challenging task, and it becomes even more challenging when the data is received online and in chunks. Many of the current multi-label classification methods require a lot of time and memory, which make them infeasible for practical real-world applications. In this paper, we propose a fast linear label space dimension reduction method that transforms the labels into a reduced encoded space and trains models on the obtained pseudo labels. Additionally…
Machine Learning and Knowledge Discovery in Databases. Research Track
Focusing Knowledge-based Graph Argument Mining via Topic Modeling
Decision-making usually takes five steps: identifying the problem, collecting data, extracting evidence, identifying pro and con arguments, and making decisions. Focusing on extracting evidence, this paper presents a hybrid model that combines latent Dirichlet allocation and word embeddings to obtain external knowledge from structured and unstructured data. We study the task of sentence-level argument mining, as arguments mostly require some degree of world knowledge to be identified and understood. Given a topic and a sentence, the goal is to classify whether a sentence represents an argument in regard to the topic. We use a topic model to extract topic- and sentence-specific evidence from…
An inductive learning perspective on automated generation of feature models from given product specifications
For explicit representation of commonality and variability of a product line, a feature model is mostly used. An open question is how a feature model can be inductively learned in an automated way from a limited number of given product specifications in terms of features.We propose to address this problem through machine learning, more precisely inductive generalization from examples. However, no counter-examples are assumed to exist. Basically, a feature model needs to be complete with respect to all the given example specifications. First results indicate the feasibility of this approach, even for generating hierarchies, but many open challenges remain.
Hub-Centered Gene Network Reconstruction Using Automatic Relevance Determination
Network inference deals with the reconstruction of biological networks from experimental data. A variety of different reverse engineering techniques are available; they differ in the underlying assumptions and mathematical models used. One common problem for all approaches stems from the complexity of the task, due to the combinatorial explosion of different network topologies for increasing network size. To handle this problem, constraints are frequently used, for example on the node degree, number of edges, or constraints on regulation functions between network components. We propose to exploit topological considerations in the inference of gene regulatory networks. Such systems are often…
Forecast of Study Success in the STEM Disciplines Based Solely on Academic Records
We present an approach to the forecast of the study success in selected STEM disciplines (computer science, mathematics, physics, and meteorology), solely based on the academic record of a student so far, without access to demographic or socioeconomic data. The purpose of the analysis is to improve student counseling, which may be essential for finishing a study program in one of the above mentioned fields. Technically, we show the successful use of propositionalization on relational data from educational data mining, based on standard aggregates and basic LSTM-trained aggregates.
Adapted Transfer of Distance Measures for Quantitative Structure-Activity Relationships and Data-Driven Selection of Source Datasets
Quantitative structure–activity relationships are regression models relating chemical structure to biological activity. Such models allow to make predictions for toxicologically relevant endpoints, which constitute the target outcomes of experiments. The task is often tackled by instance-based methods, which are all based on the notion of chemical (dis-)similarity. Our starting point is the observation by Raymond and Willett that the two families of chemical distance measures, fingerprint-based and maximum common subgraph-based measures, provide orthogonal information about chemical similarity. This paper presents a novel method for finding suitable combinations of them, called adapted tran…
Online Sparse Collapsed Hybrid Variational-Gibbs Algorithm for Hierarchical Dirichlet Process Topic Models
Topic models for text analysis are most commonly trained using either Gibbs sampling or variational Bayes. Recently, hybrid variational-Gibbs algorithms have been found to combine the best of both worlds. Variational algorithms are fast to converge and more efficient for inference on new documents. Gibbs sampling enables sparse updates since each token is only associated with one topic instead of a distribution over all topics. Additionally, Gibbs sampling is unbiased. Although Gibbs sampling takes longer to converge, it is guaranteed to arrive at the true posterior after infinitely many iterations. By combining the two methods it is possible to reduce the bias of variational methods while …
HPMA-Based Nanoparticles for Fast, Bioorthogonal iEDDA Ligation
Contains fulltext : 216143.pdf (Publisher’s version ) (Open Access) Fast and bioorthogonally reacting nanoparticles are attractive tools for biomedical applications such as tumor pretargeting. In this study, we designed an amphiphilic block copolymer system based on HPMA using different strategies to introduce the highly reactive click units 1,2,4,5-tetrazines (Tz) either at the chain end (Tz-CTA) or statistical into the hydrophobic block. This reactive group undergoes a rapid, bioorthogonal inverse electron-demand Diels-Alder reaction (iEDDA) with trans-cyclooctenes (TCO). Subsequently, this polymer platform was used for the preparation of different Tz-covered nanoparticles, such as micell…
Alternating model trees
Model tree induction is a popular method for tackling regression problems requiring interpretable models. Model trees are decision trees with multiple linear regression models at the leaf nodes. In this paper, we propose a method for growing alternating model trees, a form of option tree for regression problems. The motivation is that alternating decision trees achieve high accuracy in classification problems because they represent an ensemble classifier as a single tree structure. As in alternating decision trees for classification, our alternating model trees for regression contain splitter and prediction nodes, but we use simple linear regression functions as opposed to constant predicto…
Ensembles of Randomized Time Series Shapelets Provide Improved Accuracy while Reducing Computational Costs
Shapelets are discriminative time series subsequences that allow generation of interpretable classification models, which provide faster and generally better classification than the nearest neighbor approach. However, the shapelet discovery process requires the evaluation of all possible subsequences of all time series in the training set, making it extremely computation intensive. Consequently, shapelet discovery for large time series datasets quickly becomes intractable. A number of improvements have been proposed to reduce the training time. These techniques use approximation or discretization and often lead to reduced classification accuracy compared to the exact method. We are proposin…
Polymeric Nanoparticles with Neglectable Protein Corona
Small : nano micro 16(18), 1907574 (2020). doi:10.1002/smll.201907574
Structural clustering of millions of molecular graphs
We propose an algorithm for clustering very large molecular graph databases according to scaffolds (i.e., large structural overlaps) that are common between cluster members. Our approach first partitions the original dataset into several smaller datasets using a greedy clustering approach named APreClus based on dynamic seed clustering. APreClus is an online and instance incremental clustering algorithm delaying the final cluster assignment of an instance until one of the so-called pending clusters the instance belongs to has reached significant size and is converted to a fixed cluster. Once a cluster is fixed, APreClus recalculates the cluster centers, which are used as representatives for…
A structural cluster kernel for learning on graphs
In recent years, graph kernels have received considerable interest within the machine learning and data mining community. Here, we introduce a novel approach enabling kernel methods to utilize additional information hidden in the structural neighborhood of the graphs under consideration. Our novel structural cluster kernel (SCK) incorporates similarities induced by a structural clustering algorithm to improve state-of-the-art graph kernels. The approach taken is based on the idea that graph similarity can not only be described by the similarity between the graphs themselves, but also by the similarity they possess with respect to their structural neighborhood. We applied our novel kernel in…
Optimization of curation of the dataset with data on repeated dose toxicity
Introduction: For some areas of risk assessment, the use of alter-native methods is supported by current directives and guidance(e.g. REACH, Cosmetics, BPD, PPP). According to OECD principles alternative methods need to be scientifically valid. Methods: Within a project on grouping and development of predictive models sup-ported by a grant of Federal Ministry of Education and Research, we curated a dataset based on RepDose and ELINCS database. The final dataset consists of rat repeated dose toxicity studies for 1022 com-pounds representing 28 endpoints as organ-effect-combinations. Toxicological and modelling experts did jointly the curation and selection of endpoints as an iterative proces…
Identification of ELF3 as an early transcriptional regulator of human urothelium
AbstractDespite major advances in high-throughput and computational modelling techniques, understanding of the mechanisms regulating tissue specification and differentiation in higher eukaryotes, particularly man, remains limited. Microarray technology has been explored exhaustively in recent years and several standard approaches have been established to analyse the resultant datasets on a genome-wide scale. Gene expression time series offer a valuable opportunity to define temporal hierarchies and gain insight into the regulatory relationships of biological processes. However, unless datasets are exactly synchronous, time points cannot be compared directly.Here we present a data-driven ana…
Online Estimation of Discrete Densities
We address the problem of estimating a discrete joint density online, that is, the algorithm is only provided the current example and its current estimate. The proposed online estimator of discrete densities, EDDO (Estimation of Discrete Densities Online), uses classifier chains to model dependencies among features. Each classifier in the chain estimates the probability of one particular feature. Because a single chain may not provide a reliable estimate, we also consider ensembles of classifier chains and ensembles of weighted classifier chains. For all density estimators, we provide consistency proofs and propose algorithms to perform certain inference tasks. The empirical evaluation of t…
HPMA-Based Nanocarriers for Effective Immune System Stimulation.
The selective activation of the immune system using nanoparticles as a drug delivery system is a promising field in cancer therapy. Block copolymers from HPMA and laurylmethacrylate-co-hymecromone-methacrylate allow the preparation of multifunctionalized core-crosslinked micelles of variable size. To activate dendritic cells (DCs) as antigen presenting cells, the carbohydrates mannose and trimannose are introduced into the hydrophilic corona as DC targeting units. To activate DCs, a lipophilic adjuvant (L18-MDP) is incorporated into the core of the micelles. To elicit an immune response, a model antigen peptide (SIINFEKL) is attached to the polymeric nanoparticle-in addition-via a click rea…
Scalable Clustering by Iterative Partitioning and Point Attractor Representation
Clustering very large datasets while preserving cluster quality remains a challenging data-mining task to date. In this paper, we propose an effective scalable clustering algorithm for large datasets that builds upon the concept of synchronization. Inherited from the powerful concept of synchronization, the proposed algorithm, CIPA (Clustering by Iterative Partitioning and Point Attractor Representations), is capable of handling very large datasets by iteratively partitioning them into thousands of subsets and clustering each subset separately. Using dynamic clustering by synchronization, each subset is then represented by a set of point attractors and outliers. Finally, CIPA identifies the…
Efficient Redundancy Reduced Subgroup Discovery via Quadratic Programming
Subgroup discovery is a task at the intersection of predictive and descriptive induction, aiming at identifying subgroups that have the most unusual statistical (distributional) characteristics with respect to a property of interest. Although a great deal of work has been devoted to the topic, one remaining problem concerns the redundancy of subgroup descriptions, which often effectively convey very similar information. In this paper, we propose a quadratic programming based approach to reduce the amount of redundancy in the subgroup rules. Experimental results on 12 datasets show that the resulting subgroups are in fact less redundant compared to standard methods. In addition, our experime…
Convolutional Neural Networks for the Identification of Regions of Interest in PET Scans: A Study of Representation Learning for Diagnosing Alzheimer’s Disease
When diagnosing patients suffering from dementia based on imaging data like PET scans, the identification of suitable predictive regions of interest (ROIs) is of great importance. We present a case study of 3-D Convolutional Neural Networks (CNNs) for the detection of ROIs in this context, just using voxel data, without any knowledge given a priori. Our results on data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) suggest that the predictive performance of the method is on par with that of state-of-the-art methods, with the additional benefit of potential insights into affected brain regions.
Exploring Multi-Objective Optimization for Multi-Label Classifier Ensembles
Multi-label classification deals with the task of predicting multiple class labels for a given sample. Several performance metrics are designed in the literature to measure the quality of any multi-label classification technique. In general existing multi-label classification approaches focus on optimizing only a single performance measure. The current work builds on the hypothesis that a weighted ensemble of multiple multi-label classifiers will lead to obtain improved results. The appropriate weight combinations for combining the outputs of multiple classifiers can be selected after simultaneously optimizing different multi-label classification metrics like micro F1, hamming loss, 0/1 los…
Cinema Data Mining
While the physiological response of humans to emotional events or stimuli is well-investigated for many modalities (like EEG, skin resistance, ...), surprisingly little is known about the exhalation of so-called Volatile Organic Compounds (VOCs) at quite low concentrations in response to such stimuli. VOCs are molecules of relatively small mass that quickly evaporate or sublimate and can be detected in the air that surrounds us. The paper introduces a new field of application for data mining, where trace gas responses of people reacting on-line to films shown in cinemas (or movie theaters) are related to the semantic content of the films themselves. To do so, we measured the VOCs from a mov…
Targeting cells of the immune system: mannosylated HPMA–LMA block-copolymer micelles for targeting of dendritic cells
Background: Successful tumor immunotherapy depends on the induction of strong and sustained tumor antigen-specific immune responses by activated antigen-presenting cells (APCs) such as dendritic cells (DCs). Since nanoparticles have the potential to codeliver tumor-specific antigen and DC-stimulating adjuvant in a DC-targeting manner, we wanted to assess the suitability of mannosylated HPMA-LMA block polymers for immunotherapy. Materials & methods: Fluorescence-labeled block copolymer micelles derived from P(HPMA)-block-P(LMA) copolymers and according statistical copolymers were synthesized via RAFT polymerization, and loaded with the APC activator L18-MDP. Both types of copolymers wer…
Innovative Strategies to Develop Chemical Categories Using a Combination of Structural and Toxicological Properties.
Interest is increasing in the development of non-animal methods for toxicological evaluations. These methods are however, particularly challenging for complex toxicological endpoints such as repeated dose toxicity. European Legislation, e.g., the European Union's Cosmetic Directive and REACH, demands the use of alternative methods. Frameworks, such as the Read-across Assessment Framework or the Adverse Outcome Pathway Knowledge Base, support the development of these methods. The aim of the project presented in this publication was to develop substance categories for a read-across with complex endpoints of toxicity based on existing databases. The basic conceptual approach was to combine str…
DySC: software for greedy clustering of 16S rRNA reads.
Abstract Summary: Pyrosequencing technologies are frequently used for sequencing the 16S ribosomal RNA marker gene for profiling microbial communities. Clustering of the produced reads is an important but time-consuming task. We present Dynamic Seed-based Clustering (DySC), a new tool based on the greedy clustering approach that uses a dynamic seeding strategy. Evaluations based on the normalized mutual information (NMI) criterion show that DySC produces higher quality clusters than UCLUST and CD-HIT at a comparable runtime. Availability and implementation: DySC, implemented in C, is available at http://code.google.com/p/dysc/ under GNU GPL license. Contact: bertil.schmidt@uni-mainz.de Sup…
Incremental linear model trees on massive datasets
The existence of massive datasets raises the need for algorithms that make efficient use of resources like memory and computation time. Besides well-known approaches such as sampling, online algorithms are being recognized as good alternatives, as they often process datasets faster using much less memory. The important class of algorithms learning linear model trees online (incremental linear model trees or ILMTs in the following) offers interesting options for regression tasks in this sense. However, surprisingly little is known about their performance, as there exists no large-scale evaluation on massive stationary datasets under equal conditions. Therefore, this paper shows their applica…
A Large-Scale Empirical Evaluation of Cross-Validation and External Test Set Validation in (Q)SAR.
(Q)SAR model validation is essential to ensure the quality of inferred models and to indicate future model predictivity on unseen compounds. Proper validation is also one of the requirements of regulatory authorities in order to accept the (Q)SAR model, and to approve its use in real world scenarios as alternative testing method. However, at the same time, the question of how to validate a (Q)SAR model, in particular whether to employ variants of cross-validation or external test set validation, is still under discussion. In this paper, we empirically compare a k-fold cross-validation with external test set validation. To this end we introduce a workflow allowing to realistically simulate t…
Eawag-Soil in enviPath: a new resource for exploring regulatory pesticide soil biodegradation pathways and half-life data.
Developing models for the prediction of microbial biotransformation pathways and half-lives of trace organic contaminants in different environments requires as training data easily accessible and sufficiently large collections of respective biotransformation data that are annotated with metadata on study conditions. Here, we present the Eawag-Soil package, a public database that has been developed to contain all freely accessible regulatory data on pesticide degradation in laboratory soil simulation studies for pesticides registered in the EU (282 degradation pathways, 1535 reactions, 1619 compounds and 4716 biotransformation half-life values with corresponding metadata on study conditions)…
Extracting information from support vector machines for pattern-based classification
Statistical machine learning algorithms building on patterns found by pattern mining algorithms have to cope with large solution sets and thus the high dimensionality of the feature space. Vice versa, pattern mining algorithms are frequently applied to irrelevant instances, thus causing noise in the output. Solution sets of pattern mining algorithms also typically grow with increasing input datasets. The paper proposes an approach to overcome these limitations. The approach extracts information from trained support vector machines, in particular their support vectors and their relevance according to their coefficients. It uses the support vectors along with their coefficients as input to pa…
Effect of Core-Crosslinking on Protein Corona Formation on Polymeric Micelles.
Most nanomaterials acquire a protein corona upon contact with biological fluids. The magnitude of this effect is strongly dependent both on surface and structure of the nanoparticle. To define the contribution of the internal nanoparticle structure, protein corona formation of block copolymer micelles with poly(N-2-hydroxypropylmethacrylamide) (pHPMA) as hydrophilic shell, which are crosslinked-or not-in the hydrophobic core is comparatively analyzed. Both types of micelles are incubated with human blood plasma and separated by asymmetrical flow field-flow fractionation (AF4). Their size is determined by dynamic light scattering and proteins within the micellar fraction are characterized by…
Privacy Preserving Client/Vertical-Servers Classification
We present a novel client/vertical-servers architecture for hybrid multi-party classification problem. The model consists of clients whose attributes are distributed on multiple servers and remain secret during training and testing. Our solution builds privacy-preserving random forests and completes them with a special private set intersection protocol that provides a central commodity server with anonymous conditional statistics. Subsequently, the private set intersection protocol can be used to privately classify the queries of new clients using the commodity server’s statistics. The proviso is that the commodity server must not collude with other parties. In cases where this restriction …
Modeling recurrent distributions in streams using possible worlds
Discovering changes in the data distribution of streams and discovering recurrent data distributions are challenging problems in data mining and machine learning. Both have received a lot of attention in the context of classification. With the ever increasing growth of data, however, there is a high demand of compact and universal representations of data streams that enable the user to analyze current as well as historic data without having access to the raw data. To make a first step towards this direction, we propose a condensed representation that captures the various — possibly recurrent — data distributions of the stream by extending the notion of possible worlds. The representation en…
Modeling Multi-label Recurrence in Data Streams
Most of the existing data stream algorithms assume a single label as the target variable. However, in many applications, each observation is assigned to several labels with latent dependencies among them, which their target function may change over time. Classification of such non-stationary multi-label streaming data with the consideration of dependencies among labels and potential drifts is a challenging task. The few existing studies mostly cope with drifts implicitly, and all learn models on the original label space, which requires a lot of time and memory. None of them consider recurrent drifts in multi-label streams and particularly drifts and recurrences visible in a latent label spa…
Scavenger – A Framework for Efficient Evaluation of Dynamic and Modular Algorithms
Machine Learning methods and algorithms are often highly modular in the sense that they rely on a large number of subalgorithms that are in principle interchangeable. For example, it is often possible to use various kinds of pre- and post-processing and various base classifiers or regressors as components of the same modular approach. We propose a framework, called Scavenger, that allows evaluating whole families of conceptually similar algorithms efficiently. The algorithms are represented as compositions, couplings and products of atomic subalgorithms. This allows partial results to be cached and shared between different instances of a modular algorithm, so that potentially expensive part…
Gaussian Mixture Models and Model Selection for [18F] Fluorodeoxyglucose Positron Emission Tomography Classification in Alzheimer’s Disease
We present a method to discover discriminative brain metabolism patterns in [18F] fluorodeoxyglucose positron emission tomography (PET) scans, facilitating the clinical diagnosis of Alzheimer's disease. In the work, the term "pattern" stands for a certain brain region that characterizes a target group of patients and can be used for a classification as well as interpretation purposes. Thus, it can be understood as a so-called "region of interest (ROI)". In the literature, an ROI is often found by a given brain atlas that defines a number of brain regions, which corresponds to an anatomical approach. The present work introduces a semi-data-driven approach that is based on learning the charac…
BMaD – A Boolean Matrix Decomposition Framework
Boolean matrix decomposition is a method to obtain a compressed representation of a matrix with Boolean entries. We present a modular framework that unifies several Boolean matrix decomposition algorithms, and provide methods to evaluate their performance. The main advantages of the framework are its modular approach and hence the flexible combination of the steps of a Boolean matrix decomposition and the capability of handling missing values. The framework is licensed under the GPLv3 and can be downloaded freely at http://projects.informatik.uni-mainz.de/bmad.
A probabilistic condensed representation of data for stream mining
Data mining and machine learning algorithms usually operate directly on the data. However, if the data is not available at once or consists of billions of instances, these algorithms easily become infeasible with respect to memory and run-time concerns. As a solution to this problem, we propose a framework, called MiDEO (Mining Density Estimates inferred Online), in which algorithms are designed to operate on a condensed representation of the data. In particular, we propose to use density estimates, which are able to represent billions of instances in a compact form and can be updated when new instances arrive. As an example for an algorithm that operates on density estimates, we consider t…
A Hybrid Machine Learning and Knowledge Based Approach to Limit Combinatorial Explosion in Biodegradation Prediction
One of the main tasks in chemical industry regarding the sustainability of a product is the prediction of its environmental fate, i.e., its degradation products and pathways. Current methods for the prediction of biodegradation products and pathways of organic environmental pollutants either do not take into account domain knowledge or do not provide probability estimates. In this chapter, we propose a hybrid knowledge-based and machine learning-based approach to overcome these limitations in the context of the University of Minnesota Pathway Prediction System (UM-PPS). The proposed solution performs relative reasoning in a machine learning framework, and obtains one probability estimate fo…
Session details: Volume I: Artificial intelligence & agents, distributed systems, and information systems: data mining track
Pairwise Learning to Rank by Neural Networks Revisited: Reconstruction, Theoretical Analysis and Practical Performance
We present a pairwise learning to rank approach based on a neural net, called DirectRanker, that generalizes the RankNet architecture. We show mathematically that our model is reflexive, antisymmetric, and transitive allowing for simplified training and improved performance. Experimental results on the LETOR MSLR-WEB10K, MQ2007 and MQ2008 datasets show that our model outperforms numerous state-of-the-art methods, while being inherently simpler in structure and using a pairwise approach only.
Integrating LSTMs with Online Density Estimation for the Probabilistic Forecast of Energy Consumption
In machine learning applications in the energy sector, it is often necessary to have both highly accurate predictions and information about the probabilities of certain scenarios to occur. We address this challenge by integrating and combining long short-term memory networks (LSTMs) and online density estimation into a real-time data streaming architecture of an energy trader. The online density estimation is done in the MiDEO framework, which estimates joint densities of data streams based on ensembles of chains of Hoeffding trees. One attractive feature of the solution is that queries can be sent to the here-called forecast-based point density estimators (FPDE) to derive information from …
cuBool: Bit-Parallel Boolean Matrix Factorization on CUDA-Enabled Accelerators
Boolean Matrix Factorization (BMF) is a commonly used technique in the field of unsupervised data analytics. The goal is to decompose a ground truth matrix C into a product of two matrices A and $B$ being either an exact or approximate rank k factorization of C. Both exact and approximate factorization are time-consuming tasks due to their combinatorial complexity. In this paper, we introduce a massively parallel implementation of BMF - namely cuBool - in order to significantly speed up factorization of huge Boolean matrices. Our approach is based on alternately adjusting rows and columns of A and B using thousands of lightweight CUDA threads. The massively parallel manipulation of entries …
enviPath - The environmental contaminant biotransformation pathway resource
The University of Minnesota Biocatalysis/Biodegradation Database and Pathway Prediction System (UM-BBD/PPS) has been a unique resource covering microbial biotransformation pathways of primarily xenobiotic chemicals for over 15 years. This paper introduces the successor system, enviPath (The Environmental Contaminant Biotransformation Pathway Resource), which is a complete redesign and reimplementation of UM-BBD/PPS. enviPath uses the database from the UM-BBD/PPS as a basis, extends the use of this database, and allows users to include their own data to support multiple use cases. Relative reasoning is supported for the refinement of predictions and to allow its extensions in terms of previo…
Improving structural similarity based virtual screening using background knowledge
Background Virtual screening in the form of similarity rankings is often applied in the early drug discovery process to rank and prioritize compounds from a database. This similarity ranking can be achieved with structural similarity measures. However, their general nature can lead to insufficient performance in some application cases. In this paper, we provide a link between ranking-based virtual screening and fragment-based data mining methods. The inclusion of binding-relevant background knowledge into a structural similarity measure improves the quality of the similarity rankings. This background knowledge in the form of binding relevant substructures can either be derived by hand selec…
Machine learning for a combined electroencephalographic anesthesia index to detect awareness under anesthesia
Spontaneous electroencephalogram (EEG) and auditory evoked potentials (AEP) have been suggested to monitor the level of consciousness during anesthesia. As both signals reflect different neuronal pathways, a combination of parameters from both signals may provide broader information about the brain status during anesthesia. Appropriate parameter selection and combination to a single index is crucial to take advantage of this potential. The field of machine learning offers algorithms for both parameter selection and combination. In this study, several established machine learning approaches including a method for the selection of suitable signal parameters and classification algorithms are a…
Rule Extraction From Binary Neural Networks With Convolutional Rules for Model Validation.
Classification approaches that allow to extract logical rules such as decision trees are often considered to be more interpretable than neural networks. Also, logical rules are comparatively easy to verify with any possible input. This is an important part in systems that aim to ensure correct operation of a given model. However, for high-dimensional input data such as images, the individual symbols, i.e. pixels, are not easily interpretable. Therefore, rule-based approaches are not typically used for this kind of high-dimensional data. We introduce the concept of first-order convolutional rules, which are logical rules that can be extracted using a convolutional neural network (CNN), and w…
Pruning Incremental Linear Model Trees with Approximate Lookahead
Incremental linear model trees with approximate lookahead are fast, but produce overly large trees. This is due to non-optimal splitting decisions boosted by a possibly unlimited number of examples obtained from a data source. To keep the processing speed high and the tree complexity low, appropriate incremental pruning techniques are needed. In this paper, we introduce a pruning technique for the class of incremental linear model trees with approximate lookahead on stationary data sources. Experimental results show that the advantage of approximate lookahead in terms of processing speed can be further improved by producing much smaller and consequently more explanatory, less memory consumi…
A Nonlinear Label Compression and Transformation Method for Multi-label Classification Using Autoencoders
Multi-label classification targets the prediction of multiple interdependent and non-exclusive binary target variables. Transformation-based algorithms transform the data set such that regular single-label algorithms can be applied to the problem. A special type of transformation-based classifiers are label compression methods, which compress the labels and then mostly use single label classifiers to predict the compressed labels. So far, there are no compression-based algorithms that follow a problem transformation approach and address non-linear dependencies in the labels. In this paper, we propose a new algorithm, called Maniac (Multi-lAbel classificatioN usIng AutoenCoders), which extra…
Long-term biodistribution study of HPMA- ran -LMA copolymers in vivo by means of 131 I-labeling
Abstract Background For the evaluation of macromolecular drug delivery systems suitable pre-clinical monitoring of potential nanocarrier systems is needed. In this regard, both short-term as well as long-term in vivo tracking is crucial to understand structure-property relationships of polymer carrier systems and their resulting pharmacokinetic profile. Based on former studies revealing favorable in vivo characteristics for 18 F–labeled random (ran) copolymers consisting of N-(2-hydroxypropyl)methacrylamide (HPMA) and lauryl methacrylate (LMA) – including prolonged plasma half-life as well as enhanced tumor accumulation – the presented work focuses on their long-term investigation in the li…
CheS-Mapper - Chemical Space Mapping and Visualization in 3D
Abstract Analyzing chemical datasets is a challenging task for scientific researchers in the field of chemoinformatics. It is important, yet difficult to understand the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. To that respect, visualization tools can help to better comprehend the underlying correlations. Our recently developed 3D molecular viewer CheS-Mapper (Chemical Space Mapper) divides large datasets into clusters of similar compounds and consequently arranges them in 3D space, such that their spatial proximity reflects their similarity. The user can indirectly determine similarity, by selecting which f…
Similarity boosted quantitative structure-activity relationship--a systematic study of enhancing structural descriptors by molecular similarity.
The concept of molecular similarity is one of the most central in the fields of predictive toxicology and quantitative structure-activity relationship (QSAR) research. Many toxicological responses result from a multimechanistic process and, consequently, structural diversity among the active compounds is likely. Combining this knowledge, we introduce similarity boosted QSAR modeling, where we calculate molecular descriptors using similarities with respect to representative reference compounds to aid a statistical learning algorithm in distinguishing between different structural classes. We present three approaches for the selection of reference compounds, one by literature search and two by…
Exploring Multiobjective Optimization for Multiview Clustering
We present a new multiview clustering approach based on multiobjective optimization. In contrast to existing clustering algorithms based on multiobjective optimization, it is generally applicable to data represented by two or more views and does not require specifying the number of clusters a priori . The approach builds upon the search capability of a multiobjective simulated annealing based technique, AMOSA, as the underlying optimization technique. In the first version of the proposed approach, an internal cluster validity index is used to assess the quality of different partitionings obtained using different views. A new way of checking the compatibility of these different partitioning…
A Survey of Multi-Label Topic Models
Every day, an enormous amount of text data is produced. Sources of text data include news, social media, emails, text messages, medical reports, scientific publications and fiction. To keep track of this data, there are categories, key words, tags or labels that are assigned to each text. Automatically predicting such labels is the task of multi-label text classification. Often however, we are interested in more than just the pure classification: rather, we would like to understand which parts of a text belong to the label, which words are important for the label or which labels occur together. Because of this, topic models may be used for multi-label classification as an interpretable mode…
Multi-label classification using boolean matrix decomposition
This paper introduces a new multi-label classifier based on Boolean matrix decomposition. Boolean matrix decomposition is used to extract, from the full label matrix, latent labels representing useful Boolean combinations of the original labels. Base level models predict latent labels, which are subsequently transformed into the actual labels by Boolean matrix multiplication with the second matrix from the decomposition. The new method is tested on six publicly available datasets with varying numbers of labels. The experimental evaluation shows that the new method works particularly well on datasets with a large number of labels and strong dependencies among them.
A Brief History of Learning Symbolic Higher-Level Representations from Data (And a Curious Look Forward)
Learning higher-level representations from data has been on the agenda of AI research for several decades. In the paper, I will give a survey of various approaches to learning symbolic higher-level representations: feature construction and constructive induction, predicate invention, propositionalization, pattern mining, and mining time series patterns. Finally, I will give an outlook on how approaches to learning higher-level representations, symbolic and neural, can benefit from each other to solve current issues in machine learning.
Online Induction of Probabilistic Real Time Automata
Probabilistic real time automata (PRTAs) are a representation of dynamic processes arising in the sciences and industry. Currently, the induction of automata is divided into two steps: the creation of the prefix tree acceptor (PTA) and the merge procedure based on clustering of the states. These two steps can be very time intensive when a PRTA is to be induced for massive or even unbounded data sets. The latter one can be efficiently processed, as there exist scalable online clustering algorithms. However, the creation of the PTA still can be very time consuming. To overcome this problem, we propose a genuine online PRTA induction approach that incorporates new instances by first collapsing…
Trading off accuracy for efficiency by randomized greedy warping
Dynamic Time Warping (DTW) is a widely used distance measure for time series data mining. Its quadratic complexity requires the application of various techniques (e.g. warping constraints, lower-bounds) for deployment in real-time scenarios. In this paper we propose a randomized greedy warping algorithm for finding similarity between time series instances. We show that the proposed algorithm outperforms the simple greedy approach and also provides very good time series similarity approximation consistently, as compared to DTW. We show that the Randomized Time Warping (RTW) can be used in place of DTW as a fast similarity approximation technique by trading some classification accuracy for ve…
Model selection based product kernel learning for regression on graphs
The choice of a suitable graph kernel is intrinsically hard and often cannot be made in an informed manner for a given dataset. Methods for multiple kernel learning offer a possible remedy, as they combine and weight kernels on the basis of a labeled training set of molecules to define a new kernel. Whereas most methods for multiple kernel learning focus on learning convex linear combinations of kernels, we propose to combine kernels in products, which theoretically enables higher expressiveness. In experiments on ten publicly available chemical QSAR datasets we show that product kernel learning is on no dataset significantly worse than any of the competing kernel methods and on average the…
Deep neural networks to recover unknown physical parameters from oscillating time series.
PLOS ONE 17(5), e0268439 (2022). doi:10.1371/journal.pone.0268439
CheS-Mapper 2.0 for visual validation of (Q)SAR models
Abstract Background Sound statistical validation is important to evaluate and compare the overall performance of (Q)SAR models. However, classical validation does not support the user in better understanding the properties of the model or the underlying data. Even though, a number of visualization tools for analyzing (Q)SAR information in small molecule datasets exist, integrated visualization methods that allow the investigation of model validation results are still lacking. Results We propose visual validation, as an approach for the graphical inspection of (Q)SAR model validation results. The approach applies the 3D viewer CheS-Mapper, an open-source application for the exploration of sm…
Towards Bankruptcy Prediction: Deep Sentiment Mining to Detect Financial Distress from Business Management Reports
Due to their disclosure required by law, business management reports have become publicly available for a large number of companies, and these reports offer the opportunity to assess the financial health or distress of a company, both quantitatively from the balance sheets and qualitatively from the text. In this paper, we analyze the potential of deep sentiment mining from the textual parts of business management reports and aim to detect signals for financial distress. We (1) created the largest corpus of business reports analyzed qualitatively to date, (2) defined a non-trivial target variable based on the so-called Altman Z-score, (3) developed a filtering of sentences based on class-co…
An In-Depth Experimental Comparison of RNTNs and CNNs for Sentence Modeling
The goal of modeling sentences is to accurately represent their meaning for different tasks. A variety of deep learning architectures have been proposed to model sentences, however, little is known about their comparative performance on a common ground, across a variety of datasets, and on the same level of optimization. In this paper, we provide such a novel comparison for two popular architectures, Recursive Neural Tensor Networks (RNTNs) and Convolutional Neural Networks (CNNs). Although RNTNs have been shown to work well in many cases, they require intensive manual labeling due to the vanishing gradient problem. To enable an extensive comparison of the two architectures, this paper empl…
Cinema audiences reproducibly vary the chemical composition of air during films, by broadcasting scene specific emissions on breath
AbstractHuman beings continuously emit chemicals into the air by breath and through the skin. In order to determine whether these emissions vary predictably in response to audiovisual stimuli, we have continuously monitored carbon dioxide and over one hundred volatile organic compounds in a cinema. It was found that many airborne chemicals in cinema air varied distinctively and reproducibly with time for a particular film, even in different screenings to different audiences. Application of scene labels and advanced data mining methods revealed that specific film events, namely “suspense” or “comedy” caused audiences to change their emission of specific chemicals. These event-type synchronou…
Fair Pairwise Learning to Rank
Ranking algorithms based on Neural Networks have been a topic of recent research. Ranking is employed in everyday applications like product recommendations, search results, or even in finding good candidates for hiring. However, Neural Networks are mostly opaque tools, and it is hard to evaluate why a specific candidate, for instance, was not considered. Therefore, for neural-based ranking methods to be trustworthy, it is crucial to guarantee that the outcome is fair and that the decisions are not discriminating people according to sensitive attributes such as gender, sexual orientation, or ethnicity.In this work we present a family of fair pairwise learning to rank approaches based on Neur…
Graph Clustering with Local Density-Cut
In this paper, we introduce a new graph clustering algorithm, called Dcut. The basic idea is to envision the graph clustering as a local density-cut problem. To identify meaningful communities in a graph, a density-connected tree is first constructed in a local fashion. Building upon the local intuitive density-connected tree, Dcut allows partitioning a graph into multiple densely tight-knit clusters effectively and efficiently. We have demonstrated that our method has several attractive benefits: (a) Dcut provides an intuitive criterion to evaluate the goodness of a graph clustering in a more precise way; (b) Building upon the density-connected tree, Dcut allows identifying high-quality cl…
Maximum Common Subgraph based locally weighted regression
This paper investigates a simple, yet effective method for regression on graphs, in particular for applications in chem-informatics and for quantitative structure-activity relationships (QSARs). The method combines Locally Weighted Learning (LWL) with Maximum Common Subgraph (MCS) based graph distances. More specifically, we investigate a variant of locally weighted regression on graphs (structures) that uses the maximum common subgraph for determining and weighting the neighborhood of a graph and feature vectors for the actual regression model. We show that this combination, LWL-MCS, outperforms other methods that use the local neighborhood of graphs for regression. The performance of this…