0000000000136109

AUTHOR

Stefan Kramer

showing 75 related works from this author

Multi-label Classification Using Stacked Hierarchical Dirichlet Processes with Reduced Sampling Complexity

2018

Nonparametric topic models based on hierarchical Dirichlet processes (HDPs) allow for the number of topics to be automatically discovered from the data. The computational complexity of standard Gibbs sampling techniques for model training is linear in the number of topics. Recently, it was reduced to be linear in the number of topics per word using a technique called alias sampling combined with Metropolis Hastings (MH) sampling. We propose a different proposal distribution for the MH step based on the observation that distributions on the upper hierarchy level change slower than the document-specific distributions at the lower level. This reduces the sampling complexity, making it linear i…

Topic modelComputational complexity theoryComputer science02 engineering and technologyLatent Dirichlet allocationDirichlet distributionsymbols.namesakeArtificial Intelligence020204 information systems0202 electrical engineering electronic engineering information engineeringMathematicsMulti-label classificationbusiness.industrySampling (statistics)Pattern recognitionHuman-Computer InteractionDirichlet processMetropolis–Hastings algorithmHardware and ArchitectureTest setsymbols020201 artificial intelligence & image processingArtificial intelligencebusinessAlgorithmSoftwareInformation SystemsGibbs sampling2017 IEEE International Conference on Big Knowledge (ICBK)
researchProduct

Prototype-based learning on concept-drifting data streams

2014

Data stream mining has gained growing attentions due to its wide emerging applications such as target marketing, email filtering and network intrusion detection. In this paper, we propose a prototype-based classification model for evolving data streams, called SyncStream, which dynamically models time-changing concepts and makes predictions in a local fashion. Instead of learning a single model on a sliding window or ensemble learning, SyncStream captures evolving concepts by dynamically maintaining a set of prototypes in a new data structure called the P-tree. The prototypes are obtained by error-driven representativeness learning and synchronization-inspired constrained clustering. To ide…

Data streamConcept driftbusiness.industryComputer scienceData stream miningConstrained clusteringcomputer.software_genreData structureMachine learningEnsemble learningSynchronization (computer science)Data miningArtificial intelligencebusinesscomputerProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
researchProduct

Online Density Estimation of Heterogeneous Data Streams in Higher Dimensions

2016

The joint density of a data stream is suitable for performing data mining tasks without having access to the original data. However, the methods proposed so far only target a small to medium number of variables, since their estimates rely on representing all the interdependencies between the variables of the data. High-dimensional data streams, which are becoming more and more frequent due to increasing numbers of interconnected devices, are, therefore, pushing these methods to their limits. To mitigate these limitations, we present an approach that projects the original data stream into a vector space and uses a set of representatives to provide an estimate. Due to the structure of the est…

Data streamMahalanobis distanceComputer scienceData stream miningbusiness.industry02 engineering and technologyDensity estimationcomputer.software_genreSet (abstract data type)Software020204 information systems0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processingData miningbusinesscomputerCurse of dimensionalityVector space
researchProduct

Towards identifying drug side effects from social media using active learning and crowd sourcing.

2019

Motivation Social media is a largely untapped source of information on side effects of drugs. Twitter in particular is widely used to report on everyday events and personal ailments. However, labeling this noisy data is a difficult problem because labeled training data is sparse and automatic labeling is error-prone. Crowd sourcing can help in such a scenario to obtain more reliable labels, but is expensive in comparison because workers have to be paid. To remedy this, semi-supervised active learning may reduce the number of labeled data needed and focus the manual labeling process on important information. Results We extracted data from Twitter using the public API. We subsequently use Ama…

0303 health sciencesFocus (computing)Information retrievalDrug-Related Side Effects and Adverse ReactionsProcess (engineering)business.industryActive learning (machine learning)Computer scienceComputational BiologyCrowdsourcing03 medical and health sciences0302 clinical medicineProblem-based learningCode (cryptography)CrowdsourcingHumansSocial media030212 general & internal medicinebusinessBaseline (configuration management)Social Media030304 developmental biologyPacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
researchProduct

Polymeric Nanoparticles: Polymeric Nanoparticles with Neglectable Protein Corona (Small 18/2020)

2020

BiomaterialsMaterials scienceChemical engineeringAsymmetrical Flow Field-Flow FractionationDrug deliveryGeneral Materials ScienceProtein CoronaGeneral ChemistryPolymeric nanoparticlesBiotechnologySmall
researchProduct

Forest of Normalized Trees: Fast and Accurate Density Estimation of Streaming Data

2018

Density estimation of streaming data is a relevant task in numerous domains. In this paper, a novel non-parametric density estimator called FRONT (forest of normalized trees) is introduced. It uses a structure of multiple normalized trees, segments the feature space of the data stream through a periodically updated linear transformation and is able to adapt to ever evolving data streams. FRONT provides accurate density estimation and performs favorably compared to existing online density estimators in terms of the average log score on multiple standard data sets. Its low complexity, linear runtime as well as constant memory usage, makes FRONT by design suitable for large data streams. Final…

Data streamComputer scienceData stream miningFeature vectorEstimator02 engineering and technologyDensity estimation01 natural sciencesData modeling010104 statistics & probabilityKernel (statistics)0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processing0101 mathematicsRandom variableAlgorithm2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)
researchProduct

A label compression method for online multi-label classification

2018

Abstract Many modern applications deal with multi-label data, such as functional categorizations of genes, image labeling and text categorization. Classification of such data with a large number of labels and latent dependencies among them is a challenging task, and it becomes even more challenging when the data is received online and in chunks. Many of the current multi-label classification methods require a lot of time and memory, which make them infeasible for practical real-world applications. In this paper, we propose a fast linear label space dimension reduction method that transforms the labels into a reduced encoded space and trains models on the obtained pseudo labels. Additionally…

Multi-label classificationCurrent (mathematics)business.industryComputer sciencePattern recognition02 engineering and technologySpace (commercial competition)Compression methodTask (project management)Reduction (complexity)ComputingMethodologies_PATTERNRECOGNITIONArtificial Intelligence020204 information systemsSignal Processing0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processingComputer Vision and Pattern RecognitionArtificial intelligencebusinessSoftwarePattern Recognition Letters
researchProduct

Machine Learning and Knowledge Discovery in Databases. Research Track

2021

Information retrievalKnowledge extractionComputer scienceTrack (disk drive)
researchProduct

An inductive learning perspective on automated generation of feature models from given product specifications

2018

For explicit representation of commonality and variability of a product line, a feature model is mostly used. An open question is how a feature model can be inductively learned in an automated way from a limited number of given product specifications in terms of features.We propose to address this problem through machine learning, more precisely inductive generalization from examples. However, no counter-examples are assumed to exist. Basically, a feature model needs to be complete with respect to all the given example specifications. First results indicate the feasibility of this approach, even for generating hierarchies, but many open challenges remain.

Product design specificationTheoretical computer scienceFeature (computer vision)GeneralizationComputer science020204 information systemsProduct line0202 electrical engineering electronic engineering information engineeringLearning theory020207 software engineering02 engineering and technologyRepresentation (mathematics)Feature modelProceedings of the 22nd International Systems and Software Product Line Conference - Volume 1
researchProduct

Hub-Centered Gene Network Reconstruction Using Automatic Relevance Determination

2012

Network inference deals with the reconstruction of biological networks from experimental data. A variety of different reverse engineering techniques are available; they differ in the underlying assumptions and mathematical models used. One common problem for all approaches stems from the complexity of the task, due to the combinatorial explosion of different network topologies for increasing network size. To handle this problem, constraints are frequently used, for example on the node degree, number of edges, or constraints on regulation functions between network components. We propose to exploit topological considerations in the inference of gene regulatory networks. Such systems are often…

Dynamic network analysisTranscription GeneticMicroarraysSciencePosterior probabilityGene regulatory networkBiologycomputer.software_genreBioinformaticsNetwork topology03 medical and health sciences0302 clinical medicineYeastsGeneticsComputer SimulationGene Regulatory NetworksGene NetworksBiology030304 developmental biologyRegulatory NetworksHyperparameter0303 health sciencesMultidisciplinaryModels GeneticSystems BiologyQuantitative Biology::Molecular NetworksCell CycleQRComputational BiologyBayesian networkGene Expression RegulationROC CurveMedicineData miningcomputerAlgorithms030217 neurology & neurosurgeryCombinatorial explosionBiological networkResearch ArticlePLoS ONE
researchProduct

Forecast of Study Success in the STEM Disciplines Based Solely on Academic Records

2020

We present an approach to the forecast of the study success in selected STEM disciplines (computer science, mathematics, physics, and meteorology), solely based on the academic record of a student so far, without access to demographic or socioeconomic data. The purpose of the analysis is to improve student counseling, which may be essential for finishing a study program in one of the above mentioned fields. Technically, we show the successful use of propositionalization on relational data from educational data mining, based on standard aggregates and basic LSTM-trained aggregates.

Relational database020204 information systems0202 electrical engineering electronic engineering information engineeringMathematics education020201 artificial intelligence & image processing02 engineering and technologySocioeconomic statusEducational data mining
researchProduct

Adapted Transfer of Distance Measures for Quantitative Structure-Activity Relationships and Data-Driven Selection of Source Datasets

2012

Quantitative structure–activity relationships are regression models relating chemical structure to biological activity. Such models allow to make predictions for toxicologically relevant endpoints, which constitute the target outcomes of experiments. The task is often tackled by instance-based methods, which are all based on the notion of chemical (dis-)similarity. Our starting point is the observation by Raymond and Willett that the two families of chemical distance measures, fingerprint-based and maximum common subgraph-based measures, provide orthogonal information about chemical similarity. This paper presents a novel method for finding suitable combinations of them, called adapted tran…

General Computer Sciencebusiness.industryComputer scienceFingerprint (computing)Chemical similaritycomputer.software_genreMachine learningDistance measuresData-drivenTask (project management)Similarity (network science)Learning curveData miningArtificial intelligencebusinessTransfer of learningcomputerThe Computer Journal
researchProduct

Online Sparse Collapsed Hybrid Variational-Gibbs Algorithm for Hierarchical Dirichlet Process Topic Models

2017

Topic models for text analysis are most commonly trained using either Gibbs sampling or variational Bayes. Recently, hybrid variational-Gibbs algorithms have been found to combine the best of both worlds. Variational algorithms are fast to converge and more efficient for inference on new documents. Gibbs sampling enables sparse updates since each token is only associated with one topic instead of a distribution over all topics. Additionally, Gibbs sampling is unbiased. Although Gibbs sampling takes longer to converge, it is guaranteed to arrive at the true posterior after infinitely many iterations. By combining the two methods it is possible to reduce the bias of variational methods while …

Topic modelHierarchical Dirichlet processSpeedupGibbs algorithmComputer scienceNonparametric statistics02 engineering and technology010501 environmental sciences01 natural sciencesLatent Dirichlet allocationBayes' theoremsymbols.namesakeComputingMethodologies_PATTERNRECOGNITION020204 information systems0202 electrical engineering electronic engineering information engineeringsymbolsAlgorithm0105 earth and related environmental sciencesGibbs sampling
researchProduct

HPMA-Based Nanoparticles for Fast, Bioorthogonal iEDDA Ligation

2019

Contains fulltext : 216143.pdf (Publisher’s version ) (Open Access) Fast and bioorthogonally reacting nanoparticles are attractive tools for biomedical applications such as tumor pretargeting. In this study, we designed an amphiphilic block copolymer system based on HPMA using different strategies to introduce the highly reactive click units 1,2,4,5-tetrazines (Tz) either at the chain end (Tz-CTA) or statistical into the hydrophobic block. This reactive group undergoes a rapid, bioorthogonal inverse electron-demand Diels-Alder reaction (iEDDA) with trans-cyclooctenes (TCO). Subsequently, this polymer platform was used for the preparation of different Tz-covered nanoparticles, such as micell…

Polymers and PlasticsNanoparticleBioengineeringFluorescence correlation spectroscopy02 engineering and technologyConjugated system010402 general chemistry01 natural sciencesMicelleArticleBiomaterialsAmphiphileMaterials ChemistryCopolymerBenzene DerivativesColloidsMicellesPretargetingAza CompoundsCycloaddition ReactionChemistryOther Research Radboud Institute for Health Sciences [Radboudumc 0]021001 nanoscience & nanotechnologyCombinatorial chemistry0104 chemical sciencesCross-Linking ReagentsMethacrylatesNanoparticlesClick ChemistryBioorthogonal chemistry0210 nano-technology
researchProduct

Alternating model trees

2015

Model tree induction is a popular method for tackling regression problems requiring interpretable models. Model trees are decision trees with multiple linear regression models at the leaf nodes. In this paper, we propose a method for growing alternating model trees, a form of option tree for regression problems. The motivation is that alternating decision trees achieve high accuracy in classification problems because they represent an ensemble classifier as a single tree structure. As in alternating decision trees for classification, our alternating model trees for regression contain splitter and prediction nodes, but we use simple linear regression functions as opposed to constant predicto…

Boosting (machine learning)Computer scienceWeight-balanced treeDecision treeLogistic model treeStatistics::Machine LearningComputingMethodologies_PATTERNRECOGNITIONTree structureStatisticsLinear regressionAlternating decision treeGradient boostingSimple linear regressionAlgorithmProceedings of the 30th Annual ACM Symposium on Applied Computing
researchProduct

Polymeric Nanoparticles with Neglectable Protein Corona

2020

Small : nano micro 16(18), 1907574 (2020). doi:10.1002/smll.201907574

540 Chemistry and allied sciencesDispersity610 Medizinmicellar structuresNanoparticleProtein Corona02 engineering and technology010402 general chemistry01 natural sciencesPolyethylene GlycolsBiomaterialschemistry.chemical_compoundAdsorption610 Medical sciencesHumansGeneral Materials ScienceParticle SizeGel electrophoresisChemistryasymmetrical flow field-flow fractionationSarcosineGeneral Chemistry021001 nanoscience & nanotechnology0104 chemical sciencesChemical engineering540 Chemiedrug deliveryNanoparticlesParticleProtein CoronaParticle sizePeptides0210 nano-technologyHydrophobic and Hydrophilic InteractionsEthylene glycolBiotechnologySmall
researchProduct

Structural clustering of millions of molecular graphs

2014

We propose an algorithm for clustering very large molecular graph databases according to scaffolds (i.e., large structural overlaps) that are common between cluster members. Our approach first partitions the original dataset into several smaller datasets using a greedy clustering approach named APreClus based on dynamic seed clustering. APreClus is an online and instance incremental clustering algorithm delaying the final cluster assignment of an instance until one of the so-called pending clusters the instance belongs to has reached significant size and is converted to a fixed cluster. Once a cluster is fixed, APreClus recalculates the cluster centers, which are used as representatives for…

Clustering high-dimensional dataFuzzy clusteringTheoretical computer sciencek-medoidsComputer scienceSingle-linkage clusteringCorrelation clusteringConstrained clusteringcomputer.software_genreComplete-linkage clusteringGraphHierarchical clusteringComputingMethodologies_PATTERNRECOGNITIONData stream clusteringCURE data clustering algorithmCanopy clustering algorithmFLAME clusteringAffinity propagationData miningCluster analysiscomputerk-medians clusteringClustering coefficientProceedings of the 29th Annual ACM Symposium on Applied Computing
researchProduct

A structural cluster kernel for learning on graphs

2012

In recent years, graph kernels have received considerable interest within the machine learning and data mining community. Here, we introduce a novel approach enabling kernel methods to utilize additional information hidden in the structural neighborhood of the graphs under consideration. Our novel structural cluster kernel (SCK) incorporates similarities induced by a structural clustering algorithm to improve state-of-the-art graph kernels. The approach taken is based on the idea that graph similarity can not only be described by the similarity between the graphs themselves, but also by the similarity they possess with respect to their structural neighborhood. We applied our novel kernel in…

Graph kernelbusiness.industryPattern recognitionComputingMethodologies_PATTERNRECOGNITIONKernel methodString kernelPolynomial kernelKernel embedding of distributionsRadial basis function kernelArtificial intelligenceTree kernelCluster analysisbusinessMathematicsProceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
researchProduct

Optimization of curation of the dataset with data on repeated dose toxicity

2015

Introduction: For some areas of risk assessment, the use of alter-native methods is supported by current directives and guidance(e.g. REACH, Cosmetics, BPD, PPP). According to OECD principles alternative methods need to be scientifically valid. Methods: Within a project on grouping and development of predictive models sup-ported by a grant of Federal Ministry of Education and Research, we curated a dataset based on RepDose and ELINCS database. The final dataset consists of rat repeated dose toxicity studies for 1022 com-pounds representing 28 endpoints as organ-effect-combinations. Toxicological and modelling experts did jointly the curation and selection of endpoints as an iterative proces…

business.industryToxicityMedicineGeneral MedicineToxicologyBioinformaticsbusinessToxicology Letters
researchProduct

Identification of ELF3 as an early transcriptional regulator of human urothelium

2014

AbstractDespite major advances in high-throughput and computational modelling techniques, understanding of the mechanisms regulating tissue specification and differentiation in higher eukaryotes, particularly man, remains limited. Microarray technology has been explored exhaustively in recent years and several standard approaches have been established to analyse the resultant datasets on a genome-wide scale. Gene expression time series offer a valuable opportunity to define temporal hierarchies and gain insight into the regulatory relationships of biological processes. However, unless datasets are exactly synchronous, time points cannot be compared directly.Here we present a data-driven ana…

Hepatocyte Nuclear Factor 3-alphaTime seriesTime FactorsPPARγMicroarrayNormal Human UrotheliumComputational biologyBiologyReal-Time Polymerase Chain ReactionBioinformaticsProto-Oncogene ProteinsGene expressionElectric ImpedanceTranscriptional regulationHumansRNA Small InterferingGeneTranscription factorMolecular BiologyDNA PrimersGene knockdownProto-Oncogene Proteins c-etsReverse Transcriptase Polymerase Chain ReactionMicroarray analysis techniquesGene Expression Regulation DevelopmentalCell DifferentiationCell BiologyMicroarray AnalysisImmunohistochemistryELF3DNA-Binding ProteinsDifferentiationGene Knockdown TechniquesGene chip analysisGene expressionUrotheliumTranscription FactorsDevelopmental BiologyDevelopmental Biology
researchProduct

Online Estimation of Discrete Densities

2013

We address the problem of estimating a discrete joint density online, that is, the algorithm is only provided the current example and its current estimate. The proposed online estimator of discrete densities, EDDO (Estimation of Discrete Densities Online), uses classifier chains to model dependencies among features. Each classifier in the chain estimates the probability of one particular feature. Because a single chain may not provide a reliable estimate, we also consider ensembles of classifier chains and ensembles of weighted classifier chains. For all density estimators, we provide consistency proofs and propose algorithms to perform certain inference tasks. The empirical evaluation of t…

Concept driftStochastic processEstimation theoryBayesian probabilityEstimatorInferenceData miningClassifier chainscomputer.software_genreClassifier (UML)computerMathematics2013 IEEE 13th International Conference on Data Mining
researchProduct

HPMA-Based Nanocarriers for Effective Immune System Stimulation.

2019

The selective activation of the immune system using nanoparticles as a drug delivery system is a promising field in cancer therapy. Block copolymers from HPMA and laurylmethacrylate-co-hymecromone-methacrylate allow the preparation of multifunctionalized core-crosslinked micelles of variable size. To activate dendritic cells (DCs) as antigen presenting cells, the carbohydrates mannose and trimannose are introduced into the hydrophilic corona as DC targeting units. To activate DCs, a lipophilic adjuvant (L18-MDP) is incorporated into the core of the micelles. To elicit an immune response, a model antigen peptide (SIINFEKL) is attached to the polymeric nanoparticle-in addition-via a click rea…

AzidesPolymers and PlasticsOvalbuminPolymersMannoseBioengineering02 engineering and technology010402 general chemistry01 natural sciencesMicelleBiomaterialschemistry.chemical_compoundDrug Delivery SystemsAntigenAdjuvants ImmunologicMaterials ChemistryHumansParticle SizeAntigen-presenting cellMicellesMannanChemistryDendritic Cells021001 nanoscience & nanotechnologyPeptide Fragments0104 chemical sciencesImmune SystemDrug deliveryBiophysicsMethacrylatesNanoparticlesClick ChemistryNanocarriers0210 nano-technologyHydrophobic and Hydrophilic InteractionsMannose receptorBiotechnologyMacromolecular bioscience
researchProduct

Scalable Clustering by Iterative Partitioning and Point Attractor Representation

2016

Clustering very large datasets while preserving cluster quality remains a challenging data-mining task to date. In this paper, we propose an effective scalable clustering algorithm for large datasets that builds upon the concept of synchronization. Inherited from the powerful concept of synchronization, the proposed algorithm, CIPA (Clustering by Iterative Partitioning and Point Attractor Representations), is capable of handling very large datasets by iteratively partitioning them into thousands of subsets and clustering each subset separately. Using dynamic clustering by synchronization, each subset is then represented by a set of point attractors and outliers. Finally, CIPA identifies the…

Fuzzy clusteringGeneral Computer ScienceComputer scienceSingle-linkage clusteringCorrelation clusteringConstrained clustering02 engineering and technologycomputer.software_genreComputingMethodologies_PATTERNRECOGNITIONData stream clusteringCURE data clustering algorithm020204 information systems0202 electrical engineering electronic engineering information engineeringCanopy clustering algorithm020201 artificial intelligence & image processingData miningCluster analysiscomputerACM Transactions on Knowledge Discovery from Data
researchProduct

Efficient Redundancy Reduced Subgroup Discovery via Quadratic Programming

2012

Subgroup discovery is a task at the intersection of predictive and descriptive induction, aiming at identifying subgroups that have the most unusual statistical (distributional) characteristics with respect to a property of interest. Although a great deal of work has been devoted to the topic, one remaining problem concerns the redundancy of subgroup descriptions, which often effectively convey very similar information. In this paper, we propose a quadratic programming based approach to reduce the amount of redundancy in the subgroup rules. Experimental results on 12 datasets show that the resulting subgroups are in fact less redundant compared to standard methods. In addition, our experime…

Mathematical optimizationRedundancy (information theory)Theoretical computer scienceQuadratic programmingStandard methodsMathematics
researchProduct

Convolutional Neural Networks for the Identification of Regions of Interest in PET Scans: A Study of Representation Learning for Diagnosing Alzheimer…

2017

When diagnosing patients suffering from dementia based on imaging data like PET scans, the identification of suitable predictive regions of interest (ROIs) is of great importance. We present a case study of 3-D Convolutional Neural Networks (CNNs) for the detection of ROIs in this context, just using voxel data, without any knowledge given a priori. Our results on data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) suggest that the predictive performance of the method is on par with that of state-of-the-art methods, with the additional benefit of potential insights into affected brain regions.

Computer sciencebusiness.industryDeep learning05 social sciencesContext (language use)medicine.diseasecomputer.software_genreMachine learningConvolutional neural network03 medical and health sciencesIdentification (information)0302 clinical medicineNeuroimagingVoxelmental disordersmedicineDementia0501 psychology and cognitive sciences050102 behavioral science & comparative psychologyArtificial intelligencebusinesscomputerFeature learning030217 neurology & neurosurgery
researchProduct

Exploring Multi-Objective Optimization for Multi-Label Classifier Ensembles

2019

Multi-label classification deals with the task of predicting multiple class labels for a given sample. Several performance metrics are designed in the literature to measure the quality of any multi-label classification technique. In general existing multi-label classification approaches focus on optimizing only a single performance measure. The current work builds on the hypothesis that a weighted ensemble of multiple multi-label classifiers will lead to obtain improved results. The appropriate weight combinations for combining the outputs of multiple classifiers can be selected after simultaneously optimizing different multi-label classification metrics like micro F1, hamming loss, 0/1 los…

Optimization problemLinear programmingbusiness.industryComputer science02 engineering and technologyMachine learningcomputer.software_genreMulti-objective optimizationComputingMethodologies_PATTERNRECOGNITION020204 information systems0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processingArtificial intelligencebusinessClassifier (UML)computer2019 IEEE Congress on Evolutionary Computation (CEC)
researchProduct

Cinema Data Mining

2015

While the physiological response of humans to emotional events or stimuli is well-investigated for many modalities (like EEG, skin resistance, ...), surprisingly little is known about the exhalation of so-called Volatile Organic Compounds (VOCs) at quite low concentrations in response to such stimuli. VOCs are molecules of relatively small mass that quickly evaporate or sublimate and can be detected in the air that surrounds us. The paper introduces a new field of application for data mining, where trace gas responses of people reacting on-line to films shown in cinemas (or movie theaters) are related to the semantic content of the films themselves. To do so, we measured the VOCs from a mov…

Movie theaterGranger causalitybusiness.industryComputer scienceData miningcomputer.software_genreSkin conductancebusinessCausalitycomputerAbductive reasoningProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
researchProduct

Targeting cells of the immune system: mannosylated HPMA–LMA block-copolymer micelles for targeting of dendritic cells

2016

Background: Successful tumor immunotherapy depends on the induction of strong and sustained tumor antigen-specific immune responses by activated antigen-presenting cells (APCs) such as dendritic cells (DCs). Since nanoparticles have the potential to codeliver tumor-specific antigen and DC-stimulating adjuvant in a DC-targeting manner, we wanted to assess the suitability of mannosylated HPMA-LMA block polymers for immunotherapy. Materials & methods: Fluorescence-labeled block copolymer micelles derived from P(HPMA)-block-P(LMA) copolymers and according statistical copolymers were synthesized via RAFT polymerization, and loaded with the APC activator L18-MDP. Both types of copolymers wer…

Materials sciencePolymersSurface Propertiesmedicine.medical_treatmentBiomedical EngineeringMedicine (miscellaneous)Bone Marrow CellsBioengineering02 engineering and technologyDevelopment01 natural sciencesMicellePolymerizationImmune systemAntigenmedicineHumansGeneral Materials ScienceReversible addition−fragmentation chain-transfer polymerizationMicelles010405 organic chemistryDendritic CellsImmunotherapyDendritic cell021001 nanoscience & nanotechnologyMolecular biology0104 chemical sciencesCell biologyMethacrylatesNanoparticlesImmunotherapy0210 nano-technologyAcetylmuramyl-Alanyl-IsoglutamineMannoseAdjuvantSpleenMannose receptorNanomedicine
researchProduct

Innovative Strategies to Develop Chemical Categories Using a Combination of Structural and Toxicological Properties.

2016

Interest is increasing in the development of non-animal methods for toxicological evaluations. These methods are however, particularly challenging for complex toxicological endpoints such as repeated dose toxicity. European Legislation, e.g., the European Union's Cosmetic Directive and REACH, demands the use of alternative methods. Frameworks, such as the Read-across Assessment Framework or the Adverse Outcome Pathway Knowledge Base, support the development of these methods. The aim of the project presented in this publication was to develop substance categories for a read-across with complex endpoints of toxicity based on existing databases. The basic conceptual approach was to combine str…

0301 basic medicineQuantitative structure–activity relationshipread acrossPredictive Clustering Tree (PCT) methodComputer science610010501 environmental sciencescomputer.software_genre600 Technik Medizin angewandte Wissenschaften::610 Medizin und Gesundheit01 natural sciences03 medical and health sciencesPharmacology (medical)Cluster analysis0105 earth and related environmental sciencesOriginal ResearchAlternative methodsPharmacologytoxicological and structural similaritybusiness.industryQSARlcsh:RM1-950non-animal methods; QSAR; readacross; Predictive Clustering Tree (PCT) method; toxicological and structural similarityIdentification (information)Tree (data structure)030104 developmental biologyConceptual approachlcsh:Therapeutics. PharmacologyKnowledge basenon-animal methodsData miningWeb servicebusinesscomputerFrontiers in pharmacology
researchProduct

DySC: software for greedy clustering of 16S rRNA reads.

2012

Abstract Summary: Pyrosequencing technologies are frequently used for sequencing the 16S ribosomal RNA marker gene for profiling microbial communities. Clustering of the produced reads is an important but time-consuming task. We present Dynamic Seed-based Clustering (DySC), a new tool based on the greedy clustering approach that uses a dynamic seeding strategy. Evaluations based on the normalized mutual information (NMI) criterion show that DySC produces higher quality clusters than UCLUST and CD-HIT at a comparable runtime. Availability and implementation: DySC, implemented in C, is available at http://code.google.com/p/dysc/ under GNU GPL license. Contact:  bertil.schmidt@uni-mainz.de Sup…

Statistics and ProbabilityComputer sciencebusiness.industrySequence Analysis RNA16S ribosomal RNAcomputer.software_genreBiochemistryComputer Science ApplicationsComputational MathematicsSoftwareComputational Theory and MathematicsRNA Ribosomal 16SCluster AnalysisMetagenomeData miningCluster analysisbusinessMolecular BiologycomputerSoftwareBioinformatics (Oxford, England)
researchProduct

Incremental linear model trees on massive datasets

2013

The existence of massive datasets raises the need for algorithms that make efficient use of resources like memory and computation time. Besides well-known approaches such as sampling, online algorithms are being recognized as good alternatives, as they often process datasets faster using much less memory. The important class of algorithms learning linear model trees online (incremental linear model trees or ILMTs in the following) offers interesting options for regression tasks in this sense. However, surprisingly little is known about their performance, as there exists no large-scale evaluation on massive stationary datasets under equal conditions. Therefore, this paper shows their applica…

Class (computer programming)Computer scienceProcess (engineering)business.industryComputationLinear modelSampling (statistics)computer.software_genreMachine learningKISS principleData miningArtificial intelligenceOnline algorithmbusinesscomputerProceedings of the 28th Annual ACM Symposium on Applied Computing
researchProduct

A Large-Scale Empirical Evaluation of Cross-Validation and External Test Set Validation in (Q)SAR.

2013

(Q)SAR model validation is essential to ensure the quality of inferred models and to indicate future model predictivity on unseen compounds. Proper validation is also one of the requirements of regulatory authorities in order to accept the (Q)SAR model, and to approve its use in real world scenarios as alternative testing method. However, at the same time, the question of how to validate a (Q)SAR model, in particular whether to employ variants of cross-validation or external test set validation, is still under discussion. In this paper, we empirically compare a k-fold cross-validation with external test set validation. To this end we introduce a workflow allowing to realistically simulate t…

Computer sciencemedia_common.quotation_subjectOrganic ChemistryScale (descriptive set theory)Variance (accounting)computer.software_genreCross-validationComputer Science ApplicationsModel validationWorkflowStructural BiologyCheminformaticsTest setDrug DiscoveryMolecular MedicineQuality (business)Data miningcomputermedia_commonMolecular informatics
researchProduct

Eawag-Soil in enviPath: a new resource for exploring regulatory pesticide soil biodegradation pathways and half-life data.

2017

Developing models for the prediction of microbial biotransformation pathways and half-lives of trace organic contaminants in different environments requires as training data easily accessible and sufficiently large collections of respective biotransformation data that are annotated with metadata on study conditions. Here, we present the Eawag-Soil package, a public database that has been developed to contain all freely accessible regulatory data on pesticide degradation in laboratory soil simulation studies for pesticides registered in the EU (282 degradation pathways, 1535 reactions, 1619 compounds and 4716 biotransformation half-life values with corresponding metadata on study conditions)…

0301 basic medicine10120 Department of ChemistryDatabases FactualSoil biodegradation010501 environmental sciencesManagement Monitoring Policy and Law01 natural sciencesModels Biological03 medical and health sciencesSoilResource (project management)Biotransformation2308 Management Monitoring Policy and LawSoil retrogression and degradation540 ChemistryEnvironmental ChemistrySoil PollutantsPesticidesBiotransformation0105 earth and related environmental sciencesTraining setChemistryPublic Health Environmental and Occupational HealthGeneral Medicine2739 Public Health Environmental and Occupational Health15. Life on landPesticideMetadata030104 developmental biologyBiodegradation Environmental13. Climate actionEnvironmental chemistry2304 Environmental ChemistryPesticide degradationBiochemical engineeringHalf-LifeEnvironmental science. Processesimpacts
researchProduct

Extracting information from support vector machines for pattern-based classification

2014

Statistical machine learning algorithms building on patterns found by pattern mining algorithms have to cope with large solution sets and thus the high dimensionality of the feature space. Vice versa, pattern mining algorithms are frequently applied to irrelevant instances, thus causing noise in the output. Solution sets of pattern mining algorithms also typically grow with increasing input datasets. The paper proposes an approach to overcome these limitations. The approach extracts information from trained support vector machines, in particular their support vectors and their relevance according to their coefficients. It uses the support vectors along with their coefficients as input to pa…

business.industryComputer scienceFeature vectorSolution setPattern recognitioncomputer.software_genreGraphDomain (software engineering)Support vector machineRelevance (information retrieval)Fraction (mathematics)Noise (video)Artificial intelligenceData miningbusinesscomputerProceedings of the 29th Annual ACM Symposium on Applied Computing
researchProduct

Effect of Core-Crosslinking on Protein Corona Formation on Polymeric Micelles.

2021

Most nanomaterials acquire a protein corona upon contact with biological fluids. The magnitude of this effect is strongly dependent both on surface and structure of the nanoparticle. To define the contribution of the internal nanoparticle structure, protein corona formation of block copolymer micelles with poly(N-2-hydroxypropylmethacrylamide) (pHPMA) as hydrophilic shell, which are crosslinked-or not-in the hydrophobic core is comparatively analyzed. Both types of micelles are incubated with human blood plasma and separated by asymmetrical flow field-flow fractionation (AF4). Their size is determined by dynamic light scattering and proteins within the micellar fraction are characterized by…

Polymers and PlasticsChemical PhenomenaLightPolymersNanoparticleBioengineeringProtein Corona02 engineering and technology010402 general chemistry01 natural sciencesMicelleMass SpectrometryPolyethylene GlycolsBiomaterialsCorona (optical phenomenon)PlasmaDynamic light scatteringMaterials ChemistryCopolymerHumansScattering RadiationChromatography High Pressure LiquidMicellesGel electrophoresisChemistry021001 nanoscience & nanotechnologyBlood proteins0104 chemical sciencesNanostructuresCross-Linking ReagentsBiophysicsProtein CoronaAdsorption0210 nano-technologyHydrophobic and Hydrophilic InteractionsBiotechnologyMacromolecular bioscience
researchProduct

Privacy Preserving Client/Vertical-Servers Classification

2019

We present a novel client/vertical-servers architecture for hybrid multi-party classification problem. The model consists of clients whose attributes are distributed on multiple servers and remain secret during training and testing. Our solution builds privacy-preserving random forests and completes them with a special private set intersection protocol that provides a central commodity server with anonymous conditional statistics. Subsequently, the private set intersection protocol can be used to privately classify the queries of new clients using the commodity server’s statistics. The proviso is that the commodity server must not collude with other parties. In cases where this restriction …

Public-key cryptographyComputer sciencebusiness.industryServerCommoditySecure multi-party computationEffective methodArchitecturebusinessProtocol (object-oriented programming)Random forestComputer network
researchProduct

Modeling recurrent distributions in streams using possible worlds

2015

Discovering changes in the data distribution of streams and discovering recurrent data distributions are challenging problems in data mining and machine learning. Both have received a lot of attention in the context of classification. With the ever increasing growth of data, however, there is a high demand of compact and universal representations of data streams that enable the user to analyze current as well as historic data without having access to the raw data. To make a first step towards this direction, we propose a condensed representation that captures the various — possibly recurrent — data distributions of the stream by extending the notion of possible worlds. The representation en…

Possible worldBasis (linear algebra)Computer scienceData stream miningRepresentation (systemics)Context (language use)Data pre-processingData miningRaw datacomputer.software_genrecomputerData modeling2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
researchProduct

Modeling Multi-label Recurrence in Data Streams

2019

Most of the existing data stream algorithms assume a single label as the target variable. However, in many applications, each observation is assigned to several labels with latent dependencies among them, which their target function may change over time. Classification of such non-stationary multi-label streaming data with the consideration of dependencies among labels and potential drifts is a challenging task. The few existing studies mostly cope with drifts implicitly, and all learn models on the original label space, which requires a lot of time and memory. None of them consider recurrent drifts in multi-label streams and particularly drifts and recurrences visible in a latent label spa…

Change over timeMulti-label classificationData streambusiness.industryComputer scienceData stream miningSpace dimensionPattern recognitionComputingMethodologies_PATTERNRECOGNITIONStreaming dataArtificial intelligencebusinessClassifier (UML)Decoding methods2019 IEEE International Conference on Big Knowledge (ICBK)
researchProduct

Scavenger – A Framework for Efficient Evaluation of Dynamic and Modular Algorithms

2015

Machine Learning methods and algorithms are often highly modular in the sense that they rely on a large number of subalgorithms that are in principle interchangeable. For example, it is often possible to use various kinds of pre- and post-processing and various base classifiers or regressors as components of the same modular approach. We propose a framework, called Scavenger, that allows evaluating whole families of conceptually similar algorithms efficiently. The algorithms are represented as compositions, couplings and products of atomic subalgorithms. This allows partial results to be cached and shared between different instances of a modular algorithm, so that potentially expensive part…

Theoretical computer scienceBackupbusiness.industryComputer scienceDistributed computingCacheModular algorithmLoad balancing (computing)Modular designbusinessAlgorithm
researchProduct

BMaD – A Boolean Matrix Decomposition Framework

2014

Boolean matrix decomposition is a method to obtain a compressed representation of a matrix with Boolean entries. We present a modular framework that unifies several Boolean matrix decomposition algorithms, and provide methods to evaluate their performance. The main advantages of the framework are its modular approach and hence the flexible combination of the steps of a Boolean matrix decomposition and the capability of handling missing values. The framework is licensed under the GPLv3 and can be downloaded freely at http://projects.informatik.uni-mainz.de/bmad.

Matrix (mathematics)Theoretical computer scienceAnd-inverter graphBoolean circuitDecomposition (computer science)Logical matrixCircuit minimization for Boolean functionsRepresentation (mathematics)Standard Boolean modelMathematics
researchProduct

A probabilistic condensed representation of data for stream mining

2014

Data mining and machine learning algorithms usually operate directly on the data. However, if the data is not available at once or consists of billions of instances, these algorithms easily become infeasible with respect to memory and run-time concerns. As a solution to this problem, we propose a framework, called MiDEO (Mining Density Estimates inferred Online), in which algorithms are designed to operate on a condensed representation of the data. In particular, we propose to use density estimates, which are able to represent billions of instances in a compact form and can be updated when new instances arrive. As an example for an algorithm that operates on density estimates, we consider t…

Task (computing)Association rule learningData stream miningSimple (abstract algebra)Computer scienceProbabilistic logicProbabilistic analysis of algorithmsAlgorithm designData miningRepresentation (mathematics)computer.software_genrecomputer2014 International Conference on Data Science and Advanced Analytics (DSAA)
researchProduct

A Hybrid Machine Learning and Knowledge Based Approach to Limit Combinatorial Explosion in Biodegradation Prediction

2016

One of the main tasks in chemical industry regarding the sustainability of a product is the prediction of its environmental fate, i.e., its degradation products and pathways. Current methods for the prediction of biodegradation products and pathways of organic environmental pollutants either do not take into account domain knowledge or do not provide probability estimates. In this chapter, we propose a hybrid knowledge-based and machine learning-based approach to overcome these limitations in the context of the University of Minnesota Pathway Prediction System (UM-PPS). The proposed solution performs relative reasoning in a machine learning framework, and obtains one probability estimate fo…

Engineeringbusiness.industryContext (language use)Machine learningcomputer.software_genreRandom forestSet (abstract data type)Transformation (function)Domain knowledgeSensitivity (control systems)Artificial intelligencePrecision and recallbusinesscomputerCombinatorial explosion
researchProduct

Session details: Volume I: Artificial intelligence & agents, distributed systems, and information systems: data mining track

2013

Computer scienceTrack (disk drive)Real-time computingVolume (computing)Information systemSession (computer science)Proceedings of the 28th Annual ACM Symposium on Applied Computing
researchProduct

Pairwise Learning to Rank by Neural Networks Revisited: Reconstruction, Theoretical Analysis and Practical Performance

2020

We present a pairwise learning to rank approach based on a neural net, called DirectRanker, that generalizes the RankNet architecture. We show mathematically that our model is reflexive, antisymmetric, and transitive allowing for simplified training and improved performance. Experimental results on the LETOR MSLR-WEB10K, MQ2007 and MQ2008 datasets show that our model outperforms numerous state-of-the-art methods, while being inherently simpler in structure and using a pairwise approach only.

Transitive relationPairwise learningTheoretical computer scienceArtificial neural networkAntisymmetric relationComputer scienceRank (computer programming)Structure (category theory)Pairwise comparisonLearning to rank
researchProduct

Integrating LSTMs with Online Density Estimation for the Probabilistic Forecast of Energy Consumption

2019

In machine learning applications in the energy sector, it is often necessary to have both highly accurate predictions and information about the probabilities of certain scenarios to occur. We address this challenge by integrating and combining long short-term memory networks (LSTMs) and online density estimation into a real-time data streaming architecture of an energy trader. The online density estimation is done in the MiDEO framework, which estimates joint densities of data streams based on ensembles of chains of Hoeffding trees. One attractive feature of the solution is that queries can be sent to the here-called forecast-based point density estimators (FPDE) to derive information from …

Data streamComputer scienceData stream mining020209 energyProbabilistic logicEstimator02 engineering and technologyEnergy consumptionDensity estimationcomputer.software_genre0202 electrical engineering electronic engineering information engineeringFeature (machine learning)020201 artificial intelligence & image processingData miningRepresentation (mathematics)computer
researchProduct

cuBool: Bit-Parallel Boolean Matrix Factorization on CUDA-Enabled Accelerators

2018

Boolean Matrix Factorization (BMF) is a commonly used technique in the field of unsupervised data analytics. The goal is to decompose a ground truth matrix C into a product of two matrices A and $B$ being either an exact or approximate rank k factorization of C. Both exact and approximate factorization are time-consuming tasks due to their combinatorial complexity. In this paper, we introduce a massively parallel implementation of BMF - namely cuBool - in order to significantly speed up factorization of huge Boolean matrices. Our approach is based on alternately adjusting rows and columns of A and B using thousands of lightweight CUDA threads. The massively parallel manipulation of entries …

SpeedupRank (linear algebra)Computer science02 engineering and technologyParallel computingMatrix decompositionCUDAMatrix (mathematics)Factorization020204 information systemsSingular value decomposition0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processingMassively parallelInteger (computer science)2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS)
researchProduct

Machine learning for a combined electroencephalographic anesthesia index to detect awareness under anesthesia

2020

Spontaneous electroencephalogram (EEG) and auditory evoked potentials (AEP) have been suggested to monitor the level of consciousness during anesthesia. As both signals reflect different neuronal pathways, a combination of parameters from both signals may provide broader information about the brain status during anesthesia. Appropriate parameter selection and combination to a single index is crucial to take advantage of this potential. The field of machine learning offers algorithms for both parameter selection and combination. In this study, several established machine learning approaches including a method for the selection of suitable signal parameters and classification algorithms are a…

Support Vector MachinePhysiologyComputer scienceElectroencephalographycomputer.software_genreField (computer science)Machine Learning0302 clinical medicineLevel of consciousnessAnesthesiology030202 anesthesiologyMedicine and Health SciencesAnesthesiamedia_commonClinical NeurophysiologyAnesthesiology MonitoringBrain MappingMultidisciplinaryArtificial neural networkmedicine.diagnostic_testPharmaceuticsApplied MathematicsSimulation and ModelingQUnconsciousnessRElectroencephalographyNeuronal pathwayddc:ElectrophysiologyBioassays and Physiological AnalysisBrain ElectrophysiologyAnesthesiaPhysical SciencesEvoked Potentials AuditoryMedicinemedicine.symptomAlgorithmsAnesthetics IntravenousResearch ArticleComputer and Information SciencesConsciousnessImaging TechniquesCognitive NeuroscienceSciencemedia_common.quotation_subjectNeurophysiologyNeuroimagingAnesthesia GeneralResearch and Analysis MethodsBayesian inferenceMachine learningMachine Learning Algorithms03 medical and health sciencesConsciousness MonitorsDrug TherapyArtificial IntelligenceMonitoring IntraoperativeSupport Vector MachinesmedicineHumansMonitoring Physiologicbusiness.industryElectrophysiological TechniquesBiology and Life SciencesSupport vector machineStatistical classificationCognitive ScienceNeural Networks ComputerArtificial intelligenceClinical MedicineConsciousnessbusinesscomputerMathematics030217 neurology & neurosurgeryNeurosciencePLOS ONE
researchProduct

Pruning Incremental Linear Model Trees with Approximate Lookahead

2014

Incremental linear model trees with approximate lookahead are fast, but produce overly large trees. This is due to non-optimal splitting decisions boosted by a possibly unlimited number of examples obtained from a data source. To keep the processing speed high and the tree complexity low, appropriate incremental pruning techniques are needed. In this paper, we introduce a pruning technique for the class of incremental linear model trees with approximate lookahead on stationary data sources. Experimental results show that the advantage of approximate lookahead in terms of processing speed can be further improved by producing much smaller and consequently more explanatory, less memory consumi…

Stationary processComputational Theory and MathematicsComputer scienceLinear modelPruning (decision trees)AlgorithmTree (graph theory)Computer Science ApplicationsInformation SystemsData modelingIEEE Transactions on Knowledge and Data Engineering
researchProduct

A Nonlinear Label Compression and Transformation Method for Multi-label Classification Using Autoencoders

2016

Multi-label classification targets the prediction of multiple interdependent and non-exclusive binary target variables. Transformation-based algorithms transform the data set such that regular single-label algorithms can be applied to the problem. A special type of transformation-based classifiers are label compression methods, which compress the labels and then mostly use single label classifiers to predict the compressed labels. So far, there are no compression-based algorithms that follow a problem transformation approach and address non-linear dependencies in the labels. In this paper, we propose a new algorithm, called Maniac (Multi-lAbel classificatioN usIng AutoenCoders), which extra…

Multi-label classificationComputer sciencebusiness.industryBinary numberPattern recognitionContext (language use)02 engineering and technologyAutoencoderData setComputingMethodologies_PATTERNRECOGNITIONTransformation (function)CardinalityRanking020204 information systems0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processingArtificial intelligencebusiness
researchProduct

Long-term biodistribution study of HPMA- ran -LMA copolymers in vivo by means of 131 I-labeling

2018

Abstract Background For the evaluation of macromolecular drug delivery systems suitable pre-clinical monitoring of potential nanocarrier systems is needed. In this regard, both short-term as well as long-term in vivo tracking is crucial to understand structure-property relationships of polymer carrier systems and their resulting pharmacokinetic profile. Based on former studies revealing favorable in vivo characteristics for 18 F–labeled random (ran) copolymers consisting of N-(2-hydroxypropyl)methacrylamide (HPMA) and lauryl methacrylate (LMA) – including prolonged plasma half-life as well as enhanced tumor accumulation – the presented work focuses on their long-term investigation in the li…

chemistry.chemical_classificationCancer ResearchBiodistribution02 engineering and technologyPolymer010402 general chemistry021001 nanoscience & nanotechnology01 natural sciences0104 chemical scienceschemistry.chemical_compoundchemistryIn vivoCritical micelle concentrationBiophysicsMolecular MedicineDistribution (pharmacology)MethacrylamideRadiology Nuclear Medicine and imagingNanocarriers0210 nano-technologyEx vivoNuclear Medicine and Biology
researchProduct

CheS-Mapper - Chemical Space Mapping and Visualization in 3D

2012

Abstract Analyzing chemical datasets is a challenging task for scientific researchers in the field of chemoinformatics. It is important, yet difficult to understand the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. To that respect, visualization tools can help to better comprehend the underlying correlations. Our recently developed 3D molecular viewer CheS-Mapper (Chemical Space Mapper) divides large datasets into clusters of similar compounds and consequently arranges them in 3D space, such that their spatial proximity reflects their similarity. The user can indirectly determine similarity, by selecting which f…

Process (engineering)Computer sciencemedia_common.quotation_subjectLibrary and Information Sciencescomputer.software_genre01 natural scienceslcsh:Chemistry03 medical and health sciencesSimilarity (psychology)Physical and Theoretical ChemistryFunction (engineering)030304 developmental biologymedia_commonStructure (mathematical logic)0303 health scienceslcsh:T58.5-58.64lcsh:Information technology004 InformatikComputer Graphics and Computer-Aided DesignChemical spaceField (geography)0104 chemical sciencesVisualizationComputer Science Applications010404 medicinal & biomolecular chemistrylcsh:QD1-999CheminformaticsData miningcomputer004 Data processingSoftwareJournal of Cheminformatics
researchProduct

Similarity boosted quantitative structure-activity relationship--a systematic study of enhancing structural descriptors by molecular similarity.

2013

The concept of molecular similarity is one of the most central in the fields of predictive toxicology and quantitative structure-activity relationship (QSAR) research. Many toxicological responses result from a multimechanistic process and, consequently, structural diversity among the active compounds is likely. Combining this knowledge, we introduce similarity boosted QSAR modeling, where we calculate molecular descriptors using similarities with respect to representative reference compounds to aid a statistical learning algorithm in distinguishing between different structural classes. We present three approaches for the selection of reference compounds, one by literature search and two by…

Quantitative structure–activity relationshipInformaticsbusiness.industryStatistical learningGeneral Chemical EngineeringStructural diversityQuantitative Structure-Activity RelationshipPattern recognitionGeneral ChemistryPredictive toxicologyLibrary and Information Sciencescomputer.software_genreToxicologyComputer Science ApplicationsSimilarity (network science)Molecular descriptorArtificial intelligenceData miningbusinessCluster analysiscomputerMathematicsJournal of chemical information and modeling
researchProduct

Exploring Multiobjective Optimization for Multiview Clustering

2018

We present a new multiview clustering approach based on multiobjective optimization. In contrast to existing clustering algorithms based on multiobjective optimization, it is generally applicable to data represented by two or more views and does not require specifying the number of clusters a priori . The approach builds upon the search capability of a multiobjective simulated annealing based technique, AMOSA, as the underlying optimization technique. In the first version of the proposed approach, an internal cluster validity index is used to assess the quality of different partitionings obtained using different views. A new way of checking the compatibility of these different partitioning…

General Computer ScienceComputer science02 engineering and technologycomputer.software_genreMulti-objective optimizationCluster validity index020204 information systemsSimulated annealingNew mutation0202 electrical engineering electronic engineering information engineeringA priori and a posteriori020201 artificial intelligence & image processingData miningCluster analysisMultiple viewcomputerACM Transactions on Knowledge Discovery from Data
researchProduct

A Survey of Multi-Label Topic Models

2019

Every day, an enormous amount of text data is produced. Sources of text data include news, social media, emails, text messages, medical reports, scientific publications and fiction. To keep track of this data, there are categories, key words, tags or labels that are assigned to each text. Automatically predicting such labels is the task of multi-label text classification. Often however, we are interested in more than just the pure classification: rather, we would like to understand which parts of a text belong to the label, which words are important for the label or which labels occur together. Because of this, topic models may be used for multi-label classification as an interpretable mode…

Topic modelInformation retrievalComputer scienceGeography Planning and DevelopmentFlexibility (personality)02 engineering and technologyTask (project management)ComputingMethodologies_PATTERNRECOGNITION020204 information systems0202 electrical engineering electronic engineering information engineeringKey (cryptography)General Earth and Planetary Sciences020201 artificial intelligence & image processingSocial mediaWater Science and TechnologyACM SIGKDD Explorations Newsletter
researchProduct

Multi-label classification using boolean matrix decomposition

2012

This paper introduces a new multi-label classifier based on Boolean matrix decomposition. Boolean matrix decomposition is used to extract, from the full label matrix, latent labels representing useful Boolean combinations of the original labels. Base level models predict latent labels, which are subsequently transformed into the actual labels by Boolean matrix multiplication with the second matrix from the decomposition. The new method is tested on six publicly available datasets with varying numbers of labels. The experimental evaluation shows that the new method works particularly well on datasets with a large number of labels and strong dependencies among them.

Multi-label classificationMatrix (mathematics)ComputingMethodologies_PATTERNRECOGNITIONComputer sciencebusiness.industryBoolean matrix multiplicationLogical matrixPattern recognitionArtificial intelligencebusinessClassifier (UML)Sparse matrixProceedings of the 27th Annual ACM Symposium on Applied Computing
researchProduct

A Brief History of Learning Symbolic Higher-Level Representations from Data (And a Curious Look Forward)

2020

Learning higher-level representations from data has been on the agenda of AI research for several decades. In the paper, I will give a survey of various approaches to learning symbolic higher-level representations: feature construction and constructive induction, predicate invention, propositionalization, pattern mining, and mining time series patterns. Finally, I will give an outlook on how approaches to learning higher-level representations, symbolic and neural, can benefit from each other to solve current issues in machine learning.

Computer scienceProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence
researchProduct

Online Induction of Probabilistic Real Time Automata

2012

Probabilistic real time automata (PRTAs) are a representation of dynamic processes arising in the sciences and industry. Currently, the induction of automata is divided into two steps: the creation of the prefix tree acceptor (PTA) and the merge procedure based on clustering of the states. These two steps can be very time intensive when a PRTA is to be induced for massive or even unbounded data sets. The latter one can be efficiently processed, as there exist scalable online clustering algorithms. However, the creation of the PTA still can be very time consuming. To overcome this problem, we propose a genuine online PRTA induction approach that incorporates new instances by first collapsing…

Theoretical computer sciencebusiness.industryComputer scienceProbabilistic logiccomputer.software_genreAutomatonData setTrieAutomata theoryThe InternetData miningbusinessCluster analysiscomputer2012 IEEE 12th International Conference on Data Mining
researchProduct

Trading off accuracy for efficiency by randomized greedy warping

2016

Dynamic Time Warping (DTW) is a widely used distance measure for time series data mining. Its quadratic complexity requires the application of various techniques (e.g. warping constraints, lower-bounds) for deployment in real-time scenarios. In this paper we propose a randomized greedy warping algorithm for finding similarity between time series instances. We show that the proposed algorithm outperforms the simple greedy approach and also provides very good time series similarity approximation consistently, as compared to DTW. We show that the Randomized Time Warping (RTW) can be used in place of DTW as a fast similarity approximation technique by trading some classification accuracy for ve…

Dynamic time warpingSeries (mathematics)Computer sciencebusiness.industryPattern recognitionData_CODINGANDINFORMATIONTHEORY02 engineering and technologyMeasure (mathematics)TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGESComputingMethodologies_PATTERNRECOGNITIONSimilarity (network science)Computer Science::Sound020204 information systemsComputingMethodologies_SYMBOLICANDALGEBRAICMANIPULATION0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processingArtificial intelligenceImage warpingbusinessGeneralLiterature_REFERENCE(e.g.dictionariesencyclopediasglossaries)Computer Science::DatabasesProceedings of the 31st Annual ACM Symposium on Applied Computing
researchProduct

Model selection based product kernel learning for regression on graphs

2013

The choice of a suitable graph kernel is intrinsically hard and often cannot be made in an informed manner for a given dataset. Methods for multiple kernel learning offer a possible remedy, as they combine and weight kernels on the basis of a labeled training set of molecules to define a new kernel. Whereas most methods for multiple kernel learning focus on learning convex linear combinations of kernels, we propose to combine kernels in products, which theoretically enables higher expressiveness. In experiments on ten publicly available chemical QSAR datasets we show that product kernel learning is on no dataset significantly worse than any of the competing kernel methods and on average the…

Graph kernelTraining setMultiple kernel learningComputer sciencebusiness.industryPattern recognitionSemi-supervised learningMachine learningcomputer.software_genreKernel (linear algebra)Kernel methodKernel embedding of distributionsPolynomial kernelKernel (statistics)Radial basis function kernelArtificial intelligenceTree kernelbusinesscomputerProceedings of the 28th Annual ACM Symposium on Applied Computing
researchProduct

Deep neural networks to recover unknown physical parameters from oscillating time series.

2022

PLOS ONE 17(5), e0268439 (2022). doi:10.1371/journal.pone.0268439

FOS: Computer and information sciencesComputer Science - Machine LearningMultidisciplinaryTime FactorsPhysics610FOS: Physical sciencesSignal Processing Computer-AssistedNumerical Analysis (math.NA)Machine Learning (cs.LG)KnowledgePhysics - Data Analysis Statistics and ProbabilityFOS: MathematicsHumansMathematics - Numerical Analysisddc:610Neural Networks ComputerData Analysis Statistics and Probability (physics.data-an)PloS one
researchProduct

CheS-Mapper 2.0 for visual validation of (Q)SAR models

2014

Abstract Background Sound statistical validation is important to evaluate and compare the overall performance of (Q)SAR models. However, classical validation does not support the user in better understanding the properties of the model or the underlying data. Even though, a number of visualization tools for analyzing (Q)SAR information in small molecule datasets exist, integrated visualization methods that allow the investigation of model validation results are still lacking. Results We propose visual validation, as an approach for the graphical inspection of (Q)SAR model validation results. The approach applies the 3D viewer CheS-Mapper, an open-source application for the exploration of sm…

Visualization methodsComputer scienceFeature vectorLibrary and Information Sciencescomputer.software_genre01 natural sciences(Q)SARModel validation03 medical and health sciencesSoftwareValidationOverall performancePhysical and Theoretical ChemistryVisualization030304 developmental biology0303 health sciencesbusiness.industryStatistical validationComputer Graphics and Computer-Aided Design0104 chemical sciencesComputer Science ApplicationsVisualization010404 medicinal & biomolecular chemistry3d space3D spaceData miningbusinesscomputerSoftwareJournal of Cheminformatics
researchProduct

Towards Bankruptcy Prediction: Deep Sentiment Mining to Detect Financial Distress from Business Management Reports

2018

Due to their disclosure required by law, business management reports have become publicly available for a large number of companies, and these reports offer the opportunity to assess the financial health or distress of a company, both quantitatively from the balance sheets and qualitatively from the text. In this paper, we analyze the potential of deep sentiment mining from the textual parts of business management reports and aim to detect signals for financial distress. We (1) created the largest corpus of business reports analyzed qualitatively to date, (2) defined a non-trivial target variable based on the so-called Altman Z-score, (3) developed a filtering of sentences based on class-co…

050208 financeComputer science05 social sciencesSentiment analysis050201 accountingData scienceTask (project management)VisualizationDistressBankruptcy0502 economics and businessTask analysisBankruptcy predictionBalance sheet2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)
researchProduct

An In-Depth Experimental Comparison of RNTNs and CNNs for Sentence Modeling

2017

The goal of modeling sentences is to accurately represent their meaning for different tasks. A variety of deep learning architectures have been proposed to model sentences, however, little is known about their comparative performance on a common ground, across a variety of datasets, and on the same level of optimization. In this paper, we provide such a novel comparison for two popular architectures, Recursive Neural Tensor Networks (RNTNs) and Convolutional Neural Networks (CNNs). Although RNTNs have been shown to work well in many cases, they require intensive manual labeling due to the vanishing gradient problem. To enable an extensive comparison of the two architectures, this paper empl…

Structure (mathematical logic)Vanishing gradient problemPhrasebusiness.industryComputer scienceDeep learning05 social sciencesPattern recognition010501 environmental sciences01 natural sciencesConvolutional neural networkSet (abstract data type)0502 economics and businessBenchmark (computing)Artificial intelligence050207 economicsbusinessSentence0105 earth and related environmental sciences
researchProduct

Cinema audiences reproducibly vary the chemical composition of air during films, by broadcasting scene specific emissions on breath

2016

AbstractHuman beings continuously emit chemicals into the air by breath and through the skin. In order to determine whether these emissions vary predictably in response to audiovisual stimuli, we have continuously monitored carbon dioxide and over one hundred volatile organic compounds in a cinema. It was found that many airborne chemicals in cinema air varied distinctively and reproducibly with time for a particular film, even in different screenings to different audiences. Application of scene labels and advanced data mining methods revealed that specific film events, namely “suspense” or “comedy” caused audiences to change their emission of specific chemicals. These event-type synchronou…

Human ChemosignalsContinuous measurementTime Factors010504 meteorology & atmospheric sciencesMotion Pictures010501 environmental sciencesBroadcasting01 natural sciencesArticleAcetoneMovie theaterHemiterpenesPentanesButadienesHumansHuman groupSimulation0105 earth and related environmental sciencesHemiterpenesAir PollutantsVolatile Organic CompoundsMultidisciplinaryFilm makingbusiness.industryRespirationAdvertisingCarbon DioxideComedyAir Pollution IndoorbusinessEnvironmental MonitoringScientific Reports
researchProduct

Fair Pairwise Learning to Rank

2020

Ranking algorithms based on Neural Networks have been a topic of recent research. Ranking is employed in everyday applications like product recommendations, search results, or even in finding good candidates for hiring. However, Neural Networks are mostly opaque tools, and it is hard to evaluate why a specific candidate, for instance, was not considered. Therefore, for neural-based ranking methods to be trustworthy, it is crucial to guarantee that the outcome is fair and that the decisions are not discriminating people according to sensitive attributes such as gender, sexual orientation, or ethnicity.In this work we present a family of fair pairwise learning to rank approaches based on Neur…

FairnessArtificial neural networkNeural Networksbusiness.industryComputer science05 social sciencesRank (computer programming)02 engineering and technologyMachine learningcomputer.software_genreFairness Neural Networks RankingOutcome (game theory)Ranking (information retrieval)Correlation020204 information systems0202 electrical engineering electronic engineering information engineeringRelevance (information retrieval)Learning to rankProduct (category theory)Artificial intelligenceRanking0509 other social sciences050904 information & library sciencesbusinesscomputer
researchProduct

Graph Clustering with Local Density-Cut

2018

In this paper, we introduce a new graph clustering algorithm, called Dcut. The basic idea is to envision the graph clustering as a local density-cut problem. To identify meaningful communities in a graph, a density-connected tree is first constructed in a local fashion. Building upon the local intuitive density-connected tree, Dcut allows partitioning a graph into multiple densely tight-knit clusters effectively and efficiently. We have demonstrated that our method has several attractive benefits: (a) Dcut provides an intuitive criterion to evaluate the goodness of a graph clustering in a more precise way; (b) Building upon the density-connected tree, Dcut allows identifying high-quality cl…

The intuitive criterion"Theoretical computer scienceComputer science020204 information systems0202 electrical engineering electronic engineering information engineeringGraph (abstract data type)020201 artificial intelligence & image processing02 engineering and technologyCluster analysisClustering coefficient
researchProduct

Maximum Common Subgraph based locally weighted regression

2012

This paper investigates a simple, yet effective method for regression on graphs, in particular for applications in chem-informatics and for quantitative structure-activity relationships (QSARs). The method combines Locally Weighted Learning (LWL) with Maximum Common Subgraph (MCS) based graph distances. More specifically, we investigate a variant of locally weighted regression on graphs (structures) that uses the maximum common subgraph for determining and weighting the neighborhood of a graph and feature vectors for the actual regression model. We show that this combination, LWL-MCS, outperforms other methods that use the local neighborhood of graphs for regression. The performance of this…

Computer sciencebusiness.industryFeature vectorLocal regressionPattern recognitionRegression analysisGraphWeightingCombinatoricsLazy learningSimple (abstract algebra)Artificial intelligenceCluster analysisbusinessMathematicsofComputing_DISCRETEMATHEMATICSProceedings of the 27th Annual ACM Symposium on Applied Computing
researchProduct

Secure Sum Outperforms Homomorphic Encryption in (Current) Collaborative Deep Learning

2020

Deep learning (DL) approaches are achieving extraordinary results in a wide range of domains, but often require a massive collection of private data. Hence, methods for training neural networks on the joint data of different data owners, that keep each party's input confidential, are called for. We address a specific setting in federated learning, namely that of deep learning from horizontally distributed data with a limited number of parties, where their vulnerable intermediate results have to be processed in a privacy-preserving manner. This setting can be found in medical and healthcare as well as industrial applications. The predominant scheme for this is based on homomorphic encryption…

FOS: Computer and information sciencesComputer Science - Machine LearningComputer Science - Cryptography and SecurityStatistics - Machine LearningMachine Learning (stat.ML)Cryptography and Security (cs.CR)Machine Learning (cs.LG)
researchProduct

Focusing Knowledge-based Graph Argument Mining via Topic Modeling

2021

Decision-making usually takes five steps: identifying the problem, collecting data, extracting evidence, identifying pro and con arguments, and making decisions. Focusing on extracting evidence, this paper presents a hybrid model that combines latent Dirichlet allocation and word embeddings to obtain external knowledge from structured and unstructured data. We study the task of sentence-level argument mining, as arguments mostly require some degree of world knowledge to be identified and understood. Given a topic and a sentence, the goal is to classify whether a sentence represents an argument in regard to the topic. We use a topic model to extract topic- and sentence-specific evidence from…

FOS: Computer and information sciencesComputer Science - Machine LearningArtificial Intelligence (cs.AI)Computer Science - Artificial IntelligenceInformation Retrieval (cs.IR)Computer Science - Information RetrievalMachine Learning (cs.LG)
researchProduct

Ensembles of Randomized Time Series Shapelets Provide Improved Accuracy while Reducing Computational Costs

2017

Shapelets are discriminative time series subsequences that allow generation of interpretable classification models, which provide faster and generally better classification than the nearest neighbor approach. However, the shapelet discovery process requires the evaluation of all possible subsequences of all time series in the training set, making it extremely computation intensive. Consequently, shapelet discovery for large time series datasets quickly becomes intractable. A number of improvements have been proposed to reduce the training time. These techniques use approximation or discretization and often lead to reduced classification accuracy compared to the exact method. We are proposin…

FOS: Computer and information sciencesComputer Science - LearningComputingMethodologies_PATTERNRECOGNITIONMachine Learning (cs.LG)
researchProduct

Gaussian Mixture Models and Model Selection for [18F] Fluorodeoxyglucose Positron Emission Tomography Classification in Alzheimer’s Disease

2015

We present a method to discover discriminative brain metabolism patterns in [18F] fluorodeoxyglucose positron emission tomography (PET) scans, facilitating the clinical diagnosis of Alzheimer's disease. In the work, the term "pattern" stands for a certain brain region that characterizes a target group of patients and can be used for a classification as well as interpretation purposes. Thus, it can be understood as a so-called "region of interest (ROI)". In the literature, an ROI is often found by a given brain atlas that defines a number of brain regions, which corresponds to an anatomical approach. The present work introduces a semi-data-driven approach that is based on learning the charac…

Aged 80 and overMaleMILD COGNITIVE IMPAIRMENTScience & TechnologyPREDICTIONGeneral Science & TechnologyNormal DistributionBrainModels TheoreticalDIAGNOSISSensitivity and SpecificityMultidisciplinary SciencesPETAlzheimer DiseaseFluorodeoxyglucose F18Positron-Emission TomographyMD MultidisciplinaryHumansScience & Technology - Other TopicsFemaleRadiopharmaceuticalsResearch ArticleAged
researchProduct

enviPath - The environmental contaminant biotransformation pathway resource

2016

The University of Minnesota Biocatalysis/Biodegradation Database and Pathway Prediction System (UM-BBD/PPS) has been a unique resource covering microbial biotransformation pathways of primarily xenobiotic chemicals for over 15 years. This paper introduces the successor system, enviPath (The Environmental Contaminant Biotransformation Pathway Resource), which is a complete redesign and reimplementation of UM-BBD/PPS. enviPath uses the database from the UM-BBD/PPS as a basis, extends the use of this database, and allows users to include their own data to support multiple use cases. Relative reasoning is supported for the refinement of predictions and to allow its extensions in terms of previo…

User-Computer InterfaceBiocatalysisDatabase IssueEnvironmental PollutantsBiotransformationDatabases ChemicalXenobiotics
researchProduct

Improving structural similarity based virtual screening using background knowledge

2013

Background Virtual screening in the form of similarity rankings is often applied in the early drug discovery process to rank and prioritize compounds from a database. This similarity ranking can be achieved with structural similarity measures. However, their general nature can lead to insufficient performance in some application cases. In this paper, we provide a link between ranking-based virtual screening and fragment-based data mining methods. The inclusion of binding-relevant background knowledge into a structural similarity measure improves the quality of the similarity rankings. This background knowledge in the form of binding relevant substructures can either be derived by hand selec…

Virtual screeningEnrichmentPhysical and Theoretical ChemistryLibrary and Information SciencesStructural similarity004 InformatikComputer Graphics and Computer-Aided DesignData miningBackground knowledge004 Data processingComputer Science ApplicationsResearch Article
researchProduct

Rule Extraction From Binary Neural Networks With Convolutional Rules for Model Validation.

2020

Classification approaches that allow to extract logical rules such as decision trees are often considered to be more interpretable than neural networks. Also, logical rules are comparatively easy to verify with any possible input. This is an important part in systems that aim to ensure correct operation of a given model. However, for high-dimensional input data such as images, the individual symbols, i.e. pixels, are not easily interpretable. Therefore, rule-based approaches are not typically used for this kind of high-dimensional data. We introduce the concept of first-order convolutional rules, which are logical rules that can be extracted using a convolutional neural network (CNN), and w…

FOS: Computer and information sciencesComputer Science - Machine Learningstochastic local searchrule extractionComputer Science - Artificial Intelligencelogical rulesQA75.5-76.95004 InformatikMachine Learning (cs.LG)Artificial Intelligence (cs.AI)Artificial IntelligenceElectronic computers. Computer scienceconvolutional neural networksk-term DNFinterpretability004 Data processingOriginal ResearchFrontiers in artificial intelligence
researchProduct

Filtered circular fingerprints improve either prediction or runtime performance while retaining interpretability

2016

Physical and Theoretical ChemistryLibrary and Information Sciences004 InformatikComputer Graphics and Computer-Aided Design004 Data processingComputer Science ApplicationsJournal of Cheminformatics
researchProduct