6533b825fe1ef96bd1281d31

RESEARCH PRODUCT

Structural clustering of millions of molecular graphs

Madeleine SeelandAndreas Karwath JohannesStefan Kramer

subject

Clustering high-dimensional dataFuzzy clusteringTheoretical computer sciencek-medoidsComputer scienceSingle-linkage clusteringCorrelation clusteringConstrained clusteringcomputer.software_genreComplete-linkage clusteringGraphHierarchical clusteringComputingMethodologies_PATTERNRECOGNITIONData stream clusteringCURE data clustering algorithmCanopy clustering algorithmFLAME clusteringAffinity propagationData miningCluster analysiscomputerk-medians clusteringClustering coefficient

description

We propose an algorithm for clustering very large molecular graph databases according to scaffolds (i.e., large structural overlaps) that are common between cluster members. Our approach first partitions the original dataset into several smaller datasets using a greedy clustering approach named APreClus based on dynamic seed clustering. APreClus is an online and instance incremental clustering algorithm delaying the final cluster assignment of an instance until one of the so-called pending clusters the instance belongs to has reached significant size and is converted to a fixed cluster. Once a cluster is fixed, APreClus recalculates the cluster centers, which are used as representatives for further cluster assignments. The pre-clustering methodology employs a cluster membership test based on a set-abstraction of graphs. The resulting clusters are further partitioned into a finer level of granularity using a highly parallelized scaffold-based clustering approach that produces overlapping (non-disjoint) and non-exhaustive clusters. In our experiments on very large datasets of chemical structures, we show that our algorithm is able to cluster at least 3,000,000 molecular graph structures within reasonable time. Compared to a recently proposed scaffold-based graph clustering algorithm, our approach is able to cluster an order of magnitude more graph structures in a fraction of the time. This demonstrates that our algorithm is applicable in the context of virtual screening of realistically sized compound libraries and a first step towards structuring chemical space.

https://doi.org/10.1145/2554850.2555063