Structural clustering of millions of molecular graphs

6533b825fe1ef96bd1281d31

RESEARCH PRODUCT

Structural clustering of millions of molecular graphs

Madeleine Seeland Andreas Karwath Johannes Stefan Kramer

subject

Clustering high-dimensional data Fuzzy clustering Theoretical computer science k-medoids Computer science Single-linkage clustering Correlation clustering Constrained clustering computer.software_genre Complete-linkage clustering Graph Hierarchical clustering ComputingMethodologies_PATTERNRECOGNITION Data stream clustering CURE data clustering algorithm Canopy clustering algorithm FLAME clustering Affinity propagation Data mining Cluster analysis computer k-medians clustering Clustering coefficient

description

We propose an algorithm for clustering very large molecular graph databases according to scaffolds (i.e., large structural overlaps) that are common between cluster members. Our approach first partitions the original dataset into several smaller datasets using a greedy clustering approach named APreClus based on dynamic seed clustering. APreClus is an online and instance incremental clustering algorithm delaying the final cluster assignment of an instance until one of the so-called pending clusters the instance belongs to has reached significant size and is converted to a fixed cluster. Once a cluster is fixed, APreClus recalculates the cluster centers, which are used as representatives for further cluster assignments. The pre-clustering methodology employs a cluster membership test based on a set-abstraction of graphs. The resulting clusters are further partitioned into a finer level of granularity using a highly parallelized scaffold-based clustering approach that produces overlapping (non-disjoint) and non-exhaustive clusters. In our experiments on very large datasets of chemical structures, we show that our algorithm is able to cluster at least 3,000,000 molecular graph structures within reasonable time. Compared to a recently proposed scaffold-based graph clustering algorithm, our approach is able to cluster an order of magnitude more graph structures in a fraction of the time. This demonstrates that our algorithm is applicable in the context of virtual screening of realistically sized compound libraries and a first step towards structuring chemical space.

year	journal	country	edition	language
2014-03-24	Proceedings of the 29th Annual ACM Symposium on Applied Computing

https://doi.org/10.1145/2554850.2555063