0000000000230843

AUTHOR

Anna Katharina Hildebrandt

CUDA-enabled hierarchical ward clustering of protein structures based on the nearest neighbour chain algorithm

Clustering of molecular systems according to their three-dimensional structure is an important step in many bioinformatics workflows. In applications such as docking or structure prediction, many algorithms initially generate large numbers of candidate poses (or decoys), which are then clustered to allow for subsequent computationally expensive evaluations of reasonable representatives. Since the number of such candidates can easily range from thousands to millions, performing the clustering on standard central processing units (CPUs) is highly time consuming. In this paper, we analyse and evaluate different approaches to parallelize the nearest neighbour chain algorithm to perform hierarc…

research product

ballaxy: web services for structural bioinformatics.

Abstract Motivation: Web-based workflow systems have gained considerable momentum in sequence-oriented bioinformatics. In structural bioinformatics, however, such systems are still relatively rare; while commercial stand-alone workflow applications are common in the pharmaceutical industry, academic researchers often still rely on command-line scripting to glue individual tools together. Results: In this work, we address the problem of building a web-based system for workflows in structural bioinformatics. For the underlying molecular modelling engine, we opted for the BALL framework because of its extensive and well-tested functionality in the field of structural bioinformatics. The large …

research product

A Greedy Algorithm for Hierarchical Complete Linkage Clustering

We are interested in the greedy method to compute an hierarchical complete linkage clustering. There are two known methods for this problem, one having a running time of \({\mathcal O}(n^3)\) with a space requirement of \({\mathcal O}(n)\) and one having a running time of \({\mathcal O}(n^2 \log n)\) with a space requirement of Θ(n 2), where n is the number of points to be clustered. Both methods are not capable to handle large point sets. In this paper, we give an algorithm with a space requirement of \({\mathcal O}(n)\) which is able to cluster one million points in a day on current commodity hardware.

research product

Parallelized Clustering of Protein Structures on CUDA-Enabled GPUs

Estimation of the pose in which two given molecules might bind together to form a potential complex is a crucial task in structural biology. To solve this so-called "docking problem", most algorithms initially generate large numbers of candidate poses (or decoys) which are then clustered to allow for subsequent computationally expensive evaluations of reasonable representatives. Since the number of such candidates ranges from thousands to millions, performing the clustering on standard CPUs is highly time consuming. In this paper we analyze and evaluate different approaches to parallelize the nearest neighbor chain algorithm to perform hierarchical Ward clustering of protein structures usin…

research product

Efficient computation of root mean square deviations under rigid transformations

The computation of root mean square deviations (RMSD) is an important step in many bioinformatics applications. If approached naively, each RMSD computation takes time linear in the number of atoms. In addition, a careful implementation is required to achieve numerical stability, which further increases runtimes. In practice, the structural variations under consideration are often induced by rigid transformations of the protein, or are at least dominated by a rigid component. In this work, we show how RMSD values resulting from rigid transformations can be computed in constant time from the protein's covariance matrix, which can be precomputed in linear time. As a typical application scenar…

research product