0000000000072780

AUTHOR

Christian Hundt

showing 26 related works from this author

CUDA-Accelerated Alignment of Subsequences in Streamed Time Series Data

2014

Euclidean Distance (ED) and Dynamic Time Warping (DTW) are cornerstones in the field of time series data mining. Many high-level algorithms like kNN-classification, clustering or anomaly detection make excessive use of these distance measures as subroutines. Furthermore, the vast growth of recorded data produced by automated monitoring systems or integrated sensors establishes the need for efficient implementations. In this paper, we introduce linear memory parallelization schemes for the alignment of a given query Q in a stream of time series data S for both ED and DTW using CUDA-enabled accelerators. The ED parallelization features a log-linear calculation scheme in contrast to the naive …

Euclidean distanceCUDADynamic time warpingData stream miningComputer scienceAnomaly detectionParallel computingCluster analysisTime complexityDistance measures2014 43rd International Conference on Parallel Processing
researchProduct

Suffix Array Construction on Multi-GPU Systems

2019

Suffix arrays are prevalent data structures being fundamental to a wide range of applications including bioinformatics, data compression, and information retrieval. Therefore, various algorithms for (parallel) suffix array construction both on CPUs and GPUs have been proposed over the years. Although providing significant speedup over their CPU-based counterparts, existing GPU implementations share a common disadvantage: input text sizes are limited by the scarce memory of a single GPU. In this paper, we overcome aforementioned memory limitations by exploiting multi-GPU nodes featuring fast NVLink interconnects. In order to achieve high performance for this communication-intensive task, we …

Multi-core processorSpeedupComputer scienceSuffix array0102 computer and information sciences02 engineering and technologyParallel computingData structure01 natural scienceslaw.inventionCUDAShared memory010201 computation theory & mathematicslaw0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processingSuffixData compressionProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing
researchProduct

Automatische Detektion der primär sklerosierenden Cholangitis (PSC) anhand von 3D-MRCP Datensätzen mittels Deep Learning

2018

RöFo - Fortschritte auf dem Gebiet der Röntgenstrahlen und der bildgebenden Verfahren
researchProduct

parSRA: A framework for the parallel execution of short read aligners on compute clusters

2018

The growth of next generation sequencing datasets poses as a challenge to the alignment of reads to reference genomes in terms of both accuracy and speed. In this work we present parSRA, a parallel framework to accelerate the execution of existing short read aligners on distributed-memory systems. parSRA can be used to parallelize a variety of short read alignment tools installed in the system without any modification to their source code. We show that our framework provides good scalability on a compute cluster for accelerating the popular BWA-MEM and Bowtie2 aligners. On average, it is able to accelerate sequence alignments on 16 64-core nodes (in total, 1024 cores) with speedup of 10.48 …

0301 basic medicineSource codeSpeedupGeneral Computer ScienceComputer sciencemedia_common.quotation_subjectParallel computingSupercomputerTheoretical Computer Science03 medical and health sciences030104 developmental biology0302 clinical medicine030220 oncology & carcinogenesisModeling and SimulationComputer clusterScalabilityFuse (electrical)Node (circuits)Partitioned global address spacemedia_commonJournal of Computational Science
researchProduct

Accelerating metagenomic read classification on CUDA-enabled GPUs.

2016

Metagenomic sequencing studies are becoming increasingly popular with prominent examples including the sequencing of human microbiomes and diverse environments. A fundamental computational problem in this context is read classification; i.e. the assignment of each read to a taxonomic label. Due to the large number of reads produced by modern high-throughput sequencing technologies and the rapidly increasing number of available reference genomes software tools for fast and accurate metagenomic read classification are urgently needed. We present cuCLARK, a read-level classifier for CUDA-enabled GPUs, based on the fast and accurate classification of metagenomic sequences using reduced k-mers (…

0301 basic medicineTheoretical computer scienceWorkstationGPUsComputer scienceContext (language use)CUDAParallel computingBiochemistryGenomelaw.invention03 medical and health sciencesCUDAUser-Computer Interface0302 clinical medicineStructural BiologylawTaxonomic assignmentHumansMicrobiomeMolecular BiologyInternetXeonApplied MathematicsHigh-Throughput Nucleotide SequencingSequence Analysis DNAExact k-mer matchingComputer Science Applications030104 developmental biologyTitan (supercomputer)Metagenomics030220 oncology & carcinogenesisMetagenomicsDNA microarraySoftwareBMC bioinformatics
researchProduct

Gossip

2019

Nowadays, a growing number of servers and workstations feature an increasing number of GPUs. However, slow communication among GPUs can lead to poor application performance. Thus, there is a latent demand for efficient multi-GPU communication primitives on such systems. This paper focuses on the gather, scatter and all-to-all collectives, which are important operations for various algorithms including parallel sorting and distributed hashing. We present two distinct communication strategies (ring-based and flow-oriented) to generate transfer plans for their topology-aware implementation on NVLink-connected multi-GPU systems. We achieve a throughput of up to 526 GB/s for all-to-all and 148 G…

CUDAComputer scienceGossipDistributed computingTransfer (computing)ServerHash functionOverhead (computing)Throughput (business)Proceedings of the 48th International Conference on Parallel Processing
researchProduct

GEM

2014

The widespread use of digital sensor systems causes a tremendous demand for high-quality time series analysis tools. In this domain the majority of data mining algorithms relies on established distance measures like Dynamic Time Warping (DTW) or Euclidean distance (ED). However, the notion of similarity induced by ED and DTW may lead to unsatisfactory clusterings. In order to address this shortcoming we introduce the Gliding Elastic Match (GEM) algorithm. It determines an optimal local similarity measure of a query time series Q and a subject time series S. The measure is invariant under both local deformation on the measurement-axis and scaling in the time domain. GEM is compared to ED and…

Euclidean distanceDynamic time warpingSimilarity (network science)Computer scienceData miningInvariant (mathematics)Similarity measurecomputer.software_genreMeasure (mathematics)AlgorithmcomputerDistance measuresProceedings of the 29th Annual ACM Symposium on Applied Computing
researchProduct

Deep Learning für die automatische Bestimmung von klinisch relevanten Herzparametern mittels Kardio-MRT

2018

RöFo - Fortschritte auf dem Gebiet der Röntgenstrahlen und der bildgebenden Verfahren
researchProduct

Deep Semantic Segmentation von 4D DCE MRT Untersuchungen der Lunge zum Erheben Klinischer Biomarker bei Chronisch Obstruktiver Lungenerkrankung

2019

Einheit in Vielfalt
researchProduct

MetaCache: context-aware classification of metagenomic reads using minhashing.

2017

Abstract Motivation Metagenomic shotgun sequencing studies are becoming increasingly popular with prominent examples including the sequencing of human microbiomes and diverse environments. A fundamental computational problem in this context is read classification, i.e. the assignment of each read to a taxonomic label. Due to the large number of reads produced by modern high-throughput sequencing technologies and the rapidly increasing number of available reference genomes corresponding software tools suffer from either long runtimes, large memory requirements or low accuracy. Results We introduce MetaCache—a novel software for read classification using the big data technique minhashing. Our…

0301 basic medicineStatistics and ProbabilityComputer scienceSequence analysisContext (language use)BiochemistryGenome03 medical and health scienceschemistry.chemical_compound0302 clinical medicineRefSeqHumansMolecular BiologyInformation retrievalShotgun sequencingHigh-Throughput Nucleotide SequencingSequence Analysis DNAComputer Science ApplicationsComputational Mathematics030104 developmental biologyComputational Theory and MathematicschemistryMetagenomicsMetagenomics030217 neurology & neurosurgeryDNAAlgorithmsSoftwareReference genomeBioinformatics (Oxford, England)
researchProduct

Massively parallel computation of atmospheric neutrino oscillations on CUDA-enabled accelerators

2019

Abstract The computation of neutrino flavor transition amplitudes through inhomogeneous matter is a time-consuming step and thus could benefit from optimization and parallelization. Next to reliable parameter estimation of intrinsic physical quantities such as neutrino masses and mixing angles, these transition amplitudes are important in hypothesis testing of potential extensions of the standard model of elementary particle physics, such as additional neutrino flavors. Hence, fast yet precise implementations are of high importance to research. In the recent past, massively parallel accelerators such as CUDA-enabled GPUs featuring thousands of compute units have been widely adopted due to t…

Computer scienceComputationGeneral Physics and AstronomyMemory bandwidth01 natural sciences010305 fluids & plasmasStandard ModelComputational scienceCUDAHardware and Architecture0103 physical sciencesNeutrino010306 general physicsNeutrino oscillationMassively parallelPhysical quantityComputer Physics Communications
researchProduct

Vollautomatische, lappenbasierte Segmentierung von MR-Pefusionsmessungen in COPD Patienten mit Methoden des maschinellen Lernens

2017

Radiology Nuclear Medicine and imagingRöFo - Fortschritte auf dem Gebiet der Röntgenstrahlen und der bildgebenden Verfahren
researchProduct

WarpCore: A Library for fast Hash Tables on GPUs

2020

Hash tables are ubiquitous. Properties such as an amortized constant time complexity for insertion and querying as well as a compact memory layout make them versatile associative data structures with manifold applications. The rapidly growing amount of data emerging in many fields motivated the need for accelerated hash tables designed for modern parallel architectures. In this work, we exploit the fast memory interface of modern GPUs together with a parallel hashing scheme tailored to improve global memory access patterns, to design WarpCore -- a versatile library of hash table data structures. Unique device-sided operations allow for building high performance data processing pipelines ent…

FOS: Computer and information sciencesScheme (programming language)Amortized analysisComputer scienceHash functionParallel computingData structureHash tableCUDAComputer Science - Distributed Parallel and Cluster ComputingServerDistributed Parallel and Cluster Computing (cs.DC)Throughput (business)computercomputer.programming_language2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)
researchProduct

Unified Parallel C++

2018

Abstract Although MPI is commonly used for parallel programming on distributed-memory systems, Partitioned Global Address Space (PGAS) approaches are gaining attention for programming modern multi-core CPU clusters. They feature a hybrid memory abstraction: distributed memory is viewed as a shared memory that is partitioned among nodes in order to simplify programming. In this chapter you will learn about Unified Parallel C++ (UPC++), a library-based extension of C++ that gathers the advantages of both PGAS and Object Oriented paradigms. The examples included in this chapter will help you to understand the main features of PGAS languages and how they can simplify the task of programming par…

Object-oriented programmingSource codeComputer sciencemedia_common.quotation_subjectParallel computingSoftware_PROGRAMMINGTECHNIQUESShared memoryAsynchronous communicationUnified Parallel CDistributed memoryPartitioned global address spacecomputercomputer.programming_languageAbstraction (linguistics)media_common
researchProduct

MetaCache-GPU: Ultra-Fast Metagenomic Classification

2021

The cost of DNA sequencing has dropped exponentially over the past decade, making genomic data accessible to a growing number of scientists. In bioinformatics, localization of short DNA sequences (reads) within large genomic sequences is commonly facilitated by constructing index data structures which allow for efficient querying of substrings. Recent metagenomic classification pipelines annotate reads with taxonomic labels by analyzing their $k$-mer histograms with respect to a reference genome database. CPU-based index construction is often performed in a preprocessing phase due to the relatively high cost of building irregular data structures such as hash maps. However, the rapidly growi…

Genomics (q-bio.GN)FOS: Computer and information sciencesSource codeComputer sciencemedia_common.quotation_subjectHash functionContext (language use)MinHashcomputer.software_genreData structureHash tableComputer Science - Distributed Parallel and Cluster ComputingFOS: Biological sciencesPreprocessorQuantitative Biology - GenomicsDistributed Parallel and Cluster Computing (cs.DC)Data miningcomputermedia_commonReference genome50th International Conference on Parallel Processing
researchProduct

WarpDrive: Massively Parallel Hashing on Multi-GPU Nodes

2018

Hash maps are among the most versatile data structures in computer science because of their compact data layout and expected constant time complexity for insertion and querying. However, associated memory access patterns during the probing phase are highly irregular resulting in strongly memory-bound implementations. Massively parallel accelerators such as CUDA-enabled GPUs may overcome this limitation by virtue of their fast video memory featuring almost one TB/s bandwidth in comparison to main memory modules of state-of-the-art CPUs with less than 100 GB/s. Unfortunately, the size of hash maps supported by existing single-GPU hashing implementations is restricted by the limited amount of …

020203 distributed computingComputer scienceHash function0102 computer and information sciences02 engineering and technologyParallel computingData structure01 natural sciencesHash tableElectronic mailMemory management010201 computation theory & mathematicsScalability0202 electrical engineering electronic engineering information engineeringMassively parallelTime complexity2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
researchProduct

Ultra-Fast Detection of Higher-Order Epistatic Interactions on GPUs

2017

Detecting higher-order epistatic interactions in Genome-Wide Association Studies (GWAS) remains a challenging task in the fields of genetic epidemiology and computer science. A number of algorithms have recently been proposed for epistasis discovery. However, they suffer from a high computational cost since statistical measures have to be evaluated for each possible combination of markers. Hence, many algorithms use additional filtering stages discarding potentially non-interacting markers in order to reduce the overall number of combinations to be examined. Among others, Mutual Information Clustering (MIC) is a common pre-processing filter for grouping markers into partitions using K-Means…

0301 basic medicineTheoretical computer scienceComputer sciencebusiness.industryContrast (statistics)Genome-wide association study02 engineering and technologyMutual informationMachine learningcomputer.software_genreReduction (complexity)03 medical and health sciences030104 developmental biologyGenetic epidemiology0202 electrical engineering electronic engineering information engineeringEpistasis020201 artificial intelligence & image processingArtificial intelligenceCluster analysisbusinesscomputerGenetic association
researchProduct

FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures

2020

Deep Learning is ubiquitous in a wide field of applications ranging from research to industry. In comparison to time-consuming iterative training of convolutional neural networks (CNNs), inference is a relatively lightweight operation making it amenable to execution on mobile devices. Nevertheless, lower latency and higher computation efficiency are crucial to allow for complex models and prolonged battery life. Addressing the aforementioned challenges, we propose FeatherCNN – a fast inference library for ARM CPUs – targeting the performance ceiling of mobile devices. FeatherCNN employs three key techniques: 1) A highly efficient TensorGEMM (generalized matrix multiplication) routine is app…

020203 distributed computingSource codeIterative methodComputer sciencebusiness.industrymedia_common.quotation_subjectDeep learningInference02 engineering and technologyParallel computingConvolutional neural networkMatrix multiplicationARM architectureComputational Theory and MathematicsHardware and ArchitectureSignal Processing0202 electrical engineering electronic engineering information engineeringArtificial intelligencebusinessmedia_commonIEEE Transactions on Parallel and Distributed Systems
researchProduct

SAUCE: A web application for interactive teaching and learning of parallel programming

2017

Abstract Prevalent hardware trends towards parallel architectures and algorithms create a growing demand for graduate students familiar with the programming of concurrent software. However, learning parallel programming is challenging due to complex communication and memory access patterns as well as the avoidance of common pitfalls such as dead-locks and race conditions. Hence, the learning process has to be supported by adequate software solutions in order to enable future computer scientists and engineers to write robust and efficient code. This paper discusses a selection of well-known parallel algorithms based on C++11 threads, OpenMP, MPI, and CUDA that can be interactively embedded i…

Computer Networks and Communicationsbusiness.industryComputer scienceProgramming languageWhite-box testingParallel algorithmProcess (computing)020206 networking & telecommunications02 engineering and technologyParallel computingThread (computing)computer.software_genreTheoretical Computer ScienceCUDASoftwareArtificial IntelligenceHardware and Architecture0202 electrical engineering electronic engineering information engineeringCode (cryptography)Web application020201 artificial intelligence & image processingbusinesscomputerSoftwareJournal of Parallel and Distributed Computing
researchProduct

cuBool: Bit-Parallel Boolean Matrix Factorization on CUDA-Enabled Accelerators

2018

Boolean Matrix Factorization (BMF) is a commonly used technique in the field of unsupervised data analytics. The goal is to decompose a ground truth matrix C into a product of two matrices A and $B$ being either an exact or approximate rank k factorization of C. Both exact and approximate factorization are time-consuming tasks due to their combinatorial complexity. In this paper, we introduce a massively parallel implementation of BMF - namely cuBool - in order to significantly speed up factorization of huge Boolean matrices. Our approach is based on alternately adjusting rows and columns of A and B using thousands of lightweight CUDA threads. The massively parallel manipulation of entries …

SpeedupRank (linear algebra)Computer science02 engineering and technologyParallel computingMatrix decompositionCUDAMatrix (mathematics)Factorization020204 information systemsSingular value decomposition0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processingMassively parallelInteger (computer science)2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS)
researchProduct

S-Aligner: Ultrascalable Read Mapping on Sunway Taihu Light

2017

The availability and amount of sequenced genomes have been rapidly growing in recent years because of the adoption of next-generation sequencing (NGS) technologies that enable high-throughput short-read generation at highly competitive cost. Since this trend is expected to continue in the foreseeable future, the design and implementation of efficient and scalable NGS bioinformatics algorithms are important to research and industrial applications. In this paper, we introduce S-Aligner–a highly scalable read mapper designed for the Sunway Taihu Light supercomputer and its fourth-generationShenWei many-core architecture (SW26010). S-Aligner employs a combination of optimization techniques to o…

0301 basic medicineInstruction set03 medical and health sciences030104 developmental biologyXeonAsynchronous communicationComputer scienceMultithreadingScalabilitySIMDParallel computingSW26010Supercomputer2017 IEEE International Conference on Cluster Computing (CLUSTER)
researchProduct

Verwendung eines 3D Neuronalen Netzwerkes zur Lebervolumenbestimmmung im 3T MRT

2019

Einheit in Vielfalt
researchProduct

A 3D Deep Neural Network for Liver Volumetry in 3T Contrast-Enhanced MRI.

2020

 To create a fully automated, reliable, and fast segmentation tool for Gd-EOB-DTPA-enhanced MRI scans using deep learning. Datasets of Gd-EOB-DTPA-enhanced liver MR images of 100 patients were assembled. Ground truth segmentation of the hepatobiliary phase images was performed manually. Automatic image segmentation was achieved with a deep convolutional neural network. Our neural network achieves an intraclass correlation coefficient (ICC) of 0.987, a Sørensen-Dice coefficient of 96.7 ± 1.9 % (mean ± std), an overlap of 92 ± 3.5 %, and a Hausdorff distance of 24.9 ± 14.7 mm compared with two expert readers who corresponded to an ICC of 0.973, a Sørensen-Dice coefficient of 95.2 ± 2.8 %, and…

Ground truthArtificial neural networkComputer sciencebusiness.industryDeep learningPattern recognitionImage processingImage segmentationConvolutional neural networkMagnetic Resonance ImagingHausdorff distanceLiverImage Processing Computer-AssistedHumansRadiology Nuclear Medicine and imagingSegmentationArtificial intelligenceNeural Networks ComputerbusinessRoFo : Fortschritte auf dem Gebiete der Rontgenstrahlen und der Nuklearmedizin
researchProduct

Deep semantic lung segmentation for tracking potential pulmonary perfusion biomarkers in chronic obstructive pulmonary disease (COPD): The multi‐ethn…

2019

Background Chronic obstructive pulmonary disease (COPD) is associated with high morbidity and mortality. Identification of imaging biomarkers for phenotyping is necessary for future treatment and therapy monitoring. However, translation of visual analytic pipelines into clinics or their use in large-scale studies is significantly slowed by time-consuming postprocessing steps. Purpose To implement an automated tool chain for regional quantification of pulmonary microvascular blood flow in order to reduce analysis time and user variability. Study type Prospective. Population In all, 90 MRI scans of 63 patients, of which 31 had a COPD with a mean Global Initiative for Chronic Obstructive Lung …

Intraclass correlationConcordancePopulation030218 nuclear medicine & medical imagingPulmonary Disease Chronic Obstructive03 medical and health sciences0302 clinical medicinemedicineHumansRadiology Nuclear Medicine and imagingProspective StudieseducationLungCOPDeducation.field_of_studyLungmedicine.diagnostic_testbusiness.industryBlood flowAtherosclerosismedicine.diseaseMagnetic Resonance ImagingObstructive lung diseaseSemanticsPerfusionmedicine.anatomical_structureAngiographybusinessNuclear medicineBiomarkersJournal of Magnetic Resonance Imaging
researchProduct

SAUCE: A Web-Based Automated Assessment Tool for Teaching Parallel Programming

2015

Many curricula for undergraduate studies in computer science provide a lecture on the fundamentals of parallel programming like multi-threaded computation on shared memory architectures using POSIX threads or OpenMP. The complex structure of parallel programs can be challenging, especially for inexperienced students. Thus, there is a latent need for software supporting the learning process. Subsequent lectures may cover more advanced parallelization techniques such as the Message Passing Interface (MPI) and the Compute Unified Device Architecture (CUDA) languages. Unfortunately, the majority of students cannot easily access MPI clusters or modern hardware accelerators in order to effectivel…

Class (computer programming)POSIX Threadsbusiness.industryComputer scienceMessage Passing InterfaceParallel computingcomputer.software_genreCUDASoftwareShared memoryVirtual machineWeb applicationbusinesscomputer
researchProduct

Advanced C++11 Multithreading

2018

Abstract The previous chapter introduced the basic concepts of multithreading using the C++11 threading API starting with basic spawn and join approaches, while finishing with non-trivial synchronization based on mutexes and condition variables. However, the major bottleneck of application performance is usually caused by contention for a shared resource. In case of mutex-based programming all participating threads usually try to acquire the same lock in parallel which effectively serializes the program for lightweight operations such as increment/decrement or updates of a single scalar value. Fortunately, modern CPUs provide dedicated commands that allow for the efficient execution of unin…

Computer scienceMonitorMultithreadingThreading (manufacturing)Operating systemSemaphorecomputer.software_genreData typecomputerBottleneckSpawn (computing)Shared resource
researchProduct