0000000000465585

AUTHOR

Zejun Zheng

showing 4 related works from this author

Efficient and Accurate OTU Clustering with GPU-Based Sequence Alignment and Dynamic Dendrogram Cutting.

2015

De novo clustering is a popular technique to perform taxonomic profiling of a microbial community by grouping 16S rRNA amplicon reads into operational taxonomic units (OTUs). In this work, we introduce a new dendrogram-based OTU clustering pipeline called CRiSPy. The key idea used in CRiSPy to improve clustering accuracy is the application of an anomaly detection technique to obtain a dynamic distance cutoff instead of using the de facto value of 97 percent sequence similarity as in most existing OTU clustering pipelines. This technique works by detecting an abrupt change in the merging heights of a dendrogram. To produce the output dendrograms, CRiSPy employs the OTU hierarchical clusterin…

Computer scienceCorrelation clusteringSingle-linkage clusteringMolecular Sequence DataMachine learningcomputer.software_genrePattern Recognition AutomatedCURE data clustering algorithmRNA Ribosomal 16SGeneticsComputer GraphicsCluster analysisBase Sequencebusiness.industryApplied MathematicsDendrogramHigh-Throughput Nucleotide SequencingPattern recognitionSignal Processing Computer-AssistedEquipment DesignHierarchical clusteringEquipment Failure AnalysisRNA BacterialCanopy clustering algorithmArtificial intelligenceHierarchical clustering of networksbusinesscomputerSequence AlignmentAlgorithmsBiotechnologyIEEE/ACM transactions on computational biology and bioinformatics
researchProduct

DySC: software for greedy clustering of 16S rRNA reads.

2012

Abstract Summary: Pyrosequencing technologies are frequently used for sequencing the 16S ribosomal RNA marker gene for profiling microbial communities. Clustering of the produced reads is an important but time-consuming task. We present Dynamic Seed-based Clustering (DySC), a new tool based on the greedy clustering approach that uses a dynamic seeding strategy. Evaluations based on the normalized mutual information (NMI) criterion show that DySC produces higher quality clusters than UCLUST and CD-HIT at a comparable runtime. Availability and implementation: DySC, implemented in C, is available at http://code.google.com/p/dysc/ under GNU GPL license. Contact:  bertil.schmidt@uni-mainz.de Sup…

Statistics and ProbabilityComputer sciencebusiness.industrySequence Analysis RNA16S ribosomal RNAcomputer.software_genreBiochemistryComputer Science ApplicationsComputational MathematicsSoftwareComputational Theory and MathematicsRNA Ribosomal 16SCluster AnalysisMetagenomeData miningCluster analysisbusinessMolecular BiologycomputerSoftwareBioinformatics (Oxford, England)
researchProduct

CRiSPy-CUDA: Computing Species Richness in 16S rRNA Pyrosequencing Datasets with CUDA

2011

Pyrosequencing technologies are frequently used for sequencing the 16S rRNA marker gene for metagenomic studies of microbial communities. Computing a pairwise genetic distance matrix from the produced reads is an important but highly time consuming task. In this paper, we present a parallelized tool (called CRiSPy) for scalable pairwise genetic distance matrix computation and clustering that is based on the processing pipeline of the popular ESPRIT software package. To achieve high computational efficiency, we have designed massively parallel CUDA algorithms for pairwise k-mer distance and pairwise genetic distance computation. We have also implemented a memory-efficient sparse matrix clust…

CUDADistance matrixComputer scienceMetagenomicsPipeline (computing)Pairwise comparisonParallel computingCluster analysisQuantitative Biology::GenomicsMassively parallelSparse matrix
researchProduct

Large-Scale Clustering of Short Reads for Metagenomics On GPUs

2013

Scale (ratio)Computer scienceMetagenomicsParallel computingCluster analysisComputational science
researchProduct