6533b820fe1ef96bd127a485

RESEARCH PRODUCT

SpCLUST: Towards a fast and reliable clustering for potentially divergent biological sequences

Jean-claude CharrStéphane ChrétienJohny MatarChristophe GuyeuxHicham El Khoury

subject

0301 basic medicineComputer science[INFO.INFO-SE] Computer Science [cs]/Software Engineering [cs.SE]Health Informatics[INFO.INFO-SE]Computer Science [cs]/Software Engineering [cs.SE][INFO.INFO-IU]Computer Science [cs]/Ubiquitous Computing03 medical and health sciences[INFO.INFO-CR]Computer Science [cs]/Cryptography and Security [cs.CR]0302 clinical medicineSoftware[INFO.INFO-ET] Computer Science [cs]/Emerging Technologies [cs.ET][INFO.INFO-DC] Computer Science [cs]/Distributed Parallel and Cluster Computing [cs.DC]Cluster AnalysisHumansCluster analysis[INFO.INFO-CR] Computer Science [cs]/Cryptography and Security [cs.CR]computer.programming_languagebusiness.industry[INFO.INFO-IU] Computer Science [cs]/Ubiquitous ComputingSimilarity matrixPattern recognitionDNAGenomicsSequence Analysis DNAPython (programming language)Mixture model[INFO.INFO-MO]Computer Science [cs]/Modeling and SimulationSpectral clusteringComputer Science Applications030104 developmental biologyComputingMethodologies_PATTERNRECOGNITION[INFO.INFO-MA]Computer Science [cs]/Multiagent Systems [cs.MA][INFO.INFO-ET]Computer Science [cs]/Emerging Technologies [cs.ET][INFO.INFO-MA] Computer Science [cs]/Multiagent Systems [cs.MA][INFO.INFO-MO] Computer Science [cs]/Modeling and SimulationArtificial intelligence[INFO.INFO-DC]Computer Science [cs]/Distributed Parallel and Cluster Computing [cs.DC]businesscomputerAlgorithmsSoftware030217 neurology & neurosurgery

description

International audience; This paper presents SpCLUST, a new C++ package that takes a list of sequences as input, aligns them with MUSCLE, computes their similarity matrix in parallel and then performs the clustering. SpCLUST extends a previously released software by integrating additional scoring matrices which enables it to cover the clustering of amino-acid sequences. The similarity matrix is now computed in parallel according to the master/slave distributed architecture, using MPI. Performance analysis, realized on two real datasets of 100 nucleotide sequences and 1049 amino-acids ones, show that the resulting library substantially outperforms the original Python package. The proposed package was also intensively evaluated on simulated and real genomic and protein data sets. The clustering results were compared to the most known traditional tools, such as UCLUST, CD-HIT and DNACLUST. The comparison showed that SpCLUST outperforms the other tools when clustering divergent sequences, and contrary to the others, it does not require any user intervention or prior knowledge about the input sequences.

https://hal.archives-ouvertes.fr/hal-02366767/file/55cb4984-a90c-4ad7-8fa6-89eab3ba3f05-author.pdf