Search results for "Computational Mathematic"
showing 10 items of 987 documents
Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms
2018
Abstract Motivation Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in {A,C,G,T}k occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in…
FASTdoop: A versatile and efficient library for the input of FASTA and FASTQ files for MapReduce Hadoop bioinformatics applications
2017
Abstract Summary MapReduce Hadoop bioinformatics applications require the availability of special-purpose routines to manage the input of sequence files. Unfortunately, the Hadoop framework does not provide any built-in support for the most popular sequence file formats like FASTA or BAM. Moreover, the development of these routines is not easy, both because of the diversity of these formats and the need for managing efficiently sequence datasets that may count up to billions of characters. We present FASTdoop, a generic Hadoop library for the management of FASTA and FASTQ files. We show that, with respect to analogous input management routines that have appeared in the Literature, it offers…
FastaHerder2: Four Ways to Research Protein Function and Evolution with Clustering and Clustered Databases.
2016
The accelerated growth of protein databases offers great possibilities for the study of protein function using sequence similarity and conservation. However, the huge number of sequences deposited in these databases requires new ways of analyzing and organizing the data. It is necessary to group the many very similar sequences, creating clusters with automated derived annotations useful to understand their function, evolution, and level of experimental evidence. We developed an algorithm called FastaHerder2, which can cluster any protein database, putting together very similar protein sequences based on near-full-length similarity and/or high threshold of sequence identity. We compressed 50…
Assessment of in vivo organ-uptake and in silico prediction of CYP mediated metabolism of DA-Phen, a new dopaminergic agent
2017
Abstract The drug development process strives to predict metabolic fate of a drug candidate, together with its uptake in major organs, whether they act as target, deposit or metabolism sites, to the aim of establish a relationship between the pharmacodynamics and the pharmacokinetics and highlight the potential toxicity of the drug candidate. The present study was aimed at evaluating the in vivo uptake of 2-Amino-N-[2-(3,4-dihydroxy-phenyl)-ethyl]-3-phenyl-propionamide (DA-Phen) − a new dopaminergic neurotransmission modulator, in target and non-target organs of animal subjects and integrating these data with SMARTCyp results, an in silico method that predicts the sites of cytochrome P450-m…
Assessing statistical significance in multivariable genome wide association analysis
2016
Motivation: Although Genome Wide Association Studies (GWAS) genotype a very large number of single nucleotide polymorphisms (SNPs), the data are often analyzed one SNP at a time. The low predictive power of single SNPs, coupled with the high significance threshold needed to correct for multiple testing, greatly decreases the power of GWAS. Results: We propose a procedure in which all the SNPs are analyzed in a multiple generalized linear model, and we show its use for extremely high-dimensional datasets. Our method yields P-values for assessing significance of single SNPs or groups of SNPs while controlling for all other SNPs and the family wise error rate (FWER). Thus, our method tests whe…
SpaceScanner: COPASI wrapper for automated management of global stochastic optimization experiments
2017
Abstract Motivation Due to their universal applicability, global stochastic optimization methods are popular for designing improvements of biochemical networks. The drawbacks of global stochastic optimization methods are: (i) no guarantee of finding global optima, (ii) no clear optimization run termination criteria and (iii) no criteria to detect stagnation of an optimization run. The impact of these drawbacks can be partly compensated by manual work that becomes inefficient when the solution space is large due to combinatorial explosion of adjustable parameters or for other reasons. Results SpaceScanner uses parallel optimization runs for automatic termination of optimization tasks in case…
Partitioned learning of deep Boltzmann machines for SNP data.
2016
Abstract Motivation Learning the joint distributions of measurements, and in particular identification of an appropriate low-dimensional manifold, has been found to be a powerful ingredient of deep leaning approaches. Yet, such approaches have hardly been applied to single nucleotide polymorphism (SNP) data, probably due to the high number of features typically exceeding the number of studied individuals. Results After a brief overview of how deep Boltzmann machines (DBMs), a deep learning approach, can be adapted to SNP data in principle, we specifically present a way to alleviate the dimensionality problem by partitioned learning. We propose a sparse regression approach to coarsely screen…
FLYCOP: metabolic modeling-based analysis and engineering microbial communities
2018
10 p.-5 fig.-2 tab.
MetaCache: context-aware classification of metagenomic reads using minhashing.
2017
Abstract Motivation Metagenomic shotgun sequencing studies are becoming increasingly popular with prominent examples including the sequencing of human microbiomes and diverse environments. A fundamental computational problem in this context is read classification, i.e. the assignment of each read to a taxonomic label. Due to the large number of reads produced by modern high-throughput sequencing technologies and the rapidly increasing number of available reference genomes corresponding software tools suffer from either long runtimes, large memory requirements or low accuracy. Results We introduce MetaCache—a novel software for read classification using the big data technique minhashing. Our…
Reactome diagram viewer: data structures and strategies to boost performance
2017
Abstract Motivation Reactome is a free, open-source, open-data, curated and peer-reviewed knowledgebase of biomolecular pathways. For web-based pathway visualization, Reactome uses a custom pathway diagram viewer that has been evolved over the past years. Here, we present comprehensive enhancements in usability and performance based on extensive usability testing sessions and technology developments, aiming to optimize the viewer towards the needs of the community. Results The pathway diagram viewer version 3 achieves consistently better performance, loading and rendering of 97% of the diagrams in Reactome in less than 1 s. Combining the multi-layer html5 canvas strategy with a space partit…