Search results for "DATA MINING"
showing 10 items of 907 documents
A web application for the unspecific detection of differentially expressed DNA regions in strand-specific expression data
2015
Abstract Genomic technologies allow laboratories to produce large-scale data sets, either through the use of next-generation sequencing or microarray platforms. To explore these data sets and obtain maximum value from the data, researchers view their results alongside all the known features of a given reference genome. To study transcriptional changes that occur under a given condition, researchers search for regions of the genome that are differentially expressed between different experimental conditions. In order to identify these regions several algorithms have been developed over the years, along with some bioinformatic platforms that enable their use. However, currently available appli…
The Power of Word-Frequency Based Alignment-Free Functions: a Comprehensive Large-Scale Experimental Analysis
2021
Abstract Motivation Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e. their ability to identify true similarity, has been limited to some members of the D2 family. The corresponding experimental studies have concentrated on short sequences, a scenario no longer adequate for current applications, where sequence lengths may vary considerably. Such a State of the Art is methodologically problematic, since information regarding a key feature such as power is either mi…
Overlap and diversity in antimicrobial peptide databases: Compiling a non-redundant set of sequences
2015
Abstract Motivation: The large variety of antimicrobial peptide (AMP) databases developed to date are characterized by a substantial overlap of data and similarity of sequences. Our goals are to analyze the levels of redundancy for all available AMP databases and use this information to build a new non-redundant sequence database. For this purpose, a new software tool is introduced. Results: A comparative study of 25 AMP databases reveals the overlap and diversity among them and the internal diversity within each database. The overlap analysis shows that only one database (Peptaibol) contains exclusive data, not present in any other, whereas all sequences in the LAMP_Patent database are inc…
ArtiFuse—computational validation of fusion gene detection tools without relying on simulated reads
2019
Abstract Motivation Gene fusions are an important class of transcriptional variants that can influence cancer development and can be predicted from RNA sequencing (RNA-seq) data by multiple existing tools. However, the real-world performance of these tools is unclear due to the lack of known positive and negative events, especially with regard to fusion genes in individual samples. Often simulated reads are used, but these cannot account for all technical biases in RNA-seq data generated from real samples. Results Here, we present ArtiFuse, a novel approach that simulates fusion genes by sequence modification to the genomic reference, and therefore, can be applied to any RNA-seq dataset wit…
Fully Bayesian Approach to Image Restoration with an Application in Biogeography
1994
SUMMARY A common method of studying biogeographical ranges is an atlas survey, in which the research area is divided into a square grid and the data consist of the squares where observations occur. Often the observations form only an incomplete map of the true range, and a method is required to decide whether the blank squares indicate true absence or merely a lack of study there. This is essentially an image restoration problem, but it has properties that make the common empirical Bayesian procedures inadequate. Most notably, the observed image is heavily degraded, causing difficulties in the estimation of spatial interaction, and the assessment of reliability of the restoration is emphasi…
Testing for local structure in spatiotemporal point pattern data
2017
The detection of clustering structure in a point pattern is one of the main focuses of attention in spatiotemporal data mining. Indeed, statistical tools for clustering detection and identification of individual events belonging to clusters are welcome in epidemiology and seismology. Local second-order characteristics provide information on how an event relates to nearby events. In this work, we extend local indicators of spatial association (known as LISA functions) to the spatiotemporal context (which will be then called LISTA functions). These functions are then used to build local tests of clustering to analyse differences in local spatiotemporal structures. We present a simulation stud…
RNA-Seq Atlas—a reference database for gene expression profiling in normal tissue by next-generation sequencing
2012
Abstract Motivation: Next-generation sequencing technology enables an entirely new perspective for clinical research and will speed up personalized medicine. In contrast to microarray-based approaches, RNA-Seq analysis provides a much more comprehensive and unbiased view of gene expression. Although the perspective is clear and the long-term success of this new technology obvious, bioinformatics resources making these data easily available especially to the biomedical research community are still evolving. Results: We have generated RNA-Seq Atlas, a web-based repository of RNA-Seq gene expression profiles and query tools. The website offers open and easy access to RNA-Seq gene expression pr…
Outlier detection with automatic modelling: TRAMO/SEATS versus X-12-ARIMA
2012
Efficient change point detection in genomic sequences of continuous measurements
2010
Abstract Motivation: Knowing the exact locations of multiple change points in genomic sequences serves several biological needs, for instance when data represent aCGH profiles and it is of interest to identify possibly damaged genes involved in cancer and other diseases. Only a few of the currently available methods deal explicitly with estimation of the number and location of change points, and moreover these methods may be somewhat vulnerable to deviations of model assumptions usually employed. Results: We present a computationally efficient method to obtain estimates of the number and location of the change points. The method is based on a simple transformation of data and it provides re…
Systematic handling of missing data in complex study designs : experiences from the Health 2000 and 2011 Surveys
2016
We present a systematic approach to the practical and comprehensive handling of missing data motivated by our experiences of analyzing longitudinal survey data. We consider the Health 2000 and 2011 Surveys (BRIF8901) where increased non-response and non-participation from 2000 to 2011 was a major issue. The model assumptions involved in the complex sampling design, repeated measurements design, non-participation mechanisms and associations are presented graphically using methodology previously defined as a causal model with design, i.e. a functional causal model extended with the study design. This tool forces the statistician to make the study design and the missing-data mechanism explicit…