0000000001101791
AUTHOR
Simona E. Rombo
Characterization and Extraction of Irredundant Tandem Motifs
We address the problem of extracting pairs of subwords (m1,m2) from a text string s of length n, such that, given also an integer constant d in input, m1 and m2 occur in tandem within a maximum distance of d symbols in s. The main effort of this work is to eliminate the possible redundancy from the candidate set of the so found tandem motifs. To this aim, we first introduce the concept of maximality, characterized by four specific conditions, that we show to be not deducible by the corresponding notion of maximality already defined for "simple" (i.e., non tandem) motifs. Then, we further eliminate the remaining redundancy by defining the concept of irredundancy for tandem motifs. We prove t…
Discriminating graph pattern mining from gene expression data
We consider the problem of mining gene expression data in order to single out interesting features that characterize healthy/unhealthy samples of an input dataset. We present and approach based on a network model of the input gene expression data, where there is a labelled graph for each sample. To the best of our knowledge, this is the first attempt to build a different graph for each sample and, then, to have a database of graphs for representing a sample set. Out main goal is that of singling out interesting differences between healthy and unhealthy samples, through the extraction of "discriminating patterns" among graphs belonging to the two different sample sets. Differently from the …
Image classification based on 2D feature motifs
The classification of raw data often involves the problem of selecting the appropriate set of features to represent the input data. In general, various features can be extracted from the input dataset, but only some of them are actually relevant for the classification process. Since relevant features are often unknown in real-world problems, many candidate features are usually introduced. This degrades both the speed and the predictive accuracy of the classifier due to the presence of redundancy in the candidate feature set. In this paper, we study the capability of a special class of motifs previously introduced in the literature, i.e. 2D irredundant motifs, when they are exploited as feat…
A Collaborative Filtering Approach for Drug Repurposing
A recommendation system is proposed based on the construction of Knowledge Graphs, where physical interaction between proteins and associations between drugs and targets are taken into account. The system suggests new targets for a given drug depending on how proteins are linked each other in the graph. The framework adopted for the implementation of the proposed approach is Apache Spark, useful for loading, managing and manipulating data by means of appropriate Resilient Distributed Datasets (RDD). Moreover, the Alternating Least Square (ALS) machine learning algorithm, a Matrix Factorization algorithm for distributed and parallel computing, is applied. Preliminary obtained results seem to…
An evolutionary restricted neighborhood search clustering approach for PPI networks
Protein-protein interaction networks have been broadly studied in the last few years, in order to understand the behavior of proteins inside the cell. Proteins interacting with each other often share common biological functions or they participate in the same biological process. Thus, discovering protein complexes made of a group of proteins strictly related can be useful to predict protein functions. Clustering techniques have been widely employed to detect significant biological complexes. In this paper, we integrate one of the most popular network clustering techniques, namely the Restricted Neighborhood Search Clustering (RNSC), with evolutionary computation. The two cost functions intr…
PROTEIN SECONDARY STRUCTURE PREDICTION: HOW TO IMPROVE ACCURACY BY INTEGRATION
In this paper a technique to improve protein secondary structure prediction is proposed. The approach is based on the idea of combining the results of a set of prediction tools, choosing the most correct parts of each prediction. The correctness of the resulting prediction is measured referring to accuracy parameters used in several editions of CASP. Experimental evaluations validating the proposed approach are also reported.
Image Compression by 2D Motif Basis
Approaches to image compression and indexing based on extensions to 2D of some of the Lempel-Ziv incremental parsing techniques have been proposed in the recent past. In these approaches, an image is decomposed into a number of patches, consisting each of a square or rectangular solid block. This paper proposes image compression techniques based on patches that are not necessarily solid blocks, but are affected instead by a controlled number of undetermined or don't care pixels. Such patches are chosen from a set of candidate motifs that are extracted in turn from the image 2D motif basis, the latter consisting of a compact set of patterns that result from the autocorrelation of the image w…
Approximate Matching over Biological RDF Graphs
In the last few years, the amount of biological interaction data discovered and stored in public databases (e.g., KEGG [2]) considerably increased. To this aim, RDF is a powerful representation for interactions (or pathways), since they can be modeled as directed graphs, often referred to as biological networks, where nodes represent cellular components and the (labeled or unlabeled) edges correspond to interactions among components. Often for a given organism some components are known to be linked by well studied interactions. Such groups of components are called modules and they can be represented by sub-graphs in the corresponding biological network model. At today, one of the most impor…
An Integrative Framework for the Construction of Big Functional Networks
We present a methodology for biological data integration, aiming at building and analysing large functional networks which model complex genotype-phenotype associations. A functional network is a graph where nodes represent cellular components (e.g., genes, proteins, mRNA, etc.) and edges represent associations among such molecules. Different types of components may cohesist in the same network, and associations may be related to physical[biochemical interactions or functional/phenotipic relationships. Due to both the large amount of involved information and the computational complexity typical of the problems in this domain, the proposed framework is based on big data technologies (Spark a…
Entropic Profiles, Maximal Motifs and the Discovery of Significant Repetitions in Genomic Sequences
The degree of predictability of a sequence can be measured by its entropy and it is closely related to its repetitiveness and compressibility. Entropic profiles are useful tools to study the under- and over-representation of subsequences, providing also information about the scale of each conserved DNA region. On the other hand, compact classes of repetitive motifs, such as maximal motifs, have been proved to be useful for the identification of significant repetitions and for the compression of biological sequences. In this paper we show that there is a relationship between entropic profiles and maximal motifs, and in particular we prove that the former are a subset of the latter. As a furt…
Network Centralities and Node Ranking
An important problem in network analysis is understanding how much nodes are important in order to “propagate” the information across the input network. To this aim, many centrality measures have been proposed in the literature and our main goal here is that of providing an overview of the most important of them. In particular, we distinguish centrality measures based on walks computation from those based on shortest-paths computation. We also provide some examples in order to clarify how these measures can be calculated, with special attention to Degree Centrality, Closeness Centrality and Betweennes Centrality.
JSSPrediction: a Framework to Predict Protein Secondary Structures Using Integration
Identifying protein secondary structures is a difficult task. Recently, a lot of software tools for protein secondary structure prediction have been produced and made available on-line, mostly with good performances. However, prediction tools work correctly for families of proteins, such that users have to know which predictor to use for a given unknown protein. We propose a framework to improve secondary structure prediction by integrating results obtained from a set of available predictors. Our contribution consists in the definition of a two phase approach: (i) select a set of predictors which have good performances with the unknown protein family, and (ii) integrate the prediction resul…
Data Sources and Models
Biological networks rely on the storage and retrieval of data associated to the physical interactions and/or functional relationships among different actors. In particular, the attention may be on the interactions among cellular components, such as proteins, genes, RNA, or for example on phenotype–genotype associations. Data from which biological networks are built are usually stored in public databases, and we provide here a brief summary of the main types of both data and associations, publicly available. Moreover, we also explain how it is possible to construct suitable network models from these associations, focusing on protein–protein interaction networks, gene–disease networks and net…
2D motif basis applied to the classification of digital images
The classification of raw data often involves the problem of selecting the appropriate set of features to represent the input data. Different types of features can be extracted from the input dataset, but only some of them are actually relevant for the classification process. Since relevant features are often unknown in real-world problems, many candidate features are usually introduced. This degrades both the speed and the predictive accuracy of the classifier due to the presence of redundancy in the set of candidate features. Recently, a special class of bidimensional motifs, i.e. 2D motif basis has been introduced in the literature. 2D motif basis showed to be powerful in capturing the r…
Design and Prototyping of a Smart University Campus
The authors propose a framework to support the “smart planning” of a university environment, intended as a “smart campus.” The main goal is to improve the management, storage, and mining of information coming from the university areas and main players. The platform allows for interaction with the main players of the system, generating and displaying useful data in real time for a better user experience. The proposed framework provides also a chat assistant able to respond to user requests in real time. This will not only improve the communication between university environment and students, but it allows one to investigate on their habits and needs. Moreover, information collected from the …
Flexible pattern discovery with (extended) disjunctive logic programming
The post-genomic era showed up a wide range of new challenging issues for the areas of knowledge discovery and intelligent information management. Among them, the discovery of complex pattern repetitions in string databases plays an important role, specifically in those contexts where even what are to be considered the interesting pattern classes is unknown. This paper provides a contribution in this precise setting, proposing a novel approach, based on disjunctive logic programming extended with several advanced features, for discovering interesting pattern classes from a given data set.
(Discriminative) Pattern Discovery on Biological Networks
This work provides a review of biological networks as a model for analysis, presenting and discussing a number of illuminating analyses. Biological networks are an effective model for providing insights about biological mechanisms. Networks with different characteristics are employed for representing different scenarios. This powerful model allows analysts to perform many kinds of analyses which can be mined to provide interesting information about underlying biological behaviors. The text also covers techniques for discovering exceptional patterns, such as a pattern accounting for local similarities and also collaborative effects involving interactions between multiple actors (for example …
Prediction of Disease–lncRNA Associations via Machine Learning and Big Data Approaches
This chapter introduces long non-coding RNAs and their role in the occurrence and progress of diseases. The discovery of novel lncRNA-disease associations may provide valuable input to the understanding of disease mechanisms at the lncRNA level, as well as to the detection of biomarkers for disease diagnosis, treatment, prognosis, and prevention. Unfortunately, due to costs and time complexity, the number of possible disease-related lncRNAs verified by traditional biological experiments is very limited. Computational approaches for the prediction of potential disease-lncRNA associations can effectively decrease the time and cost of biological experiments. We first review the main computatio…
FEDRO: a software tool for the automatic discovery of candidate ORFs in plants with c →u RNA editing
RNA editing is an important mechanism for gene expression in plants organelles. It alters the direct transfer of genetic information from DNA to proteins, due to the introduction of differences between RNAs and the corresponding coding DNA sequences. Software tools successful for the search of genes in other organisms not always are able to correctly perform this task in plants organellar genomes. Moreover, the available software tools predicting RNA editing events utilise algorithms that do not account for events which may generate a novel start codon. We present Fedro, a Java software tool implementing a novel strategy to generate candidate Open Reading Frames (ORFs) resulting from Cytidi…
Efficient Algorithms for Sequence Analysis with Entropic Profiles
Entropy, being closely related to repetitiveness and compressibility, is a widely used information-related measure to assess the degree of predictability of a sequence. Entropic profiles are based on information theory principles, and can be used to study the under-/over-representation of subwords, by also providing information about the scale of conserved DNA regions. Here, we focus on the algorithmic aspects related to entropic profiles. In particular, we propose linear time algorithms for their computation that rely on suffix-based data structures, more specifically on the truncated suffix tree (TST) and on the enhanced suffix array (ESA). We performed an extensive experimental campaign …
Protein data condensation for effective quaternary structure classification
Many proteins are composed of two or more subunits, each associated with different polypeptide chains. The number and the arrangement of subunits forming a protein are referred to as quaternary structure. The quaternary structure of a protein is important, since it characterizes the biological function of the protein when it is involved in specific biological processes. Unfortunately, quaternary structures are not trivially deducible from protein amino acid sequences. In this work, we propose a protein quaternary structure classification method exploiting the functional domain composition of proteins. It is based on a nearest neighbor condensation technique in order to reduce both the porti…
Protein-protein interaction network querying by a "focus and zoom" approach
We propose an approach to network querying in protein-protein interaction networks based on bipartite graph weighted matching. An algorithm is presented that first “focuses” the potentially relevant portion of the target graph by performing a global alignment of this one with the query graph, and then “zooms” on the actual matching nodes by considering their topological arrangement, hereby obtaining a (possibly) approximated occurrence of the query graph within the target graph. Approximation is related to node insertions, node deletions and edge deletions possibly intervening in the query graph. The technique manages networks of arbitrary topology. Moreover, edge labels are used to represe…
Epigenomic k-mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning
Abstract Motivation: Information-theoretic and compositional analysis of biological sequences, in terms of k-mer dictionaries, has a well established role in genomic and proteomic studies. Much less so in epigenomics, although the role of k-mers in chromatin organization and nucleosome positioning is particularly relevant. Fundamental questions concerning the informational content and compositional structure of nucleosome favouring and disfavoring sequences with respect to their basic building blocks still remain open. Results: We present the first analysis on the role of k-mers in the composition of nucleosome enriched and depleted genomic regions (NER and NDR for short) that is: (i) exhau…
PINCoC: a Co-Clustering based Method to Analyze Protein-Protein Interaction Networks
Anovel technique to search for functionalmodules in a protein-protein interaction network is presented. The network is represented by the adjacency matrix associated with the undirected graph modelling it. The algorithm introduces the concept of quality of a sub-matrix of the adjacency matrix, and applies a greedy search technique for finding local optimal solutions made of dense submatrices containing the maximum number of ones. An initial random solution, constituted by a single protein, is evolved to search for a locally optimal solution by adding/removing connected proteins that best contribute to improve the quality function. Experimental evaluations carried out on Saccaromyces Cerevis…
Restricted Neighborhood Search Clustering Revisited: An Evolutionary Computation Perspective
Protein-protein interaction networks have been broadly studied in the last few years, in order to understand the behavior of proteins inside the cell. Proteins interacting with each other often share common biological functions or they participate in the same biological process. Thus, discovering protein complexes made of groups of proteins strictly related, can be useful to predict protein functions. Clustering techniques have been widely employed to detect significative biological complexes. In this paper, we integrate one of the most popular network clustering techniques, namely the Restricted Neighborhood Search Clustering (RNSC), with evolutionary computation. The two cost functions in…
Extracting string motif bases for quorum higher than two
Bases of generators of motifs consisting of strings in which some positions can be occupied by a don’t care provide a useful conceptual tool for their description and a way to reduce the time and space involved in the discovery process. In the last few years, a few algorithms have been proposed for the extraction of a basis, building in large part on combinatorial properties of strings and their autocorrelations. Currently, the most efficient techniques for binary alphabets and quorum q = 2 require time quadratic in the length of the host string. The present paper explores properties of motif bases for quorum q ≥ 2, both with binary and general alphabets, by also showing that important resu…
Algorithms and tools for protein-protein interaction networks clustering, with a special focus on population-based stochastic methods
Abstract Motivation: Protein–protein interaction (PPI) networks are powerful models to represent the pairwise protein interactions of the organisms. Clustering PPI networks can be useful for isolating groups of interacting proteins that participate in the same biological processes or that perform together specific biological functions. Evolutionary orthologies can be inferred this way, as well as functions and properties of yet uncharacterized proteins. Results: We present an overview of the main state-of-the-art clustering methods that have been applied to PPI networks over the past decade. We distinguish five specific categories of approaches, describe and compare their main features and …
Exceptional Pattern Discovery
This chapter is devoted to a discussion on exceptional pattern discovery, namely on scenarios, contexts, and techniques concerning the mining of patterns which are so rare or so frequent to be considered as exceptional and, then, of interest for an expert to shed lights on the domain. Frequent patterns have found broad applications in areas like association rule mining, indexing, and clustering [1, 20, 23]. The application of frequent patterns in classification also achieved some success in the classification of relational data [6, 13, 14, 19, 25], text [15], and graphs [7]. The part is organized as follows. First, the frequent pattern mining on classical datasets is presented. This is not …
Algorithms for Graph and Network Analysis: Graph Alignment
In this article we discuss the problem of graph alignment, which has been longly referred to for the purpose of analyzing and comparing biological networks. In particular, we describe different facets of graph alignment, according to the number of input networks, the fixed output objective, the possible heterogeneity of input data. Accordingly, we will discuss pairwise and multiple alignment, global and local alignment, etc. Moreover, we provide a comprehensive overview of the algorithms and techniques proposed in the literature to solve each of the specific considered types of graph alignment. In order to make the material presented here complete and useful to guide the reader in the use o…
DNA combinatorial messages and Epigenomics: The case of chromatin organization and nucleosome occupancy in eukaryotic genomes
Abstract Epigenomics is the study of modifications on the genetic material of a cell that do not depend on changes in the DNA sequence, since those latter involve specific proteins around which DNA wraps. The end result is that Epigenomic changes have a fundamental role in the proper working of each cell in Eukaryotic organisms. A particularly important part of Epigenomics concentrates on the study of chromatin, that is, a fiber composed of a DNA-protein complex and very characterizing of Eukaryotes. Understanding how chromatin is assembled and how it changes is fundamental for Biology. In more than thirty years of research in this area, Mathematics and Theoretical Computer Science have gai…
Motif patterns in 2D
AbstractMotif patterns consisting of sequences of intermixed solid and don’t-care characters have been introduced and studied in connection with pattern discovery problems of computational biology and other domains. In order to alleviate the exponential growth of such motifs, notions of maximal saturation and irredundancy have been formulated, whereby more or less compact subsets of the set of all motifs can be extracted, that are capable of expressing all others by suitable combinations. In this paper, we introduce the notion of maximal irredundant motifs in a two-dimensional array and develop initial properties and a combinatorial argument that poses a linear bound on the total number of …
Searching for repetitions in biological networks: methods, resources and tools
We present here a compact overview of the data, models and methods proposed for the analysis of biological networks based on the search for significant repetitions. In particular, we concentrate on three problems widely studied in the literature: ‘network alignment’, ‘network querying’ and ‘network motif extraction’. We provide (i) details of the experimental techniques used to obtain the main types of interaction data, (ii) descriptions of the models and approaches introduced to solve such problems and (iii) pointers to both the available databases and software tools. The intent is to lay out a useful roadmap for identifying suitable strategies to analyse cellular data, possibly based on t…
Customer recommendation based on profile matching and customized campaigns in on-line social networks
We propose a general framework for the recommendation of possible customers (users) to advertisers (e.g., brands) based on the comparison between On-Line Social Network profiles. In particular, we associate suitable categories and subcategories to both user and brand profiles in the considered On-line Social Network. When categories involve posts and comments, the comparison is based on word embedding, and this allows to take into account the similarity between the topics of particular interest for a brand and the user preferences. Furthermore, user personal information, such as age, job or genre, are used for targeting specific advertising campaigns. Results on real Facebook dataset show t…
Identifying the k Best Targets for an Advertisement Campaign via Online Social Networks
We propose a novel approach for the recommendation of possible customers (users) to advertisers (e.g., brands) based on two main aspects: (i) the comparison between On-line Social Network profiles, and (ii) neighborhood analysis on the On-line Social Network. Profile matching between users and brands is considered based on bag-of-words representation of textual contents coming from the social media, and measures such as the Term Frequency-Inverse Document Frequency are used in order to characterize the importance of words in the comparison. The approach has been implemented relying on Big Data Technologies, allowing this way the efficient analysis of very large Online Social Networks. Resul…
Protein Structure Metapredictors
Discovering discriminative graph patterns from gene expression data
We consider the problem of mining gene expression data in order to single out interesting features characterizing healthy/unhealthy samples of an input dataset. We present an approach based on a network model of the input gene expression data, where there is a labelled graph for each sample. To the best of our knowledge, this is the first attempt to build a different graph for each sample and, then, to have a database of graphs for representing a sample set. Our main goal is that of singling out interesting differences between healthy and unhealthy samples, through the extraction of "discriminative patterns" among graphs belonging to the two different sample sets. Differently from the other…
Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies
High-throughput sequencing technologies produce large collections of data, mainly DNA sequences with additional information, requiring the design of efficient and effective methodologies for both their compression and storage. In this context, we first provide a classification of the main techniques that have been proposed, according to three specific research directions that have emerged from the literature and, for each, we provide an overview of the current techniques. Finally, to make this review useful to researchers and technicians applying the existing software and tools, we include a synopsis of the main characteristics of the described approaches, including details on their impleme…
Contributions from ADBIS 2018 workshops
The ADBIS conferences provide an international forum for the presentation of research on database theory, development of advanced DBMS technologies, and their applications. The 22nd edition of ADBIS, held on September 2–5, 2018, in Budapest, Hungary, includes six thematic workshops collecting contributions from various domains representing new trends in the broad research areas of databases and information systems.
Problems and Techniques
When biological networks are considered, the extraction of interesting knowledge often involves subgraphs isomorphism check that is known to be NP-complete. For this reason, many approaches try to simplify the problem under consideration by considering structures simpler than graphs, such as trees or paths. Furthermore, the number of existing approximate techniques is notably greater than the number of exact methods. In this chapter, we provide an overview of three important problems defined on biological networks: network alignment, network clustering, and motifs extraction from biological networks. For each of these problems, we also describe some of the most important techniques proposed…
Multi-functional Protein Clustering in PPI Networks
Protein-Protein Interaction (PPI) networks contain valuable information for the isolation of groups of proteins that participate in the same biological function. Many proteins play different roles in the cell by taking part in several processes, but isolating the different processes in which a protein is involved is often a difficult task. In this paper we present a method based on a greedy local search technique to detect functional modules in PPI graphs. The approach is conceived as a generalization of the algorithm PINCoC to generate overlapping clusters of the interaction graph in input. Due to this peculiarity, multi-facets proteins are allowed to belong to different groups correspondi…
Asymmetric Comparison and Querying of Biological Networks
Comparing and querying the protein-protein interaction (PPI) networks of different organisms is important to infer knowledge about conservation across species. Known methods that perform these tasks operate symmetrically, i.e., they do not assign a distinct role to the input PPI networks. However, in most cases, the input networks are indeed distinguishable on the basis of how the corresponding organism is biologically well characterized. In this paper a new idea is developed, that is, to exploit differences in the characterization of organisms at hand in order to devise methods for comparing their PPI networks. We use the PPI network (called Master) of the best characterized organism as a …
New Trends in Graph Mining
Searching for repeated features characterizing biological data is fundamental in computational biology. When biological networks are under analysis, the presence of repeated modules across the same network (or several distinct ones) is shown to be very relevant. Indeed, several studies prove that biological networks can be often understood in terms of coalitions of basic repeated building blocks, often referred to as network motifs.This work provides a review of the main techniques proposed for motif extraction from biological networks. In particular, main intrinsic difficulties related to the problem are pointed out, along with solutions proposed in the literature to overcome them. Open ch…
"Master-Slave" Biological Network Alignment
Performing global alignment between protein-protein interaction (PPI) networks of different organisms is important to infer knowledge about conservation across species. Known methods that perform this task operate symmetrically, that is to say, they do not assign a distinct role to the input PPI networks. However, in most cases, the input networks are indeed distinguishable on the basis of how well the corresponding organism is biologically well-characterized. For well-characterized organisms the associated PPI network supposedly encode in a sound manner all the information about their proteins and associated interactions, which is far from being the case for not well characterized ones. He…
Irredundant tandem motifs
Eliminating the possible redundancy from a set of candidate motifs occurring in an input string is fundamental in many applications. The existing techniques proposed to extract irredundant motifs are not suitable when the motifs to search for are structured, i.e., they are made of two (or several) subwords that co-occur in a text string s of length n. The main effort of this work is studying and characterizing a compact class of tandem motifs, that is, pairs of substrings {m1, m2} occurring in tandem within a maximum distance of d symbols in s, where d is an integer constant given in input. To this aim, we first introduce the concept of maximality, related to four specific conditions that h…
Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark
With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of "omics" data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here we propose algorithms for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our algorithms are the first ones that distribute the index computation and not only the input dataset, allowing to fully benefit of the available cloud resources.
Integrative bioinformatics and omics data source interoperability in the next-generation sequencing era-Editorial.
With the advent of high-throughput and next-generation sequencing (NGS) technologies [1], huge amounts of ‘omics’ data (i.e. data from genomics, proteomics, pharmacogenomics, metagenomics, etc.) are continuously produced. Combining and integrating diverse omics data types is important in order to investigate the molecular machinery of complex diseases, with the hope for better disease prevention and treatment [2]. Experimental data repositories of omics data are publicly available, with the main aim of fostering the cooperation among research groups and laboratories all over the world. However, despite their openness, the effective integrated use of available public sources is hampered by t…
In vitro versus in vivo compositional landscapes of histone sequence preferences in eucaryotic genomes
Abstract Motivation Although the nucleosome occupancy along a genome can be in part predicted by in vitro experiments, it has been recently observed that the chromatin organization presents important differences in vitro with respect to in vivo. Such differences mainly regard the hierarchical and regular structures of the nucleosome fiber, whose existence has long been assumed, and in part also observed in vitro, but that does not apparently occur in vivo. It is also well known that the DNA sequence has a role in determining the nucleosome occupancy. Therefore, an important issue is to understand if, and to what extent, the structural differences in the chromatin organization between in vit…
DIAMIN: a software library for the distributed analysis of large-scale molecular interaction networks
AbstractBackgroundHuge amounts of molecular interaction data are continuously produced and stored in public databases. Although many bioinformatics tools have been proposed in the literature for their analysis, based on their modeling through different types of biological networks, several problems still remain unsolved when the problem turns on a large scale.ResultsWe propose , that is, a high-level software library to facilitate the development of applications for the efficient analysis of large-scale molecular interaction networks. relies on distributed computing, and it is implemented in Java upon the framework Apache Spark. It delivers a set of functionalities implementing different ta…
Efficient Classification of Digital Images based on Pattern-features
A summary of genomic databases: overview and discussion
In the last few years both the amount of electronically stored biological data and the number of biological data repositories grew up significantly (today, more than eight hundred can be counted thereof). In spite of the enormous amount of available resources, a user may be disoriented when he/she searches for specific data. Thus, the accurate analysis of biological data and repositories turn out to be useful to obtain a systematic view of biological database structures, tools and contents and, eventually, to facilitate the access and recovery of such data. In this chapter, we propose an analysis of genomic databases, which are databases of fundamental importance for the research in bioinfo…
IP6K gene identification in plant genomes by tag searching
Abstract Background Plants have played a special role in inositol polyphosphate (IP) research since in plant seeds was discovered the first IP, the fully phosphorylated inositol ring of phytic acid (IP6). It is now known that phytic acid is further metabolized by the IP6 Kinases (IP6Ks) to generate IP containing pyro-phosphate moiety. The IP6K are evolutionary conserved enzymes identified in several mammalian, fungi and amoebae species. Although IP6K has not yet been identified in plant chromosomes, there are many clues suggesting its presences in vegetal cells. Results In this paper we propose a new approach to search for the plant IP6K gene, that lead to the identification in plant genome…
Experimental Evaluation of Protein Secondary Structure Predictors
Understanding protein biological function is a key issue in modern biology, which is largely determined by its 3D shape. Protein 3D shape, in its turn, is functionally implied by its amino acid sequence. Since the direct inspection of such 3D structures is rather expensive and time consuming, a number of software techniques have been developed in the last few years that predict a spatial model, either of the secondary or of the tertiary form, for a given target protein starting from its amino acid sequence. This paper offers a comparison of several available automatic secondary structure prediction tools. The comparison is of the experimental kind, where two relevant sets of proteins, a non…
Discovering representative models in large time series databases
The discovery of frequently occurring patterns in a time series could be important in several application contexts. As an example, the analysis of frequent patterns in biomedical observations could allow to perform diagnosis and/or prognosis. Moreover, the efficient discovery of frequent patterns may play an important role in several data mining tasks such as association rule discovery, clustering and classification. However, in order to identify interesting repetitions, it is necessary to allow errors in the matching patterns; in this context, it is difficult to select one pattern particularly suited to represent the set of similar ones, whereas modelling this set with a single model could…
A Big Data Approach for Sequences Indexing on the Cloud via Burrows Wheeler Transform
Indexing sequence data is important in the context of Precision Medicine, where large amounts of ``omics'' data have to be daily collected and analyzed in order to categorize patients and identify the most effective therapies. Here we propose an algorithm for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our approach is the first that distributes the index computation and not only the input dataset, allowing to fully benefit of the available cloud resources.
Prediction of lncRNA-Disease Associations from Tripartite Graphs
The discovery of novel lncRNA-disease associations may provide valuable input to the understanding of disease mechanisms at lncRNA level, as well as to the detection of biomarkers for disease diagnosis, treatment, prognosis and prevention. Unfortunately, due to costs and time complexity, the number of possible disease-related lncRNAs verified by traditional biological experiments is very limited. Computational approaches for the prediction of potential disease-lncRNA associations can effectively decrease time and cost of biological experiments. We propose an approach for the prediction of lncRNA-disease associations based on neighborhood analysis performed on a tripartite graph, built upon …
Automatic simulation of RNA editing in plants for the identification of novel putative Open Reading Frames
In plant mitochondria an essential mechanism for gene expression is RNA editing, often influencing the synthesis of functional proteins. RNA editing alters the linearity of genetic information transfer, intro- ducing differences between RNAs and their coding DNA sequences that hind both experimental and computational research of genes. Thus common software tools for gene search, successfully exploited to find canonic genes, often can fail in discovering genes encrypted in the genome of plants. In this work we propose a novel strategy useful to intercept candidate coding sequences resulting from some possible editing substitutions on the start and stop codons of a given input organism DNA. O…
Improving protein secondary structure predictions by prediction fusion
Protein secondary structure prediction is still a challenging problem at today. Even if a number of prediction methods have been presented in the literature, the various prediction tools that are available on-line produce results whose quality is not always fully satisfactory. Therefore, a user has to know which predictor to use for a given protein to be analyzed. In this paper, we propose a server implementing a method to improve the accuracy in protein secondary structure prediction. The method is based on integrating the prediction results computed by some available on-line prediction tools to obtain a combined prediction of higher quality. Given an input protein p whose secondary struct…
Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics
Abstract Background Distributed approaches based on the MapReduce programming paradigm have started to be proposed in the Bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of MapReduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the …
Discovering new proteins in plant mitochondria by RNA editing simulation
In plant mitochondria an essential mechanism for gene expression is RNA editing, often influencing the synthesis of functional proteins. RNA editing alters the linearity of genetic information transfer. Indeed it causes differences between RNAs and their coding DNA sequences that hinder both experimental and computational research of genes. Therefore common software tools for gene search, successfully applied to find canonical genes, often fail in discovering genes encrypted in the genome of plants. Here we propose a novel strategy useful to identify candidate coding sequences resulting from possible editing substitutions. In particular, we consider c!u substitutions leading to the creation…
Extracting similar sub-graphs across PPI Networks
Singling out conserved modules (corresponding to connected sub-graphs) throughout protein-protein interaction networks of different organisms is a main issue in bioinformatics because of its potential applications in biology. This paper presents a method to discover highly matching sub-graphs in such networks. Sub-graph extraction is carried out by taking into account, on the one side, both protein sequence and network structure similarities and, on the other side, both quantitative and reliability information possibly available about interactions. The method is conceived as a generalization of a known technique, able to discover functional orthologs in interaction networks. Some preliminar…