Search results for "Substring"

showing 10 items of 29 documents

Deep learning network for exploiting positional information in nucleosome related sequences

2017

A nucleosome is a DNA-histone complex, wrapping about 150 pairs of double-stranded DNA. The role of nucleosomes is to pack the DNA into the nucleus of the Eukaryote cells to form the Chromatin. Nucleosome positioning genome wide play an important role in the regulation of cell type-specific gene activities. Several biological studies have shown sequence specificity of nucleosome presence, clearly underlined by the organization of precise nucleotides substrings. Taking into consideration such advances, the identification of nucleosomes on a genomic scale has been successfully performed by DNA sequence features representation and classical supervised classification methods such as Support Vec…

0301 basic medicineComputer scienceSpeech recognitionCell02 engineering and technologyComputational biologyGenomeDNA sequencing03 medical and health scienceschemistry.chemical_compoundDeep Learning0202 electrical engineering electronic engineering information engineeringmedicineNucleosomeNucleotideGeneSettore ING-INF/05 - Sistemi Di Elaborazione Delle Informazionichemistry.chemical_classificationSequenceSettore INF/01 - Informaticabiologybusiness.industryDeep learningnucleosomebiology.organism_classificationSubstringChromatinIdentification (information)030104 developmental biologymedicine.anatomical_structurechemistry020201 artificial intelligence & image processingEukaryoteNucleosome classification Epigenetic Deep learning networks Recurrent Neural NetworksArtificial intelligencebusinessDNA

researchProduct

The colored longest common prefix array computed via sequential scans

2018

Due to the increased availability of large datasets of biological sequences, the tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most of the alignment-free approaches require the computation of statistics of the sequences in the dataset. Such computations become impractical in internal memory when very large collections of long sequences are considered. In this paper, we present a new conceptual data structure, the colored longest common prefix array (cLCP), that allows to efficiently tackle several problems with an alignment-free approach. In fact, we show that such a data structure can be computed via sequential scans in semi-exter…

0301 basic medicineFOS: Computer and information sciencesAlignment-free methodsBurrows–Wheeler transformComputer scienceComputationAverage common substring0206 medical engineeringMatching statisticsScale (descriptive set theory)02 engineering and technologyTheoretical Computer Science03 medical and health sciencesComputer Science - Data Structures and AlgorithmsData Structures and Algorithms (cs.DS)Burrows-wheeler transformString (computer science)Computer Science (all)LCP arrayMatching statisticData structureSubstring030104 developmental biologyAlignment-free methods; Average common substring; Burrows-wheeler transform; Longest common prefix; Matching statistics; Theoretical Computer Science; Computer Science (all)Pairwise comparisonLongest common prefixAlgorithm020602 bioinformaticsAlignment-free method

researchProduct

Discovering unbounded unions of regular pattern languages from positive examples

1996

The problem of learning unions of certain pattern languages from positive examples is considered. We restrict to the regular patterns, i.e., patterns where each variable symbol can appear only once, and to the substring patterns, which is a subclass of regular patterns of the type xαy, where x and y are variables and α is a string of constant symbols. We present an algorithm that, given a set of strings, finds a good collection of patterns covering this set. The notion of a ‘good covering’ is defined as the most probable collection of patterns likely to be present in the examples, assuming a simple probabilistic model, or equivalently using the Minimum Description Length (MDL) principle. Ou…

0303 health sciencesComputer scienceString (computer science)0102 computer and information sciences01 natural sciencesSubstringCombinatoricsSet (abstract data type)03 medical and health sciencesVariable (computer science)Cover (topology)010201 computation theory & mathematicsSimple (abstract algebra)Minimum description length030304 developmental biology

researchProduct

On Combinatorial Generation of Prefix Normal Words

2014

A prefix normal word is a binary word with the property that no substring has more 1s than the prefix of the same length. This class of words is important in the context of binary jumbled pattern matching. In this paper we present an efficient algorithm for exhaustively listing the prefix normal words with a fixed length. The algorithm is based on the fact that the language of prefix normal words is a bubble language, a class of binary languages with the property that, for any word w in the language, exchanging the first occurrence of 01 by 10 in w results in another word in the language. We prove that each prefix normal word is produced in O(n) amortized time, and conjecture, based on expe…

Amortized analysisConjecturePrefix Normal WordBinary numbercombinatorial generation; formal languages; prefix normal words; binary strings; jumbled pattern matching; bubble languages; efficient algorithmsContext (language use)prefix normal wordsData_CODINGANDINFORMATIONTHEORYformal languagesbubble languagesSubstringcombinatorial generationbinary stringsPrefixCombinatoricsjumbled pattern matchingefficient algorithmsPattern matchingAlgorithmsWord (computer architecture)Mathematics

researchProduct

Maximal Closed Substrings

2022

A string is closed if it has length 1 or has a nonempty border without internal occurrences. In this paper we introduce the definition of a maximal closed substring (MCS), which is an occurrence of a closed substring that cannot be extended to the left nor to the right into a longer closed substring. MCSs with exponent at least 2 are commonly called runs; those with exponent smaller than 2, instead, are particular cases of maximal gapped repeats. We show that a string of length n contains O(n1.5) MCSs. We also provide an output-sensitive algorithm that, given a string of length n over a constant-size alphabet, locates all m MCSs the string contains in O(nlog n+ m) time.

Closed word Maximal closed substring Run

researchProduct

On the construction of classes of suffix trees for square matrices: Algorithms and applications

1995

Given an n × n TEXT matrix with entries defined over an ordered alphabet σ, we introduce 4n−1 classes of index data structures for TEXT. Those indices are informally the two-dimensional analog of the suffix tree of a string [15], allowing on-line searches and statistics to be performed on TEXT. We provide one simple algorithm that efficiently builds any chosen index in those classes in O(n2 log n) worst case time using O(n2) space. The algorithm can be modified to require optimal O(n2) expected time for bounded σ.

CombinatoricsCompressed suffix arraylawSuffix treeString (computer science)Generalized suffix treeSuffix arraySuffixAlgorithmFM-indexlaw.inventionMathematicsLongest common substring problem

researchProduct

Irredundant tandem motifs

2014

Eliminating the possible redundancy from a set of candidate motifs occurring in an input string is fundamental in many applications. The existing techniques proposed to extract irredundant motifs are not suitable when the motifs to search for are structured, i.e., they are made of two (or several) subwords that co-occur in a text string s of length n. The main effort of this work is studying and characterizing a compact class of tandem motifs, that is, pairs of substrings {m1, m2} occurring in tandem within a maximum distance of d symbols in s, where d is an integer constant given in input. To this aim, we first introduce the concept of maximality, related to four specific conditions that h…

CombinatoricsDiscrete mathematicsMotifs Tandem Patterns Irredundant motifs String algorithm Suffix treeGeneral Computer ScienceTandemlawSuffix treeText stringSubstringTheoretical Computer ScienceLinear numberMathematicslaw.inventionTheoretical Computer Science

researchProduct

Forbidden Factors and Fragment Assembly

2001

In this paper methods and results related to the notion of minimal forbidden words are applied to the fragment assembly problem. The fragment assembly problem can be formulated, in its simplest form, as follows: reconstruct a word w from a given set I of substrings (fragments ) of a word w . We introduce an hypothesis involving the set of fragments I and the maximal length m(w) of the minimal forbidden factors of w . Such hypothesis allows us to reconstruct uniquely the word w from the set I in linear time. We prove also that, if w is a word randomly generated by a memoryless source with identical symbol probabilities, m(w) is logarithmic with respect to the size of w . This result shows th…

CombinatoricsSet (abstract data type)Fragment (logic)LogarithmDeterministic automatonSymbol (programming)General MathematicsTime complexitySoftwareWord (computer architecture)SubstringComputer Science ApplicationsMathematicsRAIRO - Theoretical Informatics and Applications

researchProduct

Linear-size suffix tries

2016

Suffix trees are highly regarded data structures for text indexing and string algorithms [MCreight 76, Weiner 73]. For any given string w of length n = | w | , a suffix tree for w takes O ( n ) nodes and links. It is often presented as a compacted version of a suffix trie for w, where the latter is the trie (or digital search tree) built on the suffixes of w. Here the compaction process replaces each maximal chain of unary nodes with a single arc. For this, the suffix tree requires that the labels of its arcs are substrings encoded as pointers to w (or equivalent information). On the contrary, the arcs of the suffix trie are labeled by single symbols but there can be Θ ( n 2 ) nodes and lin…

Compressed suffix arrayGeneral Computer ScienceSuffix tree[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS]Generalized suffix tree0102 computer and information sciences02 engineering and technologyData_CODINGANDINFORMATIONTHEORYText indexing01 natural sciencesY-fast trielaw.inventionLongest common substring problemTheoretical Computer ScienceCombinatoricsSuffix treelawFactor and suffix automata0202 electrical engineering electronic engineering information engineeringData_FILESArithmeticFactor and suffix automata; Pattern matching; Suffix tree; Text indexing; Theoretical Computer Science; Computer Science (all)Pattern matchingMathematicsSettore INF/01 - InformaticaX-fast trieComputer Science (all)LCP array010201 computation theory & mathematics020201 artificial intelligence & image processingFM-index

researchProduct

On Table Arrangements, Scrabble Freaks, and Jumbled Pattern Matching

2010

Given a string s, the Parikh vector of s, denoted p(s), counts the multiplicity of each character in s. Searching for a match of Parikh vector q (a “jumbled string”) in the text s requires to find a substring t of s with p(t) = q. The corresponding decision problem is to verify whether at least one such match exists. So, for example for the alphabet Σ = {a, b, c}, the string s = abaccbabaaa has Parikh vector p(s) = (6,3,2), and the Parikh vector q = (2,1,1) appears once in s in position (1,4). Like its more precise counterpart, the renown Exact String Matching, Jumbled Pattern Matching has ubiquitous applications, e.g., string matching with a dyslectic word processor, table rearrangements, …

Discrete mathematicsParikh vectors jumbled pattern matching scrabble approximate pattern matching000AnagramParikh vectorsString searching algorithmApproximate string matchingDecision problemalgorithmsData structureJumbled Pattern MatchingSubstringscrabbleapproximate pattern matchingString MatchingWavelet TreePattern matchingMathematics

researchProduct