Search results for "Substring"

showing 9 items of 29 documents

Sparse Dynamic Programming for Longest Common Subsequence from Fragments

2002

Sparse Dynamic Programming has emerged as an essential tool for the design of efficient algorithms for optimization problems coming from such diverse areas as computer science, computational biology, and speech recognition. We provide a new sparse dynamic programming technique that extends the Hunt?Szymanski paradigm for the computation of the longest common subsequence (LCS) and apply it to solve the LCS from Fragments problem: given a pair of strings X and Y (of length n and m, respectively) and a set M of matching substrings of X and Y, find the longest common subsequence based only on the symbol correspondences induced by the substrings. This problem arises in an application to analysis…

Longest common subsequence problemCombinatoricsDynamic programmingSet (abstract data type)Computational MathematicsControl and OptimizationOptimization problemComputational Theory and MathematicsMatching (graph theory)Symbol (programming)ComputationSubstringMathematicsJournal of Algorithms
researchProduct

On Approximate Jumbled Pattern Matching in Strings

2011

Given a string s, the Parikh vector of s, denoted p(s), counts the multiplicity of each character in s. Searching for a match of a Parikh vector q in the text s requires finding a substring t of s with p(t) = q. This can be viewed as the task of finding a jumbled (permuted) version of a query pattern, hence the term Jumbled Pattern Matching. We present several algorithms for the approximate version of the problem: Given a string s and two Parikh vectors u, v (the query bounds), find all maximal occurrences in s of some Parikh vector q such that u <= q <= v. This definition encompasses several natural versions of approximate Parikh vector search. We present an algorithm solving this problem …

Parikh vectors: Average case analysiApproximate searchString algorithmsDiscrete mathematicsWeight functionanalysisSearch engine indexingParikh vectorsAverage case analysisApproximate string matchingSubstringString algorithmTheoretical Computer ScienceCombinatoricsComputational Theory and MathematicsString algorithms Pattern matching Parikh vectors Average case analysis Approximate search Permuted stringsPermuted stringsAverage caseTheory of computationWavelet TreePreprocessorPattern matchingPattern matchingMathematicsTheory of Computing Systems
researchProduct

Learning of regular expressions by pattern matching

1995

We consider the problem of restoring regular expressions from good examples. We describe a natural learning algorithm for obtaining a “plausible” regular expression from one example. The algorithm is based on finding the longest substring which can be matched by some part of the so far obtained expression. We believe that the algorithm to a certain extent mimics humans guessing regular expressions from the same sort of examples. We show that for regular expressions of bounded length successful learning takes time linear in the length of the example, provided that the example is “good”. Under certain natural restrictions the run-time of the learning algorithm is polynomial also in unsuccessf…

PolynomialFinite-state machineRegular languageComputer scienceBounded functionRegular expressionPattern matchingAlgorithmExpression (mathematics)Substring
researchProduct

Forbidden Factors and Fragment Assembly

2002

In this paper we approach the fragment assembly problem by using the notion of minimal forbidden factors introduced in previous paper. Denoting by M(w) the set of minimal forbidden factors of a word w, we first focus on the evaluation of the size of elements in M(w) and on designing of an algorithm to recover the word w from M(w). Actually we prove that for a word w randomly generated by a memoryless source with identical symbol probabilities, the maximal length m(w) of words in M(w) is logarithmic and that the reconstruction algorithm runs in linear time. These results have an interesting application to the fragment assembly problem, i.e. reconstruct a word w from a given set I of substrin…

Set (abstract data type)CombinatoricsLogarithmFragment (logic)Reconstruction algorithmFocus (optics)AlgorithmTime complexitySubstringWord (computer architecture)Mathematics
researchProduct

On-line Construction of Two-Dimensional Suffix Trees

1999

AbstractWe say that a data structure is builton-lineif, at any instant, we have the data structure corresponding to the input we have seen up to that instant. For instance, consider the suffix tree of a stringx[1,n]. An algorithm building iton-lineis such that, when we have read the firstisymbols ofx[1,n], we have the suffix tree forx[1,i]. We present a new technique, which we refer to asimplicit updates, based on which we obtain: (a) an algorithm for theon-lineconstruction of the Lsuffix tree of ann×nmatrixA—this data structure is the two-dimensional analog of the suffix tree of a string; (b) simple algorithms implementing primitive operations forLZ1-typeon-line losslessimage compression m…

Statistics and ProbabilityCompressed suffix arrayNumerical AnalysisControl and OptimizationAlgebra and Number TheoryTheoretical computer scienceApplied MathematicsGeneral MathematicsSuffix treeString (computer science)Generalized suffix treelaw.inventionLongest common substring problemTree (data structure)lawSuffixAlgorithmFM-indexMathematicsJournal of Complexity
researchProduct

Boosting Textual Compression in Optimal Linear Time

2005

We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression performance guarantee. It displays the following remarkable properties: (a) it can turn any memoryless compressor into a compression algorithm that uses the “best possible” contexts; (b) it is very simple and optimal in terms of time; and (c) it admits a decompression algorithm again optimal in time. To the best of our knowledge, this is the first boosting technique displaying these properties.Technically, our boosting technique builds upon three main ingredients: the Burrows--Wheeler Transform, the Suffix Tree d…

Theoretical computer scienceBurrows–Wheeler transformSuffix treeString (computer science)Data_CODINGANDINFORMATIONTHEORYBurrows-Wheeler transformSubstringArithmetic codinglaw.inventionLempel-Ziv compressorsArtificial IntelligenceHardware and ArchitectureControl and Systems Engineeringlawtext compressionempirical entropyArithmetic codingGreedy algorithmTime complexityAlgorithmSoftwareInformation SystemsMathematicsData compression
researchProduct

Lightweight LCP construction for next-generation sequencing datasets

2012

The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets. In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and B…

Whole genome sequencingGenomics (q-bio.GN)FOS: Computer and information sciencesSequenceBWT; LCP; next-generation sequencing datasetsBWT LCP text indexes next-generation sequencing datasets massive datasetsSettore INF/01 - InformaticaComputer scienceComputationString (computer science)LCP arrayParallel computingData structureDNA sequencingSubstringBWTLCPFOS: Biological sciencesComputer Science - Data Structures and AlgorithmsQuantitative Biology - GenomicsData Structures and Algorithms (cs.DS)next-generation sequencing datasets
researchProduct

Normal, Abby Normal, Prefix Normal

2014

A prefix normal word is a binary word with the property that no substring has more 1s than the prefix of the same length. This class of words is important in the context of binary jumbled pattern matching. In this paper we present results about the number \(\textit{pnw}(n)\) of prefix normal words of length n, showing that \(\textit{pnw}(n) =\Omega\left(2^{n - c\sqrt{n\ln n}}\right)\) for some c and \(\textit{pnw}(n) = O \left(\frac{2^n (\ln n)^2}{n}\right)\). We introduce efficient algorithms for testing the prefix normal property and a “mechanical algorithm” for computing prefix normal forms. We also include games which can be played with prefix normal words. In these games Alice wishes t…

binary jumbled pattern matchingEfficient algorithmmembership testBinary numberContext (language use)Prefix Normal Word AlgorithmData_CODINGANDINFORMATIONTHEORYprefix normal wordsOmegaSubstringenumerationCombinatoricsPrefixprefix normal words; binary jumbled pattern matching; normal forms; enumeration; membership test; binary languagesEnumerationnormal formsbinary languagesWord (group theory)Mathematics
researchProduct

A New Feature Selection Methodology for K-mers Representation of DNA Sequences

2015

DNA sequence decomposition into k-mers and their frequency counting, defines a mapping of a sequence into a numerical space by a numerical feature vector of fixed length. This simple process allows to compare sequences in an alignment free way, using common similarities and distance functions on the numerical codomain of the mapping. The most common used decomposition uses all the substrings of a fixed length k making the codomain of exponential dimension. This obviously can affect the time complexity of the similarity computation, and in general of the machine learning algorithm used for the purpose of sequence analysis. Moreover, the presence of possible noisy features can also affect the…

k-mers DNA sequence similarity feature selection DNA sequence classification.Settore INF/01 - InformaticaComputer scienceSequence analysisbusiness.industryFeature vectorPattern recognitionFeature selectionDNA sequencingSubstringExponential functionArtificial intelligencebusinessAlgorithmTime complexity
researchProduct