6533b854fe1ef96bd12ae8fc

RESEARCH PRODUCT

Discovering unbounded unions of regular pattern languages from positive examples

Jaak ViloAlvis BrazmaEsko Ukkonen

subject

0303 health sciencesComputer scienceString (computer science)0102 computer and information sciences01 natural sciencesSubstringCombinatoricsSet (abstract data type)03 medical and health sciencesVariable (computer science)Cover (topology)010201 computation theory & mathematicsSimple (abstract algebra)Minimum description length030304 developmental biology

description

The problem of learning unions of certain pattern languages from positive examples is considered. We restrict to the regular patterns, i.e., patterns where each variable symbol can appear only once, and to the substring patterns, which is a subclass of regular patterns of the type xαy, where x and y are variables and α is a string of constant symbols. We present an algorithm that, given a set of strings, finds a good collection of patterns covering this set. The notion of a ‘good covering’ is defined as the most probable collection of patterns likely to be present in the examples, assuming a simple probabilistic model, or equivalently using the Minimum Description Length (MDL) principle. Our algorithm is shown to approximate the optimal cover within a logarithmic factor. This extends a similar recent result for the so-called simple patterns. For substring patterns the running time of the algorithm is O(nN), where n is the number and N the total lenght of the sequences.

https://doi.org/10.1007/bfb0009485