Linear-size suffix tries

6533b827fe1ef96bd1285b36

RESEARCH PRODUCT

Linear-size suffix tries

Roberto Grossi Maxime Crochemore Chiara Epifanio Filippo Mignosi

subject

Compressed suffix array General Computer Science Suffix tree [INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS]Generalized suffix tree 0102 computer and information sciences 02 engineering and technology Data_CODINGANDINFORMATIONTHEORY Text indexing 01 natural sciences Y-fast trie law.invention Longest common substring problem Theoretical Computer Science Combinatorics Suffix tree law Factor and suffix automata 0202 electrical engineering electronic engineering information engineering Data_FILES Arithmetic Factor and suffix automata; Pattern matching; Suffix tree; Text indexing; Theoretical Computer Science; Computer Science (all)Pattern matching Mathematics Settore INF/01 - Informatica X-fast trie Computer Science (all)LCP array 010201 computation theory & mathematics 020201 artificial intelligence & image processing FM-index

description

Suffix trees are highly regarded data structures for text indexing and string algorithms [MCreight 76, Weiner 73]. For any given string w of length n = | w | , a suffix tree for w takes O ( n ) nodes and links. It is often presented as a compacted version of a suffix trie for w, where the latter is the trie (or digital search tree) built on the suffixes of w. Here the compaction process replaces each maximal chain of unary nodes with a single arc. For this, the suffix tree requires that the labels of its arcs are substrings encoded as pointers to w (or equivalent information). On the contrary, the arcs of the suffix trie are labeled by single symbols but there can be Θ ( n 2 ) nodes and links for suffix tries in the worst case because of their unary nodes. It is an interesting question if the suffix trie can be stored using O ( n ) nodes. We present the linear-size suffix trie, which guarantees O ( n ) nodes. We use a new technique for reducing the number of unary nodes to O ( n ) , that stems from some results on antidictionaries. For instance, by using the linear-size suffix trie, we are able to check whether a pattern p of length m = | p | occurs in w in O ( m log ⁡ | Σ | ) time and we can find the longest common substring of two strings w 1 and w 2 in O ( ( | w 1 | + | w 2 | ) log ⁡ | Σ | ) time for an alphabet Σ.

year	journal	country	edition	language
2016-01-01

10.1016/j.tcs.2016.04.002 https://kclpure.kcl.ac.uk/en/publications/5cfcb891-f458-4e27-a5fe-50f786f6b043