Suffix tree implementation
The suffix tree is a data structure used for various string manipulation tasks like pattern matching, substring search, and more. It's particularly efficient for solving these problems with linear time complexity. The goal of this article is to explain the implementation of a suffix tree using Java,c#,Python etc, along with a step-by-step breakdown of the provided code, a suitable example, and an analysis of its time complexity.
The problem revolves around building a suffix tree for a given input string. A suffix tree is a trie-like data structure that represents all the suffixes of a given string in a way that allows for efficient substring searches and pattern matching.
Let's consider two examples:
Input string: "coconut" The suffix tree would look like:
┐ ├─┐ co │ ├─⤑ conut │ └─⤑ nut ├─┐ o │ ├─⤑ conut │ └─⤑ nut ├─⤑ nut ├─⤑ ut └─⤑ t
Input string: "BAABAIAIIBI" The suffix tree would look like:
┐ ├─┐ B │ ├─┐ A │ │ ├─⤑ ABAIAIIBI │ │ └─⤑ IAIIBI │ └─⤑ I ├─┐ A │ ├─⤑ ABAIAIIBI │ ├─⤑ BAIAIIBI │ └─┐ I │ ├─⤑ AIIBI │ └─⤑ IBI └─┐ I ├─⤑ AIIBI ├─⤑ IBI └─⤑ BI
Idea to Solve the Problem
The idea behind constructing a suffix tree involves iteratively adding suffixes of the input string to the tree. This is done by considering each character in the string as the starting point for a new suffix. As each suffix is added, the tree is updated to accommodate the new suffix while sharing common prefixes between suffixes.
class TreeNode sub // substring label of the edge leading to this node child // list of child nodes' indices in the nodes array class SuffixTree nodes // list of tree nodes constructor(data) Create an empty list 'nodes' Create and add the root node to 'nodes' Build the suffix tree using the buildTree(data) method buildTree(str) for i from 0 to length of str - 1 addSuffix(str.substring(i)) addSuffix(suf) n = 0 // index of root node i = 0 // index for iterating over suf while i < length of suf b = character at index i in suf children = child nodes of node at index n process = true for each child c in children if first character of sublabel of node c is b set process to false n = c break if process is true create a new node temp with sub label suf[i:] add temp to nodes add index of temp to the child list of node n return sub2 = sub label of node n j = 0 while j < length of sub2 and i+j < length of suf and suf[i+j] == sub2[j] increment j if j < length of sub2 create a new node temp with sub label sub2[0:j] add n to the child list of temp update sub label of node n to sub2[j:] update child at index x2 in the child list of node n to the index of temp i += j n = index of temp printData(n, prefix) children = child nodes of node at index n if children is empty print "⤑ " followed by sub label of node n return print "┐ " followed by sub label of node n for each child c in children except the last one print prefix + "├─" recursively call printData(c, prefix + "│ ") print prefix + "└─" recursively call printData(last child, prefix + " ") visualize() if nodes is empty print "Empty Tree" return call printData(0, "") main Create a suffix tree tree1 for the input "coconut" Create a suffix tree tree2 for the input "BAABAIAIIBI" Visualize tree1 Visualize tree2
- The TreeNode class represents a node in the suffix tree. Each node has a sub label and a list of indices pointing to its child nodes.
- The SuffixTree class encapsulates the construction and visualization of the suffix tree. It initializes with an empty list of nodes and adds the root node. The buildTree method iterates over each character in the input string and calls the addSuffix method to add the corresponding suffix to the tree.
- The addSuffix method takes a suffix (suf) and adds it to the tree. It traverses the tree, comparing characters of the suffix with the first character of sub labels of child nodes. It handles cases where a node needs to split to accommodate the new suffix.
- The printData method recursively prints the tree structure, indicating nodes and their relationships using ASCII characters.
- The visualize method prints the overall tree structure by calling the printData method starting from the root node.
- In the main method, two suffix trees are created for different input strings, and their structures are visualized.
1) Suffix tree implementation in java
2) Suffix tree implementation in c++
3) Suffix tree implementation in vb.net
4) Suffix tree implementation in c#
5) Suffix tree implementation in python
6) Suffix tree implementation in ruby
7) Suffix tree implementation in scala
The time complexity of constructing a suffix tree using this approach is generally considered to be O(n^2), where n is the length of the input string. This is because in the worst case, each new suffix added might require a traversal of the existing tree, leading to a quadratic growth in the number of nodes and edges.