Suffix tree implementation
The suffix tree is a data structure used for various string manipulation tasks like pattern matching, substring search, and more. It's particularly efficient for solving these problems with linear time complexity. The goal of this article is to explain the implementation of a suffix tree using Java,c#,Python etc, along with a step-by-step breakdown of the provided code, a suitable example, and an analysis of its time complexity.
Problem Statement
The problem revolves around building a suffix tree for a given input string. A suffix tree is a trie-like data structure that represents all the suffixes of a given string in a way that allows for efficient substring searches and pattern matching.
Example
Let's consider two examples:
Input string: "coconut" The suffix tree would look like:
┐ ├─┐ co │ ├─⤑ conut │ └─⤑ nut ├─┐ o │ ├─⤑ conut │ └─⤑ nut ├─⤑ nut ├─⤑ ut └─⤑ t
Input string: "BAABAIAIIBI" The suffix tree would look like:
┐ ├─┐ B │ ├─┐ A │ │ ├─⤑ ABAIAIIBI │ │ └─⤑ IAIIBI │ └─⤑ I ├─┐ A │ ├─⤑ ABAIAIIBI │ ├─⤑ BAIAIIBI │ └─┐ I │ ├─⤑ AIIBI │ └─⤑ IBI └─┐ I ├─⤑ AIIBI ├─⤑ IBI └─⤑ BI
Idea to Solve the Problem
The idea behind constructing a suffix tree involves iteratively adding suffixes of the input string to the tree. This is done by considering each character in the string as the starting point for a new suffix. As each suffix is added, the tree is updated to accommodate the new suffix while sharing common prefixes between suffixes.
Pseudocode
class TreeNode
sub // substring label of the edge leading to this node
child // list of child nodes' indices in the nodes array
class SuffixTree
nodes // list of tree nodes
constructor(data)
Create an empty list 'nodes'
Create and add the root node to 'nodes'
Build the suffix tree using the buildTree(data) method
buildTree(str)
for i from 0 to length of str - 1
addSuffix(str.substring(i))
addSuffix(suf)
n = 0 // index of root node
i = 0 // index for iterating over suf
while i < length of suf
b = character at index i in suf
children = child nodes of node at index n
process = true
for each child c in children
if first character of sublabel of node c is b
set process to false
n = c
break
if process is true
create a new node temp with sub label suf[i:]
add temp to nodes
add index of temp to the child list of node n
return
sub2 = sub label of node n
j = 0
while j < length of sub2 and i+j < length of suf and suf[i+j] == sub2[j]
increment j
if j < length of sub2
create a new node temp with sub label sub2[0:j]
add n to the child list of temp
update sub label of node n to sub2[j:]
update child at index x2 in the child list of node n to the index of temp
i += j
n = index of temp
printData(n, prefix)
children = child nodes of node at index n
if children is empty
print "⤑ " followed by sub label of node n
return
print "┐ " followed by sub label of node n
for each child c in children except the last one
print prefix + "├─"
recursively call printData(c, prefix + "│ ")
print prefix + "└─"
recursively call printData(last child, prefix + " ")
visualize()
if nodes is empty
print "Empty Tree"
return
call printData(0, "")
main
Create a suffix tree tree1 for the input "coconut"
Create a suffix tree tree2 for the input "BAABAIAIIBI"
Visualize tree1
Visualize tree2
Algorithm Explanation
- The TreeNode class represents a node in the suffix tree. Each node has a sub label and a list of indices pointing to its child nodes.
- The SuffixTree class encapsulates the construction and visualization of the suffix tree. It initializes with an empty list of nodes and adds the root node. The buildTree method iterates over each character in the input string and calls the addSuffix method to add the corresponding suffix to the tree.
- The addSuffix method takes a suffix (suf) and adds it to the tree. It traverses the tree, comparing characters of the suffix with the first character of sub labels of child nodes. It handles cases where a node needs to split to accommodate the new suffix.
- The printData method recursively prints the tree structure, indicating nodes and their relationships using ASCII characters.
- The visualize method prints the overall tree structure by calling the printData method starting from the root node.
- In the main method, two suffix trees are created for different input strings, and their structures are visualized.
Code Solution
-
1) Suffix tree implementation in java
2) Suffix tree implementation in c++
3) Suffix tree implementation in vb.net
4) Suffix tree implementation in c#
5) Suffix tree implementation in python
6) Suffix tree implementation in ruby
7) Suffix tree implementation in scala
Time Complexity
The time complexity of constructing a suffix tree using this approach is generally considered to be O(n^2), where n is the length of the input string. This is because in the worst case, each new suffix added might require a traversal of the existing tree, leading to a quadratic growth in the number of nodes and edges.
Please share your knowledge to improve code and content standard. Also submit your doubts, and test case. We improve by your feedback. We will try to resolve your query as soon as possible.
New Comment