Posted on by Kalkicode
Code Tree

Suffix tree implementation

The suffix tree is a data structure used for various string manipulation tasks like pattern matching, substring search, and more. It's particularly efficient for solving these problems with linear time complexity. The goal of this article is to explain the implementation of a suffix tree using Java,c#,Python etc, along with a step-by-step breakdown of the provided code, a suitable example, and an analysis of its time complexity.

Problem Statement

The problem revolves around building a suffix tree for a given input string. A suffix tree is a trie-like data structure that represents all the suffixes of a given string in a way that allows for efficient substring searches and pattern matching.

Example

Let's consider two examples:

  1. Input string: "coconut" The suffix tree would look like:

    ┐
    ├─┐ co
    │ ├─⤑ conut
    │ └─⤑ nut
    ├─┐ o
    │ ├─⤑ conut
    │ └─⤑ nut
    ├─⤑ nut
    ├─⤑ ut
    └─⤑ t
  2. Input string: "BAABAIAIIBI" The suffix tree would look like:

    ┐
    ├─┐ B
    │ ├─┐ A
    │ │ ├─⤑ ABAIAIIBI
    │ │ └─⤑ IAIIBI
    │ └─⤑ I
    ├─┐ A
    │ ├─⤑ ABAIAIIBI
    │ ├─⤑ BAIAIIBI
    │ └─┐ I
    │   ├─⤑ AIIBI
    │   └─⤑ IBI
    └─┐ I
      ├─⤑ AIIBI
      ├─⤑ IBI
      └─⤑ BI

Idea to Solve the Problem

The idea behind constructing a suffix tree involves iteratively adding suffixes of the input string to the tree. This is done by considering each character in the string as the starting point for a new suffix. As each suffix is added, the tree is updated to accommodate the new suffix while sharing common prefixes between suffixes.

Pseudocode

class TreeNode
    sub // substring label of the edge leading to this node
    child // list of child nodes' indices in the nodes array

class SuffixTree
    nodes // list of tree nodes

    constructor(data)
        Create an empty list 'nodes'
        Create and add the root node to 'nodes'
        Build the suffix tree using the buildTree(data) method

    buildTree(str)
        for i from 0 to length of str - 1
            addSuffix(str.substring(i))

    addSuffix(suf)
        n = 0 // index of root node
        i = 0 // index for iterating over suf
        while i < length of suf
            b = character at index i in suf
            children = child nodes of node at index n
            process = true
            for each child c in children
                if first character of sublabel of node c is b
                    set process to false
                    n = c
                    break
            if process is true
                create a new node temp with sub label suf[i:]
                add temp to nodes
                add index of temp to the child list of node n
                return
            sub2 = sub label of node n
            j = 0
            while j < length of sub2 and i+j < length of suf and suf[i+j] == sub2[j]
                increment j
            if j < length of sub2
                create a new node temp with sub label sub2[0:j]
                add n to the child list of temp
                update sub label of node n to sub2[j:]
                update child at index x2 in the child list of node n to the index of temp
            i += j
            n = index of temp

    printData(n, prefix)
        children = child nodes of node at index n
        if children is empty
            print "⤑ " followed by sub label of node n
            return
        print "┐ " followed by sub label of node n
        for each child c in children except the last one
            print prefix + "├─"
            recursively call printData(c, prefix + "│ ")
        print prefix + "└─"
        recursively call printData(last child, prefix + "  ")

    visualize()
        if nodes is empty
            print "Empty Tree"
            return
        call printData(0, "")

    main
        Create a suffix tree tree1 for the input "coconut"
        Create a suffix tree tree2 for the input "BAABAIAIIBI"
        Visualize tree1
        Visualize tree2

Algorithm Explanation

  1. The TreeNode class represents a node in the suffix tree. Each node has a sub label and a list of indices pointing to its child nodes.
  2. The SuffixTree class encapsulates the construction and visualization of the suffix tree. It initializes with an empty list of nodes and adds the root node. The buildTree method iterates over each character in the input string and calls the addSuffix method to add the corresponding suffix to the tree.
  3. The addSuffix method takes a suffix (suf) and adds it to the tree. It traverses the tree, comparing characters of the suffix with the first character of sub labels of child nodes. It handles cases where a node needs to split to accommodate the new suffix.
  4. The printData method recursively prints the tree structure, indicating nodes and their relationships using ASCII characters.
  5. The visualize method prints the overall tree structure by calling the printData method starting from the root node.
  6. In the main method, two suffix trees are created for different input strings, and their structures are visualized.

Code Solution

Time Complexity

The time complexity of constructing a suffix tree using this approach is generally considered to be O(n^2), where n is the length of the input string. This is because in the worst case, each new suffix added might require a traversal of the existing tree, leading to a quadratic growth in the number of nodes and edges.

Comment

Please share your knowledge to improve code and content standard. Also submit your doubts, and test case. We improve by your feedback. We will try to resolve your query as soon as possible.

New Comment