This post walks through an example of linear-time suffix tree construction to ease the understanding and implementation of Ukkonen (1995). I assume that you have the paper at hand, so I can refer to it and don't need to introduce all the terminology and every detail of the full algorithm from scratch.
When I decided to implement the algorithm, I first read Gusfield's (1997) chapter on the linear-time construction of suffix trees, because I saw it cited a lot and thought I could cut a corner using an easier description instead of the original paper. However, I found the chapter rather difficult to follow: for didactic reasons, it starts with a cubic-time version of the algorithm (although the naive approach is quadratic), it introduces some verbose and unintuitive terminology, and illustrates steps with confusing tree diagrams. After that I read Ukkonen's original paper and found it much clearer.1 With a few variable definitions scribbled on the margins of my copy, it was straightforward to implement a basic working version in Python.
The algorithm moves over each position i in string T from left to right and constructs the (implicit) suffix tree for substring T[..i] from the previously constructed tree for T[..i-1]. Implicit because the tree may contain suffixes that do not end in a leaf node if the final character is not unique. While doing so, it maintains a canonical reference pair to the state which represents the longest suffix from the previous step that occured at least twice, the active point. The canonical reference pair of an explicit state consists of the state itself and the empty string. The canonical reference pair of an implicit state consists of its closest explicit ancestor and the string spelled out by the transition between this explicit ancestor state and the implicit state. In each step i, we walk the boundary path from the active point and add ti-transitions until we reach the end point, a state which already has a ti-transition, where ti is the character at index i. Transitions on leaf nodes are added automatically by using a variable to keep track of the end index.
Constructing the tree
We will construct the suffix tree for the string abaabcab#. For each step you'll find a visualization of the intermediate tree structure and a dry, repetitive description of what is happening. Circles represent states, arrows transitions. The state numbers indicate the order in which the states were created. Each edge is labeled with the transition's start and end indices (one-based) and the corresponding characters. Blue arrows represent suffix links.
Before the first step, we initialize our tree with an auxiliary state '⊥' and the root 'R'. The auxiliary state has transitions to the root for each character of the alphabet. The root has a suffix link to the auxiliary state. The auxiliary state is there to simplify the algorithm (since it has transitions for each character, it is the ultimate end point). In a real-world implementation, we would probably get rid of it.
Our active point is (R, ε) and we look at the first character a. We find that the root does not have an a-transition (very last
else block in the
test_and_split routine), so we add an a-transition to the newly created leaf node 1 (within the main loop of the
update routine). We follow the suffix link from the root node, which brings us to the auxiliary state. This state has all possible transitions, so it also has an a-transition and we're done. We didn't split any node, so no suffix link needs to be set (at the end of the
According to lemma 2, following the a-transition from our previous end point, the auxiliary state, gives us the new active point with the canonic reference pair (R, ε). The next steps are exactly the same as before, only for character b. The transition to our first state grows automatically since we use a variable tracking the length of the processed string as the second index on transitions leading up to leafs.
Our existing two transitions again grow automatically. We notice that our active point already has an a-transition (
test_and_split routine), so we don't need to create a new state. Active point and end point are the same.
Based on our previous end point, the new active point is (R, a) and we're looking at another a. The transition we are on continues with b, so we need to split it at this point (we jump into the first
if block and then the nested
test_and_split). We create state 3. We insert state 3 between the root and state 1 (adjusting the start index of state 1). Now we add an a-transition from state 3 to the second newly created state 4. We follow the suffix link to the auxiliary state and canonize it ('consuming' one symbol), which brings us back to the root. We find that there already is an a-transition from the root, so we reached the end point. Now we suffix-link our newly created internal state 3 to the end point, the root (final step of the update routine).
The new active point is (3, ε). We find a b-transition from our active point, so again active point and end point are the same.
The new active point is (3, b). The transition we are on does not continue with the current character c, so we need to split again, as in step 4, creating states 5 and 6. We keep a reference to the new internal state 5 (to set its suffix link later) and follow the suffix link of the active point to the root, where we find the need to split the existing transition baab after the first b to add the suffix bc. After creating internal state 7, we can set the suffix link on state 5. We follow the suffix link from the active point (root) to the auxiliary state. Canonization 'consumes' one symbol and brings us back to the root, where we add the missing c-transition to the new state 9. State 7 was the last internal state created and gets a suffix link to the last canonized suffix link we followed, root. The final suffix link we follow brings us to the auxiliary state, which is the end point.
The active point is (R, ε) and we are looking at character a, for which there is already a transition. Active point and end point are the same.
The canonized reference pair to the active point is (3, ε) and the current character is b, for which state 3 has a transition. Active point and endpoint are the same.
The canonized reference pair to the active point is (5, ε) and the current character is the string terminator $. Because this character is unique, we have to follow the suffix links from the active point and create new states at each state on the path until we arrive at the auxiliary state, which is the end point.
We're done, finally. The unique terminating character ensures that there is an explicit leaf for each suffix.
1 Maybe I'm doing Gusfield's description injustice and the original paper only seemed clearer because I read it afterwards.
Gusfield, Dan (1997). Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press.
Ukkonen, E. (1995). On-line construction of suffix trees. Algorithmica, 14(3), 249-260.