0% found this document useful (0 votes)
86 views18 pages

On-Line Construction of Suffix Trees

This document presents an online algorithm for constructing suffix trees from a given string in linear time. The algorithm processes the string symbol by symbol from left to right, maintaining the suffix tree for the scanned portion of the string at each step. It is based on the observation that the suffixes of the string after scanning a new symbol can be obtained by appending that symbol to each suffix of the previously scanned portion and adding an empty suffix. The algorithm operates by traversing a "boundary path" in the current suffix tree and adding new states and transitions as needed when a new symbol is scanned. This results in a linear-time algorithm, in contrast to a simpler suffix trie construction algorithm that has quadratic worst-case time complexity.

Uploaded by

tuupituu
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views18 pages

On-Line Construction of Suffix Trees

This document presents an online algorithm for constructing suffix trees from a given string in linear time. The algorithm processes the string symbol by symbol from left to right, maintaining the suffix tree for the scanned portion of the string at each step. It is based on the observation that the suffixes of the string after scanning a new symbol can be obtained by appending that symbol to each suffix of the previously scanned portion and adding an empty suffix. The algorithm operates by traversing a "boundary path" in the current suffix tree and adding new states and transitions as needed when a new symbol is scanned. This results in a linear-time algorithm, in contrast to a simpler suffix trie construction algorithm that has quadratic worst-case time complexity.

Uploaded by

tuupituu
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

(To appear in ALGORITHMICA)

Online construction of sux trees


Esko Ukkonen Department of Computer Science, University of Helsinki,

P. O. Box 26 (Teollisuuskatu 23), FIN00014 University of Helsinki, Finland Tel.: +358-0-7084172, fax: +358-0-7084441 Email: [email protected] Abstract. An online algorithm is presented for constructing the sux tree for a given string in time linear in the length of the string. The new algorithm has the desirable property of processing the string symbol by symbol from left to right. It has always the sux tree for the scanned part of the string ready. The method is developed as a lineartime version of a very simple algorithm for (quadratic size) sux tries. Regardless of its quadratic worst-case this latter algorithm can be a good practical method when the string is not too long. Another variation of this method is shown to give in a natural way the wellknown algorithms for constructing sux automata (DAWGs).

Key Words. Linear time algorithm, sux tree, sux trie, sux automaton, DAWG.

Research supported by the Academy of Finland and by the Alexander von Humboldt

Foundation (Germany).

1. INTRODUCTION A sux tree is a trielike data structure representing all suxes of a string. Such trees have a central role in many algorithms on strings, see e.g. [3, 7, 2]. It is quite commonly felt, however, that the lineartime sux tree algorithms presented in the literature are rather dicult to grasp. The main purpose of this paper is to be an attempt in developing an understandable sux tree construction based on a natural idea that seems to complete our picture of sux trees in an essential way. The new algorithm has the important property of being online. It processes the string symbol by symbol from left to right, and has always the sux tree for the scanned part of the string ready. The algorithm is based on the simple observation that the suxes of a string T i = t1 ti can be obtained from the suxes of string T i1 = t1 ti1 by catenating symbol ti at the end of each sux of T i1 and by adding the empty sux. The suxes of the whole string T = T n = t1 t2 tn can be obtained by rst expanding the suxes of T 0 into the suxes of T 1 and so on, until the suxes of T are obtained from the suxes of T n1 . This is in contrast with the method by Weiner [13] that proceeds right toleft and adds the suxes to the tree in increasing order of their length, starting from the shortest sux, and with the method by McCreight [9] that adds the suxes to the tree in the decreasing order of their length. It should be noted, however, that despite of the clear dierence in the intuitive view on the problem, our algorithm and McCreights algorithm are in their nal form functionally rather closely related. Our algorithm is best understood as a lineartime version of another algorithm from [12] for (quadraticsize) sux tries. The latter very elementary algorithm, which resembles the position tree algorithm in [8], is given in Section 2. Unfortunately, it does not run in linear time it takes time proportional to the size of the sux trie which can be quadratic. However, a rather transparent modication, which we describe in Section 4, gives our online, lineartime method for sux trees. This also oers a natural perspective 2

which makes the lineartime sux tree construction understandable. We also point out in Section 5 that the sux trie augmented with the sux links gives an elementary characterization of the sux automata (also known as directed acyclic word graphs or DAWGs). This immediately leads to an algorithm for constructing such automata. Fortunately, the resulting method is essentially the same as already given in [46]. Again it is felt that our new perspective is very natural and helps understanding the sux automata constructions.

2. CONSTRUCTING SUFFIX TRIES Let T = t1 t2 tn be a string over an alphabet . Each string x such that T = uxv for some (possibly empty) strings u and v is a substring of T , and each string Ti = ti tn where 1 i n + 1 is a sux of T ; in particular, Tn+1 = is the empty sux. The set of all suxes of T is denoted (T ). The sux trie of T is a trie representing (T ). More formally, we denote the sux trie of T as ST rie(T ) = (Q {}, root, F, g, f ) and dene such a trie as an augmented deterministic nitestate automaton which has a treeshaped transition graph representing the trie for (T ) and which is augmented with the socalled sux function f and auxiliary state . The set Q of the states of ST rie(T ) can be put in a onetoone correspondence with the substrings of T . We denote by x the state that corresponds to a substring x. The initial state root corresponds to the empty string , and the set F of the nal states corresponds to (T ). The transition function g is dened as g(, a) = y for all x, y in Q such that y = xa, where a . x The sux function f is dened for each state x Q as follows. Let x = root. Then x = ay for some a , and we set f () = y . Moreover, x f (root) =. Auxiliary state allows us to write the algorithms in the sequel such that an explicit distinction between the empty and the nonempty suxes

(or, between root and the other states) can be avoided. State is connected to the trie by g(, a) = root for every a . We leave f () undened. (Note that the transitions from to root are dened consistently with the other transitions: State corresponds to the inverse a1 of all symbols a . Because a1 a = , we can set g(, a) = root as root corresponds to .) Following [9] we call f (r) the sux link of state r. The sux links will be utilized during the construction of a sux tree; they have many uses also in the applications (e.g. [11, 12]). Automaton STrie(T ) is identical to the AhoCorasick string matching automaton [1] for the keyword set {Ti |1 i n + 1} (the sux links are called in [1] the failure transitions.)

Fig. 1. Construction of STrie(cacao): state transitions shown in bold arrows, failure transitions in thin arrows. Note: Only the last two layers of sux links shown explicitly. 4

It is easy to construct STrie(T ) online, in a lefttoright scan over T as follows. Let T i denote the prex t1 ti of T for 0 i n. As intermediate results the construction gives STrie(T i ) for i = 0, 1, . . . , n. Fig. 1 shows the dierent phases of constructing STrie(T ) for T = cacao. The keyobservation explaining how STrie(T i ) is obtained from STrie(T i1 ) is that the suxes of T i can be obtained by catenating ti to the end of each sux of T i1 and by adding an empty sux. That is, (T i ) = (T i1 )ti { }. By denition, STrie(T i1 ) accepts (T i1 ). To make it accept (T i ), we must examine the nal state set Fi1 of ST rie(T i1 ). If r Fi1 has not already a ti transition, such a transition from r to a new state (which becomes a new leaf of the trie) is added. The states to which there is an old or new ti transition from some state in Fi1 constitute together with root the nal states Fi of STrie(T i ). The states r Fi1 that get new transitions can be found using the sux links as follows. The denition of the sux function implies that r Fi1 if and only if r = f j (t1 . . . ti1 ) for some 0 j i 1. Therefore all states in Fi1 are on the path of sux links that starts from the deepest state t1 . . . ti1 of STrie(T i1 ) and ends at . We call this important path the boundary path of ST rie(T i1 ). The boundary path is traversed. If a state z on the boundary path does not have a transition on ti yet, a new state zti and a new transition g(, ti ) = z zti are added. This gives updated g. To get updated f , the new states zti are linked together with new sux links that form a path starting from state t1 . . . ti . Obviously, this is the boundary path of ST rie(T i ). The traversal over Fi1 along the boundary path can be stopped immediately when the rst state z is found such that state zti (and hence also transition g(, ti ) = zti ) already exists. Let namely zti already be a state. z Then STrie(T i1 ) has to contain state z ti and transition g(z , ti ) = z ti for all z = f j (), j 1. In other words, if zti is a substring of T i1 then every z sux of zti is a substring of T i1 . Note that z always exists because is the 5

last state on the boundary path and has a transition for every possible ti . When the traversal is stopped in this way, the procedure will create a new state for every sux link examined during the traversal. This implies that the whole procedure will take time proportional to the size of the resulting automaton. Summarized, the procedure for building STrie(T i ) from STrie(T i1 ) is as follows [12]. Here top denotes the state t1 . . . ti1 . Algorithm 1. r top; while g(r, ti ) is undened do create new state r and new transition g(r, ti ) = r ; if r = top then create new sux link f (oldr ) = r ; oldr r ; r f (r); create new sux link f (oldr ) = g(r, ti ); top g(top, ti ). Starting from STrie( ), which consists only of root and and the links between them, and repeating Algorithm 1 for ti = t1 , t2 , . . . , tn , we obviously get STrie(T ). The algorithm is optimal in the sense that it takes time proportional to the size of its end result STrie(T ). This in turn is proportional to |Q|, that is, to the number of dierent substrings of T . Unfortunately, this can be quadratic in |T |, as is the case for example if T = an bn . Theorem 1 Sux trie ST rie(T ) can be constructed in time proportional to the size of ST rie(T ) which, in the worst case, is O(|T |2 ). 3. SUFFIX TREES Sux tree STree(T ) of T is a data structure that represents STrie(T ) in space linear in the length |T | of T . This is achieved by representing only a subset Q {} of the states of STrie(T ). We call the states in Q {} 6

the explicit states. Set Q consists of all branching states (states from which there are at least two transitions) and all leaves (states from which there are no transitions) of STrie(T ). By denition, root is included into the branching states. The other states of STrie(T ) (the states other than root and from which there is exactly one transition) are called implicit states as states of STree(T ); they are not explicitly present in STree(T ). The string w spelled out by the transition path in STrie(T ) between two explicit states s and r is represented in STree(T ) as generalized transition g (s, w) = r. To save space the string w is actually represented as a pair (k, p) of pointers (the left pointer k and the right pointer p) to T such that tk . . . tp = w. In this way the generalized transition gets form g (s, (k, p)) = r. Such pointers exist because there must be a sux Ti such that the transition path for Ti in STrie(T ) goes through s and r. We could select the smallest such i, and let k and p point to the substring of this Ti that is spelled out by the transition path from s to r. A transition g (s, (k, p)) = r is called an atransition if tk = a. Each s can have at most one atransition for each a . Transitions g(, a) = root are represented in a similar fashion: Let = {a1 , a2 , . . . , am }. Then g(, aj ) = root is represented as g(, (j, j)) = root for j = 1, . . . , m. Hence sux tree STree(T ) has two components: The tree itself and the string T . It is of linear size in |T | because Q has at most |T | leaves (there is at most one leaf for each nonempty sux) and therefore Q has to contain at most |T | 1 branching states (when |T | > 1). There can be at most 2|T | 2 transitions between the states in Q , each taking a constant space because of using pointers instead of an explicit string. (Here we have assumed the standard RAM model in which a pointer takes constant space.) We again augment the structure with the sux function f , now dened only for all branching states x = root as f () = y where y is a branching x state such that x = ay for some a , and f (root) =. Such an f is welldened: If x is a branching state, then also f () is a branching state. x These sux links are explicitly represented. It will sometimes be helpful 7

to speak about implicit sux links, i.e. imaginary sux links between the implicit states. The sux tree of T is denoted as STree(T ) = (Q {}, root, g , f ). We refer to an explicit or implicit state r of a sux tree by a reference pair (s, w) where s is some explicit state that is an ancestor of r and w is the string spelled out by the transitions from s to r in the corresponding sux trie. A reference pair is canonical if s is the closest ancestor of r (and hence, w is shortest possible). For an explicit r the canonical reference pair obviously is (r, ). Again, we represent string w as a pair (k, p) of pointers such that tk . . . tp = w. In this way a reference pair (s, w) gets form (s, (k, p)). Pair (s, ) is represented as (s, (p + 1, p)). It is technically convenient to omit the nal states in the denition of a sux tree. When explicit nal states are needed in some application, one gets them gratuitously by adding to T an end marking symbol that does not occur elsewhere in T . The leaves of the sux tree for such a T are in onetoone correspondence with the suxes of T and constitute the set of the nal states. Another possibility is to traverse the sux link path from leaf T to root and make all states on the path explicit; these states are the nal states of STree(T ). In many applications of STree(T ), the start location of each sux is stored with the corresponding state. Such an augmented tree can be used as an index for nding any substring of T .

4. ONLINE CONSTRUCTION OF SUFFIX TREES The algorithm for constructing STree(T ) will be patterned after Algorithm 1. What has to be done is for the most part immediately clear. Fig. 2 shows the phases of constructing STree(cacao); for simplicity, the strings associated with each transition are shown explicitly in the gure. However, to get a linear time algorithm some details need a more careful examination. We rst make more precise what Algorithm 1 does. Let s1 = t1 . . . ti1 , s2 , s3 , . . . , si = root, si+1 = be the states of STrie(T i1 ) on the boundary

path. Let j be the smallest index such that sj is not a leaf, and let j be the smallest index such that sj has a ti transition. As s1 is a leaf and is a nonleaf that has a ti transition, both j and j are welldened and j j . Now the following lemma should be obvious.

Fig. 2. Construction of STree(cacao) Lemma 1 Algorithm 1 adds to ST rie(T i1 ) a ti transition for each of the states sh , 1 h < j , such that for 1 h < j, the new transition expands an old branch of the trie that ends at leaf sh , and for j h < j , the new transition initiates a new branch from sh . Algorithm 1 does not create any other transitions. We call state sj the active point and sj the end point of ST rie(T i1 ). These states are present, explicitly or implicitly, in ST ree(T i1 ), too. For example, the active points of the last three trees in Fig. 2 are (root, c), (root, ca), (root, ). Lemma 1 says that Algorithm 1 inserts two dierent groups of ti transitions into ST rie(T i1 ): (i) First, the states on the boundary path before the active point sj get a transition. These states are leaves, hence each such transition has to expand 9

an existing branch of the trie. (ii) Second, the states from the active point sj to the end point sj , the end point excluded, get a new transition. These states are not leaves, hence each new transition has to initiate a new branch. Let us next interpret this in terms of sux tree STree(T i1 ). The rst group of transitions that expand an existing branch could be implemented by updating the right pointer of each transition that represents the branch. Let g (s, (k, i 1)) = r be such a transition. The right pointer has to point to the last position i 1 of T i1 . This is because r is a leaf and therefore a path leading to r has to spell out a sux of T i1 that does not occur elsewhere in T i1 . Then the updated transition must be g (s, (k, i)) = r. This only makes the string spelled out by the transition longer but does not change the states s and r. Making all such updates would take too much time. Therefore we use the following trick. Any transition of STree(T i1 ) leading to a leaf is called an open transition. Such a transition is of the form g (s, (k, i 1)) = r where, as stated above, the right pointer has to point to the last position i 1 of T i1 . Therefore it is not necessary to represent the actual value of the right pointer. Instead, open transitions are represented as g (s, (k, )) = r where indicates that this transition is open to grow. In fact, g (s, (k, )) = r represents a branch of any length between state s and the imaginary state r that is in innity. An explicit updating of the right pointer when ti is inserted into this branch is not needed. Symbols can be replaced by n = |T | after completing STree(T ). In this way the rst group of transitions is implemented without any explicit changes to ST ree(T i1 ). We have still to describe how to add to STree(T i1 ) the second group of transitions. These create entirely new branches that start from states sh , j h < j . Finding such states sh needs some care as they need not be explicit states at the moment. They will be found along the boundary path of ST ree(T i1 ) using reference pairs and sux links. Let h = j and let (s, w) be the canonical reference pair for sh , i. e., for

10

the active point. As sh is on the boundary path of STrie(T i1 ), w has to be a sux of T i1 . Hence (s, w) = (s, (k, i 1)) for some k i. We want to create a new branch starting from the state represented by (s, (k, i1)). However, rst we test whether or not (s, (k, i1)) already refers to the end point sj . If it does, we are done. Otherwise a new branch has to be created. To this end the state sh referred to by (s, (k, i1)) has to be explicit. If it is not, an explicit state, denoted sh , is created by splitting the transition that contains the corresponding implicit state. Then a ti transition from sh is created. It has to be an open transition g (sh , (i, )) = sh where sh is a new leaf. Moreover, the sux link f (sh ) is added if sh was created by splitting a transition. Next the construction proceeds to sh+1 . As the reference pair for sh was (s, (k, i1)), the canonical reference pair for sh+1 is canonize(f (s), (k, i1)) where canonize makes the reference pair canonical by updating the state and the left pointer (note that the right pointer i 1 remains unchanged in canonization). The above operations are then repeated for sh+1 , and so on until the end point sj is found. In this way we obtain the procedure update, given below, that transforms STree(T i1 ) into STree(T i ) by inserting the ti transitions in the second group. The procedure uses procedure canonize mentioned above, and procedure testandsplit that tests whether or not a given reference pair refers to the end point. If it does not then the procedure creates and returns an explicit state for the reference pair provided that the pair does not already represent an explicit state. Procedure update returns a reference pair for the end point sj (actually only the state and the left pointer of the pair, as the second pointer remains i 1 for all states on the boundary path).

11

procedure update(s, (k, i)): (s, (k, i 1)) is the canonical reference pair for the active point; 1. oldr root; (endpoint, r) testandsplit(s, (k, i 1), ti ); 2. while not(endpoint) do 3. 4. 5. 6. 7. create new transition g (r, (i, )) = r where r is a new state; if oldr = root then create new sux link f (oldr) = r; oldr r; (s, k) canonize(f (s), (k, i 1)); (endpoint, r) testandsplit(s, (k, i 1), ti );

8. if oldr = root then create new sux link f (oldr) = s; 9. return (s, k). Procedure testandsplit tests whether or not a state with canonical reference pair (s, (k, p)) is the end point, that is, a state that in STrie(T i1 ) would have a ti transition. Symbol ti is given as input parameter t. The test result is returned as the rst output parameter. If (s, (k, p)) is not the end point, then state (s, (k, p)) is made explicit (if not already so) by splitting a transition. The explicit state is returned as the second output parameter. procedure testandsplit(s, (k, p), t): 1. 2. 3. 4. 5. if k p then let g (s, (k , p )) = s be the tk transition from s; if t = tk +pk+1 then return(true, s) else replace the tk transition above by transitions g (s, (k , k + p k)) = r and g (r, (k + p k + 1, p )) = s where r is a new state; 6. 7. 8. 9. else if there is no ttransition from s then return(false, s) else return(true, s). return(false, r)

12

This procedure benets from that (s, (k, p)) is canonical: The answer to the end point test can be found in constant time by considering only one transition from s. Procedure canonize is as follows. Given a reference pair (s, (k, p)) for some state r, it nds and returns state s and left link k such that (s , (k , p)) is the canonical reference pair for r. State s is the closest explicit ancestor of r (or r itself if r is explicit). Therefore the string that leads from s to r must be a sux of the string tk . . . tp that leads from s to r. Hence the right link p does not change but the left link k can become k , k k. procedure canonize(s, (k, p)): 1. 2. 3. 4. 5. 6. 7. 8. if p < k then return (s, k) else nd the tk transition g (s, (k , p )) = s from s; while p k p k do k k + p k + 1; ss; if k p then nd the tk transition g (s, (k , p )) = s from s; return (s, k). To be able to continue the construction for the next text symbol ti+1 , the active point of STree(T i ) has to be found. To this end, note rst that sj is the active point of ST ree(T i1 ) if and only if sj = tj ti1 where tj ti1 is the longest sux of T i1 that occurs at least twice in T i1 . Second, note that sj is the end point of ST ree(T i1 ) if and only if sj = tj ti1 where tj ti1 is the longest sux of T i1 such that tj ti1 ti is a substring of T i1 . But this means that if sj is the end point of ST ree(T i1 ) then tj ti1 ti is the longest sux of T i that occurs at least twice in T i , that is, then state g(sj , ti ) is the active point of ST ree(T i ). We have shown the following result. Lemma 2 Let (s, (k, i1)) be a reference pair of the end point sj of ST ree(T i1 ). Then (s, (k, i)) is a reference pair of the active point of ST ree(T i ). 13

The overall algorithm for constructing STree(T ) is nally as follows. String T is processed symbol by symbol, in one left-to-right scan. Writing = {t1 , . . . , tm } makes it possible to present the transitions from in the same way as the other transitions. Algorithm 2. Construction of STree(T ) for string T = t1 t2 . . . in alphabet = {t1 , . . . , tm }; 1. 2. 3. 4. 5. 6. 7. 8. is the end marker not appearing elsewhere in T . create states root and ; for j 1, . . . , m do create transition g (, (j, j)) = root; create sux link f (root) =; s root; k 1; i 0; while ti+1 = do i i + 1; (s, k) update(s, (k, i)); (s, k) canonize(s, (k, i)).

Steps 78 are based on Lemma 2: After step 7 pair (s, (k, i 1)) refers to the end point of ST ree(T i1 ), and hence, (s, (k, i)) refers to the active point of ST ree(T i ). Theorem 2 Algorithm 2 constructs the sux tree ST ree(T ) for a string T = t1 . . . tn online in time O(n). Proof. The algorithm constructs STree(T ) through intermediate trees STree(T 0 ), STree(T 1 ), . . . , STree(T n ) = STree(T ). It is online as to construct STree(T i ) it only needs access to the rst i symbols of T . For the running time analysis we divide the time requirement into two components, both turn out to be O(n). The rst component consists of the total time for procedure canonize. The second component consists of the rest: The time for repeatedly traversing the sux link path from the present active point to the end point and creating the new branches by update and then nding the next active point by taking a transition from the end point

14

(step 8 of Alg. 2). We call the states (reference pairs) on these paths the visited states. The second component takes time proportional to the total number of the visited states, because the operations at each such state (create an explicit state and a new branch, follow an explicit or implicit sux link, test for the end point) at each such state can be implemented in constant time as canonize is excluded. (To be precise, this also requires that || is bounded independently of n.) Let ri be the active point of STree(T i ) for 0 i n. The visited states between ri1 and ri are on a path that consists of some sux links and one ti transition. Taking a sux link decreases the depth (the length of the string spelled out on the transition path from root) of the current state by one, and taking a ti transition increases it by one. The number of the visited states (including ri1 , excluding ri ) on the path is therefore depth(ri1 ) depth(ri ) + 2, and their total number is time component is O(n). The time spent by each execution of canonize has an upper bound of the form a + bq where a and b are constants and q is the number of executions of the body of the loop in steps 57 of canonize. The total time spent by canonize has therefore a bound that is proportional to the sum of the number of the calls of canonize and the total number of the executions of the body of the loop in all calls. There are O(n) calls as there is one call for each visited state (either in step 6 of update or directly in step 8 of Alg. 2.). Each execution of the body deletes a nonempty string from the left end of string w = tk . . . tp represented by the pointers in reference pair (s, (k, p)). String w can grow during the whole process only in step 8 of Alg. 2 which catenates ti for i = 1, . . . , n to the right end of w. Hence a nonempty deletion is possible at most n times. The total time for the body of the loop is therefore O(n), and altogether canonize or our rst component needs time O(n). 2
n i=1 (depth(ri1 )

depth(ri ) + 2) = depth(r0 ) depth(rn ) + 2n 2n. This implies the second

15

Remark 1. (due to J. Krkkinen) In its nal form our algorithm is a a a rather close relative of McCreights method [9]. The principal technical dierence seems to be, that each execution of the body of the main loop of our Algorithm 2 consumes one text symbol ti whereas each execution of the body of the main loop of McCreights algorithm traverses one sux link and consumes zero or more text symbols. Remark 2. It is not hard to generalize Algorithm 2 for the following dynamic version of the sux tree problem (c.f. the adaptive dictionary matching problem of [2]): Maintain a generalized linearsize sux tree representing all suxes of strings Ti in set {T1 , . . . , Tk } under operations that insert or delete a string Ti . The resulting algorithm will make such updates in time O(|Ti |). 5. CONSTRUCTING SUFFIX AUTOMATA The sux automaton SA(T ) of a string T = t1 . . . tn is the minimal DFA that accepts all the suxes of T . As our STrie(T ) is a DFA for the suxes of T , SA(T ) could be obtained by minimizing STrie(T ) in standard way. Minimization works by combining the equivalent states, i. e., states from which STrie(T ) accepts the same set of strings. Using the sux links we will obtain a natural characterization of the equivalent states as follows. A state s of STrie(T ) is called essential if there is at least two dierent sux links pointing to s or s = t1 tk for some k. Theorem 3 Let s and r be two states of ST rie(T ). The set of strings accepted from s is equal to the set of strings accepted from r if and only if the sux link path that starts from s contains r (the path from r contains s) and the subpath from s to r (from r to s) does not contain any other essential states than possibly s (r). Proof. The theorem is implied by the following observations. The set of strings accepted from some state of STrie(T ) is a subset of the suxes of T and therefore each accepted string is of dierent length. 16

A string of length i is accepted from a state s of STrie(T ) if and only if the sux link path that starts from state t1 tni contains s. The sux links form a tree that is directed to its root root. 2 This suggests a method for constructing SA(T ) with a modied Algorithm 1. The new feature is that the construction should create a new state only if the state is essential. An unessential state s is merged with the rst essential state that is before s on the sux link path through s. This is correct as, by Theorem 3, the states are equivalent. As there are O(|T |) essential states, the resulting algorithm can be made to work in linear time. The algorithm turns out to be similar to the algorithms in [46]. We therefore omit the details.

Acknowledgements. J. Krkkinen pointed out some inaccuracies in the eara a lier version [10] of this work. The author is also indebted to E. Sutinen, D. Wood, and, in particular, S. Kurtz and G. A. Stephen for several useful comments.

References
1. A. Aho and M. Corasick, Ecient string matching: An aid to bibliographic search, Comm. ACM 18 (1975), 333340. 2. A. Amir and M. Farach, Adaptive dictionary matching, in Proc. 32nd IEEE Ann. Symp. on Foundations of Computer Science, 1991, pp. 760 766. 3. A. Apostolico, The myriad virtues of subword trees, in Combinatorial Algorithms on Words (A. Apostolico and Z. Galil, eds.), Springer Verlag, 1985, pp. 8595. 4. A. Blumer & al., The smallest automaton recognizing the subwords of a text, Theor. Comp. Sci. 40 (1985), 3155. 17

5. M. Crochemore, Transducers and repetitions, Theor. Comp. Sci. 45 (1986), 6386. 6. M. Crochemore, String matching with constraints, in Mathematical Foundations of Computer Science 1988 (M.P. Chytil, L. Janiga and V. Koubek, eds.), Lect. Notes in Computer Science, vol. 324, Springer Verlag, 1988, pp. 4458. 7. Z. Galil and R. Giancarlo, Data structures and algorithms for approximate string matching, J. Complexity 4 (1988), 3372. 8. M. Kempf, R. Bayer and U. Gntzer, Time optimal left to right conu struction of position trees, Acta Informatica 24 (1987), 461474. 9. E. McCreight, A spaceeconomical sux tree construction algorithm, Journal of the ACM 23 (1976), 262272. 10. E. Ukkonen, Constructing sux trees online in linear time, in Algorithms, Software, Architecture. Information Processing 92, vol. I (J. van Leeuwen, ed.), Elsevier, 1992, pp. 484492. 11. E. Ukkonen, Approximate stringmatching over sux trees, in Combinatorial Pattern Matching, CPM93 (A. Apostolico, M. Crochemore, Z. Galil, and U. Manber, eds.), Lect. Notes in Computer Science, vol. 684, SpringerVerlag, 1993, pp. 228242. 12. E. Ukkonen and D. Wood, Approximate string matching with sux automata, Algorithmica 10 (1993), 353364. 13. P. Weiner, Linear pattern matching algorithms, in IEEE 14th Ann. Symp. on Switching and Automata Theory, 1973, pp. 111.

18

You might also like