Suffix Automaton - Algorithms for Competitive Programming
Suffix Automaton - Algorithms for Competitive Programming
ru
Suffix Automaton
A suffix automaton is a powerful data structure that allows solving many string-related problems.
For example, you can search for all occurrences of one string in another, or count the amount of different substrings of a given string.
Both tasks can be solved in linear time with the help of a suffix automaton.
Intuitively a suffix automaton can be understood as a compressed form of all substrings of a given string. An impressive fact is, that the
suffix automaton contains all this information in a highly compressed form. For a string of length n it only requires O(n) memory.
Moreover, it can also be built in O(n) time (if we consider the size k of the alphabet as a constant), otherwise both the memory and the
time complexity will be O(n log k) .
The linearity of the size of the suffix automaton was first discovered in 1983 by Blumer et al., and in 1985 the first linear algorithms for the
construction was presented by Crochemore and Blumer.
In other words:
A suffix automaton is an oriented acyclic graph. The vertices are called states, and the edges are called transitions between states.
One of the states t0 is the initial state, and it must be the source of the graph (all other states are reachable from t0 ).
Each transition is labeled with some character. All transitions originating from a state must have different labels.
One or multiple states are marked as terminal states. If we start from the initial state t0 and move along transitions to a terminal
state, then the labels of the passed transitions must spell one of the suffixes of the string s . Each of the suffixes of s must be
spellable using a path from t0 to a terminal state.
The suffix automaton contains the minimum number of vertices among all automata satisfying the conditions described above.
Substring property
The simplest and most important property of a suffix automaton is, that it contains information about all substrings of the string s . Any
path starting at the initial state t0 , if we write down the labels of the transitions, forms a substring of s . And conversely every substring of
s corresponds to a certain path starting at t0 .
In order to simplify the explanations, we will say that the substring corresponds to that path (starting at t0 and the labels spell the
substring). And conversely we say that any path corresponds to the string spelled by its labels.
One or multiple paths can lead to a state. Thus, we will say that a state corresponds to the set of strings, which correspond to these
paths.
Here we will show some examples of suffix automata for several simple strings.
We will denote the initial state with blue and the terminal states with green.
Consider any non-empty substring t of the string s . We will denote with endpos(t) the set of all positions in the string s , in which the
occurrences of t end. For instance, we have endpos("bc") = {2, 4} for the string "abcbc" .
We will call two substrings t1 and t2 endpos -equivalent, if their ending sets coincide: endpos(t1) = endpos(t2) . Thus all non-empty
substrings of the string s can be decomposed into several equivalence classes according to their sets endpos .
It turns out, that in a suffix machine endpos -equivalent substrings correspond to the same state. In other words the number of states in a
suffix automaton is equal to the number of equivalence classes among all substrings, plus the initial state. Each state of a suffix
automaton corresponds to one or more substrings having the same value endpos .
We will later describe the construction algorithm using this assumption. We will then see, that all the required properties of a suffix
automaton, except for the minimality, are fulfilled. And the minimality follows from Nerode's theorem (which will not be proven in this
article).
Lemma 1: Two non-empty substrings u and w (with length(u) ≤ length(w) ) are endpos -equivalent, if and only if the string u occurs
in s only in the form of a suffix of w .
The proof is obvious. If u and w have the same endpos values, then u is a suffix of w and appears only in the form of a suffix of w in s .
And if u is a suffix of w and appears only in the form as a suffix in s , then the values endpos are equal by definition.
Lemma 2: Consider two non-empty substrings u and w (with length(u) ≤ length(w) ). Then their sets endpos either don't intersect
at all, or endpos(w) is a subset of endpos(u) . And it depends on if u is a suffix of w or not.
Proof: If the sets endpos(u) and endpos(w) have at least one common element, then the strings u and w both end in that position, i.e.
u is a suffix of w . But then at every occurrence of w also appears the substring u , which means that endpos(w) is a subset of
endpos(u) .
Lemma 3: Consider an endpos -equivalence class. Sort all the substrings in this class by decreasing length. Then in the resulting
sequence each substring will be one shorter than the previous one, and at the same time will be a suffix of the previous one. In other
words, in a same equivalence class, the shorter substrings are actually suffixes of the longer substrings, and they take all possible lengths
in a certain interval [x; y] .
Proof: Fix some endpos -equivalence class. If it only contains one string, then the lemma is obviously true. Now let's say that the number
of strings in the class is greater than one.
According to Lemma 1, two different endpos -equivalent strings are always in such a way, that the shorter one is a proper suffix of the
longer one. Consequently, there cannot be two strings of the same length in the equivalence class.
Let's denote by w the longest, and through u the shortest string in the equivalence class. According to Lemma 1, the string u is a proper
suffix of the string w . Consider now any suffix of w with a length in the interval [length(u); length(w)] . It is easy to see, that this suffix
is also contained in the same equivalence class. Because this suffix can only appear in the form of a suffix of w in the string s (since also
the shorter suffix u occurs in s only in the form of a suffix of w ). Consequently, according to Lemma 1, this suffix is endpos -equivalent
to the string w .
Consider some state v ≠ t0 in the automaton. As we know, the state v corresponds to the class of strings with the same endpos
values. And if we denote by w the longest of these strings, then all the other strings are suffixes of w .
We also know the first few suffixes of a string w (if we consider suffixes in descending order of their length) are all contained in this
equivalence class, and all other suffixes (at least one other - the empty suffix) are in some other classes. We denote by t the biggest such
suffix, and make a suffix link to it.
In other words, a suffix link link(v) leads to the state that corresponds to the longest suffix of w that is in another endpos -equivalence
class.
Here we assume that the initial state t0 corresponds to its own equivalence class (containing only the empty string), and for convenience
we set endpos(t0) = {−1, 0, … , length(s) − 1} .
Proof: Consider an arbitrary state v ≠ t0 . A suffix link link(v) leads to a state corresponding to strings with strictly smaller length (this
follows from the definition of the suffix links and from Lemma 3). Therefore, by moving along the suffix links, we will sooner or later come
to the initial state t0 , which corresponds to the empty string.
Lemma 5: If we construct a tree using the sets endpos (by the rule that the set of a parent node contains the sets of all children as
subsets), then the structure will coincide with the tree of suffix links.
Proof: The fact that we can construct a tree using the sets endpos follows directly from Lemma 2 (that any two sets either do not
intersect or one is contained in the other).
Let us now consider an arbitrary state v ≠ t0 , and its suffix link link(v) . From the definition of the suffix link and from Lemma 2 it
follows that
endpos(v) ⊆ endpos(link(v)),
which together with the previous lemma proves the assertion: the tree of suffix links is essentially a tree of sets endpos .
Here is an example of a tree of suffix links in the suffix automaton build for the string "abcbc" . The nodes are labeled with the longest
substring from the corresponding equivalence class.
Recap
Before proceeding to the algorithm itself, we recap the accumulated knowledge, and introduce a few auxiliary notations.
The substrings of the string s can be decomposed into equivalence classes according to their end positions endpos .
The suffix automaton consists of the initial state t0 , as well as of one state for each endpos -equivalence class.
For each state v one or multiple substrings match. We denote by longest(v) the longest such string, and through len(v) its length.
We denote by shortest(v) the shortest such substring, and its length with minlen(v) . Then all the strings corresponding to this
state are different suffixes of the string longest(v) and have all possible lengths in the interval [minlen(v); len(v)] .
For each state v ≠ t0 a suffix link is defined as a link, that leads to a state that corresponds to the suffix of the string longest(v) of
length minlen(v) − 1 . The suffix links form a tree with the root in t0 , and at the same time this tree forms an inclusion relationship
between the sets endpos .
We can express minlen(v) for v ≠ t0 using the suffix link link(v) as:
minlen(v) = len(link(v)) + 1
If we start from an arbitrary state v0 and follow the suffix links, then sooner or later we will reach the initial state t0 . In this case we
obtain a sequence of disjoint intervals [minlen(vi); len(vi)] , which in union forms the continuous interval [0; len(v0)] .
Algorithm
Now we can proceed to the algorithm itself. The algorithm will be online, i.e. we will add the characters of the string one by one, and
modify the automaton accordingly in each step.
To achieve linear memory consumption, we will only store the values len , link and a list of transitions in each state. We will not label
terminal states (but we will later show how to arrange these labels after constructing the suffix automaton).
Initially the automaton consists of a single state t0 , which will be the index 0 (the remaining states will receive the indices 1, 2, … ). We
assign it len = 0 and link = −1 for convenience ( −1 will be a fictional, non-existing state).
Now the whole task boils down to implementing the process of adding one character c to the end of the current string. Let us describe
this process:
Let last be the state corresponding to the entire string before adding the character c . (Initially we set last = 0 , and we will change
last in the last step of the algorithm accordingly.)
Create a new state cur , and assign it with len(cur) = len(last) + 1 . The value link(cur) is not known at the time.
Now we to the following procedure: We start at the state last . While there isn't a transition through the letter c , we will add a
transition to the state cur , and follow the suffix link. If at some point there already exists a transition through the letter c , then we will
stop and denote this state with p .
If it haven't found such a state p , then we reached the fictitious state −1 , then we can just assign link(cur) = 0 and leave.
Suppose now that we have found a state p , from which there exists a transition through the letter c . We will denote the state, to
which the transition leads, with q .
Otherwise it is a bit more complicated. It is necessary to clone the state q : we create a new state clone , copy all the data from q
(suffix link and transition) except the value len . We will assign len(clone) = len(p) + 1 .
After cloning we direct the suffix link from cur to clone , and also from q to clone.
Finally we need to walk from the state p back using suffix links as long as there is a transition through c to the state q , and redirect
all those to the state clone .
In any of the three cases, after completing the procedure, we update the value last with the state cur .
If we also want to know which states are terminal and which are not, the we can find all terminal states after constructing the complete
suffix automaton for the entire string s . To do this, we take the state corresponding to the entire string (stored in the variable last ), and
follow its suffix links until we reach the initial state. We will mark all visited states as terminal. It is easy to understand that by doing so we
will mark exactly the states corresponding to all the suffixes of the string s , which are exactly the terminal states.
In the next section we will look in detail at each step and show its correctness.
Here we only note that, since we only create one or two new states for each character of s , the suffix automaton contains a linear number
of states.
The linearity of the number of transitions, and in general the linearity of the runtime of the algorithm is less clear, and they will be proven
after we proved the correctness.
Correctness
We will call a transition (p, q) continuous if len(p) + 1 = len(q) . Otherwise, i.e. when len(p) + 1 < len(q) , the transition will be
called non-continuous.
As we can see from the description of the algorithm, continuous and non-continuous transitions will lead to different cases of the
algorithm. Continuous transitions are fixed, and will never change again. In contrast non-continuous transition may change, when new
letters are added to the string (the end of the transition edge may change).
To avoid ambiguity we will denote the string, for which the suffix automaton was built before adding the current character c , with s .
The algorithm begins with creating a new state cur , which will correspond to the entire string s + c . It is clear why we have to create
a new state. Together with the new character a new equivalence class is created.
After creating a new state we traverse by suffix links starting from the state corresponding to the entire string s . For each state we try
to add a transition with the character c to the new state cur . Thus we append to each suffix of s the character c . However we can
only add these new transitions, if they don't conflict with an already existing one. Therefore as soon as we find an already existing
transition with c we have to stop.
In the simplest case we reached the fictitious state −1 . This means we added the transition with c to all suffixes of s . This also
means, that the character c hasn't been part of the string s before. Therefore the suffix link of cur has to lead to the state 0 .
In the second case we came across an existing transition (p, q) . This means that we tried to add a string x + c (where x is a suffix
of s ) to the machine that already exists in the machine (the string x + c already appears as a substring of s ). Since we assume that
the automaton for the string s is build correctly, we should not add a new transition here.
However there is a difficulty. To which state should the suffix link from the state cur lead? We have to make a suffix link to a state, in
which the longest string is exactly x + c , i.e. the len of this state should be len(p) + 1 . However it is possible, that such a state
doesn't yet exists, i.e. len(q) > len(p) + 1 . In this case we have to create such a state, by splitting the state q .
If the transition (p, q) turns out to be continuous, then len(q) = len(p) + 1 . In this case everything is simple. We direct the suffix
link from cur to the state q .
Otherwise the transition is non-continuous, i.e. len(q) > len(p) + 1 . This means that the state q corresponds to not only the suffix
of s + c with length len(p) + 1 , but also to longer substrings of s . We can do nothing other than splitting the state q into two sub-
states, so that the first one has length len(p) + 1 .
How can we split a state? We clone the state q , which gives us the state clone , and we set len(clone) = len(p) + 1 . We copy all
the transitions from q to clone , because we don't want to change the paths that traverse through q . Also we set the suffix link from
clone to the target of the suffix link of q , and set the suffix link of q to clone .
And after splitting the state, we set the suffix link from cur to clone .
In the last step we change some of the transitions to q , we redirect them to clone . Which transitions do we have to change? It is
enough to redirect only the transitions corresponding to all the suffixes of the string w + c (where w is the longest string of p ), i.e.
we need to continue to move along the suffix links, starting from the vertex p until we reach the fictitious state −1 or a transition that
leads to a different state than q .
First we immediately make the assumption that the size of the alphabet is constant. If this is not the case, then it will not be possible to
talk about the linear time complexity. The list of transitions from one vertex will be stored in a balanced tree, which allows you to quickly
perform key search operations and adding keys. Therefore if we denote with k the size of the alphabet, then the asymptotic behavior of
the algorithm will be O(n log k) with O(n) memory. However if the alphabet is small enough, then you can sacrifice memory by
avoiding balanced trees, and store the transitions at each vertex as an array of length k (for quick searching by key) and a dynamic list (to
quickly traverse all available keys). Thus we reach the O(n) time complexity for the algorithm, but at a cost of O(nk) memory
complexity.
So we will consider the size of the alphabet to be constant, i.e. each operation of searching for a transition on a character, adding a
transition, searching for the next transition - all these operations can be done in O(1) .
If we consider all parts of the algorithm, then it contains three places in the algorithm in which the linear complexity is not obvious:
The first place is the traversal through the suffix links from the state last , adding transitions with the character c .
The second place is the copying of transitions when the state q is cloned into a new state clone .
We use the fact that the size of the suffix automaton (both in number of states and in the number of transitions) is linear. (The proof of
the linearity of the number of states is the algorithm itself, and the proof of linearity of the number of states is given below, after the
implementation of the algorithm).
Thus the total complexity of the first and second places is obvious, after all each operation adds only one amortized new transition to the
automaton.
It remains to estimate the total complexity of the third place, in which we redirect transitions, that pointed originally to q , to clone . We
denote v = longest(p) . This is a suffix of the string s , and with each iteration its length decreases - and therefore the position v as the
suffix of the string s increases monotonically with each iteration. In this case, if before the first iteration of the loop, the corresponding
string v was at the depth k ( k ≥ 2 ) from last (by counting the depth as the number of suffix links), then after the last iteration the string
v + c will be a 2 -th suffix link on the path from cur (which will become the new value last ).
Thus, each iteration of this loop leads to the fact that the position of the string longest(link(link(last)) as suffix of the current string
will monotonically increase. Therefore this cycle cannot be executed more than n iterations, which was required to prove.
Implementation
First we describe a data structure that will store all information about a specific transition ( len , link and the list of transitions). If
necessary you can add a terminal flag here, as well as other information. We will store the list of transitions in the form of a map , which
allows us to achieve total O(n) memory and O(n log k) time for processing the entire string.
struct state {
int len, link;
map<char, int> next;
};
The suffix automaton itself will be stored in an array of these structures state . We store the current size sz and also the variable last ,
the state corresponding to the entire string at the moment.
We give a function that initializes a suffix automaton (creating a suffix automaton with a single state).
void sa_init() {
st[0].len = 0;
st[0].link = -1;
sz++;
last = 0;
}
And finally we give the implementation of the main function - which adds the next character to the end of the current line, rebuilding the
machine accordingly.
void sa_extend(char c) {
int cur = sz++;
st[cur].len = st[last].len + 1;
int p = last;
while (p != -1 && !st[p].next.count(c)) {
st[p].next[c] = cur;
p = st[p].link;
}
if (p == -1) {
st[cur].link = 0;
} else {
int q = st[p].next[c];
if (st[p].len + 1 == st[q].len) {
st[cur].link = q;
} else {
int clone = sz++;
st[clone].len = st[p].len + 1;
st[clone].next = st[q].next;
st[clone].link = st[q].link;
while (p != -1 && st[p].next[c] == q) {
st[p].next[c] = clone;
p = st[p].link;
}
st[q].link = st[cur].link = clone;
}
}
last = cur;
}
As mentioned above, if you sacrifice memory ( O(nk) , where k is the size of the alphabet), then you can achieve the build time of the
machine in O(n) , even for any alphabet size k . But for this you will have to store an array of size k in each state (for quickly jumping to
the transition of the letter), and additional a list of all transitions (to quickly iterate over the transitions them).
Additional properties
Number of states
The number of states in a suffix automaton of the string s of length n doesn't exceed 2n − 1 (for n ≥ 2 ).
The proof is the construction algorithm itself, since initially the automaton consists of one state, and in the first and second iteration only
a single state will be created, and in the remaining n − 2 steps at most 2 states will be created each.
However we can also show this estimation without knowing the algorithm. Let us recall that the number of states is equal to the number
of different sets endpos . In addition theses sets endpos form a tree (a parent vertex contains all children sets in his set). Consider this
tree and transform it a little bit: as long as it has an internal vertex with only one child (which means that the set of the child misses at
least one position from the parent set), we create a new child with the set of the missing positions. In the end we have a tree in which
each inner vertex has a degree greater than one, and the number of leaves does not exceed n . Therefore there are no more than 2n − 1
vertices in such a tree.
This bound of the number of states can actually be achieved for each n . A possible string is:
"abbb … bbb"
In each iteration, starting at the third one, the algorithm will split a state, resulting in exactly 2n − 1 states.
Number of transitions
The number of transitions in a suffix automaton of a string s of length n doesn't exceed 3n − 4 (for n ≥ 3 ).
Let us first estimate the number of continuous transitions. Consider a spanning tree of the longest paths in the automaton starting in the
state t0 . This skeleton will consist of only the continuous edges, and therefore their number is less than the number of states, i.e. it does
not exceed 2n − 2 .
Now let us estimate the number of non-continuous transitions. Let the current non-continuous transition be (p, q) with the character c .
We take the correspondent string u + c + w , where the string u corresponds to the longest path from the initial state to p , and w to the
longest path from q to any terminal state. On one hand, each such string u + c + w for each incomplete strings will be different (since
the strings u and w are formed only by complete transitions). On the other hand each such string u + c + w , by the definition of the
terminal states, will be a suffix of the entire string s . Since there are only n non-empty suffixes of s , and non of the strings u + c + w
can contain s (because the entire string only contains complete transitions), the total number of incomplete transitions does not exceed
n − 1.
Combining these two estimates gives us the bound 3n − 3 . However, since the maximum number of states can only be achieved with
the test case "abbb\dots bbb" and this case has clearly less than 3n − 3 transitions, we get the tighter bound of 3n − 4 for the
number of transitions in a suffix automaton.
"abbb … bbbc"
Applications
Here we look at some tasks that can be solved using the suffix automaton. For the simplicity we assume that the alphabet size k is
constant, which allows us to consider the complexity of appending a character and the traversal as constant.
Given a text T , and multiple patters P . We have to check whether or not the strings P appear as a substring of T .
We build a suffix automaton of the text T in O(length(T )) time. To check if a pattern P appears in T , we follow the transitions,
starting from t0 , according to the characters of P . If at some point there doesn't exists a transition, then the pattern P doesn't appear as
a substring of T . If we can process the entire string P this way, then the string appears in T .
It is clear that this will take O(length(P )) time for each string P . Moreover the algorithm actually finds the length of the longest prefix
of P that appears in the text.
Each substring of S corresponds to some path in the automaton. Therefore the number of different substrings is equal to the number of
different paths in the automaton starting at t0 .
Given that the suffix automaton is a directed acyclic graph, the number of different ways can be computed using dynamic programming.
Namely, let d[v] be the number of ways, starting at the state v (including the path of length zero). Then we have the recursion:
d[v] = 1 + ∑ d[w]
w:(v,w,c)∈DAW G
I.e. d[v] can be expressed as the sum of answers for all ends of the transitions of v .
The number of different substrings is the value d[t0] − 1 (since we don't count the empty substring).
Alternatively, we can take advantage of the fact that each state v matches to substrings of length [minlen(v), len(v)] . Therefore, given
minlen(v) = 1 + len(link(v)) , we have total distinct substrings at state v being
len(v) − minlen(v) + 1 = len(v) − (1 + len(link(v))) + 1 = len(v) − len(link(v)) .
While this is also O(length(S)) , it requires no extra space and no recursive calls, consequently running faster in practice.
Given a string S . We want to compute the total length of all its various substrings.
The solution is similar to the previous one, only now it is necessary to consider two quantities for the dynamic programming part: the
number of different substrings d[v] and their total length ans[v] .
We already described how to compute d[v] in the previous task. The value ans[v] can be computed using the recursion:
We take the answer of each adjacent vertex w , and add to it d[w] (since every substrings is one character longer when starting from the
state v ).
Alternatively, we can, again, take advantage of the fact that each state v matches to substrings of length [minlen(v), len(v)] . Since
minlen(v) = 1 + len(link(v)) and the arithmetic series formula Sn = n ⋅ a1+an
2 (where Sn denotes the sum of n terms, a1
representing the first term, and an representing the last), we can compute the length of substrings at a state in constant time. We then
sum up these totals for each state v ≠ t0 in the automaton. This is shown by the code below:
This approaches runs in O(length(S)) time, but experimentally runs 20x faster than the memoized dynamic programming version on
randomized strings. It requires no extra space and no recursion.
Given a string S . We have to answer multiple queries. For each given number Ki we have to find the Ki -th string in the lexicographically
ordered list of all substrings.
The solution of this problem is based on the idea of the previous two problems. The lexicographically k -th substring corresponds to the
lexicographically k -th path in the suffix automaton. Therefore after counting the number of paths from each state, we can easily search
for the k -th path starting from the root of the automaton.
This takes O(length(S)) time for preprocessing and then O(length(ans) ⋅ k) for each query (where ans is the answer for the query
and k is the size of the alphabet).
We construct a suffix automaton for the string S + S . Then the automaton will contain in itself as paths all the cyclic shifts of the string
S.
Consequently the problem is reduced to finding the lexicographically smallest path of length length(S) , which can be done in a trivial
way: we start in the initial state and greedily pass through the transitions with the minimal character.
Number of occurrences
For a given text T . We have to answer multiple queries. For each given pattern P we have to find out how many times the string P
appears in the string T as substring.
Next we do the following preprocessing: for each state v in the automaton we calculate the number cnt[v] that is equal to the size of the
set endpos(v) . In fact all strings corresponding to the same state v appear in the text T an equal amount of times, which is equal to the
number of positions in the set endpos .
However we cannot construct the sets endpos explicitly, therefore we only consider their sizes cnt .
To compute them we proceed as follows. For each state, if it was not created by cloning (and if it is not the initial state t0 ), we initialize it
with cnt = 1 . Then we will go through all states in decreasing order of their length len , and add the current value cnt[v] to the suffix
links:
cnt[link(v)] += cnt[v]
This gives the correct value for each state.
Why is this correct? The total number of states obtained not via cloning is exactly length(T ) , and the first i of them appeared when we
added the first i characters. Consequently for each of these states we count the corresponding position at which it was processed.
Therefore initially we have cnt = 1 for each such state, and cnt = 0 for all other.
Then we apply the following operation for each v : cnt[link(v)] += cnt[v] . The meaning behind this is, that if a string v appears cnt[v]
times, then also all its suffixes appear at the exact same end positions, therefore also cnt[v] times.
Why don't we overcount in this procedure (i.e. don't count some position twice)? Because we add the positions of a state to only one other
state, so it can not happen that one state directs its positions to another state twice in two different ways.
Thus we can compute the quantities cnt for all states in the automaton in O(length(T )) time.
After that answering a query by just looking up the value cnt[t] , where t is the state corresponding to the pattern, if such a state exists.
Otherwise answer with 0 . Answering a query takes O(length(P )) time.
Given a text T and multiple queries. For each query string P we want to find the position of the first occurrence of P in the string T (the
position of the beginning of P ).
We again construct a suffix automaton. Additionally we precompute the position firstpos for all states in the automaton, i.e. for each
state v we want to find the position firstpos[v] of the end of the first occurrence. In other words, we want to find in advance the
minimal element of each set endpos (since obviously cannot maintain all sets endpos explicitly).
To maintain these positions firstpos we extend the function sa_extend() . When we create a new state cur , we set:
firstpos(cur) = len(cur) − 1
firstpos(clone) = firstpos(q)
(since the only other option for a value would be firstpos(cur) which is definitely too big)
Thus the answer for a query is simply firstpos(t) − length(P ) + 1 , where t is the state corresponding to the string P . Answering a
query again takes only O(length(P )) time.
This time we have to display all positions of the occurrences in the string T .
Again we construct a suffix automaton for the text T . Similar as in the previous task we compute the position firstpos for all states.
Clearly firstpos(t) is part of the answer, if t is the state corresponding to a query string P . So we took into account the state of the
automaton containing P . What other states do we need to take into account? All states that correspond to strings for which P is a suffix.
In other words we need to find all the states that can reach the state t via suffix links.
Therefore to solve the problem we need to save for each state a list of suffix references leading to it. The answer to the query then will
then contain all firstpos for each state that we can find on a DFS / BFS starting from the state t using only the suffix references.
Overall, this requires O(length(T )) for preprocessing and O(length(P ) + answer(P )) for each request, where answer(P ) — this
is the size of the answer.
First, we walk down the automaton for each character in the pattern to find our starting node requiring O(length(P )) . Then, we use our
workaround which will work in time O(answer(P )) , because we will not visit a state twice (because only one suffix link leaves each
state, so there cannot be two different paths leading to the same state).
We only must take into account that two different states can have the same firstpos value. This happens if one state was obtained by
cloning another. However, this doesn't ruin the complexity, since each state can only have at most one clone.
Moreover, we can also get rid of the duplicate positions, if we don't output the positions from the cloned states. In fact a state, that a
cloned state can reach, is also reachable from the original state. Thus if we remember the flag is_cloned for each state, we can simply
ignore the cloned states and only output firstpos for all other states.
struct state {
...
bool is_clone;
int first_pos;
vector<int> inv_link;
};
Given a string S and a certain alphabet. We have to find a string of smallest length, that doesn't appear in S .
We will apply dynamic programming on the suffix automaton built for the string S .
Let d[v] be the answer for the node v , i.e. we already processed part of the substring, are currently in the state v , and want to find the
smallest number of characters that have to be added to find a non-existent transition. Computing d[v] is very simple. If there is not
transition using at least one character of the alphabet, then d[v] = 1 . Otherwise one character is not enough, and so we need to take the
minimum of all answers of all transitions:
The answer to the problem will be d[t0] , and the actual string can be restored using the computed array d[] .
Given two strings S and T . We have to find the longest common substring, i.e. such a string X that appears as substring in S and also
in T .
We will now take the string T , and for each prefix look for the longest suffix of this prefix in S . In other words, for each position in the
string T , we want to find the longest common substring of S and T ending in that position.
For this we will use two variables, the current state v , and the current length l . These two variables will describe the current matching
part: its length and the state that corresponds to it.
Now let us describe how we can add a character T [i] and recalculate the answer for it.
If there is a transition from v with the character T [i] , then we simply follow the transition and increase l by one.
If there is no such transition, we have to shorten the current matching part, which means that we need to follow the suffix link:
v = link(v) . At the same time, the current length has to be shortened. Obviously we need to assign l = len(v) , since after passing
through the suffix link we end up in state whose corresponding longest string is a substring.
If there is still no transition using the required character, we repeat and again go through the suffix link and decrease l , until we find a
transition or we reach the fictional state −1 (which means that the symbol T [i] doesn't appear at all in S , so we assign v = l = 0 ).
The answer to the task will be the maximum of all the values l .
The complexity of this part is O(length(T )) , since in one move we can either increase l by one, or make several passes through the
suffix links, each one ends up reducing the value l .
Implementation:
There are k strings Si given. We have to find the longest common substring, i.e. such a string X that appears as substring in each string
Si .
We join all strings into one large string T , separating the strings by a special characters Di (one for each string):
T = S1 + D1 + S2 + D2 + ⋯ + Sk + Dk.
Now we need to find a string in the machine, which is contained in all the strings Si , and this can be done by using the special added
characters. Note that if a substring is included in some string Sj , then in the suffix automaton exists a path starting from this substring
containing the character Dj and not containing the other characters D1, … , Dj−1, Dj+1, … , Dk .
Thus we need to calculate the attainability, which tells us for each state of the machine and each symbol Di if there exists such a path.
This can easily be computed by DFS or BFS and dynamic programming. After that, the answer to the problem will be the string
longest(v) for the state v , from which the paths were exists for all special characters.
Practice Problems
CSES - Finding Patterns
SPOJ - SUBLEX
Codeforces - String
Contributors:
Jakob Kogler (91.77%) Michael Hayter (3.92%) iagorrr (0.89%) Oleksandr Kulkov (0.89%) James Wilson (0.76%) Tarun Khullar (0.76%)
Ir1dXD (0.25%) Alfredo Beaumont (0.13%) Robin Hafid Quintero Lopez (0.13%) Bergur Snorrason (0.13%) Le Bao Hiep (0.13%)
wiktorkuchta (0.13%) Md. Abdulla Al Mamun (0.13%)