You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/string/suffix-automaton.md
+15-15Lines changed: 15 additions & 15 deletions
Original file line number
Diff line number
Diff line change
@@ -221,11 +221,11 @@ Let us describe this process:
221
221
(Initially we set $last = 0$, and we will change $last$ in the last step of the algorithm accordingly.)
222
222
- Create a new state $cur$, and assign it with $len(cur) = len(last) + 1$.
223
223
The value $link(cur)$ is not known at the time.
224
-
- Now we to the following procedure:
224
+
- Now we do the following procedure:
225
225
We start at the state $last$.
226
226
While there isn't a transition through the letter $c$, we will add a transition to the state $cur$, and follow the suffix link.
227
227
If at some point there already exists a transition through the letter $c$, then we will stop and denote this state with $p$.
228
-
- If it haven't found such a state $p$, then we reached the fictitious state $-1$, then we can just assign $link(cur) = 0$ and leave.
228
+
- If we haven't found such a state $p$, then we reached the fictitious state $-1$, then we can just assign $link(cur) = 0$ and leave.
229
229
- Suppose now that we have found a state $p$, from which there exists a transition through the letter $c$.
230
230
We will denote the state, to which the transition leads, with $q$.
231
231
- Now we have two cases. Either $len(p) + 1 = len(q)$, or not.
@@ -241,7 +241,7 @@ Let us describe this process:
241
241
242
242
- In any of the three cases, after completing the procedure, we update the value $last$ with the state $cur$.
243
243
244
-
If we also want to know which states are **terminal** and which are not, the we can find all terminal states after constructing the complete suffix automaton for the entire string $s$.
244
+
If we also want to know which states are **terminal** and which are not, we can find all terminal states after constructing the complete suffix automaton for the entire string $s$.
245
245
To do this, we take the state corresponding to the entire string (stored in the variable $last$), and follow its suffix links until we reach the initial state.
246
246
We will mark all visited states as terminal.
247
247
It is easy to understand that by doing so we will mark exactly the states corresponding to all the suffixes of the string $s$, which are exactly the terminal states.
@@ -280,7 +280,7 @@ The linearity of the number of transitions, and in general the linearity of the
280
280
281
281
- In the second case we came across an existing transition $(p, q)$.
282
282
This means that we tried to add a string $x + c$ (where $x$ is a suffix of $s$) to the machine that **already exists** in the machine (the string $x + c$ already appears as a substring of $s$).
283
-
Since we assume that the automaton for the string $s$ is build correctly, we should not add a new transition here.
283
+
Since we assume that the automaton for the string $s$ is built correctly, we should not add a new transition here.
284
284
285
285
However there is a difficulty.
286
286
To which state should the suffix link from the state $cur$ lead?
@@ -324,7 +324,7 @@ If we consider all parts of the algorithm, then it contains three places in the
324
324
- The second place is the copying of transitions when the state $q$ is cloned into a new state $clone$.
325
325
- Third place is changing the transition leading to $q$, redirecting them to $clone$.
326
326
327
-
We use the fact that the size of the suffix automaton (both in number of states and in the number of transitions) is **linear**.
327
+
We use the fact that the size of the suffix automaton (both in the number of states and in the number of transitions) is **linear**.
328
328
(The proof of the linearity of the number of states is the algorithm itself, and the proof of linearity of the number of states is given below, after the implementation of the algorithm).
329
329
330
330
Thus the total complexity of the **first and second places** is obvious, after all each operation adds only one amortized new transition to the automaton.
@@ -334,7 +334,7 @@ We denote $v = longest(p)$.
334
334
This is a suffix of the string $s$, and with each iteration its length decreases - and therefore the position $v$ as the suffix of the string $s$ increases monotonically with each iteration.
335
335
In this case, if before the first iteration of the loop, the corresponding string $v$ was at the depth $k$ ($k \ge 2$) from $last$ (by counting the depth as the number of suffix links), then after the last iteration the string $v + c$ will be a $2$-th suffix link on the path from $cur$ (which will become the new value $last$).
336
336
337
-
Thus, each iteration of this loop leads to the fact that the position of the string $longest(link(link(last))$ as suffix of the current string will monotonically increase.
337
+
Thus, each iteration of this loop leads to the fact that the position of the string $longest(link(link(last))$ as a suffix of the current string will monotonically increase.
338
338
Therefore this cycle cannot be executed more than $n$ iterations, which was required to prove.
339
339
340
340
### Implementation
@@ -444,7 +444,7 @@ Let the current non-continuous transition be $(p, q)$ with the character $c$.
444
444
We take the correspondent string $u + c + w$, where the string $u$ corresponds to the longest path from the initial state to $p$, and $w$ to the longest path from $q$ to any terminal state.
445
445
On one hand, each such string $u + c + w$ for each incomplete strings will be different (since the strings $u$ and $w$ are formed only by complete transitions).
446
446
On the other hand each such string $u + c + w$, by the definition of the terminal states, will be a suffix of the entire string $s$.
447
-
Since there are only $n$ non-empty suffixes of $s$, and non of the strings $u + c + w$ can contain $s$ (because the entire string only contains complete transitions), the total number of incomplete transitions does not exceed $n - 1$.
447
+
Since there are only $n$ non-empty suffixes of $s$, and none of the strings $u + c + w$ can contain $s$ (because the entire string only contains complete transitions), the total number of incomplete transitions does not exceed $n - 1$.
448
448
449
449
Combining these two estimates gives us the bound $3n - 3$.
450
450
However, since the maximum number of states can only be achieved with the test case $\text{"abbb\dots bbb"}$ and this case has clearly less than $3n - 3$ transitions, we get the tighter bound of $3n - 4$ for the number of transitions in a suffix automaton.
@@ -460,7 +460,7 @@ For the simplicity we assume that the alphabet size $k$ is constant, which allow
460
460
461
461
### Check for occurrence
462
462
463
-
Given a text $T$, and multiple patters $P$.
463
+
Given a text $T$, and multiple patterns $P$.
464
464
We have to check whether or not the strings $P$ appear as a substring of $T$.
465
465
466
466
We build a suffix automaton of the text $T$ in $O(length(T))$ time.
@@ -525,7 +525,7 @@ The value $ans[v]$ can be computed using the recursion:
We take the answer of each adjacent vertex $w$, and add to it $d[w]$ (since every substrings is one character longer when starting from the state $v$).
528
+
We take the answer of each adjacent vertex $w$, and add to it $d[w]$ (since every substring is one character longer when starting from the state $v$).
529
529
530
530
Again this task can be computed in $O(length(S))$ time.
531
531
@@ -547,15 +547,15 @@ long long get_tot_len_diff_substings() {
547
547
}
548
548
```
549
549
550
-
This approaches runs in $O(length(S))$ time, but experimentally runs 20x faster than the memoized dynamic programming version on randomized strings. It requires no extra space and no recursion.
550
+
This approach runs in $O(length(S))$ time, but experimentally runs 20x faster than the memoized dynamic programming version on randomized strings. It requires no extra space and no recursion.
For each given number $K_i$ we have to find the $K_i$-th string in the lexicographically ordered list of all substrings.
557
557
558
-
The solution of this problem is based on the idea of the previous two problems.
558
+
The solution to this problem is based on the idea of the previous two problems.
559
559
The lexicographically $k$-th substring corresponds to the lexicographically $k$-th path in the suffix automaton.
560
560
Therefore after counting the number of paths from each state, we can easily search for the $k$-th path starting from the root of the automaton.
561
561
@@ -577,7 +577,7 @@ Total time complexity is $O(length(S))$.
577
577
578
578
For a given text $T$.
579
579
We have to answer multiple queries.
580
-
For each given pattern $P$ we have to find out how many times the string $P$ appears in the string $T$ as substring.
580
+
For each given pattern $P$ we have to find out how many times the string $P$ appears in the string $T$ as a substring.
581
581
582
582
We construct the suffix automaton for the text $T$.
583
583
@@ -603,7 +603,7 @@ Therefore initially we have $cnt = 1$ for each such state, and $cnt = 0$ for all
603
603
Then we apply the following operation for each $v$: $cnt[link(v)] \text{ += } cnt[v]$.
604
604
The meaning behind this is, that if a string $v$ appears $cnt[v]$ times, then also all its suffixes appear at the exact same end positions, therefore also $cnt[v]$ times.
605
605
606
-
Why don't we overcount in this procedure (i.e. don't count some position twice)?
606
+
Why don't we overcount in this procedure (i.e. don't count some positions twice)?
607
607
Because we add the positions of a state to only one other state, so it can not happen that one state directs its positions to another state twice in two different ways.
608
608
609
609
Thus we can compute the quantities $cnt$ for all states in the automaton in $O(length(T))$ time.
@@ -690,7 +690,7 @@ void output_all_occurrences(int v, int P_length) {
690
690
### Shortest non-appearing string
691
691
692
692
Given a string $S$ and a certain alphabet.
693
-
We have to find a string of smallest length, that doesn't appear in $S$.
693
+
We have to find a string of the smallest length, that doesn't appear in $S$.
694
694
695
695
We will apply dynamic programming on the suffix automaton built for the string $S$.
696
696
@@ -706,7 +706,7 @@ The answer to the problem will be $d[t_0]$, and the actual string can be restore
706
706
### Longest common substring of two strings
707
707
708
708
Given two strings $S$ and $T$.
709
-
We have to find the longest common substring, i.e. such a string $X$ that appears as substring in $S$ and also in $T$.
709
+
We have to find the longest common substring, i.e. such a string $X$ that appears as a substring in $S$ and also in $T$.
710
710
711
711
We construct a suffix automaton for the string $S$.
0 commit comments