-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Update suffix-array.md for Linear Time approach #1369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks! I'm not very familiar with linear-time suffix array construction algorithms, but random places over the internet claim that SA-iS is simpler and faster than Kärkkäinen-Sanders, is that true? I wonder if we should describe SA-IS instead of DC3... 🤔 |
Thank you for the feedback! You're absolutely right , While the DC3 algorithm is efficient for large inputs and widely used when memory consumption is a constraint, I agree that SA-IS is simpler and faster in many practical scenarios, especially in competitive programming. Given this, I’m happy to switch the implementation to SA-IS, as it better aligns with the goals of this repository. Would you prefer I update the pull request to describe SA-IS instead of DC3, and I can add that? |
Ideally we'd rely on some benchmarks, for example implement both algorithms (in an as simple way as possible) and see how they fare on Library Checker. Then, if the performance turns out to be a close match, we should describe the one that is simpler to understand, otherwise we should probably focus on the one with better performance. |
Yes, let's do it then. |
I have updated the Pull Request with the latest changes as discussed |
It seems like this slipped through the cracks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the pull request, and sorry for the late review!
I only reviewed part of it, but I think we should add some explanation, in text, of what constitutes the algorithm (not just tldr), and also some proper explanation on why it is linear, so I think it makes sense to return to review once it is done.
Also a lot of itemized/numerated lists misrender because of missing newlines before the first list item, which should be fixed.
Also, so that I better understand this pull request, could you please tell if you have used AI when making it (to which extent, if yes), and if the code implementation is your own?
|
||
### Understanding L-Type and S-Type Suffixes | ||
|
||
In the SA-IS algorithm, suffixes are classified as: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check your suggestion against the preview. Lists require a newline before the first item, otherwise they misrender.
- **S-type (Right)**: A suffix is S-type if it is lexicographically smaller than or equal to the suffix immediately following it. | ||
|
||
For the string `banana$`, let's classify each suffix: | ||
1. Starting from the end, `$` is considered S-type by definition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also requires a newline before the first list item.
| 3 | `ana$` | S | | ||
| 2 | `nana$` | L | | ||
| 1 | `anana$`| S | | ||
| 0 | `banana$`| S | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think banana$
is actually larger than anana$
?
The `L` and `S` types provide crucial information for sorting suffixes using induced sorting. | ||
### Example: | ||
For the string `s = "banana"`, The steps were : | ||
1. The algorithm first classifies suffixes into L-type and S-type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This misses the newline.
### Example: | ||
For the string `s = "banana"`, The steps were : | ||
1. The algorithm first classifies suffixes into L-type and S-type. | ||
2. It identifies LMS positions based on the L and S classifications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intro text says "the steps were", but the example above actually doesn't contain or explain any further steps.
if (n == 2) return (s[0] < s[1]) ? std::vector<int>{0, 1} : std::vector<int>{1, 0}; | ||
``` | ||
This part initializes the `sa_is` function, which constructs the suffix array for a given input vector `s`. It first handles edge cases: | ||
- If the string is empty (`n == 0`), it returns an empty array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newline needed before the first list item.
|
||
### 1. **Suffix Array Initialization and Base Cases** | ||
```cpp | ||
std::vector<int> sa_is(const std::vector<int>& s, int upper) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is upper
? It appears to be max possible character, but it's not explicitly stated anywhere.
if (!ls[i]) sum_s[s[i]]++; | ||
else sum_l[s[i] + 1]++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be the other way around? ls[i] = true
2. **Identifying LMS Substrings**: | ||
- SA-IS identifies LMS (Leftmost S-type) positions, which are boundaries between L-type and S-type suffixes. LMS substrings are critical as they serve as anchor points for the sorting process. | ||
|
||
3. **Induced Sorting**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we really need to add further explanation, in text, of how exactly induced sorting works, and why recursive calls would be linear.
3. **Induced Sorting**: | ||
- The algorithm first sorts LMS substrings recursively, then induces the order of the remaining suffixes. Sorting the LMS substrings is the key part of the algorithm, and once sorted, the remaining suffixes are arranged by their lexicographical order relative to LMS substrings. | ||
|
||
4. **Recursive Call for LMS Substrings**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this already included in the previous step?..
Hi @adamant-pwn ,
The previous version of suffix Array did not include the Linear time algorithm , I have modified the same using the Skew Algorithm to introduce the O(n) time complexity approach
It will benefit competitive programmers by optimizing time-critical problems.
Reference: Skew Algorithm Paper