Skip to content

Update suffix-array.md for Linear Time approach #1369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

PeroxideParadox
Copy link

Hi @adamant-pwn ,

The previous version of suffix Array did not include the Linear time algorithm , I have modified the same using the Skew Algorithm to introduce the O(n) time complexity approach
It will benefit competitive programmers by optimizing time-critical problems.

Reference: Skew Algorithm Paper

@adamant-pwn
Copy link
Member

Thanks! I'm not very familiar with linear-time suffix array construction algorithms, but random places over the internet claim that SA-iS is simpler and faster than Kärkkäinen-Sanders, is that true? I wonder if we should describe SA-IS instead of DC3... 🤔

@PeroxideParadox
Copy link
Author

Thank you for the feedback! You're absolutely right , While the DC3 algorithm is efficient for large inputs and widely used when memory consumption is a constraint, I agree that SA-IS is simpler and faster in many practical scenarios, especially in competitive programming. Given this, I’m happy to switch the implementation to SA-IS, as it better aligns with the goals of this repository.

Would you prefer I update the pull request to describe SA-IS instead of DC3, and I can add that?

@adamant-pwn
Copy link
Member

Ideally we'd rely on some benchmarks, for example implement both algorithms (in an as simple way as possible) and see how they fare on Library Checker. Then, if the performance turns out to be a close match, we should describe the one that is simpler to understand, otherwise we should probably focus on the one with better performance.

@PeroxideParadox
Copy link
Author

on benchmarks SA-IS performs better than DC3 and is easier to implement as well
Screenshot 2024-10-26 at 2 55 12 PM

So let's switch to SA-IS and let me update the pull request
Is that good ?

@adamant-pwn
Copy link
Member

Yes, let's do it then.

@PeroxideParadox
Copy link
Author

I have updated the Pull Request with the latest changes as discussed

@mhayter mhayter closed this Apr 11, 2025
github-actions bot added a commit that referenced this pull request Apr 11, 2025
@mhayter mhayter reopened this Apr 11, 2025
@mhayter
Copy link
Contributor

mhayter commented Apr 11, 2025

It seems like this slipped through the cracks
Is this ready to go?

@mhayter mhayter requested a review from adamant-pwn April 11, 2025 23:23
Copy link
Member

@adamant-pwn adamant-pwn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the pull request, and sorry for the late review!

I only reviewed part of it, but I think we should add some explanation, in text, of what constitutes the algorithm (not just tldr), and also some proper explanation on why it is linear, so I think it makes sense to return to review once it is done.

Also a lot of itemized/numerated lists misrender because of missing newlines before the first list item, which should be fixed.

Also, so that I better understand this pull request, could you please tell if you have used AI when making it (to which extent, if yes), and if the code implementation is your own?


### Understanding L-Type and S-Type Suffixes

In the SA-IS algorithm, suffixes are classified as:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check your suggestion against the preview. Lists require a newline before the first item, otherwise they misrender.

- **S-type (Right)**: A suffix is S-type if it is lexicographically smaller than or equal to the suffix immediately following it.

For the string `banana$`, let's classify each suffix:
1. Starting from the end, `$` is considered S-type by definition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also requires a newline before the first list item.

| 3 | `ana$` | S |
| 2 | `nana$` | L |
| 1 | `anana$`| S |
| 0 | `banana$`| S |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think banana$ is actually larger than anana$?

The `L` and `S` types provide crucial information for sorting suffixes using induced sorting.
### Example:
For the string `s = "banana"`, The steps were :
1. The algorithm first classifies suffixes into L-type and S-type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This misses the newline.

### Example:
For the string `s = "banana"`, The steps were :
1. The algorithm first classifies suffixes into L-type and S-type.
2. It identifies LMS positions based on the L and S classifications.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intro text says "the steps were", but the example above actually doesn't contain or explain any further steps.

if (n == 2) return (s[0] < s[1]) ? std::vector<int>{0, 1} : std::vector<int>{1, 0};
```
This part initializes the `sa_is` function, which constructs the suffix array for a given input vector `s`. It first handles edge cases:
- If the string is empty (`n == 0`), it returns an empty array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline needed before the first list item.


### 1. **Suffix Array Initialization and Base Cases**
```cpp
std::vector<int> sa_is(const std::vector<int>& s, int upper) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is upper? It appears to be max possible character, but it's not explicitly stated anywhere.

Comment on lines +296 to +297
if (!ls[i]) sum_s[s[i]]++;
else sum_l[s[i] + 1]++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be the other way around? ls[i] = true $\iff$ the suffix at $i$ is smaller than the suffix at $i+1$, so it means it is $S$-suffix?

2. **Identifying LMS Substrings**:
- SA-IS identifies LMS (Leftmost S-type) positions, which are boundaries between L-type and S-type suffixes. LMS substrings are critical as they serve as anchor points for the sorting process.

3. **Induced Sorting**:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we really need to add further explanation, in text, of how exactly induced sorting works, and why recursive calls would be linear.

3. **Induced Sorting**:
- The algorithm first sorts LMS substrings recursively, then induces the order of the remaining suffixes. Sorting the LMS substrings is the key part of the algorithm, and once sorted, the remaining suffixes are arranged by their lexicographical order relative to LMS substrings.

4. **Recursive Call for LMS Substrings**:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this already included in the previous step?..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants