Skip to content

bpo-41972: Use the "Two-Way" algorithm when searching for long substrings #22679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 21 commits into from

Conversation

sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Oct 12, 2020

@sweeneyde sweeneyde changed the title bpo-41972: Use the "Two-Way" algorithm for substring search bpo-41972: (WIP) Use the "Two-Way" algorithm for substring search Oct 12, 2020
Copy link
Contributor

@eamanu eamanu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO you should send the PR when it is ready, to avoid maintain WIP PR for a long time

@tim-one
Copy link
Member

tim-one commented Oct 13, 2020

IMO you should send the PR when it is ready,, to avoid maintain WIP PR for a long time

Disagree in this case: this is a change to extremely important core functionality, so needs to be made as easy as possible for others to try. The changes are all in one file that's typically altered less than once per year (for example, the most recent change was an edit to the comments over a year ago - the most recent non-trivial change was in March of 2017).

@sweeneyde
Copy link
Member Author

sweeneyde commented Oct 14, 2020

The most recent batch of commits added a jump table.
Between master and this PR now, there are 151 cases slower than master and 463 that faster than master.
The slower cases are at most twice as slow, but the faster cases are often 10-20x faster.
I could add a cutoff to use a simpler algorithm instead, for needles of length less than ~10,
but I wanted to get the "purer" data out before making that change.

The benchmark data is here: https://pastebin.com/raw/bzQ4xQgM

@sweeneyde sweeneyde marked this pull request as ready for review October 17, 2020 20:01
@sweeneyde sweeneyde changed the title bpo-41972: (WIP) Use the "Two-Way" algorithm for substring search bpo-41972: Use the "Two-Way" algorithm for substring search Oct 17, 2020
@sweeneyde sweeneyde changed the title bpo-41972: Use the "Two-Way" algorithm for substring search bpo-41972: Use the "Two-Way" algorithm when searching for long substrings Oct 17, 2020
@taleinat
Copy link
Contributor

@sweeneyde, could you share how exactly you're running the benchmarks? I'd like to check this out myself :)

@sweeneyde
Copy link
Member Author

sweeneyde commented Oct 18, 2020

I used this static file containing a bunch of randomly generated strings:
https://gist.github.com/sweeneyde/f77ccf0d25f9c41d41163b11fe64e9b4
This file has:

  • One length-1_000_000 haystack
  • Needles with lengths ranging from length 1 to 10, then increasing by 50% + 1 up until 100_000
  • ~20 needles of each length
  • All were generated with this function:
import random
from string import ascii_uppercase as alphabet
zipf = [1/x for x in range(1, 1+26)]

def zipf_string(length):
    letters = random.choices(alphabet, weights=zipf, k=length)
    return ''.join(letters)

Then I ran the benchmarks with this script (string_benchmarks.py):

from lots_of_benches import needles, haystack
needles: list[str]
haystack: str

from pyperf import Runner
runner = Runner()

for needle in needles:
    n = len(needle)
    abbrev = needle if n <= 10 else f"{needle[:10]}..."
    runner.timeit(
        name=f"length={n}, value={abbrev}",
        stmt=f"needle in haystack",
        globals=globals(),
    )

Run as python string_benchmarks.py -o whatever.json

@pitrou
Copy link
Member

pitrou commented Oct 19, 2020

@maartenbreddels if you are interested in string algorithms, this may tick your curiosity.

@sweeneyde sweeneyde closed this Oct 19, 2020
@taleinat
Copy link
Contributor

@sweeneyde, why did you close this?

@sweeneyde
Copy link
Member Author

The source code was too heavily based on glibc's, which has a more restrictive license that forbids commercial use.

I opened this as a "clean-room" implementation instead:
#22904

@sweeneyde sweeneyde deleted the two-way branch December 19, 2021 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants