-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
bpo-41972: Use the "Two-Way" algorithm when searching for long substrings #22679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO you should send the PR when it is ready, to avoid maintain WIP PR for a long time
Disagree in this case: this is a change to extremely important core functionality, so needs to be made as easy as possible for others to try. The changes are all in one file that's typically altered less than once per year (for example, the most recent change was an edit to the comments over a year ago - the most recent non-trivial change was in March of 2017). |
The most recent batch of commits added a jump table. The benchmark data is here: https://pastebin.com/raw/bzQ4xQgM |
@sweeneyde, could you share how exactly you're running the benchmarks? I'd like to check this out myself :) |
I used this static file containing a bunch of randomly generated strings:
import random
from string import ascii_uppercase as alphabet
zipf = [1/x for x in range(1, 1+26)]
def zipf_string(length):
letters = random.choices(alphabet, weights=zipf, k=length)
return ''.join(letters) Then I ran the benchmarks with this script ( from lots_of_benches import needles, haystack
needles: list[str]
haystack: str
from pyperf import Runner
runner = Runner()
for needle in needles:
n = len(needle)
abbrev = needle if n <= 10 else f"{needle[:10]}..."
runner.timeit(
name=f"length={n}, value={abbrev}",
stmt=f"needle in haystack",
globals=globals(),
) Run as |
@maartenbreddels if you are interested in string algorithms, this may tick your curiosity. |
@sweeneyde, why did you close this? |
The source code was too heavily based on glibc's, which has a more restrictive license that forbids commercial use. I opened this as a "clean-room" implementation instead: |
https://bugs.python.org/issue41972