-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
gh-106865: Improve difflib.find_longest_match
performance with cache
#106877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
difflib.find_longest_match
performance with cachedifflib.find_longest_match
performance with cache
@tim-one I know this is probably ancient code to you :) git blames you for working on the code 23 years ago. Could you maybe share some thoughts on the optimization? Thanks! |
To the contrary, comparing lists of lines is by far the most common - and important intended - use case, where a "letter in the alphabet" is the entire content of a line. I don't know why you would want to compare text files as sequences of characters instead, but "so don't do that" comes to mind 😉. Two common uses for comparing sequences of characters come to mind:
So the patch looks good, but I'm skeptical of that it addresses a "real problem" that's common enough to be worth any slowdown in the actual common cases. The What would be welcome is a different algorithm specifically designed for "long sequences from a small alphabet" cases. |
BTW, if you want to pursue this, there's an "optimization" I never bothered with because I didn't expect it would pay in the common cases (which I apparently have a different idea of than you have 😉): the one-at-a-time comparison of the j's to js = b2j[a[i]] # this is assuming we know a[i] is in the dict
lo = bisect_left(js, blo)
hi = bisect_left(js, bhi, lo)
if lo or hi < len(js): # or maybe faster to always do the slice; it's correct either way
js = js[lo : hi] That finds all the j's in range efficiently regardless of how many are out of range, and regardless of how many are in range, materializes them in a list "at C speed" instead of appending one at a time "at Python speed". The result is the value of your |
Okay I might have misunderstood the usage of If we do not expect duplicate items, then even the bisection may not be able to compensate the extra cost of creating lists and doing the bisection right? We might still slow down the cases where there's no duplicate. I'll leave the decision to you, I'm okay if things keep the way they are. If you think this could be helpful when users use strings as the sequence instead of list of strings, I'll do the bisect optimization. |
difflib.SequenceMatcher
does not perform well on large input. This patch tries to improve it by doing some micro optimization on hot path.When dealing with large inputs, for example, large files, almost all the time is spent in
find_longest_match
, and the hottest code is the nested for loop:The content in the for loop will execute
(ahi - alo) * len(b2j[a[i]])
times. Any optimization in the loop could potentially be a large improvement.The patch runs the same loop for the first time
a[i]
is found, except that it creates a cache for the validb2j
list - list withj
inside the range(blo, bhi)
. Ifa[i]
already appears before, then the cached list would be used - then the two if checks can be removed.Also, if
a[i]
already appears before and there is at least one valid content in the list,bestsize
would be already set to at least1
- we can then only do theif k > bestsize
check whenk
is greater than1
- which happens to be already checked if we useif
to check whetherj-1
is in the dict.In this way, we eliminated plenty of code on the cached path. This optimization is significant when the cached path is hit frequently, or the "item"s in
a
repeats frequently. This is the case for any large file (especially file with smaller vocabulary like ascii).I tried this on some of files by hand - a c source file and a README file.
benchmark code
The time spent improvement was significant:
to
This patch will have a small negative performance impact on smaller test cases and cases where none of the items in
a
duplicates. For example, it is significantly slower intest_recursion_limit
:However, in real world scenario, it does not make much sense to diff/match two sequences with a giant vocabulary, so in almost all actual use cases this should improve the performance. Especially on large sequences.
We can, however, check the vocabulary (
set(a) / len(a)
) and determine whether we want to do this optimization. Extra overhead, potentially better perf on worst case.P.S. changing
cache.append
tocacheappend = cache.append
does not have a observable impact, but I can do that if that's needed.