gh-106865: Improve difflib.find_longest_match
performance with cache
#106877
+24
−9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
difflib.SequenceMatcher
does not perform well on large input. This patch tries to improve it by doing some micro optimization on hot path.When dealing with large inputs, for example, large files, almost all the time is spent in
find_longest_match
, and the hottest code is the nested for loop:The content in the for loop will execute
(ahi - alo) * len(b2j[a[i]])
times. Any optimization in the loop could potentially be a large improvement.The patch runs the same loop for the first time
a[i]
is found, except that it creates a cache for the validb2j
list - list withj
inside the range(blo, bhi)
. Ifa[i]
already appears before, then the cached list would be used - then the two if checks can be removed.Also, if
a[i]
already appears before and there is at least one valid content in the list,bestsize
would be already set to at least1
- we can then only do theif k > bestsize
check whenk
is greater than1
- which happens to be already checked if we useif
to check whetherj-1
is in the dict.In this way, we eliminated plenty of code on the cached path. This optimization is significant when the cached path is hit frequently, or the "item"s in
a
repeats frequently. This is the case for any large file (especially file with smaller vocabulary like ascii).I tried this on some of files by hand - a c source file and a README file.
benchmark code
The time spent improvement was significant:
to
This patch will have a small negative performance impact on smaller test cases and cases where none of the items in
a
duplicates. For example, it is significantly slower intest_recursion_limit
:However, in real world scenario, it does not make much sense to diff/match two sequences with a giant vocabulary, so in almost all actual use cases this should improve the performance. Especially on large sequences.
We can, however, check the vocabulary (
set(a) / len(a)
) and determine whether we want to do this optimization. Extra overhead, potentially better perf on worst case.P.S. changing
cache.append
tocacheappend = cache.append
does not have a observable impact, but I can do that if that's needed.