MAINT: mutual information using upstream KDTree #31347
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
KDTree
. In particular, a few select use cases of thesklearn
in-houseKDTree
by the mutual information infrastructure are replaced with upstreamKDTree
from SciPy. The full test suite appears to pass locally.If there is an appetite for this, we could spend some time on it (scientific-python/summit-2025#31). I've placed some more detailed discussion points below. This may also interest @sturlamolden.
More detailed Analysis of the Situation
Potential Advantages of Doing This:
KDTree
has been heavily optimized and supports concurrency in many places that thesklearn
version does not, as far as I can tell.KDTree
could focus their efforts on a single (or at least, less) implementation(s). This is the main one I had in mind.Considerations, Drawbacks, Points that are not clear:
KDTree
sources and offerings alone for quite some time to see how that goes first, progressively performing replacements in cases where there's an obvious benefit, or at least no loss of performance or functionality.KDTree
code is quite complex. The features offered by both libraries are not identical, and if you have a code base using two types ofKDTree
it may temporarily make things even more complex.asv
benchmarks for yourKDTree
implementation I don't think? We'd want to add that to avoid performance regressions I suspect, and/or to demonstrate improvements where SciPy adds theworkers
/concurrency option.sklearn
uses a "base"BinaryTree
that is "specialized" toKDTree
andBallTree
in an object-oriented style. SciPy doesn't haveBallTree
(should it?)--how would this be coordinated if there was an appetite for upstreaming?sklearn
activity surrounding 32- and 64-bit "variations" or type preservation with the binary tree infrastructure? I'm not so sure how this would fly/work for SciPy.