MAINT: mutual information using upstream KDTree #31347

tylerjereddy · 2025-05-09T23:16:18Z

This patch aims to test the waters for reducing community duplication of effort/maintenance with KDTree. In particular, a few select use cases of the sklearn in-house KDTree by the mutual information infrastructure are replaced with upstream KDTree from SciPy. The full test suite appears to pass locally.

If there is an appetite for this, we could spend some time on it (scientific-python/summit-2025#31). I've placed some more detailed discussion points below. This may also interest @sturlamolden.

More detailed Analysis of the Situation

Potential Advantages of Doing This:

The SciPy version of KDTree has been heavily optimized and supports concurrency in many places that the sklearn version does not, as far as I can tell.
Reduced duplication of effort--community folks who are experts on KDTree could focus their efforts on a single (or at least, less) implementation(s). This is the main one I had in mind.

Considerations, Drawbacks, Points that are not clear:

An obvious question is what "replacing" would even mean and how far it would go--we could just replace a few obvious cases to start, like here, but leave the in-house KDTree sources and offerings alone for quite some time to see how that goes first, progressively performing replacements in cases where there's an obvious benefit, or at least no loss of performance or functionality.
The switchover would require reviewer bandwidth, and KDTree code is quite complex. The features offered by both libraries are not identical, and if you have a code base using two types of KDTree it may temporarily make things even more complex.
(minor) you don't have asv benchmarks for your KDTree implementation I don't think? We'd want to add that to avoid performance regressions I suspect, and/or to demonstrate improvements where SciPy adds the workers/concurrency option.
A design challenge is that sklearn uses a "base" BinaryTree that is "specialized" to KDTree and BallTree in an object-oriented style. SciPy doesn't have BallTree (should it?)--how would this be coordinated if there was an appetite for upstreaming?
There appears to be some sklearn activity surrounding 32- and 64-bit "variations" or type preservation with the binary tree infrastructure? I'm not so sure how this would fly/work for SciPy.
Compatibility with pipelines may require quite a bit more thought when upstreaming.

github-actions · 2025-05-09T23:17:12Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: da64dc8. Link to the linter CI: here}

tylerjereddy · 2025-05-10T22:03:13Z

sklearn/feature_selection/_mutual_info.py

-    ny = kd.query_radius(y, radius, count_only=True, return_distance=False)
+    kd = KDTree(y)
+    ny = kd.query_ball_point(y, radius, p=np.inf)
+    ny = [len(sub_list) for sub_list in ny]


Here and elsewhere in this diff, we can do even better (see below). It seems that the two APIs have effectively developed via convergent evolution to offer similar options with different names.

kd = KDTree(y) - ny = kd.query_ball_point(y, radius, p=np.inf) - ny = [len(sub_list) for sub_list in ny] + ny = kd.query_ball_point(y, radius, p=np.inf, return_length=True)

* This patch aims to test the waters for reducing community duplication of effort/maintenance with KDTree. In particular, a few select use cases of the `sklearn` in-house `KDTree` by the mutual information infrastructure are replaced with upstream `KDTree` from SciPy. The full test suite appears to pass locally.

tylerjereddy · 2025-05-12T00:02:25Z

I applied the simplification mentioned at #31347 (comment) and tests were still happy locally.

github-actions bot added the module:feature_selection label May 9, 2025

tylerjereddy force-pushed the treddy_kdtree_txn_1 branch from 90d9ed9 to d30b4fc Compare May 9, 2025 23:19

tylerjereddy mentioned this pull request May 9, 2025

Reducing Ecosystem Duplication scientific-python/summit-2025#31

Open

tylerjereddy commented May 10, 2025

View reviewed changes

tylerjereddy force-pushed the treddy_kdtree_txn_1 branch from d30b4fc to da64dc8 Compare May 12, 2025 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: mutual information using upstream KDTree #31347

MAINT: mutual information using upstream KDTree #31347

tylerjereddy commented May 9, 2025 •

edited

Loading

github-actions bot commented May 9, 2025 •

edited

Loading

tylerjereddy May 10, 2025

tylerjereddy commented May 12, 2025

MAINT: mutual information using upstream KDTree #31347

Are you sure you want to change the base?

MAINT: mutual information using upstream KDTree #31347

Conversation

tylerjereddy commented May 9, 2025 • edited Loading

github-actions bot commented May 9, 2025 • edited Loading

✔️ Linting Passed

tylerjereddy May 10, 2025

Choose a reason for hiding this comment

tylerjereddy commented May 12, 2025

tylerjereddy commented May 9, 2025 •

edited

Loading

github-actions bot commented May 9, 2025 •

edited

Loading