Skip to content

MAINT: mutual information using upstream KDTree #31347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tylerjereddy
Copy link
Contributor

@tylerjereddy tylerjereddy commented May 9, 2025

  • This patch aims to test the waters for reducing community duplication of effort/maintenance with KDTree. In particular, a few select use cases of the sklearn in-house KDTree by the mutual information infrastructure are replaced with upstream KDTree from SciPy. The full test suite appears to pass locally.

If there is an appetite for this, we could spend some time on it (scientific-python/summit-2025#31). I've placed some more detailed discussion points below. This may also interest @sturlamolden.


More detailed Analysis of the Situation

Potential Advantages of Doing This:

  1. The SciPy version of KDTree has been heavily optimized and supports concurrency in many places that the sklearn version does not, as far as I can tell.
  2. Reduced duplication of effort--community folks who are experts on KDTree could focus their efforts on a single (or at least, less) implementation(s). This is the main one I had in mind.

Considerations, Drawbacks, Points that are not clear:

  1. An obvious question is what "replacing" would even mean and how far it would go--we could just replace a few obvious cases to start, like here, but leave the in-house KDTree sources and offerings alone for quite some time to see how that goes first, progressively performing replacements in cases where there's an obvious benefit, or at least no loss of performance or functionality.
  2. The switchover would require reviewer bandwidth, and KDTree code is quite complex. The features offered by both libraries are not identical, and if you have a code base using two types of KDTree it may temporarily make things even more complex.
  3. (minor) you don't have asv benchmarks for your KDTree implementation I don't think? We'd want to add that to avoid performance regressions I suspect, and/or to demonstrate improvements where SciPy adds the workers/concurrency option.
  4. A design challenge is that sklearn uses a "base" BinaryTree that is "specialized" to KDTree and BallTree in an object-oriented style. SciPy doesn't have BallTree (should it?)--how would this be coordinated if there was an appetite for upstreaming?
  5. There appears to be some sklearn activity surrounding 32- and 64-bit "variations" or type preservation with the binary tree infrastructure? I'm not so sure how this would fly/work for SciPy.
  6. Compatibility with pipelines may require quite a bit more thought when upstreaming.

Copy link

github-actions bot commented May 9, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: da64dc8. Link to the linter CI: here

ny = kd.query_radius(y, radius, count_only=True, return_distance=False)
kd = KDTree(y)
ny = kd.query_ball_point(y, radius, p=np.inf)
ny = [len(sub_list) for sub_list in ny]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere in this diff, we can do even better (see below). It seems that the two APIs have effectively developed via convergent evolution to offer similar options with different names.

     kd = KDTree(y)
-    ny = kd.query_ball_point(y, radius, p=np.inf)
-    ny = [len(sub_list) for sub_list in ny]
+    ny = kd.query_ball_point(y, radius, p=np.inf, return_length=True)

* This patch aims to test the waters for reducing
community duplication of effort/maintenance with
KDTree. In particular, a few select use cases
of the `sklearn` in-house `KDTree` by the mutual information
infrastructure are replaced with upstream `KDTree` from SciPy.
The full test suite appears to pass locally.
@tylerjereddy tylerjereddy force-pushed the treddy_kdtree_txn_1 branch from d30b4fc to da64dc8 Compare May 12, 2025 00:01
@tylerjereddy
Copy link
Contributor Author

I applied the simplification mentioned at #31347 (comment) and tests were still happy locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant