Skip to content

BUG: fix KD Tree node construction #11556

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 19, 2018
Merged

Conversation

jakevdp
Copy link
Member

@jakevdp jakevdp commented Jul 16, 2018

Fixes #11313

The issue is in pre-computation of KD Tree node radii, which are used during querying to rank the queue of nodes to be split & explored. As such, I think the main effect of this bug will be related to performance – this is why our tests didn't catch it.

@agramfort
Copy link
Member

do you have a tiny bench script to share?

@jakevdp
Copy link
Member Author

jakevdp commented Jul 16, 2018

do you have a tiny bench script to share?

No, I could create one if it would be useful

@agramfort
Copy link
Member

agramfort commented Jul 16, 2018 via email

@jakevdp
Copy link
Member Author

jakevdp commented Jul 16, 2018

Here's a quick test:

from sklearn.datasets import make_blobs
from sklearn.neighbors import KDTree

import sklearn
print(sklearn.__version__)

X, y = make_blobs(n_samples=100000, n_features=3)

X1 = X[:50000]
X2 = X[50000:]

%timeit KDTree(X1).query(X2, 1)

Current release:

0.19.1
151 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

On this branch:

0.20.dev0
136 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@jakevdp
Copy link
Member Author

jakevdp commented Jul 16, 2018

Exploring a bit more... the interesting thing is that the use of the correct radius does not seem to appreciably impact the query time... as currently written, it ends up being essentially a proxy for number of points in the node.

We could probably explicitly rank query nodes by number of points rather than by radius and marginally improve both memory use (from not having to store each node's radius) as well as tree construction time (from not having to compute each node's radius), without an appreciable impact on query time in most cases... but that's probably for another PR.

from sklearn.datasets import make_blobs
from sklearn.neighbors import KDTree

import sklearn
print(sklearn.__version__)

X, y = make_blobs(n_samples=100000, n_features=3)

X1 = X[:50000]
X2 = X[50000:]

%timeit KDTree(X1)

tree = KDTree(X1)
%timeit tree.query(X2, 5)
0.19.1
32.4 ms ± 920 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
214 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
0.20.dev0
14.9 ms ± 550 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
214 ms ± 3.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@jnothman
Copy link
Member

Appveyor had failed. I've restarted.

@agramfort
Copy link
Member

thx @jakevdp

@jnothman can you reastart appveyor here?

@agramfort
Copy link
Member

@jakevdp can you just add a what's new entry? thx

jakevdp added a commit to jakevdp/scikit-learn that referenced this pull request Jul 17, 2018
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot for this fix @jakevdp !

I'm not sure if it's worth waiting for Appveyor or merge as it is, given the issues we have been having lately with it.

@ogrisel
Copy link
Member

ogrisel commented Jul 17, 2018

This PR is soon to be processed by appveyor, let's wait a bit.

@agramfort
Copy link
Member

can someone restart appveyor here?

@agramfort
Copy link
Member

rebased to restart CIs

@rth
Copy link
Member

rth commented Jul 19, 2018

Thanks a lot for this fix @jakevdp ! Appveyor passed, merging.

BTW, if you have an opinion on #11103 and #597 (comment) it would be very much appreciated.

@rth rth merged commit 73fc2e4 into scikit-learn:master Jul 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants