-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] allow nearest neighbors algorithm to be an estimator (v2) #8999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Provide methods kneighbors and radius_neighbors on BinaryTree classes. Also, ensure array of arrays returned from radius_neighbors.
I've not looked again at code, but I still think this is worthwhile. Not sure that https://github.com/erikbern/ann-benchmarks seems to adopt the latter, but I'd be inclined towards the former, maybe even renaming the methods in |
Yeah, that's not ideal. The pro and cons of each API choice are below,
Will wait for more feedback and change the API accordingly. Thanks @jnothman ! |
I don't think NearestNeighbors(algorithm=x) is the application.
KNeighborsClassifier(algorithm=x), radius_neighbors_graph(algorithm=x),
etc. are. So I do get what you mean about wrapping something with something
similar. The problem is actually that KNeighborsClassifier shouldn't be
exposing kneighbors and radius_neighbors from an API purist perspective.
…On 6 June 2017 at 23:14, Roman Yurchak ***@***.***> wrote:
Not sure that query and query_radius are more appropriate than kneighbors
and radius_neighbors. There is an API inconsistency between the more public
NearestNeighbors family and the underlying BinaryTree. The former has
kneighbors and radius_neighbors, which return (dist, ind), and the latter
has query, query_radius returning (ind, dist). Yuck!
Yeah, that's not ideal. The pro and cons of each API choices are below,
1.
*BinaryTree* like API
- (+) Same API as KDTree, BallTree and scipy.spatial.KDTree
- (+)
https://github.com/erikbern/ann-benchmarks seems to adopt the
latter [this one],
- (-) Not an estimator: does not have a fit method and fit happens
during init. Which means that it couldn't be used directly as is in this
PR. Using a BinaryTree like API with a fit method would create yet
a 3rd type of NN API...
2.
*NearestNeigbors* like API
- (-) *NearestNeigbors* is the only class that uses it.
- (-/+) If custom algorithm is a NearestNeigbors like object, on
some level it feels strange and awkward for
NearestNeigbors(algorithm=algorithm) to return a NearestNeigbour
wrapper around a NearesNeigbor object. It would have been much
simpler to have NearestNeigbors(algorithm=algorithm) return
algorithm directly (which would involved adding a __new__ method
<https://stackoverflow.com/a/3209240/1791279> to NearestNeigbors
and would allow to simplify this PR significantly, I think).
- (+) +1 from Joel
Will wait for more feedback and change the API accordingly. Thanks
@jnothman <https://github.com/jnothman> !
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8999 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69jHT2vIP8knrdfs50XgT1-43urkks5sBVC9gaJpZM4NxO8i>
.
|
Definitely, my point is that there are two ways of achieving the desired behavior:
Since In any case, What do you think?
So in the current case, those would be accessible via |
Looks like the failures on Travis are genuine. |
@lesteve They are. I just wanted to discuss the general direction of this PR and fix the API before spending time on making everything work fully, particularly as there were talk in #8996 (comment) that @ogrisel or you were already working on related topics... |
In any case, KNeighborsClassifier doesn't call NearestNeighbors internally
and so it would need to be handled separately.
But we're modifying NeighborsBase which affects both. I'm not sure what
you're trying to get at, except if you're talking about the special case of
someone calling NearestNeighbors(algorithm=MyNeighborFinder()) which I
think is not helpful in user code, and the indirection matters little in
library code (except insofar as validation is concerned: we should minimise
data validation in NeighborsBase when an alternative algorithm is
provided). Implementing __new__ is not a good solution to anything. Rather,
the role of NeighborsBase becomes handling parameters and initialising
attributes.
I could imagine NeighborsBase doing something like self.index_ =
_fit_algorithm(...) which returns a fitted object with kneighbors and
radius_neighbors. This may be a BallTree, a _BruteForceIndex, an LSHForest,
or a MissingValueNearestNeighbor (not yet sure if this is useful).
…On 7 June 2017 at 02:41, Roman Yurchak ***@***.***> wrote:
Looks like the failures on Travis are genuine.
@lesteve <https://github.com/lesteve> They are. I just wanted to discuss
the general direction of this PR and fix the API before spending time on
making everything work fully, particularly as there were talk in #8996
(comment)
<#8996 (comment)>
that @ogrisel <https://github.com/ogrisel> or you were already working on
related topics...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8999 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_9vJhhrr-ItOO5bBZUTzQ_I5e_Tks5sBYEcgaJpZM4NxO8i>
.
|
Do you want someone else to work on this, @rth? @SebastinSanty may be keen. |
On 01/10/17 05:18, Joel Nothman wrote:
Realistically, I don't have much time for this PR at the moment so anyone can feel free to take it over, as mentioned above #8999 (comment). It may take some work. If not I will try to find some time for it over the next few months. cc @vrishank97 |
Yes, this is what I'd been thinking of. |
Our milestones are unfortunately fairly meaningless unless close to release. Release is not as soon as that deadline. This remains something good to have, and I suspect @rth will happily mentor the work, but not a priority. It's the kind of functionality where most users don't know what they're missing, and even once it's implemented will be hard for all but advanced users to get into. |
It would be nice to have. @thechargedneutron feel free to take the PR over. Please ping me if you want a review of the continued PR. |
@rth Yes, I need a brief review of what has been done so far. It would be good if you comment on this PR what the particular patch does. Can't think of an easier way to convey the progress to me. |
@thechargedneutron There isn't really much additional information outside of what is written in the description, comments and the parent PR. As far as I remember, the general status of this was prototyping and in the process of fixing unit tests. The API was still not fully fixed and may need to be re-discussed again. I synched with master to make the changes more up to date. Hope this helps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @rth
# check algorithm is queried and parameter is passed | ||
dist, ind = est.radius_neighbors([[0]], radius=radius, | ||
return_distance=True) | ||
assert_array_equal(dist[0][0], radius) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it supposed to be equal to radius
?
Doesn't dist[0][0]
mean minimum distance between [[0]]
and [[0],[1]]
provided radius
is bigger than distance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That part came from the parent PR. This was just intended to be a consistency test, I think. If some part of it doesn't make sense to you, you should change it )
def fit(self, X): | ||
self._fitted = True | ||
|
||
def query_radius(self, X, r, return_distance=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is not using parameters passed in fit
function. How will it compute points from fit function?
It would be okay of you describe the motivation behind this function and the below assert condition (other reviewed part)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thechargedneutron Sorry for the late reply. Yes, you are right it should take an X from fit. The idea was to prototype the simplest possible example of a custom NN search class with an API similar to KDTree.
Note there is also discussion at #10463 about whether enabling using nearest neighbors as a transformer, together with sparse precomputed distance matrices, means we don't need this functionality most of the time. |
I should note that the conversation around #10463 suggests we may not even need this. Allowing nearest neighbours estimators to transform into a sparse nearest neighbourhood graph should be sufficient for moist applications. |
Agreed, that discussion/work looks quite interesting. (My response to comments wasn't intended to imply that this PR is still needed). |
Thanks for informing that this issue needs no further work. Thanks :) |
Other solution merged in #10482 |
This PR continues work in PR #3922 and aims to allow the
algorithm
parameter ofNearestNeighbors
to be an estimator.For instance, the prototype could be,
Since DBSCAN, LocallyLinearEmbedding, KNeighborsClassifier, RadiusNeighborsClassifier, Isomap, KNeighborsRegressor rely on nearest neighbors, such custom
algorithm
backend could also be used there.A direct application, is that external ANN implementations could be used with scikit-learn methods, by writing a wrapper around their API. The illustration could be done either with a wrapper around brute force nearest neighbors, or some popular ANN library such as annoy.
While the original PR #3922 focused on illustrating the usage of custom backend with LSHForest, it's no longer the case here, since LSHForest would be deprecated (cf discussion in #8996 )
This is still work in progress and several tests fail. I opened this PR to make sure that such approach would be compatible with other work in progress on adding ANN to scikit-learn. If there are some conflicts please let me know, so I can work on something else instead.
@jnothman @lesteve @ogrisel @amueller What do you think?