Skip to content

[WIP] allow nearest neighbors algorithm to be an estimator (v2) #8999

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

rth
Copy link
Member

@rth rth commented Jun 6, 2017

This PR continues work in PR #3922 and aims to allow the algorithm parameter of NearestNeighbors to be an estimator.

For instance, the prototype could be,

class MyKNNAlgorithm(object):
     def fit(self, X): 
         self._fitted = True
  
     def query(self, X, k, return_distance=True)
         # optional method
         # [some KNN implementation]
         return ind, dist
     
    
     def query_radius(self, X, r, return_distance=True):
         # optional method
         # [some radius KNN implementation]
         return ind, dist

nn = NearestNeighbors(n_neighbors=1, algorithm=MyKNNAlgorithm())

Since DBSCAN, LocallyLinearEmbedding, KNeighborsClassifier, RadiusNeighborsClassifier, Isomap, KNeighborsRegressor rely on nearest neighbors, such custom algorithm backend could also be used there.

A direct application, is that external ANN implementations could be used with scikit-learn methods, by writing a wrapper around their API. The illustration could be done either with a wrapper around brute force nearest neighbors, or some popular ANN library such as annoy.

While the original PR #3922 focused on illustrating the usage of custom backend with LSHForest, it's no longer the case here, since LSHForest would be deprecated (cf discussion in #8996 )

This is still work in progress and several tests fail. I opened this PR to make sure that such approach would be compatible with other work in progress on adding ANN to scikit-learn. If there are some conflicts please let me know, so I can work on something else instead.

@jnothman @lesteve @ogrisel @amueller What do you think?

@jnothman
Copy link
Member

jnothman commented Jun 6, 2017

I've not looked again at code, but I still think this is worthwhile. Not sure that query and query_radius are more appropriate than kneighbors and radius_neighbors. There is an API inconsistency between the more public NearestNeighbors family and the underlying BinaryTree. The former has kneighbors and radius_neighbors, which return (dist, ind), and the latter has query, query_radius returning (ind, dist). Yuck!

https://github.com/erikbern/ann-benchmarks seems to adopt the latter, but I'd be inclined towards the former, maybe even renaming the methods in BinaryTree.

@rth
Copy link
Member Author

rth commented Jun 6, 2017

Not sure that query and query_radius are more appropriate than kneighbors and radius_neighbors. There is an API inconsistency between the more public NearestNeighbors family and the underlying BinaryTree. The former has kneighbors and radius_neighbors, which return (dist, ind), and the latter has query, query_radius returning (ind, dist). Yuck!

Yeah, that's not ideal. The pro and cons of each API choice are below,

  1. BinaryTree like API

    • (+) Same API as KDTree, BallTree and scipy.spatial.KDTree
    • (+)

      https://github.com/erikbern/ann-benchmarks seems to adopt the latter [this one],

    • (-) Not an estimator: does not have a fit method and fit happens during init. Which means that it couldn't be used directly as is in this PR. Using a BinaryTree like API with a fit method would create yet a 3rd type of NN API...
  2. NearestNeigbors like API

    • (-) NearestNeigbors is the only class that uses it.
    • (-/+) If custom algorithm is a NearestNeigbors like object, on some level it feels strange and awkward for NearestNeigbors(algorithm=algorithm) to return a NearestNeigbour wrapper around a NearesNeigbor object. It would have been much simpler to have NearestNeigbors(algorithm=algorithm) return algorithm directly (which would involved adding a __new__ method to NearestNeigbors and would allow to simplify this PR significantly, I think).
    • (+) +1 from Joel

Will wait for more feedback and change the API accordingly. Thanks @jnothman !

@jnothman
Copy link
Member

jnothman commented Jun 6, 2017 via email

@rth
Copy link
Member Author

rth commented Jun 6, 2017

I don't think NearestNeighbors(algorithm=x) is the application.
KNeighborsClassifier(algorithm=x), radius_neighbors_graph(algorithm=x),
etc. are.

Definitely, my point is that there are two ways of achieving the desired behavior:

  1. First is roughly what is done in this PR (e.g. making sure that NearestNeighbors can handle an algorithm that's an estimator)
  2. Define NearestNeighbors as something along the lines of,
     import sklearn.neighbors
     
     class NearestNeighbors(sklearn.neighbors.NearestNeighbors):
         def __new__(cls, *args, **kwargs):
             algorithm = kwargs.get('algorithm', None)
             if not algorithm and len(args) >= 3:
                 algorithm = args[3] 
             if algorithm is not None and hasattr(algorithm, 'fit'):
                 print('Using custom NN backend')
                 # check that custom NN algorithm is picklable
                 pickle.loads(pickle.dumps(self.algorithm))  
                 # some other tests could be added here
                 return algorithm
             else:
                 return super(cls, NearestNeighbors).__new__(cls)

Since radius_neighbors_graph or DBSCAN, etc. internally call NearestNeighbors(algorithm=x) and do something with the returned object, the second choice might allows to have one less level of indirection and achieve the results with very little changes to the code base. But it might have other issues. Not sure what's the best way.

In any case, KNeighborsClassifier doesn't call NearestNeighbors internally and so it would need to be handled separately.

What do you think?

The problem is actually that KNeighborsClassifier shouldn't be
exposing kneighbors and radius_neighbors from an API purist perspective.

So in the current case, those would be accessible via KNeighborsClassifier.algorithm.kneighbors . Are you saying that it shouldn't be the case?

@lesteve
Copy link
Member

lesteve commented Jun 6, 2017

Looks like the failures on Travis are genuine.

@rth
Copy link
Member Author

rth commented Jun 6, 2017

Looks like the failures on Travis are genuine.

@lesteve They are. I just wanted to discuss the general direction of this PR and fix the API before spending time on making everything work fully, particularly as there were talk in #8996 (comment) that @ogrisel or you were already working on related topics...

@jnothman
Copy link
Member

jnothman commented Jun 6, 2017 via email

@jnothman jnothman added this to the 0.20 milestone Jun 14, 2017
@rth rth mentioned this pull request Aug 15, 2017
@jnothman
Copy link
Member

jnothman commented Aug 30, 2017

Do you want someone else to work on this, @rth? @SebastinSanty may be keen.

@rth
Copy link
Member Author

rth commented Aug 30, 2017

@jnothman Sorry this is taking a while. If someone is eager to work on this they can feel free to take over this PR. @lesteve Might have also worked on related topics as far as I understood. In the other case, I will get back to this, but it would take some time..

@rth
Copy link
Member Author

rth commented Oct 2, 2017

On 01/10/17 05:18, Joel Nothman wrote:

Would you mind completing someone else's work? @rth
has a few that he is currently unavailable to
complete. (Not that I've asked him if he'd like to donate them to a new
contributor...) I also think they might be quite challenging still for
you.

Realistically, I don't have much time for this PR at the moment so anyone can feel free to take it over, as mentioned above #8999 (comment). It may take some work. If not I will try to find some time for it over the next few months. cc @vrishank97

@jnothman
Copy link
Member

jnothman commented Oct 2, 2017

Yes, this is what I'd been thinking of.

@thechargedneutron
Copy link
Contributor

@rth @jnothman Since this is in the 0.20 milestone, is this up for grabs? The 0.20 deadline is near ;) but I am not very sure how strict the deadline is.

@jnothman
Copy link
Member

jnothman commented Jan 8, 2018

Our milestones are unfortunately fairly meaningless unless close to release. Release is not as soon as that deadline. This remains something good to have, and I suspect @rth will happily mentor the work, but not a priority. It's the kind of functionality where most users don't know what they're missing, and even once it's implemented will be hard for all but advanced users to get into.

@rth
Copy link
Member Author

rth commented Jan 8, 2018

It would be nice to have. @thechargedneutron feel free to take the PR over. Please ping me if you want a review of the continued PR.

@thechargedneutron
Copy link
Contributor

@rth Yes, I need a brief review of what has been done so far. It would be good if you comment on this PR what the particular patch does. Can't think of an easier way to convey the progress to me.

@rth
Copy link
Member Author

rth commented Jan 9, 2018

Yes, I need a brief review of what has been done so far. It would be good if you comment on this PR what the particular patch does. Can't think of an easier way to convey the progress to me.

@thechargedneutron There isn't really much additional information outside of what is written in the description, comments and the parent PR. As far as I remember, the general status of this was prototyping and in the process of fixing unit tests. The API was still not fully fixed and may need to be re-discussed again. I synched with master to make the changes more up to date. Hope this helps.

Copy link
Contributor

@thechargedneutron thechargedneutron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @rth

# check algorithm is queried and parameter is passed
dist, ind = est.radius_neighbors([[0]], radius=radius,
return_distance=True)
assert_array_equal(dist[0][0], radius)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it supposed to be equal to radius?
Doesn't dist[0][0] mean minimum distance between [[0]] and [[0],[1]] provided radius is bigger than distance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That part came from the parent PR. This was just intended to be a consistency test, I think. If some part of it doesn't make sense to you, you should change it )

def fit(self, X):
self._fitted = True

def query_radius(self, X, r, return_distance=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not using parameters passed in fit function. How will it compute points from fit function?
It would be okay of you describe the motivation behind this function and the below assert condition (other reviewed part)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thechargedneutron Sorry for the late reply. Yes, you are right it should take an X from fit. The idea was to prototype the simplest possible example of a custom NN search class with an API similar to KDTree.

@jnothman
Copy link
Member

Note there is also discussion at #10463 about whether enabling using nearest neighbors as a transformer, together with sparse precomputed distance matrices, means we don't need this functionality most of the time.

@jnothman
Copy link
Member

I should note that the conversation around #10463 suggests we may not even need this. Allowing nearest neighbours estimators to transform into a sparse nearest neighbourhood graph should be sufficient for moist applications.

@rth
Copy link
Member Author

rth commented Jan 22, 2018

I should note that the conversation around #10463 suggests we may not even need this.

Agreed, that discussion/work looks quite interesting. (My response to comments wasn't intended to imply that this PR is still needed).

@thechargedneutron
Copy link
Contributor

Thanks for informing that this issue needs no further work. Thanks :)

@TomDLT
Copy link
Member

TomDLT commented Sep 18, 2019

Other solution merged in #10482

@TomDLT TomDLT closed this Sep 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants