-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Neighbor barycenter #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…arse matrix MISC : pep8 in neighbors module
…eighbors API : change api for kneighbors_graph function
we should explain in the Neighbors docstring what window_size stands for. |
more following Gael and Olivier's comments. |
Ok great. If I understand correctly this is a non-linear regression model that approach the ground truth function by small locally linear regression problems? Maybe you could make this more explicit in the class docstring. Also, is this an original design? If not you should include the canonical reference in the docstring too. Otherwise I am +1 for the merge. That looks like a fun new tool. Is there a natural machine learning problem that could be solved efficiently with this tool? I have a hammer, everything looks like a nail. |
@olivier: anything that has a very structured output in high dimension. For instance, if you have images of faces that have label maps: 'mouth', 'chin', 'nose', 'eyes', given a new image, you can try to label it using a KNN (actually this would correspond more to the classifier problem, which is our Neighbors object, but you get the idea ;-| ) @alex: all good for the merge. |
done
I think it's standard. I've added the wikipedia page even if a better reference probably exists.
Let's wait for Fabian and Matthieu Brucher feedbacks.
no clue. Note that k-NN based estimators need a lot of samples to be accurate. |
@GaelVarroquaux I knew about k-nn as a baseline for classification tasks such as face recognition but using it for regression is not that common. |
Using the weighted average of the neighbor responses is also part of the intuition in Gaussian Processes for Regression, with the covariance function. |
ok. Do you have a reference? to maybe add it in the docstring? |
See for example equation 6.66 (page 308) of Bishop's book and also the small paragraph just below 6.60. If you think it's worth mentioning it, maybe something like this: "for a more theoretically-motivated model that also uses the notion of proximity with training examples, see GaussianProcess"... I'm no expert in GP so maybe Vincent can say if he agrees... Regarding the merge, does it overlap with what Matthieu Brucher has been doing? This my only potential concern otherwise the code looks good. |
I don't have the bishop with me so I'll let you add it when it's merged.
yes indeed. Matthieu has been notified and will take a look as soon as possible |
In my opinion, feedback from Matthieu Brucher is paramount. I'd like to be sure that the above code would suit his purposes well before merging it in. Matthieu, if you are reading this, don't hesitate to pull from this branch in your manifold branch, once you are happy with the code. I am not like the nipy guys, I don't mind pulling from master to feature branches. |
I don't have enough time until this weekend (I have a gig to prepare ;)) |
Also it may be worth changing/deprecating the k argument from the Neighbors class to something more obvious like n_neigbors. |
|
+1 for n_neighbors As for "from scipy import sparse", some people complained that it is slow to import (in other modules) hence the lazy loading approach... I am -0 on the lazy import and would prefer top level imports as well but don't really care if there is comment to explain why it's not a top level import. |
All manifold code that is based on k-neighbors (or Parzen window for that matter) suffers from shortcuts (http://www.ncbi.nlm.nih.gov/pubmed/11778013). It is a real PITA that has to be addressed (it is recurrent to have such questions in conferences when you talk about manifold learning, even in sessions where it is not the main topic), and the only way is to allow the selection of another class. It's what create_neighborer tried to do. It covers 90% of the use cases, but the issue is that I have a lot of trouble making everyone understand these issues. I'm afraid that if it is not labelled as an important use case, you'll never want to consider it in the future :( |
+1 for n_neighbors The lazy import is indeed there for import times. I would like to keep it there but if people are violently against... Matthieu, your comments on the shortcomings of the neighbors API are important to me. If I understand correctly your remark, you want to have the option of using a different metric that the Euclidian one to build the nearest neighbor, right? Do you think that you could achieve this reasonably well in the proposed design by sub-classing the proposed neighbors class? |
Yes, it is using something different than the Euclidian distance (perhaps a Euclidian distance in 99% of the case, and at other moments, it would use another distance, or even reject the point alltogether) and no, it cannot be achieved by derivation. One may use BallTree, but there would be a lot of additional code, and this means no derivation/inheritance. I guess the only way would be replacing the construction of the BallTree by something else with a similar interface (which should then be clarified and made part of the official API). |
The API of the ball tree is historical and was not given any though. I am more than open to redesigning it. I particular, my guidelines to designing an API are 'what does this object do?', in other words: 'What are the questions we will ask it. In the case of the BallTree, it is not clear to me. If you come in with a simple API (a set of a few methods) that is well-posed for the BallTree and for your usecases, it would probably clean up the code in the scikit. |
If I understand correctly the problem of k-NN when used in manifold learning is the high sensitivity to the k hyper-parameter that requires a-priori knowledge to get right? What kind of alternative implementation do you have in mind? By reading the abstract of the paper by Balasubramanian & Schwartz they don't seem to give any solution. Why not implement a new Neighbor class with the alternative implementation and use that instead of NeighborRegression? Maybe NeighborRegression is a too generic a name and should be renamed to BalltreeNeighborRegression and we could then have YourAlternativeNeighborRegression as a side class in the neighbor package as (in a similar way as we have both Lasso and ElasticNet in the coordinate_descent package for instance). |
@ogrisel Your solution is almost what I would like. The issue is that I need more than YourAlternativeNeighborRegression. For manifold construction, I cannot use the regression directly, the fit parameters are the result of an embedding algorithm. I need to use the functions Alexandre created, but with a different BallTree. @Gael |
Question: NeighborRegression input for y is an integer array. Couldn't it just be any array (it is needed if I want to get rid of my version of the same mapping) |
I agree with you, I don't understand why it should be an integer. With regards to whether the BallTree should be hidden inside the Neighbors, I don't now, I would say that it is more you that can tell us. Can a minimalist API for Neighbors define the API that you need to suit your needs. Maybe neighbors_graph could take such an object as a keyword argument (with a default being the simple Euclidian nearest neighbor). Would that help your code? @olivier: |
Yes, it would be enough, although it might mean that each user has to write a wrapper function for her Neighbor replacement. Well, I can live with it :) |
But couldn't you provide the few classes that cover all the use cases you need? It is true that quite often in the scikit we have taken the point of view that we preferred having simple classes, and having advanced users subclass them to implement the logic that they needed. We do this a lot at Neurospin to add our neuro-imaging specific needs. |
+1 on simple default classes that can be overridden for advanced research purpose. I don't want overly generic API with explicit dictionary of parameters passed in the constructors or public methods. The basic API user should not have to manipulate abstract dictionary of parameters. They should only have to deal with the named parameters that are explicitly documented in the docstring of the method. To be more specific: n_neighbors and neigh are admissible public parameters for the public API (although neigh is a bad parameter name in my opinion: neighbors_finder or topology_extractor is probably better), neigh_alternate_arguments is not an admissible parameter. |
How would you tell the function that it has to create a specific class? I don't mind a functor that acts as a builder, but I still would have to have the current number of arguments I use for the default case, because each dataset may need its own number of neighbors, it cannot be arbirtraly fixed. Also don't forget that LSP (Liskov Substitution Principle) limits what you can do with subclassing. |
OK, so we can agree on neighbors_finders that is a functor that will create the instance that will be used then. I will also change the mapping builder to reflect this in my current branch. As for the qustion in #23 (comment), perhaps Alexandre can also give a clue if he can live with Neighbors instead of the BallTree? |
Pass the pre-configured instance and not a factory. If you need to adjust n_neighbors later then pass explicitly the number of neighbors to extract as part of the public API of the instance of neighbors_finder or topology_extractor. For instance we can than that instances of the "neighbors_finder" contract must implement the predict method that accept an optional argument n_neighbors (which will be otherwise be set to a reasonable default such as 3 for instance). |
You prefer the instance directly? Why not. It could be reused or built differently. And it's less lines :) |
Yes for instances instead of parameterized factories that force the user to pass generic maps of parameters all around the API and make it overly complex! As for the n_neighbors it was just an optional parameter proposal to introduce some of the flexibility that might have been lost by switching from the factory pattern to the configured instance pattern: it you don't need it for the manifold module, we don't have to introduce it (unless useful for other usecases). |
I have to admit I have a hard time following the discussion. it might be naive comments but i had in mind that my kneighbors_graph function |
what should I do? |
+1 for the merge. |
I would say that if the APIs are going to change, I'd rather merge in 'origin', but in a 0.7.X branch, to avoid putting moving targets in the 0.6 release. |
I think that the only thing that will change in my branch is adding the keyword neighbors_finder. The rest should be OK. I may have to add one function for a symetrical graph, but this is not modifying the existing API. |
Could you do it in a separate branch, so that we can merge this in a common branch and try to use it on real problem (to discover where it sucks) |
I can ;) |
Refactored ensemble learning code
IoT Gateway Code Update
new estimator for k-NN based regression
+
kneighbors_graph function to build the eventually weighted graph of neighbors