Skip to content

Neighbor barycenter #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
10 commits merged into from
Dec 2, 2010
Merged

Neighbor barycenter #23

10 commits merged into from
Dec 2, 2010

Conversation

agramfort
Copy link
Member

new estimator for k-NN based regression
+
kneighbors_graph function to build the eventually weighted graph of neighbors

@ogrisel
Copy link
Member

ogrisel commented Nov 28, 2010

we should explain in the Neighbors docstring what window_size stands for.

@agramfort
Copy link
Member Author

more following Gael and Olivier's comments.

@ogrisel
Copy link
Member

ogrisel commented Nov 28, 2010

Ok great. If I understand correctly this is a non-linear regression model that approach the ground truth function by small locally linear regression problems? Maybe you could make this more explicit in the class docstring.

Also, is this an original design? If not you should include the canonical reference in the docstring too.

Otherwise I am +1 for the merge. That looks like a fun new tool. Is there a natural machine learning problem that could be solved efficiently with this tool? I have a hammer, everything looks like a nail.

@GaelVaroquaux
Copy link
Member

@olivier: anything that has a very structured output in high dimension. For instance, if you have images of faces that have label maps: 'mouth', 'chin', 'nose', 'eyes', given a new image, you can try to label it using a KNN (actually this would correspond more to the classifier problem, which is our Neighbors object, but you get the idea ;-| )

@alex: all good for the merge.

@agramfort
Copy link
Member Author

Ok great. If I understand correctly this is a non-linear regression model that approach the ground truth function by small locally linear regression problems? Maybe you could make this more explicit in the class docstring.

done

Also, is this an original design? If not you should include the canonical reference in the docstring too.

I think it's standard. I've added the wikipedia page even if a better reference probably exists.

Otherwise I am +1 for the merge.

Let's wait for Fabian and Matthieu Brucher feedbacks.

That looks like a fun new tool. Is there a natural machine learning problem that could be solved efficiently with this tool? I have a hammer, everything looks like a nail.

no clue. Note that k-NN based estimators need a lot of samples to be accurate.

@ogrisel
Copy link
Member

ogrisel commented Nov 28, 2010

@GaelVarroquaux I knew about k-nn as a baseline for classification tasks such as face recognition but using it for regression is not that common.

@mblondel
Copy link
Member

Using the weighted average of the neighbor responses is also part of the intuition in Gaussian Processes for Regression, with the covariance function.

@agramfort
Copy link
Member Author

ok. Do you have a reference? to maybe add it in the docstring?
are you +1 for the merge?

@mblondel
Copy link
Member

See for example equation 6.66 (page 308) of Bishop's book and also the small paragraph just below 6.60. If you think it's worth mentioning it, maybe something like this: "for a more theoretically-motivated model that also uses the notion of proximity with training examples, see GaussianProcess"... I'm no expert in GP so maybe Vincent can say if he agrees...

Regarding the merge, does it overlap with what Matthieu Brucher has been doing? This my only potential concern otherwise the code looks good.

@agramfort
Copy link
Member Author

See for example equation 6.66 (page 308) of Bishop's book and also the small paragraph just below 6.60.
If you think it's worth mentioning it, maybe something like this: "for a more theoretically-motivated model
that also uses the notion of proximity with training examples, see GaussianProcess"... I'm no expert in GP
so maybe Vincent can say if he agrees...

I don't have the bishop with me so I'll let you add it when it's merged.

Regarding the merge, does it overlap with what Matthieu Brucher has been doing?
This my only potential concern otherwise the code looks good.

yes indeed. Matthieu has been notified and will take a look as soon as possible
to integrate this in the manifold module.

@GaelVaroquaux
Copy link
Member

In my opinion, feedback from Matthieu Brucher is paramount. I'd like to be sure that the above code would suit his purposes well before merging it in. Matthieu, if you are reading this, don't hesitate to pull from this branch in your manifold branch, once you are happy with the code. I am not like the nipy guys, I don't mind pulling from master to feature branches.

@mbrucher
Copy link
Contributor

I don't have enough time until this weekend (I have a gig to prepare ;))
I think it can cover 90% of my needs, the design is good (no derivation from Neighbors ;)), but I'd like to be able to use something else than BallTree for scientific purposes (cf the manifold review, and mainly http://www.ncbi.nlm.nih.gov/pubmed/11778013). My solution is not the prefered one for scikits.learn according to the review I got, if someone has another design, it would fit nicely in this pull request.

@mbrucher
Copy link
Contributor

Also it may be worth changing/deprecating the k argument from the Neighbors class to something more obvious like n_neigbors.
I thought also of the froms cipy import sparse in the code, shouldn't they be located at the beginning of the files ?

@agramfort
Copy link
Member Author

  • could you clarify what you mean by something else than balltree? If it covers 90% of our needs I think it's a good first pass :)
  • why not for n_neighbors even the k in k-NN is quiter understandable
  • importing sparse from the function is done on purpose to avoid long loading time. That's the only exception I think.

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2010

+1 for n_neighbors

As for "from scipy import sparse", some people complained that it is slow to import (in other modules) hence the lazy loading approach... I am -0 on the lazy import and would prefer top level imports as well but don't really care if there is comment to explain why it's not a top level import.

@mbrucher
Copy link
Contributor

All manifold code that is based on k-neighbors (or Parzen window for that matter) suffers from shortcuts (http://www.ncbi.nlm.nih.gov/pubmed/11778013). It is a real PITA that has to be addressed (it is recurrent to have such questions in conferences when you talk about manifold learning, even in sessions where it is not the main topic), and the only way is to allow the selection of another class. It's what create_neighborer tried to do.

It covers 90% of the use cases, but the issue is that I have a lot of trouble making everyone understand these issues. I'm afraid that if it is not labelled as an important use case, you'll never want to consider it in the future :(

@GaelVaroquaux
Copy link
Member

+1 for n_neighbors

The lazy import is indeed there for import times. I would like to keep it there but if people are violently against...

Matthieu, your comments on the shortcomings of the neighbors API are important to me. If I understand correctly your remark, you want to have the option of using a different metric that the Euclidian one to build the nearest neighbor, right? Do you think that you could achieve this reasonably well in the proposed design by sub-classing the proposed neighbors class?

@mbrucher
Copy link
Contributor

Yes, it is using something different than the Euclidian distance (perhaps a Euclidian distance in 99% of the case, and at other moments, it would use another distance, or even reject the point alltogether) and no, it cannot be achieved by derivation. One may use BallTree, but there would be a lot of additional code, and this means no derivation/inheritance. I guess the only way would be replacing the construction of the BallTree by something else with a similar interface (which should then be clarified and made part of the official API).

@GaelVaroquaux
Copy link
Member

The API of the ball tree is historical and was not given any though. I am more than open to redesigning it.

I particular, my guidelines to designing an API are 'what does this object do?', in other words: 'What are the questions we will ask it.

In the case of the BallTree, it is not clear to me. If you come in with a simple API (a set of a few methods) that is well-posed for the BallTree and for your usecases, it would probably clean up the code in the scikit.

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2010

If I understand correctly the problem of k-NN when used in manifold learning is the high sensitivity to the k hyper-parameter that requires a-priori knowledge to get right? What kind of alternative implementation do you have in mind? By reading the abstract of the paper by Balasubramanian & Schwartz they don't seem to give any solution.

Why not implement a new Neighbor class with the alternative implementation and use that instead of NeighborRegression? Maybe NeighborRegression is a too generic a name and should be renamed to BalltreeNeighborRegression and we could then have YourAlternativeNeighborRegression as a side class in the neighbor package as (in a similar way as we have both Lasso and ElasticNet in the coordinate_descent package for instance).

@mbrucher
Copy link
Contributor

@ogrisel
The k parameter is not the only way to find a solution. The may be only one point that is problematic, and no k value could fit the purpose, even if the k parameter is different only for the beighbors of this point. Imagine a Swissroll with noise, it is the perfect case for a k-NN bad behavior.
The paper does not give a solution, because it is still an opened research topic.

Your solution is almost what I would like. The issue is that I need more than YourAlternativeNeighborRegression. For manifold construction, I cannot use the regression directly, the fit parameters are the result of an embedding algorithm. I need to use the functions Alexandre created, but with a different BallTree.
I could be happy with NeighborRegression, and allow people to provide their own way of populating the neighbor graph, but this is what I currently have in my branch, and the solution has not reached consensus, so I'd like to offer the possibility of selecting the neighbor class in this request instead of in my branch (it would be a higher level solution, I only use Neighbors, not Ball Tree).

@Gael
I'm wondering also if BallTree may be completely hidden inside Neighbors to get rid of the issue of its interface. Or perhaps in this request, only Neighbors may be used, as it has an official API?

@mbrucher
Copy link
Contributor

Question: NeighborRegression input for y is an integer array. Couldn't it just be any array (it is needed if I want to get rid of my version of the same mapping)

@GaelVaroquaux
Copy link
Member

@mbrucher

I agree with you, I don't understand why it should be an integer.

With regards to whether the BallTree should be hidden inside the Neighbors, I don't now, I would say that it is more you that can tell us. Can a minimalist API for Neighbors define the API that you need to suit your needs. Maybe neighbors_graph could take such an object as a keyword argument (with a default being the simple Euclidian nearest neighbor). Would that help your code?

@olivier:
Mentioning the BallTree in the name of the Euclidian nearest neighbors seems clunky for me: most users are not interested in what a BallTree is, or even by the fact that nearest neighbors has a sens in non Euclidian space. Hell, I think that many of the people processing data do not even know what non Euclidian spaces are.

@mbrucher
Copy link
Contributor

Yes, it would be enough, although it might mean that each user has to write a wrapper function for her Neighbor replacement. Well, I can live with it :)
It would only mean one less argument in https://github.com/mbrucher/scikit-learn/blob/manifold-light/scikits/learn/manifold/embedding/tools.py#L14

@GaelVaroquaux
Copy link
Member

But couldn't you provide the few classes that cover all the use cases you need?

It is true that quite often in the scikit we have taken the point of view that we preferred having simple classes, and having advanced users subclass them to implement the logic that they needed. We do this a lot at Neurospin to add our neuro-imaging specific needs.

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2010

+1 on simple default classes that can be overridden for advanced research purpose. I don't want overly generic API with explicit dictionary of parameters passed in the constructors or public methods. The basic API user should not have to manipulate abstract dictionary of parameters. They should only have to deal with the named parameters that are explicitly documented in the docstring of the method.

To be more specific: n_neighbors and neigh are admissible public parameters for the public API (although neigh is a bad parameter name in my opinion: neighbors_finder or topology_extractor is probably better), neigh_alternate_arguments is not an admissible parameter.

@mbrucher
Copy link
Contributor

How would you tell the function that it has to create a specific class? I don't mind a functor that acts as a builder, but I still would have to have the current number of arguments I use for the default case, because each dataset may need its own number of neighbors, it cannot be arbirtraly fixed.

Also don't forget that LSP (Liskov Substitution Principle) limits what you can do with subclassing.

@mbrucher
Copy link
Contributor

OK, so we can agree on neighbors_finders that is a functor that will create the instance that will be used then.

I will also change the mapping builder to reflect this in my current branch.

As for the qustion in #23 (comment), perhaps Alexandre can also give a clue if he can live with Neighbors instead of the BallTree?

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2010

Pass the pre-configured instance and not a factory. If you need to adjust n_neighbors later then pass explicitly the number of neighbors to extract as part of the public API of the instance of neighbors_finder or topology_extractor.

For instance we can than that instances of the "neighbors_finder" contract must implement the predict method that accept an optional argument n_neighbors (which will be otherwise be set to a reasonable default such as 3 for instance).

@mbrucher
Copy link
Contributor

You prefer the instance directly? Why not. It could be reused or built differently. And it's less lines :)
The predict method cannot expect n_neighbors: for some algorithms, it has no sense (example would be a pure Parzen-based Neighbors class).

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2010

Yes for instances instead of parameterized factories that force the user to pass generic maps of parameters all around the API and make it overly complex!

As for the n_neighbors it was just an optional parameter proposal to introduce some of the flexibility that might have been lost by switching from the factory pattern to the configured instance pattern: it you don't need it for the manifold module, we don't have to introduce it (unless useful for other usecases).

@agramfort
Copy link
Member Author

I have to admit I have a hard time following the discussion.
@matthieu: do not hesitate to pull from this branch and work from there.

it might be naive comments but i had in mind that my kneighbors_graph function
was all was needed to the manifold module. A new weight option can be added
if necessary though. To me manifold learning works by building an affinity matrix
directly derived from the kneighbors_graph output and then simple linear
algebra hence a few lines of code. But I might be wrong.
The NeighborsBarycenter object would then be out of the manifold module.

@agramfort
Copy link
Member Author

what should I do?
wait for matthieu to work from there or do the merge, close the pull request
and matthieu will merge from origin/master ?

@mbrucher
Copy link
Contributor

mbrucher commented Dec 2, 2010

+1 for the merge.
Better closing some "tickets" than letting them open when they are OK ;)

@GaelVaroquaux
Copy link
Member

I would say that if the APIs are going to change, I'd rather merge in 'origin', but in a 0.7.X branch, to avoid putting moving targets in the 0.6 release.

@mbrucher
Copy link
Contributor

mbrucher commented Dec 2, 2010

I think that the only thing that will change in my branch is adding the keyword neighbors_finder. The rest should be OK. I may have to add one function for a symetrical graph, but this is not modifying the existing API.

@GaelVaroquaux
Copy link
Member

Could you do it in a separate branch, so that we can merge this in a common branch and try to use it on real problem (to discover where it sucks)

@mbrucher
Copy link
Contributor

mbrucher commented Dec 2, 2010

I can ;)

satra pushed a commit to satra/scikit-learn that referenced this pull request Oct 7, 2011
glouppe pushed a commit to glouppe/scikit-learn that referenced this pull request Jan 17, 2013
prismdata pushed a commit to prismdata/scikit-learn that referenced this pull request Oct 28, 2020
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants