Neighbor barycenter #23

agramfort · 2010-11-28T18:03:09Z

new estimator for k-NN based regression
+
kneighbors_graph function to build the eventually weighted graph of neighbors

…arse matrix MISC : pep8 in neighbors module

…eighbors API : change api for kneighbors_graph function

ogrisel · 2010-11-28T19:44:59Z

we should explain in the Neighbors docstring what window_size stands for.

agramfort · 2010-11-28T20:11:48Z

more following Gael and Olivier's comments.

ogrisel · 2010-11-28T20:55:34Z

Ok great. If I understand correctly this is a non-linear regression model that approach the ground truth function by small locally linear regression problems? Maybe you could make this more explicit in the class docstring.

Also, is this an original design? If not you should include the canonical reference in the docstring too.

Otherwise I am +1 for the merge. That looks like a fun new tool. Is there a natural machine learning problem that could be solved efficiently with this tool? I have a hammer, everything looks like a nail.

GaelVaroquaux · 2010-11-28T21:07:27Z

@olivier: anything that has a very structured output in high dimension. For instance, if you have images of faces that have label maps: 'mouth', 'chin', 'nose', 'eyes', given a new image, you can try to label it using a KNN (actually this would correspond more to the classifier problem, which is our Neighbors object, but you get the idea ;-| )

@alex: all good for the merge.

agramfort · 2010-11-28T21:07:36Z

Ok great. If I understand correctly this is a non-linear regression model that approach the ground truth function by small locally linear regression problems? Maybe you could make this more explicit in the class docstring.

done

Also, is this an original design? If not you should include the canonical reference in the docstring too.

I think it's standard. I've added the wikipedia page even if a better reference probably exists.

Otherwise I am +1 for the merge.

Let's wait for Fabian and Matthieu Brucher feedbacks.

That looks like a fun new tool. Is there a natural machine learning problem that could be solved efficiently with this tool? I have a hammer, everything looks like a nail.

no clue. Note that k-NN based estimators need a lot of samples to be accurate.

ogrisel · 2010-11-28T21:23:49Z

@GaelVarroquaux I knew about k-nn as a baseline for classification tasks such as face recognition but using it for regression is not that common.

mblondel · 2010-11-29T07:05:57Z

Using the weighted average of the neighbor responses is also part of the intuition in Gaussian Processes for Regression, with the covariance function.

agramfort · 2010-11-29T13:00:05Z

ok. Do you have a reference? to maybe add it in the docstring?
are you +1 for the merge?

mblondel · 2010-11-29T13:44:16Z

See for example equation 6.66 (page 308) of Bishop's book and also the small paragraph just below 6.60. If you think it's worth mentioning it, maybe something like this: "for a more theoretically-motivated model that also uses the notion of proximity with training examples, see GaussianProcess"... I'm no expert in GP so maybe Vincent can say if he agrees...

Regarding the merge, does it overlap with what Matthieu Brucher has been doing? This my only potential concern otherwise the code looks good.

agramfort · 2010-11-29T13:47:57Z

See for example equation 6.66 (page 308) of Bishop's book and also the small paragraph just below 6.60.
If you think it's worth mentioning it, maybe something like this: "for a more theoretically-motivated model
that also uses the notion of proximity with training examples, see GaussianProcess"... I'm no expert in GP
so maybe Vincent can say if he agrees...

I don't have the bishop with me so I'll let you add it when it's merged.

Regarding the merge, does it overlap with what Matthieu Brucher has been doing?
This my only potential concern otherwise the code looks good.

yes indeed. Matthieu has been notified and will take a look as soon as possible
to integrate this in the manifold module.

GaelVaroquaux · 2010-11-29T14:14:56Z

In my opinion, feedback from Matthieu Brucher is paramount. I'd like to be sure that the above code would suit his purposes well before merging it in. Matthieu, if you are reading this, don't hesitate to pull from this branch in your manifold branch, once you are happy with the code. I am not like the nipy guys, I don't mind pulling from master to feature branches.

mbrucher · 2010-11-29T17:23:48Z

I don't have enough time until this weekend (I have a gig to prepare ;))
I think it can cover 90% of my needs, the design is good (no derivation from Neighbors ;)), but I'd like to be able to use something else than BallTree for scientific purposes (cf the manifold review, and mainly http://www.ncbi.nlm.nih.gov/pubmed/11778013). My solution is not the prefered one for scikits.learn according to the review I got, if someone has another design, it would fit nicely in this pull request.

mbrucher · 2010-11-29T17:26:12Z

Also it may be worth changing/deprecating the k argument from the Neighbors class to something more obvious like n_neigbors.
I thought also of the froms cipy import sparse in the code, shouldn't they be located at the beginning of the files ?

agramfort · 2010-11-29T17:28:51Z

could you clarify what you mean by something else than balltree? If it covers 90% of our needs I think it's a good first pass :)
why not for n_neighbors even the k in k-NN is quiter understandable
importing sparse from the function is done on purpose to avoid long loading time. That's the only exception I think.

ogrisel · 2010-11-29T17:34:08Z

+1 for n_neighbors

As for "from scipy import sparse", some people complained that it is slow to import (in other modules) hence the lazy loading approach... I am -0 on the lazy import and would prefer top level imports as well but don't really care if there is comment to explain why it's not a top level import.

mbrucher · 2010-11-29T17:58:41Z

All manifold code that is based on k-neighbors (or Parzen window for that matter) suffers from shortcuts (http://www.ncbi.nlm.nih.gov/pubmed/11778013). It is a real PITA that has to be addressed (it is recurrent to have such questions in conferences when you talk about manifold learning, even in sessions where it is not the main topic), and the only way is to allow the selection of another class. It's what create_neighborer tried to do.

It covers 90% of the use cases, but the issue is that I have a lot of trouble making everyone understand these issues. I'm afraid that if it is not labelled as an important use case, you'll never want to consider it in the future :(

GaelVaroquaux · 2010-11-29T18:25:44Z

+1 for n_neighbors

The lazy import is indeed there for import times. I would like to keep it there but if people are violently against...

Matthieu, your comments on the shortcomings of the neighbors API are important to me. If I understand correctly your remark, you want to have the option of using a different metric that the Euclidian one to build the nearest neighbor, right? Do you think that you could achieve this reasonably well in the proposed design by sub-classing the proposed neighbors class?

mbrucher · 2010-11-29T18:59:48Z

Yes, it is using something different than the Euclidian distance (perhaps a Euclidian distance in 99% of the case, and at other moments, it would use another distance, or even reject the point alltogether) and no, it cannot be achieved by derivation. One may use BallTree, but there would be a lot of additional code, and this means no derivation/inheritance. I guess the only way would be replacing the construction of the BallTree by something else with a similar interface (which should then be clarified and made part of the official API).

GaelVaroquaux · 2010-11-29T19:03:19Z

The API of the ball tree is historical and was not given any though. I am more than open to redesigning it.

I particular, my guidelines to designing an API are 'what does this object do?', in other words: 'What are the questions we will ask it.

In the case of the BallTree, it is not clear to me. If you come in with a simple API (a set of a few methods) that is well-posed for the BallTree and for your usecases, it would probably clean up the code in the scikit.

ogrisel · 2010-11-29T19:11:08Z

If I understand correctly the problem of k-NN when used in manifold learning is the high sensitivity to the k hyper-parameter that requires a-priori knowledge to get right? What kind of alternative implementation do you have in mind? By reading the abstract of the paper by Balasubramanian & Schwartz they don't seem to give any solution.

Why not implement a new Neighbor class with the alternative implementation and use that instead of NeighborRegression? Maybe NeighborRegression is a too generic a name and should be renamed to BalltreeNeighborRegression and we could then have YourAlternativeNeighborRegression as a side class in the neighbor package as (in a similar way as we have both Lasso and ElasticNet in the coordinate_descent package for instance).

mbrucher · 2010-11-29T19:22:11Z

@ogrisel
The k parameter is not the only way to find a solution. The may be only one point that is problematic, and no k value could fit the purpose, even if the k parameter is different only for the beighbors of this point. Imagine a Swissroll with noise, it is the perfect case for a k-NN bad behavior.
The paper does not give a solution, because it is still an opened research topic.

Your solution is almost what I would like. The issue is that I need more than YourAlternativeNeighborRegression. For manifold construction, I cannot use the regression directly, the fit parameters are the result of an embedding algorithm. I need to use the functions Alexandre created, but with a different BallTree.
I could be happy with NeighborRegression, and allow people to provide their own way of populating the neighbor graph, but this is what I currently have in my branch, and the solution has not reached consensus, so I'd like to offer the possibility of selecting the neighbor class in this request instead of in my branch (it would be a higher level solution, I only use Neighbors, not Ball Tree).

@Gael
I'm wondering also if BallTree may be completely hidden inside Neighbors to get rid of the issue of its interface. Or perhaps in this request, only Neighbors may be used, as it has an official API?

mbrucher · 2010-11-29T20:20:54Z

Question: NeighborRegression input for y is an integer array. Couldn't it just be any array (it is needed if I want to get rid of my version of the same mapping)

GaelVaroquaux · 2010-11-29T20:44:24Z

@mbrucher

I agree with you, I don't understand why it should be an integer.

With regards to whether the BallTree should be hidden inside the Neighbors, I don't now, I would say that it is more you that can tell us. Can a minimalist API for Neighbors define the API that you need to suit your needs. Maybe neighbors_graph could take such an object as a keyword argument (with a default being the simple Euclidian nearest neighbor). Would that help your code?

@olivier:
Mentioning the BallTree in the name of the Euclidian nearest neighbors seems clunky for me: most users are not interested in what a BallTree is, or even by the fact that nearest neighbors has a sens in non Euclidian space. Hell, I think that many of the people processing data do not even know what non Euclidian spaces are.

mbrucher · 2010-11-29T20:49:27Z

Yes, it would be enough, although it might mean that each user has to write a wrapper function for her Neighbor replacement. Well, I can live with it :)
It would only mean one less argument in https://github.com/mbrucher/scikit-learn/blob/manifold-light/scikits/learn/manifold/embedding/tools.py#L14

GaelVaroquaux · 2010-11-29T20:52:19Z

But couldn't you provide the few classes that cover all the use cases you need?

It is true that quite often in the scikit we have taken the point of view that we preferred having simple classes, and having advanced users subclass them to implement the logic that they needed. We do this a lot at Neurospin to add our neuro-imaging specific needs.

ogrisel · 2010-11-29T20:58:39Z

+1 on simple default classes that can be overridden for advanced research purpose. I don't want overly generic API with explicit dictionary of parameters passed in the constructors or public methods. The basic API user should not have to manipulate abstract dictionary of parameters. They should only have to deal with the named parameters that are explicitly documented in the docstring of the method.

To be more specific: n_neighbors and neigh are admissible public parameters for the public API (although neigh is a bad parameter name in my opinion: neighbors_finder or topology_extractor is probably better), neigh_alternate_arguments is not an admissible parameter.

mbrucher · 2010-11-29T20:59:54Z

How would you tell the function that it has to create a specific class? I don't mind a functor that acts as a builder, but I still would have to have the current number of arguments I use for the default case, because each dataset may need its own number of neighbors, it cannot be arbirtraly fixed.

Also don't forget that LSP (Liskov Substitution Principle) limits what you can do with subclassing.

mbrucher · 2010-11-29T21:02:58Z

OK, so we can agree on neighbors_finders that is a functor that will create the instance that will be used then.

I will also change the mapping builder to reflect this in my current branch.

As for the qustion in #23 (comment), perhaps Alexandre can also give a clue if he can live with Neighbors instead of the BallTree?

ogrisel · 2010-11-29T21:06:26Z

Pass the pre-configured instance and not a factory. If you need to adjust n_neighbors later then pass explicitly the number of neighbors to extract as part of the public API of the instance of neighbors_finder or topology_extractor.

For instance we can than that instances of the "neighbors_finder" contract must implement the predict method that accept an optional argument n_neighbors (which will be otherwise be set to a reasonable default such as 3 for instance).

mbrucher · 2010-11-29T21:11:25Z

You prefer the instance directly? Why not. It could be reused or built differently. And it's less lines :)
The predict method cannot expect n_neighbors: for some algorithms, it has no sense (example would be a pure Parzen-based Neighbors class).

ogrisel · 2010-11-29T21:17:19Z

Yes for instances instead of parameterized factories that force the user to pass generic maps of parameters all around the API and make it overly complex!

As for the n_neighbors it was just an optional parameter proposal to introduce some of the flexibility that might have been lost by switching from the factory pattern to the configured instance pattern: it you don't need it for the manifold module, we don't have to introduce it (unless useful for other usecases).

agramfort · 2010-11-30T00:43:55Z

I have to admit I have a hard time following the discussion.
@matthieu: do not hesitate to pull from this branch and work from there.

it might be naive comments but i had in mind that my kneighbors_graph function
was all was needed to the manifold module. A new weight option can be added
if necessary though. To me manifold learning works by building an affinity matrix
directly derived from the kneighbors_graph output and then simple linear
algebra hence a few lines of code. But I might be wrong.
The NeighborsBarycenter object would then be out of the manifold module.

agramfort · 2010-12-02T03:42:08Z

what should I do?
wait for matthieu to work from there or do the merge, close the pull request
and matthieu will merge from origin/master ?

mbrucher · 2010-12-02T07:10:57Z

+1 for the merge.
Better closing some "tickets" than letting them open when they are OK ;)

GaelVaroquaux · 2010-12-02T09:25:58Z

I would say that if the APIs are going to change, I'd rather merge in 'origin', but in a 0.7.X branch, to avoid putting moving targets in the 0.6 release.

mbrucher · 2010-12-02T10:02:31Z

I think that the only thing that will change in my branch is adding the keyword neighbors_finder. The rest should be OK. I may have to add one function for a symetrical graph, but this is not modifying the existing API.

GaelVaroquaux · 2010-12-02T10:06:52Z

Could you do it in a separate branch, so that we can merge this in a common branch and try to use it on real problem (to discover where it sucks)

mbrucher · 2010-12-02T10:41:40Z

I can ;)

Refactored ensemble learning code

Adaboost

IoT Gateway Code Update

agramfort added 5 commits November 27, 2010 18:21

pep8 in plot_weighted_samples.py

c9abe50

ENH : adding kneighbors_graph to build the graph of neighbors as a sp…

01e93e7

…arse matrix MISC : pep8 in neighbors module

FIX fragile doctest

034a5dd

ENH : adding NeighborsBarycenter for regression pbs using k-Nearest N…

c0b4aa4

…eighbors API : change api for kneighbors_graph function

DOC: adding NeighborsBarycenter to doc

6f54797

DOC: better docstring for barycenter_weights function

e86ffc1

agramfort added 2 commits November 28, 2010 15:13

DOC: even better docstrings in neighbors

9b4281e

MISC: reindenting BallTree C++ code (no tabs + 4 spaces)

dff3752

DOC : more on docstrings in neighbors.py

eb76161

API renmae k->n_neighbors

46a3108

satra pushed a commit to satra/scikit-learn that referenced this pull request Oct 7, 2011

Merge pull request scikit-learn#23 from larsmans/enh/ensemble

afc4a03

Refactored ensemble learning code

GaelVaroquaux mentioned this pull request Oct 23, 2012

MRG: Speed up euclidean_distances with Cython #1006

Closed

glouppe pushed a commit to glouppe/scikit-learn that referenced this pull request Jan 17, 2013

Merge pull request scikit-learn#23 from glouppe/adaboost

9a0b44b

Adaboost

prismdata pushed a commit to prismdata/scikit-learn that referenced this pull request Oct 28, 2020

Merge pull request scikit-learn#23 from hyeonseok-han/feature/IoTGateway

0cf9b10

IoT Gateway Code Update

ZHCSOFT mentioned this pull request Mar 10, 2022

random Segfaults on distance_transform_edt with Intel 12 Alder lake (E-Core enabled) #22744

Closed

This pull request was closed.

Uh oh!

Neighbor barycenter #23

Neighbor barycenter #23

Uh oh!

Conversation

agramfort commented Nov 28, 2010

Uh oh!

ogrisel commented Nov 28, 2010

Uh oh!

agramfort commented Nov 28, 2010

Uh oh!

ogrisel commented Nov 28, 2010

Uh oh!

GaelVaroquaux commented Nov 28, 2010

Uh oh!

agramfort commented Nov 28, 2010

Uh oh!

ogrisel commented Nov 28, 2010

Uh oh!

mblondel commented Nov 29, 2010

Uh oh!

agramfort commented Nov 29, 2010

Uh oh!

mblondel commented Nov 29, 2010

Uh oh!

agramfort commented Nov 29, 2010

Uh oh!

GaelVaroquaux commented Nov 29, 2010

Uh oh!

mbrucher commented Nov 29, 2010

Uh oh!

mbrucher commented Nov 29, 2010

Uh oh!

agramfort commented Nov 29, 2010

Uh oh!

ogrisel commented Nov 29, 2010

Uh oh!

mbrucher commented Nov 29, 2010

Uh oh!

GaelVaroquaux commented Nov 29, 2010

Uh oh!

mbrucher commented Nov 29, 2010

Uh oh!

GaelVaroquaux commented Nov 29, 2010

Uh oh!

ogrisel commented Nov 29, 2010

Uh oh!

mbrucher commented Nov 29, 2010

Uh oh!

mbrucher commented Nov 29, 2010

Uh oh!

GaelVaroquaux commented Nov 29, 2010

Uh oh!

mbrucher commented Nov 29, 2010

Uh oh!

GaelVaroquaux commented Nov 29, 2010

Uh oh!

ogrisel commented Nov 29, 2010

Uh oh!

mbrucher commented Nov 29, 2010

Uh oh!

mbrucher commented Nov 29, 2010

Uh oh!

ogrisel commented Nov 29, 2010

Uh oh!

mbrucher commented Nov 29, 2010

Uh oh!

ogrisel commented Nov 29, 2010

Uh oh!

agramfort commented Nov 30, 2010

Uh oh!

agramfort commented Dec 2, 2010

Uh oh!

mbrucher commented Dec 2, 2010

Uh oh!

GaelVaroquaux commented Dec 2, 2010

Uh oh!

mbrucher commented Dec 2, 2010

Uh oh!

GaelVaroquaux commented Dec 2, 2010