Skip to content

ENH: use Bayesian priors in Nearest Neighbors classifier (Issue 399) #970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

cricketsong
Copy link

@cricketsong cricketsong commented Jul 19, 2012

The k-Nearest-Neighbors and Radius Neighbors classifiers now make explicit use of the Bayesian prior probabilities of the classes (self.class_prior_). By default, each of these prior probabilities (for a given class) is set to the proportion of sample points which are in the given class; but they can also be set to other values by the class user.

(This is my first submission so I'd be grateful for any and all advices. Thank you!)

Fixes #399

@jakevdp
Copy link
Member

jakevdp commented Jul 19, 2012

Thanks for submitting this! I'll take a detailed look, but I'm travelling today so it might have to wait until tomorrow or the weekend.

@@ -567,8 +567,11 @@ def fit(self, X, y):
y : {array-like, sparse matrix}, shape = [n_samples]
Target values, array of integer values.
"""
self._y = np.asarray(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so familiar with this module but if the y wasn't preprocessed before, you should probably do an "unique" with return_inverse (use the one from utils for backward compatibility) so get integer labels from whatever what put in. Not sure how this fit is called, though, so maybe you should wait for Jakes opinion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this will also take care of the next (line 571) line:

self._classes, self._y = _unique(y, return_inverse=True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it should. And I looked it up, there is no prior processing so this is the right place to do this, I think.

@amueller
Copy link
Member

Thanks for your contribution.
I am not so familiar with this module but here some general comments.
First, I think we should not lightly change the default behavior of our classifiers. People who upgrade might get unexpected results. Also, this might not be the expected behavior for textbook KNN.
So I would rather like to have a parameter to control the behavior.
Also you mentioned the user could set the prior, but I don't see that you provided a way to do this. I think people would usually use KNeighborsClassifier and RadiusClassifier, so these should provide this option.

There is already a weights parameter, but I think this works on the samples, not the classes. Other estimators have a class_weight parameter, maybe you can adopt this name.

While you commented your code, you did not change the docstring or the narrative documentation. People who want to use a class usually look into these two places, not the code. So the functionality should be explained there.

If possible (and sensible) I like new functionality also to be integrated in an example, preferably an existing one.

# given tested point), the corresponding element in `probabilities`
# would still be incremented only once.
outliers = [] # row indices of the outliers (if any)
for row in range(len(pred_labels)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm didn't check the details but I think you can do the whole thing using np.bincount with the weights parameter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, but then I'd have to use the minlength parameter, thus:

np.bincount(pi, weights[i], minlength=self._classes.size)

The problem is that minlength was introduced in NumPy 1.6.0, so it's too recent for what we're supporting. One approach would be:

np.bincount(np.append(pi, np.arange(self._classes.size)), np.append(weights[i], np.zeros(self._classes.size)))

which basically adds the whole range of classes, but weighted with 0.0 so they have no impact beside making sure the returned array is self._classes.size long. Though I'm not sure this is more efficient or less kludgy than the loop in the original code I submitted.

Or I'd be happy to submit a wrapper in utils/fixes.py around bincount (like we have for unique).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just use bincount and then append zeros?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was too focused on solving it in only 1 line :-) So something like:

unpadded_probs = np.bincount(pi, weights[i])
probabilities[i] = np.append(unpadded_probs, np.zeros(self._classes.size - unpadded_probs.shape[0]))

@amueller
Copy link
Member

@cricketsong I think this looks good in principle but I didn't check details. I hope my comments make sense, for a details review, let's wait for Jake :)

@cricketsong
Copy link
Author

@amueller First of all, thank you for your numerous suggestions.

  1. I will definitely look into using np.bincount where you've suggested it.
  2. This PR does not in fact change the default behavior: in its current form, the KNN and Radius Neighbor classifiers already implicitly rely on a Bayesian prior computed as the proportion of sample points which are in each class. (When using such a prior, the posterior probability for class k reduces simply to the ratio of points in the volume V which are in class k, where the volume is either the one covering the k nearest neighbors (as in KNeighborsClassifier), or the volume with the given radius (as in RadiusNeighborsClassifier). More details can be found in Bishop, Pattern Recognition and Machine Learning, Springer, 2006, p. 125.)
  3. Per Jake's last comment in Issue 399, that case is similar to the class_prior_ parameter in GaussianNB. It was my understanding that in the latter, the user could set the value of class_prior_ using set_params (inherited from class BaseEstimator); but now I'm not so sure. I'll wait to see what Jake had in mind.
  4. I will modify the docstring and the narrative documentation, and will include an example.

Thanks again!

@ogrisel
Copy link
Member

ogrisel commented Jul 20, 2012

Just a note on the naming conventions for attributes and parameters in sklearn: attributes that have a trailing _ are automatically fitted parameters learned from the data and you should not expect the user to set them manually. User adjustable parameters should be stored with their name (without the trailing _) in the constructor (and then can be adjusted using set_parameters when nesting / stacking models in a sklearn.pipeline.Pipeline compound construct).

Please have a look at APIs of scikit-learn objects. The rest of the contributors guide is interesting too.

@ogrisel
Copy link
Member

ogrisel commented Jul 20, 2012

Also could you add References section in the docstring of the models that give "Bishop, Pattern Recognition and Machine Learning, Springer, 2006, p. 125." as a reference for the explicit use of priors in k-NN models? It should also be mentioned in the narrative doc.

@amueller
Copy link
Member

@cricketsong Can you explain 2) again? The new predict_proba method gives different results than the old predict_proba method, right? Maybe I misunderstood, though.

@cricketsong
Copy link
Author

@amueller The new predict_proba, by default (without the user changing the class priors) should -- and does -- give the same result as the old one. Here's why:

We want the posterior P(C_k | x) to correctly classify x. We can use the following approximations (the same are used in kNN and in Radius Neighbors):

  1. P(x | C_k) = K_k / (N_k V) where K_k is the number of points in the neighborhood of x which are in class k; N_k is the total number of points in class k; and V is the volume of the neighborhood. (About that volume: we can safely ignore it because it will get simplified away when we apply Bayes' rule).
  2. p(x) = K / (NV) where K is the total number of points in the neighborhood (fixed in kNN, variable in Radius Neighbors) and N is the total number of points in the training set.
  3. p(C_k) = N_k / N. This is the prior by default in the new predict_proba and (as we'll see) the implicit prior in the old predict_proba.

When we use Bayes' rule to compute P(C_k | x) from these 3 approximations, we get P(C_k | x) = K_k / K. And that expression is precisely what was used by the old predict_proba: it was implicitly using the prior P(C_k) = N_k / N.

In the new version, our prior is arbitrary so when we use Bayes' rule, we compute P(x | C_k) first as K_k * weight / N_k (we ignore V), multiplying by our arbitrary prior P(C_k) and then dividing by p(x) (which also ignores V).

# M += 1: if a predicted index was to occur more than once (for a
# given tested point), the corresponding element in `probabilities`
# would still be incremented only once.
for i, pi in enumerate(pred_labels):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I thought you could eliminate this one, too. But I guess it is not so obvious. I think you'd need my favourite non-existing numpy function bincount2d (analog to histogram2d). One of these days I'm gonna write it ;)

@amueller
Copy link
Member

nice work :)

self._classes, self._y = unique(y, return_inverse=True)
self.class_prior_ = np.zeros(self._classes.size)
for i, y_i in enumerate(self._classes):
self.class_prior_[i] = np.float(np.sum(y == y_i)) / len(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be accomplished more efficiently with

self.class_prior_ = np.bincount(self._y).astype(float) / len(self._classes)

I believe this should work

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm mistaken, we should divide by len(self._y) (instead of len(self._classes)), which will give us the proportion of points in each class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right: we should divide by len(self._y)

@jakevdp
Copy link
Member

jakevdp commented Jul 23, 2012

This is looking good. A few thoughts:

  • in order to maintain backward compatibility, the Bayesian prior should be opt-in. I'm confused by the comment above saying that adding a Bayesian prior does not change the default behavior. Granted, I think I need to sit down and try to recall what I had in mind when I opened Issue Bayesian priors in nearest neighbors classification/regression #399 😄. I pictured having an input flag in the constructor named use_bayesian_prior which is set to False by default.
  • Olivier's comments are good: a reference to Bishop would be nice, as well as some mention of the behavior in the doc string
  • I just noticed that KNeighbors has a predict_proba method, while RadiusNeighbors does not. This is left over from a previous PR which added predict_proba only to one of the classes: it would be nice to add that in RadiusNeighbors now.
  • I'm still confused by the affect of the class prior (see my line note above). Let me know if I'm missing something on that!

All that aside: great work. I think once this is cleaned up, it will be a nice addition to the module.

@jakevdp
Copy link
Member

jakevdp commented Jul 24, 2012

Looking good. Above I said a boolean flag would be good. I think it would be better to allow the user to pass their own bayesian priors via a class_prior argument, which defaults to None. If it's None, the priors would be computed as they are right now. we could also use priors = 'flat' or other standard options to make it easier on the user.

Getting close!

@cricketsong
Copy link
Author

If everything is satisfactory, I'll work on the corresponding narrative documentation tomorrow morning (EDT/Montreal time).

@jakevdp
Copy link
Member

jakevdp commented Jul 24, 2012

@cricketsong - all looks good. Thanks for all the explanations and responses above. I'm surprised that predict_proba is faster... looking at the source of stats.mode, however, I can see why that would be. Your bincount solution in predict_proba is better,

I'm excited to see the narrative documentation! Again, thanks for all the work you're doing on this.

@amueller
Copy link
Member

I think naming the parameter class_weight instead of class_prior would be better as this name is already used in other classifiers. What do you think @jakevdp (and @ogrisel)?

@jakevdp
Copy link
Member

jakevdp commented Jul 28, 2012

@amueller - good question: I had to think about it a bit!

class_weight is used for many of the classifiers in sklearn.linear_model. class_prior is currently used in Gaussian Naive Bayes. In discriminative classifiers like those in sklearn.linear_model, class_weight defaults to 1. In Naive Bayes, which is a generative classifier, class_prior defaults to the frequency of the label in the training set.

Nearest neighbors classification is a bit of a hybrid: it has characteristics of both generative and discriminative classification, so it could go either way. But this PR frames nearest neighbors classification squarely in a Bayesian, generative sense. Because of this, and because the default behavior is similar to that of class_prior in GaussianNaiveBayes, I'd prefer to stick with the label class_prior here.

@amueller
Copy link
Member

@jakevdp Ok, I agree. I didn't know that GNB used the same parameter name. The difference in default behavior is also a good argument. I didn't realize that. Thanks for your thoughts :)

@@ -94,7 +94,19 @@ be accomplished through the ``weights`` keyword. The default value,
distance from the query point. Alternatively, a user-defined function of the
distance can be supplied which is used to compute the weights.


The nearest neighbors classification algorithm is implicitly based on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather says something like "there is a probabilistic interpretation of nn"

@amueller
Copy link
Member

This looks pretty good. Could you maybe add a little explanation to the example explaining the plot? I haven't looked at it myself but will do tomorrow.

@cricketsong
Copy link
Author

@amueller Thanks for the comments. I've corrected the =/- problem, applied the modification you suggested re: the probabilistic interpretation and commented on the example.

@amueller
Copy link
Member

@cricketsong Sorry I forgot a bit about this PR, I have been pretty busy. What is the current status?
@jakevdp what do you think?

@jakevdp
Copy link
Member

jakevdp commented Sep 17, 2012

@amueller - Thanks for bringing this up again - I need to take some time to look back at this. I recall it was in pretty good shape, but I'd like to do a full review.

@chkoar
Copy link
Contributor

chkoar commented Jan 5, 2018

What is the current status of this PR? Can we recap? If @cricketsong is not available maybe I could carry it out. cc @amueller @jakevdp @ogrisel

@jnothman
Copy link
Member

jnothman commented Jan 7, 2018

Status is stalled in ancient history! I like the idea of supporting a bayesian KNN (mostly because users don't think about this approach to KNN), but I don't think we're going to make it a top priority!

@jnothman
Copy link
Member

jnothman commented Jan 7, 2018

One thing that would help here is having a reference (a textbook, a paper, etc) that would allow us to ensure correctness according to an established standard.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in thinking that under this approach, n_neighbors=1 will produce a posterior with all its mass in one class? (I've not tried running the code because it would involve looking at 6-year-old compilation instructions lol.)

The math here seems in accordance with bishop, but it goes against my intuition of bayesian prior, wherein we should rely on the prior more if the number of neighbors available to inform us is few (e.g. when using radius_neighbors in a region that is sparse in the training set). Am I mistaken?

class_prior : str, list or ndarray, optional (default = 'default')
class prior probabilities used in prediction. Possible values:

- 'default': default prior probabilities. For each class, its
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be called 'mle'?

- 'default': default prior probabilities. For each class, its
prior probability is the proportion of points in the dataset
that are in this class.
- 'flat': equiprobable prior probabilites. If there are C classes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or 'uniform'?

if class_prior in (None, 'default'):
return np.bincount(y).astype(float) / len(y)
elif class_prior == 'flat':
return np.ones((len(np.unique(y)),)) / len(np.unique(y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we normalise later, we can just return 1 here.

class_prior_arr: array of the same shape as ``np.unique(y)``
"""
if class_prior in (None, 'default'):
return np.bincount(y).astype(float) / len(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we normalize later, we can ignore the / len(y)


# Compute the unnormalized posterior probability, taking
# self.class_prior_ into consideration.
class_count = np.bincount(self._y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nowadays we need to handle multioutput y. Perhaps each will have its own prior...

probabilities[i] = 1e-6 # prevent division by zero later
continue # outlier
# When we support NumPy >= 1.6, we'll be able to simply use:
# np.bincount(pi, weights, minlength=self._classes.size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we support this now.

Base automatically changed from master to main January 22, 2021 10:48
@lorentzenchr
Copy link
Member

Should we close? We still have the open issue #399.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bayesian priors in nearest neighbors classification/regression
8 participants