-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH: use Bayesian priors in Nearest Neighbors classifier (Issue 399) #970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: use Bayesian priors in Nearest Neighbors classifier (Issue 399) #970
Conversation
Thanks for submitting this! I'll take a detailed look, but I'm travelling today so it might have to wait until tomorrow or the weekend. |
sklearn/neighbors/base.py
Outdated
@@ -567,8 +567,11 @@ def fit(self, X, y): | |||
y : {array-like, sparse matrix}, shape = [n_samples] | |||
Target values, array of integer values. | |||
""" | |||
self._y = np.asarray(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so familiar with this module but if the y
wasn't preprocessed before, you should probably do an "unique" with return_inverse (use the one from utils for backward compatibility) so get integer labels from whatever what put in. Not sure how this fit is called, though, so maybe you should wait for Jakes opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this will also take care of the next (line 571) line:
self._classes, self._y = _unique(y, return_inverse=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it should. And I looked it up, there is no prior processing so this is the right place to do this, I think.
Thanks for your contribution. There is already a While you commented your code, you did not change the docstring or the narrative documentation. People who want to use a class usually look into these two places, not the code. So the functionality should be explained there. If possible (and sensible) I like new functionality also to be integrated in an example, preferably an existing one. |
sklearn/neighbors/classification.py
Outdated
# given tested point), the corresponding element in `probabilities` | ||
# would still be incremented only once. | ||
outliers = [] # row indices of the outliers (if any) | ||
for row in range(len(pred_labels)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm didn't check the details but I think you can do the whole thing using np.bincount
with the weights
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could, but then I'd have to use the minlength
parameter, thus:
np.bincount(pi, weights[i], minlength=self._classes.size)
The problem is that minlength
was introduced in NumPy 1.6.0, so it's too recent for what we're supporting. One approach would be:
np.bincount(np.append(pi, np.arange(self._classes.size)), np.append(weights[i], np.zeros(self._classes.size)))
which basically adds the whole range of classes, but weighted with 0.0 so they have no impact beside making sure the returned array is self._classes.size
long. Though I'm not sure this is more efficient or less kludgy than the loop in the original code I submitted.
Or I'd be happy to submit a wrapper in utils/fixes.py
around bincount
(like we have for unique
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you just use bincount and then append zeros?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was too focused on solving it in only 1 line :-) So something like:
unpadded_probs = np.bincount(pi, weights[i])
probabilities[i] = np.append(unpadded_probs, np.zeros(self._classes.size - unpadded_probs.shape[0]))
@cricketsong I think this looks good in principle but I didn't check details. I hope my comments make sense, for a details review, let's wait for Jake :) |
@amueller First of all, thank you for your numerous suggestions.
Thanks again! |
Just a note on the naming conventions for attributes and parameters in sklearn: attributes that have a trailing Please have a look at APIs of scikit-learn objects. The rest of the contributors guide is interesting too. |
Also could you add |
@cricketsong Can you explain 2) again? The new |
@amueller The new We want the posterior
When we use Bayes' rule to compute In the new version, our prior is arbitrary so when we use Bayes' rule, we compute |
# M += 1: if a predicted index was to occur more than once (for a | ||
# given tested point), the corresponding element in `probabilities` | ||
# would still be incremented only once. | ||
for i, pi in enumerate(pred_labels): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I thought you could eliminate this one, too. But I guess it is not so obvious. I think you'd need my favourite non-existing numpy function bincount2d
(analog to histogram2d
). One of these days I'm gonna write it ;)
nice work :) |
sklearn/neighbors/base.py
Outdated
self._classes, self._y = unique(y, return_inverse=True) | ||
self.class_prior_ = np.zeros(self._classes.size) | ||
for i, y_i in enumerate(self._classes): | ||
self.class_prior_[i] = np.float(np.sum(y == y_i)) / len(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be accomplished more efficiently with
self.class_prior_ = np.bincount(self._y).astype(float) / len(self._classes)
I believe this should work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm mistaken, we should divide by len(self._y)
(instead of len(self._classes)
), which will give us the proportion of points in each class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right: we should divide by len(self._y)
This is looking good. A few thoughts:
All that aside: great work. I think once this is cleaned up, it will be a nice addition to the module. |
Looking good. Above I said a boolean flag would be good. I think it would be better to allow the user to pass their own bayesian priors via a Getting close! |
If everything is satisfactory, I'll work on the corresponding narrative documentation tomorrow morning (EDT/Montreal time). |
@cricketsong - all looks good. Thanks for all the explanations and responses above. I'm surprised that I'm excited to see the narrative documentation! Again, thanks for all the work you're doing on this. |
@amueller - good question: I had to think about it a bit!
Nearest neighbors classification is a bit of a hybrid: it has characteristics of both generative and discriminative classification, so it could go either way. But this PR frames nearest neighbors classification squarely in a Bayesian, generative sense. Because of this, and because the default behavior is similar to that of |
@jakevdp Ok, I agree. I didn't know that GNB used the same parameter name. The difference in default behavior is also a good argument. I didn't realize that. Thanks for your thoughts :) |
doc/modules/neighbors.rst
Outdated
@@ -94,7 +94,19 @@ be accomplished through the ``weights`` keyword. The default value, | |||
distance from the query point. Alternatively, a user-defined function of the | |||
distance can be supplied which is used to compute the weights. | |||
|
|||
|
|||
The nearest neighbors classification algorithm is implicitly based on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather says something like "there is a probabilistic interpretation of nn"
This looks pretty good. Could you maybe add a little explanation to the example explaining the plot? I haven't looked at it myself but will do tomorrow. |
@amueller Thanks for the comments. I've corrected the =/- problem, applied the modification you suggested re: the probabilistic interpretation and commented on the example. |
@cricketsong Sorry I forgot a bit about this PR, I have been pretty busy. What is the current status? |
@amueller - Thanks for bringing this up again - I need to take some time to look back at this. I recall it was in pretty good shape, but I'd like to do a full review. |
What is the current status of this PR? Can we recap? If @cricketsong is not available maybe I could carry it out. cc @amueller @jakevdp @ogrisel |
Status is stalled in ancient history! I like the idea of supporting a bayesian KNN (mostly because users don't think about this approach to KNN), but I don't think we're going to make it a top priority! |
One thing that would help here is having a reference (a textbook, a paper, etc) that would allow us to ensure correctness according to an established standard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct in thinking that under this approach, n_neighbors=1 will produce a posterior with all its mass in one class? (I've not tried running the code because it would involve looking at 6-year-old compilation instructions lol.)
The math here seems in accordance with bishop, but it goes against my intuition of bayesian prior, wherein we should rely on the prior more if the number of neighbors available to inform us is few (e.g. when using radius_neighbors
in a region that is sparse in the training set). Am I mistaken?
class_prior : str, list or ndarray, optional (default = 'default') | ||
class prior probabilities used in prediction. Possible values: | ||
|
||
- 'default': default prior probabilities. For each class, its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be called 'mle'?
- 'default': default prior probabilities. For each class, its | ||
prior probability is the proportion of points in the dataset | ||
that are in this class. | ||
- 'flat': equiprobable prior probabilites. If there are C classes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or 'uniform'?
if class_prior in (None, 'default'): | ||
return np.bincount(y).astype(float) / len(y) | ||
elif class_prior == 'flat': | ||
return np.ones((len(np.unique(y)),)) / len(np.unique(y)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we normalise later, we can just return 1 here.
class_prior_arr: array of the same shape as ``np.unique(y)`` | ||
""" | ||
if class_prior in (None, 'default'): | ||
return np.bincount(y).astype(float) / len(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we normalize later, we can ignore the / len(y)
|
||
# Compute the unnormalized posterior probability, taking | ||
# self.class_prior_ into consideration. | ||
class_count = np.bincount(self._y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nowadays we need to handle multioutput y. Perhaps each will have its own prior...
probabilities[i] = 1e-6 # prevent division by zero later | ||
continue # outlier | ||
# When we support NumPy >= 1.6, we'll be able to simply use: | ||
# np.bincount(pi, weights, minlength=self._classes.size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we support this now.
Should we close? We still have the open issue #399. |
The k-Nearest-Neighbors and Radius Neighbors classifiers now make explicit use of the Bayesian prior probabilities of the classes (
self.class_prior_
). By default, each of these prior probabilities (for a given class) is set to the proportion of sample points which are in the given class; but they can also be set to other values by the class user.(This is my first submission so I'd be grateful for any and all advices. Thank you!)
Fixes #399