-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Add predict_proba(X) and outlier handler for RadiusNeighborsClassifier #9597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to provide this for radius neighbors and not kneighbors?
@jnothman Hi, the predict_proba function of kneighbors has already been implemented. And I do not find it for radius neighbors. Sorry I am new here. Is there anything I should do first? |
ahh okay!
…On 22 Aug 2017 10:05 am, "Wenbo Zhao" ***@***.***> wrote:
@jnothman <https://github.com/jnothman> The predict_proba functions of
kneighbors has already be implemented.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9597 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_w_tRE4clGDYyljOBJiqEtoHSPgks5sahthgaJpZM4O9rFL>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there more that can be shared between the radius and k implementations?
Please add a unit test
… of outlier lable
I thought I commented here, but I can't see it now. Perhaps I commented in my head. #970 makes the MAP prediction. You sample from the posterior distribution. This is a very substantial difference. I can see the benefit of sampling from the posterior, but I think we should prefer the MAP. Certainly, if we allow stochastic predictions (which I don't think we do anywhere else, but of course |
@jnothman I think I am a little bit of confused on prior and posterior in Neighbors models. In my mind, the prior distribution is the distribution of target variable, 'y' in the training data. Assuming that we have 5 samples in the training data with labels of [0, 0, 0, 1, 1]. Then the prior P(0) = 0.6 and P(1) = 0.4.
The posteriors are something like
Am I correct? |
I think you should avoid making predictions stochastic, at least for now. As I said, a user can do so by sampling from predict_proba. I'm not sure what you're trying to say, so let me try from the top. In In a bayesian approach to NN, that distribution is the posterior distribution of classes accounting for both a prior distribution (our belief of how classes should be distributed before we look at the labels in the neighborhood of the sample) and the observations (i.e. the class distribution in the neighborhood of the sample). The posterior is generally P(class | neighborhood) ∝ P(neighborhood | class) * P(class without regard to neighborhood). So you calculate this for each class. |
@jnothman Thanks for your answer! It makes me clear now. The reason I considered to add outliers handler for Since an outlier doesn't have any neighbor within a fix radius, all of its class probabilities will be 0. A 0 probability will have a very huge impact on scoring such as log_loss. When we use Therefore, I thought assigning outliers with prior or uniform probabilities in |
Okay, but having a prior is fundamentally different to outputting a different label in the case of an outlier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still good to me, though I've not benchmarked myself. Might want to benchmark comparison to #13783
doc/whats_new/v0.22.rst
Outdated
|
||
- |Efficiency| Efficiency improvements for | ||
:func:`neighbors.RadiusNeighborsClassifier.prdict` by changing | ||
implementation from scipy.stats.mode to numpy.bincount. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right description anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @jnothman meant something along the lines of,
- |Efficiency| Efficiency improvements for
:func:`neighbors.RadiusNeighborsClassifier.predict_proba` by changing
implementation from scipy.stats.mode to numpy.bincount, and for
:func:`neighbors.RadiusNeighborsClassifier.predict` that is now computed as
an `argmax` of `predict_proba`.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have neighbors.RadiusNeighborsClassifier.predict_proba in the current master. It is created in this branch.
How about
- |Efficiency| Efficiency improvements for :func:neighbors.RadiusNeighborsClassifier.predictby changing implementation from usingscipy.stats.modeto usingnumpy.bincountinpredict_proba.
Right, of course. But then we can't say that the implementation of predict_proba changed from using scipy.stats.mode
, as it didn't exist previously. Maybe just saying that "predict is now computed from predict_proba" (and that it's faster) without going in much details.
fix typo Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
A small benchmark shows a nice speed up compared to I did not compare with #14543 though. @rth import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import RadiusNeighborsClassifier
class OldRadiusNeighborsClassifier(RadiusNeighborsClassifier):
def predict(self, X):
""""""
from sklearn.utils.validation import check_array
from sklearn.neighbors.base import _get_weights
from sklearn.utils.extmath import weighted_mode
from scipy import stats
X = check_array(X, accept_sparse='csr')
n_samples = X.shape[0]
neigh_dist, neigh_ind = self.radius_neighbors(X)
inliers = [i for i, nind in enumerate(neigh_ind) if len(nind) != 0]
outliers = [i for i, nind in enumerate(neigh_ind) if len(nind) == 0]
classes_ = self.classes_
_y = self._y
if not self.outputs_2d_:
_y = self._y.reshape((-1, 1))
classes_ = [self.classes_]
n_outputs = len(classes_)
if self.outlier_label is not None:
neigh_dist[outliers] = 1e-6
elif outliers:
raise ValueError('No neighbors found for test samples %r, '
'you can try using larger radius, '
'give a label for outliers, '
'or consider removing them from your dataset.' %
outliers)
weights = _get_weights(neigh_dist, self.weights)
y_pred = np.empty((n_samples, n_outputs), dtype=classes_[0].dtype)
for k, classes_k in enumerate(classes_):
pred_labels = np.zeros(len(neigh_ind), dtype=object)
pred_labels[:] = [_y[ind, k] for ind in neigh_ind]
if weights is None:
mode = np.array(
[stats.mode(pl)[0] for pl in pred_labels[inliers]],
dtype=np.int)
else:
mode = np.array([
weighted_mode(pl, w)[0]
for (pl, w) in zip(pred_labels[inliers], weights[inliers])
], dtype=np.int)
mode = mode.ravel()
y_pred[inliers, k] = classes_k.take(mode)
if outliers:
y_pred[outliers, :] = self.outlier_label
if not self.outputs_2d_:
y_pred = y_pred.ravel()
return y_pred
OldRadiusNeighborsClassifier.branch = 'master'
RadiusNeighborsClassifier.branch = 'branch'
averages = []
results = []
for klass, weights, n_samples, n_features in itertools.product(
[OldRadiusNeighborsClassifier, RadiusNeighborsClassifier],
['uniform', 'distance'],
[100, 1000, 10000],
[10, 100, 1000, 10000],
):
X = np.random.randn(n_samples, n_features)
y = np.random.randint(2, size=n_samples)
neigh = klass(weights=weights, radius=0.2).fit(X, y)
out = %timeit -o neigh.predict(X) # Ipython
results.append((klass.branch, weights, n_samples, n_features, out.average))
##################
# plot the results
df = pd.DataFrame(
results, columns='name weights n_samples n_features average'.split(' '))
table = df.pivot_table(index='n_features',
columns=['name', 'weights', 'n_samples'],
values='average')
table = table['branch'] / table['master']
table.plot(marker='.', logx=True)
plt.title('time(branch) / time(master)')
plt.show() |
doc/whats_new/v0.22.rst
Outdated
|
||
- |Efficiency| Efficiency improvements for | ||
:func:`neighbors.RadiusNeighborsClassifier.prdict` by changing | ||
implementation from scipy.stats.mode to numpy.bincount. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it ?
typo fix Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
fix typo Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
… of np.where in outlier_label checking
…rn into RadNeiClfPredProb
Update fork repo
…rn into RadNeiClfPredProb
The existence of that PR shouldn't prevent this one to get merged. It's a clear improvement. Looks like there there are 2 approvals and comments were addressed apart for #9597 (comment) |
The case benchmarked above #9597 (comment) with
I haven't reviewed the code since there are enough reviews, but +1 for merge based on the benchmark results. |
I agree. |
Thanks for this nice work @webber26232 ! 🎉 |
Sigh. Finally! Thank you!
|
Reference Issue
What does this implement/fix? Explain your changes.
Currently, RadiusNeighborsClassifier doesn't provide the function predict_proba(X). When an outlier (a sample that doesn't have any neighbors within a fix redius) is detected, no class label can be assigned.
This branch implemented predict_proba(X) for RadiusNeighborsClassifier. The class probabilities are generated by samples' neighbors and weights within the radius r.
When outliers (no neighbors in the radius r) are detected, given 4 solutions, controlled by parameter outlier_label:
Any other comments?
DOC for both class of RaduisNeighborsClassifier (outlier solution and an example) and function of predict_proba is added.