[MRG+1] Add predict_proba(X) and outlier handler for RadiusNeighborsClassifier #9597

webber26232 · 2017-08-21T18:22:56Z

Reference Issue

What does this implement/fix? Explain your changes.

Currently, RadiusNeighborsClassifier doesn't provide the function predict_proba(X). When an outlier (a sample that doesn't have any neighbors within a fix redius) is detected, no class label can be assigned.

This branch implemented predict_proba(X) for RadiusNeighborsClassifier. The class probabilities are generated by samples' neighbors and weights within the radius r.

When outliers (no neighbors in the radius r) are detected, given 4 solutions, controlled by parameter outlier_label:

solution	predict(X)	predict_proba(X)
None	Exception will be raised.	Exception will be raised.
int, str or list of correspond labels	Assign a manual label for outliers.	If the new label is in the 'y' of training data, the probabilities will be 1 in this label and 0 in the rest. Otherwise, all possibilities will be 0.
'most_frequent'	Assign the most frequent label in training target variable for outliers.	The probability of most frequent label in training target will be assigned with 1 and other label's probability will be 0.

Any other comments?

DOC for both class of RaduisNeighborsClassifier (outlier solution and an example) and function of predict_proba is added.

jnothman

Is there a reason to provide this for radius neighbors and not kneighbors?

webber26232 · 2017-08-22T00:05:51Z

@jnothman Hi, the predict_proba function of kneighbors has already been implemented. And I do not find it for radius neighbors.

Sorry I am new here. Is there anything I should do first?

jnothman · 2017-08-22T00:32:24Z

ahh okay!

…

On 22 Aug 2017 10:05 am, "Wenbo Zhao" ***@***.***> wrote: @jnothman <https://github.com/jnothman> The predict_proba functions of kneighbors has already be implemented. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9597 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6_w_tRE4clGDYyljOBJiqEtoHSPgks5sahthgaJpZM4O9rFL> .

jnothman

Is there more that can be shared between the radius and k implementations?

Please add a unit test

webber26232 · 2017-08-24T19:28:24Z

@jnothman Hi, I found #2606 #399 #970 all mentioned an issue, using Bayesian prior in both K and Radius neighbors. However, this feature has not been implemented yet.

… of outlier lable

jnothman · 2017-09-18T00:27:35Z

I thought I commented here, but I can't see it now. Perhaps I commented in my head.

#970 makes the MAP prediction. You sample from the posterior distribution. This is a very substantial difference. I can see the benefit of sampling from the posterior, but I think we should prefer the MAP. Certainly, if we allow stochastic predictions (which I don't think we do anywhere else, but of course predict_proba allows users to sample for themselves), it needs to be controlled by a random_state that the user can set for replicability.

webber26232 · 2017-09-18T17:03:15Z

@jnothman I think I am a little bit of confused on prior and posterior in Neighbors models.

In my mind, the prior distribution is the distribution of target variable, 'y' in the training data. Assuming that we have 5 samples in the training data with labels of [0, 0, 0, 1, 1].

Then the prior P(0) = 0.6 and P(1) = 0.4.
Llet's say we want to use two neighbors to do the classification. The possible labels of these two neighbors is

[0, 0]
[0, 1]
[1, 1]

The posteriors are something like

P(1 | [0, 0])
P(1 | [0, 1])
P(1 | [1, 1])

Am I correct?
I will add ramdom_state feature soon

jnothman · 2017-09-19T00:39:03Z

I think you should avoid making predictions stochastic, at least for now. As I said, a user can do so by sampling from predict_proba.

I'm not sure what you're trying to say, so let me try from the top. In predict_proba we return a distribution over classes for each sample. In predict we return the single most likely class from that distribution.

In a bayesian approach to NN, that distribution is the posterior distribution of classes accounting for both a prior distribution (our belief of how classes should be distributed before we look at the labels in the neighborhood of the sample) and the observations (i.e. the class distribution in the neighborhood of the sample).

The posterior is generally P(class | neighborhood) ∝ P(neighborhood | class) * P(class without regard to neighborhood). So you calculate this for each class.

webber26232 · 2017-09-20T04:51:26Z

@jnothman Thanks for your answer! It makes me clear now.

The reason I considered to add outliers handler for RadiusNeighborsClassifier is avoiding abnormal scoring, especially scoring with predict_proba method.

Since an outlier doesn't have any neighbor within a fix radius, all of its class probabilities will be 0. A 0 probability will have a very huge impact on scoring such as log_loss.

When we use GridSearchCV or RFE which depends on scoring, the function will be influenced by outliers' scores and won't provide an accurate result.

Therefore, I thought assigning outliers with prior or uniform probabilities in predict_proba method may work. To make predict corresponds with predict_proba, I added random outlier predictions for predict.

jnothman · 2017-09-20T05:44:39Z

Okay, but having a prior is fundamentally different to outputting a different label in the case of an outlier.

webber26232 · 2019-08-04T04:27:45Z

@jnothman @TomDLT I've changed the implementation of predict() for performance improvement. This branch is ready for reviewing and merging.

jnothman

This is still good to me, though I've not benchmarked myself. Might want to benchmark comparison to #13783

doc/whats_new/v0.22.rst

jnothman · 2019-08-04T05:53:28Z

doc/whats_new/v0.22.rst

+
+- |Efficiency| Efficiency improvements for
+  :func:`neighbors.RadiusNeighborsClassifier.prdict` by changing
+  implementation from scipy.stats.mode to numpy.bincount.


I don't think this is the right description anymore.

I think @jnothman meant something along the lines of,

- |Efficiency| Efficiency improvements for :func:`neighbors.RadiusNeighborsClassifier.predict_proba` by changing implementation from scipy.stats.mode to numpy.bincount, and for :func:`neighbors.RadiusNeighborsClassifier.predict` that is now computed as an `argmax` of `predict_proba`.

?

We don't have neighbors.RadiusNeighborsClassifier.predict_proba in the current master. It is created in this branch.

How about

|Efficiency| Efficiency improvements for :func:neighbors.RadiusNeighborsClassifier.predictby changing implementation from usingscipy.stats.modeto usingnumpy.bincountinpredict_proba.

Right, of course. But then we can't say that the implementation of predict_proba changed from using scipy.stats.mode, as it didn't exist previously. Maybe just saying that "predict is now computed from predict_proba" (and that it's faster) without going in much details.

fix typo Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>

TomDLT · 2019-08-06T01:33:15Z

A small benchmark shows a nice speed up compared to master.

I did not compare with #14543 though. @rth

import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import RadiusNeighborsClassifier


class OldRadiusNeighborsClassifier(RadiusNeighborsClassifier):
    def predict(self, X):
        """"""
        from sklearn.utils.validation import check_array
        from sklearn.neighbors.base import _get_weights
        from sklearn.utils.extmath import weighted_mode
        from scipy import stats
        X = check_array(X, accept_sparse='csr')
        n_samples = X.shape[0]

        neigh_dist, neigh_ind = self.radius_neighbors(X)
        inliers = [i for i, nind in enumerate(neigh_ind) if len(nind) != 0]
        outliers = [i for i, nind in enumerate(neigh_ind) if len(nind) == 0]

        classes_ = self.classes_
        _y = self._y
        if not self.outputs_2d_:
            _y = self._y.reshape((-1, 1))
            classes_ = [self.classes_]
        n_outputs = len(classes_)

        if self.outlier_label is not None:
            neigh_dist[outliers] = 1e-6
        elif outliers:
            raise ValueError('No neighbors found for test samples %r, '
                             'you can try using larger radius, '
                             'give a label for outliers, '
                             'or consider removing them from your dataset.' %
                             outliers)

        weights = _get_weights(neigh_dist, self.weights)

        y_pred = np.empty((n_samples, n_outputs), dtype=classes_[0].dtype)
        for k, classes_k in enumerate(classes_):
            pred_labels = np.zeros(len(neigh_ind), dtype=object)
            pred_labels[:] = [_y[ind, k] for ind in neigh_ind]
            if weights is None:
                mode = np.array(
                    [stats.mode(pl)[0] for pl in pred_labels[inliers]],
                    dtype=np.int)
            else:
                mode = np.array([
                    weighted_mode(pl, w)[0]
                    for (pl, w) in zip(pred_labels[inliers], weights[inliers])
                ], dtype=np.int)

            mode = mode.ravel()

            y_pred[inliers, k] = classes_k.take(mode)

        if outliers:
            y_pred[outliers, :] = self.outlier_label

        if not self.outputs_2d_:
            y_pred = y_pred.ravel()

        return y_pred


OldRadiusNeighborsClassifier.branch = 'master'
RadiusNeighborsClassifier.branch = 'branch'

averages = []
results = []
for klass, weights, n_samples, n_features in itertools.product(
    [OldRadiusNeighborsClassifier, RadiusNeighborsClassifier],
    ['uniform', 'distance'],
    [100, 1000, 10000],
    [10, 100, 1000, 10000],
):
    X = np.random.randn(n_samples, n_features)
    y = np.random.randint(2, size=n_samples)

    neigh = klass(weights=weights, radius=0.2).fit(X, y)
    out = %timeit -o neigh.predict(X)  # Ipython
    results.append((klass.branch, weights, n_samples, n_features, out.average))

##################
# plot the results
df = pd.DataFrame(
    results, columns='name weights n_samples n_features average'.split(' '))

table = df.pivot_table(index='n_features',
                       columns=['name', 'weights', 'n_samples'],
                       values='average')
table = table['branch'] / table['master']
table.plot(marker='.', logx=True)
plt.title('time(branch) / time(master)')
plt.show()

TomDLT · 2019-08-06T01:33:53Z

doc/whats_new/v0.22.rst

+
+- |Efficiency| Efficiency improvements for
+  :func:`neighbors.RadiusNeighborsClassifier.prdict` by changing
+  implementation from scipy.stats.mode to numpy.bincount.


doc/whats_new/v0.22.rst

sklearn/neighbors/classification.py

sklearn/neighbors/tests/test_neighbors.py

typo fix Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

fix typo Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

… of np.where in outlier_label checking

…rn into RadNeiClfPredProb

…better comments

Update fork repo

…rn into RadNeiClfPredProb

rth · 2019-08-06T12:47:50Z

I did not compare with #14543 though.

The existence of that PR shouldn't prevent this one to get merged. It's a clear improvement.

Looks like there there are 2 approvals and comments were addressed apart for #9597 (comment)

rth · 2019-08-06T13:35:39Z

The case benchmarked above #9597 (comment) with n_class=2 is somewhat of an optimal case, as for large number of classes computing the probablility is probably more expensive. Still this PR is faster than master even with up to 1000 classes,

name                                   speedup (lower is better)                                                           
weights  n_samples n_features n_class                           
distance 2000      10         2                         0.367820
                              100                       0.368969
                              1000                      0.564461
                   100        2                         0.598981
                              100                       0.671586
                              1000                      0.787753
                   1000       2                         0.890036
                              100                       0.916686
                              1000                      0.975508
         10000     10         2                         0.438604
                              100                       0.440602
                              1000                      0.665863
                   100        2                         0.768036
                              100                       0.727307
                              1000                      0.827298
                   1000       2                         0.966119
                              100                       0.953986
                              1000                      0.990590

import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import RadiusNeighborsClassifier
from neurtu import timeit, delayed


class OldRadiusNeighborsClassifier(RadiusNeighborsClassifier):
    def predict(self, X):
        """"""
        from sklearn.utils.validation import check_array
        from sklearn.neighbors.base import _get_weights
        from sklearn.utils.extmath import weighted_mode
        from scipy import stats
        X = check_array(X, accept_sparse='csr')
        n_samples = X.shape[0]

        neigh_dist, neigh_ind = self.radius_neighbors(X)
        inliers = [i for i, nind in enumerate(neigh_ind) if len(nind) != 0]
        outliers = [i for i, nind in enumerate(neigh_ind) if len(nind) == 0]

        classes_ = self.classes_
        _y = self._y
        if not self.outputs_2d_:
            _y = self._y.reshape((-1, 1))
            classes_ = [self.classes_]
        n_outputs = len(classes_)

        if self.outlier_label is not None:
            neigh_dist[outliers] = 1e-6
        elif outliers:
            raise ValueError('No neighbors found for test samples %r, '
                             'you can try using larger radius, '
                             'give a label for outliers, '
                             'or consider removing them from your dataset.' %
                             outliers)

        weights = _get_weights(neigh_dist, self.weights)

        y_pred = np.empty((n_samples, n_outputs), dtype=classes_[0].dtype)
        for k, classes_k in enumerate(classes_):
            pred_labels = np.zeros(len(neigh_ind), dtype=object)
            pred_labels[:] = [_y[ind, k] for ind in neigh_ind]
            if weights is None:
                mode = np.array(
                    [stats.mode(pl)[0] for pl in pred_labels[inliers]],
                    dtype=np.int)
            else:
                mode = np.array([
                    weighted_mode(pl, w)[0]
                    for (pl, w) in zip(pred_labels[inliers], weights[inliers])
                ], dtype=np.int)

            mode = mode.ravel()

            y_pred[inliers, k] = classes_k.take(mode)

        if outliers:
            y_pred[outliers, :] = self.outlier_label

        if not self.outputs_2d_:
            y_pred = y_pred.ravel()

        return y_pred


OldRadiusNeighborsClassifier.branch = 'master'
RadiusNeighborsClassifier.branch = 'branch'

results = []
for klass, weights, n_samples, n_features, n_class in itertools.product(
    [OldRadiusNeighborsClassifier, RadiusNeighborsClassifier],
    ['uniform', 'distance'],
    [2000, 10000],
    [10, 100, 1000],
    [2, 100, 1000]
):
    X = np.random.randn(n_samples, n_features)
    y = np.random.randint(n_class, size=n_samples)

    neigh = klass(weights=weights, radius=0.2).fit(X, y)
    tags = {'name': klass.branch, "weights": weights,  'n_samples': n_samples,
            'n_features': n_features, "n_class": n_class}
    out = delayed(neigh, tags=tags).predict(X)
    results.append(out)


results = timeit(results)
results = results.wall_time .unstack(0)
results['speedup (lower is better)'] = results['branch'] / results['master']

print(results)

I haven't reviewed the code since there are enough reviews, but +1 for merge based on the benchmark results.

rth · 2019-08-07T10:54:14Z

I think implementation in #13783 and #14543 is helpful in neighbors.KNeighborsClassifier but not neighbors.RadiusNeighborsClassifier, because neigh_ind is a 2D array in neighbors.KNeighborsClassifier but a nested array (each 1D array has different length) in neighbors.RadiusNeighborsClassifier.

I agree.

TomDLT · 2019-08-07T17:32:00Z

Thanks for this nice work @webber26232 ! 🎉

jnothman · 2019-08-08T08:23:59Z

Sigh. Finally! Thank you!

webber26232 added 4 commits August 21, 2017 11:21

add predict_proba method for RadiusNeighborsClassifier

236c766

add warning

67a59cd

DOC Add predict_proba in class description

b69eef1

Finish formats

9ac27e1

webber26232 changed the title ~~Add predict_proba(X) for RaduisNeighborsClassifier~~ Add predict_proba(X) for RadiusNeighborsClassifier Aug 21, 2017

Finish formats

9b46e2a

jnothman reviewed Aug 21, 2017

View reviewed changes

jnothman reviewed Aug 22, 2017

View reviewed changes

webber26232 added 2 commits August 22, 2017 10:54

Add test, improve warning

dc9069c

Add test

f30ed4f

webber26232 changed the title ~~Add predict_proba(X) for RadiusNeighborsClassifier~~ Add predict_proba(X) and outlier handler for RadiusNeighborsClassifier Aug 23, 2017

webber26232 added 4 commits August 23, 2017 21:14

add outlier handler

0982ee3

format, 2.7 float divide

9565eac

modify code length

5baa070

indent

1ceab5a

webber26232 added 7 commits August 25, 2017 17:16

add _check_outlier_handler, prepare for regressor

cd1c25f

format

be801b6

bug

d6a2ff8

outlier

96cb19e

random int -> randomly choose class labels from y

4b480a5

fix weights index and vector scalar bug

6f6c96c

fix weights inlier index, change inlier addition to assign, get index…

32e7207

… of outlier lable

webber26232 and others added 2 commits July 21, 2019 10:43

fix grammer in docs

543e1ed

Move predict function for cleaner diff

e73a20a

jnothman mentioned this pull request Aug 2, 2019

knn predict unreasonably slow b/c of use of scipy.stats.mode #13783

Closed

Merge branch 'master' into RadNeiClfPredProb

0208a60

jnothman approved these changes Aug 4, 2019

View reviewed changes

Update doc/whats_new/v0.22.rst

1638ee5

fix typo Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>

thomasjpfan added the Waiting for Reviewer label Aug 5, 2019

TomDLT approved these changes Aug 6, 2019

View reviewed changes

webber26232 and others added 9 commits August 5, 2019 21:22

Update sklearn/neighbors/tests/test_neighbors.py

39180ac

typo fix Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

Update doc/whats_new/v0.22.rst

644a9f1

Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

Update doc/whats_new/v0.22.rst

a21eb5c

fix typo Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

add doc for iterations over multi-outputs, use np.flatnonzero instead…

87b5288

… of np.where in outlier_label checking

Merge branch 'RadNeiClfPredProb' of github.com:webber26232/scikit-lea…

13712f2

…rn into RadNeiClfPredProb

add outlier_label scalar verification and unit test, fix format, add …

ab73440

…better comments

Merge pull request #3 from scikit-learn/master

83cba2d

Update fork repo

Merge branch 'RadNeiClfPredProb' of github.com:webber26232/scikit-lea…

3783c9b

…rn into RadNeiClfPredProb

better docs

5f2d33e

Update v0.22.rst

6ac2ffb

TomDLT merged commit 36bca23 into scikit-learn:master Aug 7, 2019

TomDLT mentioned this pull request Aug 7, 2019

No label or value to be assigned to outliers in Radius Neighbors Classifier and Regressor #9629

Closed

webber26232 deleted the RadNeiClfPredProb branch October 22, 2020 21:15

webber26232 restored the RadNeiClfPredProb branch October 22, 2020 21:19

haiatn mentioned this pull request Jul 29, 2023

Bayesian priors in nearest neighbors classification/regression #399

Open

Uh oh!

[MRG+1] Add predict_proba(X) and outlier handler for RadiusNeighborsClassifier #9597

[MRG+1] Add predict_proba(X) and outlier handler for RadiusNeighborsClassifier #9597

Uh oh!

Conversation

webber26232 commented Aug 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

webber26232 commented Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Aug 22, 2017 via email

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

webber26232 commented Aug 24, 2017

Uh oh!

jnothman commented Sep 18, 2017

Uh oh!

webber26232 commented Sep 18, 2017

Uh oh!

jnothman commented Sep 19, 2017

Uh oh!

webber26232 commented Sep 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Sep 20, 2017

Uh oh!

webber26232 commented Aug 4, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman Aug 4, 2019

Choose a reason for hiding this comment

Uh oh!

TomDLT Aug 6, 2019

Choose a reason for hiding this comment

Uh oh!

rth Aug 6, 2019

Choose a reason for hiding this comment

Uh oh!

rth Aug 7, 2019

Choose a reason for hiding this comment

Uh oh!

TomDLT commented Aug 6, 2019

Uh oh!

TomDLT Aug 6, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rth commented Aug 6, 2019

Uh oh!

rth commented Aug 6, 2019

Uh oh!

rth commented Aug 7, 2019

Uh oh!

TomDLT commented Aug 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Aug 8, 2019 via email

Uh oh!

Uh oh!

webber26232 commented Aug 21, 2017 •

edited

Loading

webber26232 commented Aug 22, 2017 •

edited

Loading

webber26232 commented Sep 20, 2017 •

edited

Loading

TomDLT commented Aug 7, 2019 •

edited

Loading