Skip to content

[MRG+1] Add predict_proba(X) and outlier handler for RadiusNeighborsClassifier #9597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 63 commits into from
Aug 7, 2019

Conversation

webber26232
Copy link
Contributor

@webber26232 webber26232 commented Aug 21, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Currently, RadiusNeighborsClassifier doesn't provide the function predict_proba(X). When an outlier (a sample that doesn't have any neighbors within a fix redius) is detected, no class label can be assigned.

This branch implemented predict_proba(X) for RadiusNeighborsClassifier. The class probabilities are generated by samples' neighbors and weights within the radius r.

When outliers (no neighbors in the radius r) are detected, given 4 solutions, controlled by parameter outlier_label:

solution predict(X) predict_proba(X)
None Exception will be raised. Exception will be raised.
int, str or list of correspond labels Assign a manual label for outliers. If the new label is in the 'y' of training data, the probabilities will be 1 in this label and 0 in the rest. Otherwise, all possibilities will be 0.
'most_frequent' Assign the most frequent label in training target variable for outliers. The probability of most frequent label in training target will be assigned with 1 and other label's probability will be 0.

Any other comments?

DOC for both class of RaduisNeighborsClassifier (outlier solution and an example) and function of predict_proba is added.

@webber26232 webber26232 changed the title Add predict_proba(X) for RaduisNeighborsClassifier Add predict_proba(X) for RadiusNeighborsClassifier Aug 21, 2017
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to provide this for radius neighbors and not kneighbors?

@webber26232
Copy link
Contributor Author

webber26232 commented Aug 22, 2017

@jnothman Hi, the predict_proba function of kneighbors has already been implemented. And I do not find it for radius neighbors.

Sorry I am new here. Is there anything I should do first?

@jnothman
Copy link
Member

jnothman commented Aug 22, 2017 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there more that can be shared between the radius and k implementations?

Please add a unit test

@webber26232 webber26232 changed the title Add predict_proba(X) for RadiusNeighborsClassifier Add predict_proba(X) and outlier handler for RadiusNeighborsClassifier Aug 23, 2017
@webber26232
Copy link
Contributor Author

@jnothman Hi, I found #2606 #399 #970 all mentioned an issue, using Bayesian prior in both K and Radius neighbors. However, this feature has not been implemented yet.

@jnothman
Copy link
Member

I thought I commented here, but I can't see it now. Perhaps I commented in my head.

#970 makes the MAP prediction. You sample from the posterior distribution. This is a very substantial difference. I can see the benefit of sampling from the posterior, but I think we should prefer the MAP. Certainly, if we allow stochastic predictions (which I don't think we do anywhere else, but of course predict_proba allows users to sample for themselves), it needs to be controlled by a random_state that the user can set for replicability.

@webber26232
Copy link
Contributor Author

@jnothman I think I am a little bit of confused on prior and posterior in Neighbors models.

In my mind, the prior distribution is the distribution of target variable, 'y' in the training data. Assuming that we have 5 samples in the training data with labels of [0, 0, 0, 1, 1].

Then the prior P(0) = 0.6 and P(1) = 0.4.
Llet's say we want to use two neighbors to do the classification. The possible labels of these two neighbors is

  • [0, 0]
  • [0, 1]
  • [1, 1]

The posteriors are something like

  • P(1 | [0, 0])
  • P(1 | [0, 1])
  • P(1 | [1, 1])

Am I correct?
I will add ramdom_state feature soon

@jnothman
Copy link
Member

I think you should avoid making predictions stochastic, at least for now. As I said, a user can do so by sampling from predict_proba.

I'm not sure what you're trying to say, so let me try from the top. In predict_proba we return a distribution over classes for each sample. In predict we return the single most likely class from that distribution.

In a bayesian approach to NN, that distribution is the posterior distribution of classes accounting for both a prior distribution (our belief of how classes should be distributed before we look at the labels in the neighborhood of the sample) and the observations (i.e. the class distribution in the neighborhood of the sample).

The posterior is generally P(class | neighborhood) ∝ P(neighborhood | class) * P(class without regard to neighborhood). So you calculate this for each class.

@webber26232
Copy link
Contributor Author

webber26232 commented Sep 20, 2017

@jnothman Thanks for your answer! It makes me clear now.

The reason I considered to add outliers handler for RadiusNeighborsClassifier is avoiding abnormal scoring, especially scoring with predict_proba method.

Since an outlier doesn't have any neighbor within a fix radius, all of its class probabilities will be 0. A 0 probability will have a very huge impact on scoring such as log_loss.

When we use GridSearchCV or RFE which depends on scoring, the function will be influenced by outliers' scores and won't provide an accurate result.

Therefore, I thought assigning outliers with prior or uniform probabilities in predict_proba method may work. To make predict corresponds with predict_proba, I added random outlier predictions for predict.

@jnothman
Copy link
Member

Okay, but having a prior is fundamentally different to outputting a different label in the case of an outlier.

@webber26232
Copy link
Contributor Author

@jnothman @TomDLT I've changed the implementation of predict() for performance improvement. This branch is ready for reviewing and merging.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still good to me, though I've not benchmarked myself. Might want to benchmark comparison to #13783


- |Efficiency| Efficiency improvements for
:func:`neighbors.RadiusNeighborsClassifier.prdict` by changing
implementation from scipy.stats.mode to numpy.bincount.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right description anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jnothman meant something along the lines of,

- |Efficiency| Efficiency improvements for
  :func:`neighbors.RadiusNeighborsClassifier.predict_proba` by changing
  implementation from scipy.stats.mode to numpy.bincount, and for 
  :func:`neighbors.RadiusNeighborsClassifier.predict` that is now computed as
  an `argmax` of `predict_proba`.

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have neighbors.RadiusNeighborsClassifier.predict_proba in the current master. It is created in this branch.

How about

  • |Efficiency| Efficiency improvements for :func:neighbors.RadiusNeighborsClassifier.predictby changing implementation from usingscipy.stats.modeto usingnumpy.bincountinpredict_proba.

Right, of course. But then we can't say that the implementation of predict_proba changed from using scipy.stats.mode, as it didn't exist previously. Maybe just saying that "predict is now computed from predict_proba" (and that it's faster) without going in much details.

fix typo

Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
@TomDLT
Copy link
Member

TomDLT commented Aug 6, 2019

A small benchmark shows a nice speed up compared to master.

I did not compare with #14543 though. @rth

Figure_1

import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import RadiusNeighborsClassifier


class OldRadiusNeighborsClassifier(RadiusNeighborsClassifier):
    def predict(self, X):
        """"""
        from sklearn.utils.validation import check_array
        from sklearn.neighbors.base import _get_weights
        from sklearn.utils.extmath import weighted_mode
        from scipy import stats
        X = check_array(X, accept_sparse='csr')
        n_samples = X.shape[0]

        neigh_dist, neigh_ind = self.radius_neighbors(X)
        inliers = [i for i, nind in enumerate(neigh_ind) if len(nind) != 0]
        outliers = [i for i, nind in enumerate(neigh_ind) if len(nind) == 0]

        classes_ = self.classes_
        _y = self._y
        if not self.outputs_2d_:
            _y = self._y.reshape((-1, 1))
            classes_ = [self.classes_]
        n_outputs = len(classes_)

        if self.outlier_label is not None:
            neigh_dist[outliers] = 1e-6
        elif outliers:
            raise ValueError('No neighbors found for test samples %r, '
                             'you can try using larger radius, '
                             'give a label for outliers, '
                             'or consider removing them from your dataset.' %
                             outliers)

        weights = _get_weights(neigh_dist, self.weights)

        y_pred = np.empty((n_samples, n_outputs), dtype=classes_[0].dtype)
        for k, classes_k in enumerate(classes_):
            pred_labels = np.zeros(len(neigh_ind), dtype=object)
            pred_labels[:] = [_y[ind, k] for ind in neigh_ind]
            if weights is None:
                mode = np.array(
                    [stats.mode(pl)[0] for pl in pred_labels[inliers]],
                    dtype=np.int)
            else:
                mode = np.array([
                    weighted_mode(pl, w)[0]
                    for (pl, w) in zip(pred_labels[inliers], weights[inliers])
                ], dtype=np.int)

            mode = mode.ravel()

            y_pred[inliers, k] = classes_k.take(mode)

        if outliers:
            y_pred[outliers, :] = self.outlier_label

        if not self.outputs_2d_:
            y_pred = y_pred.ravel()

        return y_pred


OldRadiusNeighborsClassifier.branch = 'master'
RadiusNeighborsClassifier.branch = 'branch'

averages = []
results = []
for klass, weights, n_samples, n_features in itertools.product(
    [OldRadiusNeighborsClassifier, RadiusNeighborsClassifier],
    ['uniform', 'distance'],
    [100, 1000, 10000],
    [10, 100, 1000, 10000],
):
    X = np.random.randn(n_samples, n_features)
    y = np.random.randint(2, size=n_samples)

    neigh = klass(weights=weights, radius=0.2).fit(X, y)
    out = %timeit -o neigh.predict(X)  # Ipython
    results.append((klass.branch, weights, n_samples, n_features, out.average))

##################
# plot the results
df = pd.DataFrame(
    results, columns='name weights n_samples n_features average'.split(' '))

table = df.pivot_table(index='n_features',
                       columns=['name', 'weights', 'n_samples'],
                       values='average')
table = table['branch'] / table['master']
table.plot(marker='.', logx=True)
plt.title('time(branch) / time(master)')
plt.show()


- |Efficiency| Efficiency improvements for
:func:`neighbors.RadiusNeighborsClassifier.prdict` by changing
implementation from scipy.stats.mode to numpy.bincount.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it ?

webber26232 and others added 9 commits August 5, 2019 21:22
typo fix

Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
fix typo

Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
@rth
Copy link
Member

rth commented Aug 6, 2019

I did not compare with #14543 though.

The existence of that PR shouldn't prevent this one to get merged. It's a clear improvement.

Looks like there there are 2 approvals and comments were addressed apart for #9597 (comment)

@rth
Copy link
Member

rth commented Aug 6, 2019

The case benchmarked above #9597 (comment) with n_class=2 is somewhat of an optimal case, as for large number of classes computing the probablility is probably more expensive. Still this PR is faster than master even with up to 1000 classes,

name                                   speedup (lower is better)                                                           
weights  n_samples n_features n_class                           
distance 2000      10         2                         0.367820
                              100                       0.368969
                              1000                      0.564461
                   100        2                         0.598981
                              100                       0.671586
                              1000                      0.787753
                   1000       2                         0.890036
                              100                       0.916686
                              1000                      0.975508
         10000     10         2                         0.438604
                              100                       0.440602
                              1000                      0.665863
                   100        2                         0.768036
                              100                       0.727307
                              1000                      0.827298
                   1000       2                         0.966119
                              100                       0.953986
                              1000                      0.990590
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import RadiusNeighborsClassifier
from neurtu import timeit, delayed


class OldRadiusNeighborsClassifier(RadiusNeighborsClassifier):
    def predict(self, X):
        """"""
        from sklearn.utils.validation import check_array
        from sklearn.neighbors.base import _get_weights
        from sklearn.utils.extmath import weighted_mode
        from scipy import stats
        X = check_array(X, accept_sparse='csr')
        n_samples = X.shape[0]

        neigh_dist, neigh_ind = self.radius_neighbors(X)
        inliers = [i for i, nind in enumerate(neigh_ind) if len(nind) != 0]
        outliers = [i for i, nind in enumerate(neigh_ind) if len(nind) == 0]

        classes_ = self.classes_
        _y = self._y
        if not self.outputs_2d_:
            _y = self._y.reshape((-1, 1))
            classes_ = [self.classes_]
        n_outputs = len(classes_)

        if self.outlier_label is not None:
            neigh_dist[outliers] = 1e-6
        elif outliers:
            raise ValueError('No neighbors found for test samples %r, '
                             'you can try using larger radius, '
                             'give a label for outliers, '
                             'or consider removing them from your dataset.' %
                             outliers)

        weights = _get_weights(neigh_dist, self.weights)

        y_pred = np.empty((n_samples, n_outputs), dtype=classes_[0].dtype)
        for k, classes_k in enumerate(classes_):
            pred_labels = np.zeros(len(neigh_ind), dtype=object)
            pred_labels[:] = [_y[ind, k] for ind in neigh_ind]
            if weights is None:
                mode = np.array(
                    [stats.mode(pl)[0] for pl in pred_labels[inliers]],
                    dtype=np.int)
            else:
                mode = np.array([
                    weighted_mode(pl, w)[0]
                    for (pl, w) in zip(pred_labels[inliers], weights[inliers])
                ], dtype=np.int)

            mode = mode.ravel()

            y_pred[inliers, k] = classes_k.take(mode)

        if outliers:
            y_pred[outliers, :] = self.outlier_label

        if not self.outputs_2d_:
            y_pred = y_pred.ravel()

        return y_pred


OldRadiusNeighborsClassifier.branch = 'master'
RadiusNeighborsClassifier.branch = 'branch'

results = []
for klass, weights, n_samples, n_features, n_class in itertools.product(
    [OldRadiusNeighborsClassifier, RadiusNeighborsClassifier],
    ['uniform', 'distance'],
    [2000, 10000],
    [10, 100, 1000],
    [2, 100, 1000]
):
    X = np.random.randn(n_samples, n_features)
    y = np.random.randint(n_class, size=n_samples)

    neigh = klass(weights=weights, radius=0.2).fit(X, y)
    tags = {'name': klass.branch, "weights": weights,  'n_samples': n_samples,
            'n_features': n_features, "n_class": n_class}
    out = delayed(neigh, tags=tags).predict(X)
    results.append(out)


results = timeit(results)
results = results.wall_time .unstack(0)
results['speedup (lower is better)'] = results['branch'] / results['master']

print(results)

I haven't reviewed the code since there are enough reviews, but +1 for merge based on the benchmark results.

@rth
Copy link
Member

rth commented Aug 7, 2019

I think implementation in #13783 and #14543 is helpful in neighbors.KNeighborsClassifier but not neighbors.RadiusNeighborsClassifier, because neigh_ind is a 2D array in neighbors.KNeighborsClassifier but a nested array (each 1D array has different length) in neighbors.RadiusNeighborsClassifier.

I agree.

@TomDLT TomDLT merged commit 36bca23 into scikit-learn:master Aug 7, 2019
@TomDLT
Copy link
Member

TomDLT commented Aug 7, 2019

Thanks for this nice work @webber26232 ! 🎉

@jnothman
Copy link
Member

jnothman commented Aug 8, 2019 via email

@webber26232 webber26232 deleted the RadNeiClfPredProb branch October 22, 2020 21:15
@webber26232 webber26232 restored the RadNeiClfPredProb branch October 22, 2020 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants