Skip to content

KBinsDiscretizer: quantile strategy fails due to unsorted bin_edges #13194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
SandroCasagrande opened this issue Feb 19, 2019 · 3 comments
Closed
Labels
Milestone

Comments

@SandroCasagrande
Copy link
Contributor

Description

KBinsDiscretizer with strategy='quantile fails in certain situations with an exception. It happens when multiple percentiles returned from numpy are expected to be identical but show numerical instability and render bin_edges non-monotonic, which is fatal for np.digitize. This is probably related to numpy/numpy#10373

Steps/Code to Reproduce

import numpy as np
from sklearn.preprocessing import KBinsDiscretizer
X = np.array([0.05, 0.05, 0.95]).reshape(-1, 1)
KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile').fit_transform(X)

The example is a bit contrived (3 values and 10 bins), but isolates the problem well enough.

Expected Results

No error is thrown. Robust handling of close percentiles.

Actual Results

ValueError                                Traceback (most recent call last)
<ipython-input-2-304908356f18> in <module>()
      2 from sklearn.preprocessing import KBinsDiscretizer
      3 X = np.array([0.05, 0.05, 0.95]).reshape(-1, 1)
----> 4 KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile').fit_transform(X)
      5 """
      6 Xdi = transformer.inverse_transform(Xd)

/home/sandro/code/scikit-learn/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    461         if y is None:
    462             # fit method of arity 1 (unsupervised transformation)
--> 463             return self.fit(X, **fit_params).transform(X)
    464         else:
    465             # fit method of arity 2 (supervised transformation)

/home/sandro/code/scikit-learn/sklearn/preprocessing/_discretization.py in transform(self, X)
    259             atol = 1.e-8
    260             eps = atol + rtol * np.abs(Xt[:, jj])
--> 261             Xt[:, jj] = np.digitize(Xt[:, jj] + eps, bin_edges[jj][1:])
    262         np.clip(Xt, 0, self.n_bins_ - 1, out=Xt)
    263 

ValueError: bins must be monotonically increasing or decreasing

Versions

System:
   machine: Linux-4.15.0-45-generic-x86_64-with-Ubuntu-16.04-xenial
executable: /home/sandro/.virtualenvs/scikit-learn/bin/python
    python: 3.5.2 (default, Nov 23 2017, 16:37:01)  [GCC 5.4.0 20160609]

BLAS:
    macros: 
cblas_libs: cblas
  lib_dirs: 

Python deps:
       pip: 10.0.1
setuptools: 39.1.0
   sklearn: 0.21.dev0
    Cython: 0.28.5
     scipy: 1.1.0
    pandas: 0.23.4
     numpy: 1.15.2
@jnothman
Copy link
Member

So do we remove bins that are indistinguishable within a tolerance? Or do we sort the bin edges?

Could a similar issue affect QuantileTransformer too?

@SandroCasagrande
Copy link
Contributor Author

This could be an option if #12774 will be solved by removing empty bins in every case. I'm not sure if this should be optional behaviour or not. I stumbled upon some strange edge cases when trying to do that similar to #13165, but better discuss it there.

I've tested this on QuantileTransformer with the toy data from above. Resulting quantiles_ are indeed monotonic with n_quantiles=11, but the transform method seems to handle it well anyway and gives the expected result.

@qinhanmin2014
Copy link
Member

I've updated #13165 to solve the issue, but honestly I think this is a bug in numpy and it's not the duty of scikit-learn to solve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants