FIX Remove bins whose width <= 0 with a warning in KBinsDiscretizer #13165

qinhanmin2014 · 2019-02-14T14:53:40Z

Closes #12774
Closes #13194
Closes #13195
(1) Remove bins whose width <= 0 with a warning.
(2) Tell users that KBinsDiscretizer might produce constant features (e.g., when encode = 'onehot' and certain bins do not contain any data) and these features can be removed with feature selection algorithms (e.g., sklearn.compose.VarianceThreshold).

Similar to #12893, I don't think it's the duty of a preprocessor to remove redundant features.

I think this is a bug (seems that Joel agrees with me in the issue), so we don't need to worry about backward compatibility.

adrinjalali · 2019-02-15T09:20:55Z

sklearn/preprocessing/_discretization.py

+            if self.strategy in ('quantile', 'kmeans'):
+                bin_edges[jj] = np.unique(bin_edges[jj])
+                if len(bin_edges[jj]) - 1 != n_bins[jj]:
+                    warnings.warn('Redundant bins (i.e., bins whose width = 0)'


This means the user would get a UserWarning without being able or knowing how to fix it. We usually try to tell the user how to handle and remove the warnings, don't we?

We've solved the problem for the user. We've already raised similar warning when certain feature is constant.

Then we can add something like "Setting the number of bins for feature %d to %d."

Then we can add something like "Setting the number of bins for feature %d to %d."

I don't think this will solve the problem when data is skewed.

jnothman · 2019-02-19T22:33:21Z

We should make sure to remove bins with width < 0. See #13194

qinhanmin2014 · 2019-02-20T02:56:19Z

@jnothman The PR now solves the new issue (tagged 0.20.3 by you) as well.

jnothman · 2019-02-20T03:17:05Z

Yes, I'd be okay to include this in 0.20.3 if others agree that that is the right place to fix up a design flaw in KBinsDiscretizer.

sklearn/preprocessing/_discretization.py

jnothman · 2019-02-20T03:18:35Z

sklearn/preprocessing/_discretization.py

+                bin_edges[jj] = np.array(
+                    [bin_edges[jj][0]] +
+                    [bin_edges[jj][i] for i in range(1, len(bin_edges[jj]))
+                     if bin_edges[jj][i] - bin_edges[jj][i - 1] > 1e-8])


we would still be solving both issues if we had >0 here. Are we sure we want a hard coded tolerance of 1e-8?

Is mask = np.ediff1d(bin_edges[jj], to_begin=np.inf) > 1e-8; bin_edges[jj] = bin_edges[jj][mask] clearer?

I think we need 1e-8 here, see e.g.,

import numpy as np a = np.ones(279) * 0.56758051638767337 ans = [] for p in range(0, 100, 5): ans.append(np.percentile(a, p)) print([ans[0]] + [ans[i] for i in range(1, len(ans)) if ans[i] - ans[i - 1] > 0]) # [0.5675805163876734, 0.5675805163876735] print([ans[0]] + [ans[i] for i in range(1, len(ans)) if ans[i] - ans[i - 1] > 1e-8]) # [0.5675805163876734]

Though we should blame numpy for this

Is mask = np.ediff1d(bin_edges[jj], to_begin=np.inf) > 1e-8; bin_edges[jj] = bin_edges[jj][mask] clearer?

At least not from my side :) But I assume this can make use of some optimization in numpy.

qinhanmin2014 · 2019-02-20T04:32:16Z

ping @jnothman ready for another review :)

jnothman · 2019-02-20T07:35:03Z

sklearn/preprocessing/_discretization.py

@@ -102,6 +103,11 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
    :class:`sklearn.compose.ColumnTransformer` if you only want to preprocess
    part of the features.

+    ``KBinsDiscretizer`` might produce constant features (e.g., when
+    ``encode = 'onehot'`` and certain bins do not contain any data).


Maybe mention strategy=uniform here

Maybe mention strategy=uniform here

Why? Other strategies also suffer from this problem.

Do they? Since the others follow the empirical distribution of the feature, and remove empty bins how can they have no data at training time?

Do they? Since the others follow the empirical distribution of the feature, and remove empty bins how can they have no data at training time?

We only remove bins whose width <= 0, e.g.,

from sklearn.preprocessing import KBinsDiscretizer X = [[1], [2]] kb = KBinsDiscretizer(n_bins=5, encode='ordinal') kb.fit_transform(X) # array([[0.], [4.]])

Maybe I should use bins whose width <= 0 instead of Redundant bins (i.e., bins whose width <= 0) since it's difficult to define redundant bins?

Ahh... Okay. A bit of an edge case since there are fewer samples than bins which maybe we should prohibit anyway

sklearn/preprocessing/_discretization.py

adrinjalali · 2019-02-20T09:33:31Z

sklearn/preprocessing/_discretization.py

@@ -177,6 +183,16 @@ def fit(self, X, y=None):
                bin_edges[jj] = (centers[1:] + centers[:-1]) * 0.5
                bin_edges[jj] = np.r_[col_min, bin_edges[jj], col_max]

+            # Remove redundant bins (i.e., bins whose width <= 0)
+            if self.strategy in ('quantile', 'kmeans'):
+                mask = np.ediff1d(bin_edges[jj], to_begin=np.inf) > 1e-8


don't we need to test this for a 32bit system as well? It'd be nice to have a test which tests this precision level and see if it works on all CI machines.

I actually think we should make this 0 and not specify a tolerance for a negligible bin. With unary or ordinal encoding small bins won't harm the system... For one hot we could consider a user-specified tolerance

Sorry, fixed typo

I see what you mean, but in general, comparing floats to 0 has kinda proven itself unreliable on multiple occasions at least in the past few months. The last one I remember is the test you commented on, where the two floats logically should have been identical, but they weren't.

math.isclose() has the default rel_tol=1e-9, which kinda seems like a good value for 64bit systems to me, not sure if it'd be as good on the 32bit ones.

don't we need to test this for a 32bit system as well?

We have 32bit CI?

I actually think we should make this 0 and not specify a tolerance for a negligible bin.

Is it good to do so when comparing floats? If we use 0, this PR won't solve the new issue completely (see #13165 (comment)), or do you want to ignore these extreme cases?

I choose 1e-8 because I saw it several times in the repo and it's also the default atol of np.isclose, not sure if there're better ways.

We have 32bit CI?

we have at least one 32bit windows.

I choose 1e-8 because I saw it several times in the repo and it's also the default atol of np.isclose, not sure if there're better ways.

AFAIK, np.isclose is the way it is for backward compatibility. The new standard is the one adopted by python>=3.5 in PEP-485 (i.e. math.isclose), and seems like a logical choice for new code to follow that one wherever possible/convenient.

jnothman · 2019-02-20T10:14:40Z

Okay. Let's leave it at something like 1e-8, document it, and if there is complaint we can consider making it configurable?

qinhanmin2014 · 2019-02-20T10:18:33Z

I've removed the term redundant bins because it's hard to define. Users might regard bins which do not contain any data as redundant bins. Will revert if someone disagree.

qinhanmin2014 · 2019-02-20T10:19:53Z

Okay. Let's leave it at something like 1e-8, document it

@jnothman How to document it?

jnothman · 2019-02-20T10:29:43Z

Say <= 1e-8 instead of <= 0?

jnothman · 2019-02-20T10:34:09Z

I wouldn't thought 32 bit data was more relevant than 32 bit processor ... But perhaps both

adrinjalali · 2019-02-20T10:39:20Z

I wouldn't thought 32 bit data was more relevant than 32 bit processor ... But perhaps both

I'm not sure, that's why I think having a test checking the boundary conditions here is a good idea.

jnothman

This is fine by me. Re @adrinjalali's suggestion of testing platforms and datatypes, is that particularly with reference to the numpy bug in percentile?

adrinjalali · 2019-02-20T14:38:47Z

This is fine by me. Re @adrinjalali's suggestion of testing platforms and datatypes, is that particularly with reference to the numpy bug in percentile?

No, not really. Regardless, I think this is an improvement on the status-quo anyway; and I don't think we test these boundary cases in other similar situations.

qinhanmin2014 · 2019-02-20T14:42:41Z

No, not really. Regardless, I think this is an improvement on the status-quo anyway; and I don't think we test these boundary cases in other similar situations.

Please open an issue if you think it's worth discussing. Honestly I'm unable to fully understand your point :)

jnothman · 2019-02-20T20:42:22Z

@adrinjalali do I take it you agree to releasing this in 0.20.3?

scikit-learn#13165) * remove redundant bins * tests * what's new * issue number * numeric issue * move what's new * Joel's comment * forget something * flake8 * more doc update * Joel's comment * redundant bins * new message * comment

…scretizer (scikit-learn#13165)" This reverts commit f4ea212.

scikit-learn#13165) * remove redundant bins * tests * what's new * issue number * numeric issue * move what's new * Joel's comment * forget something * flake8 * more doc update * Joel's comment * redundant bins * new message * comment

qinhanmin2014 added 4 commits February 13, 2019 22:31

remove redundant bins

a43d534

tests

b7acc86

what's new

54f9ff4

issue number

389d674

adrinjalali reviewed Feb 15, 2019

View reviewed changes

SandroCasagrande mentioned this pull request Feb 19, 2019

KBinsDiscretizer: quantile strategy fails due to unsorted bin_edges #13194

Closed

numeric issue

8660ff3

jnothman reviewed Feb 20, 2019

View reviewed changes

qinhanmin2014 added 5 commits February 20, 2019 11:24

move what's new

72831cd

Joel's comment

1708910

forget something

38c7b23

flake8

4414982

more doc update

0c09988

jnothman reviewed Feb 20, 2019

View reviewed changes

Joel's comment

481267e

adrinjalali reviewed Feb 20, 2019

View reviewed changes

redundant bins

a408380

qinhanmin2014 changed the title ~~FIX Remove redundant bins in KBinsDiscretizer~~ FIX Remove bins whose width <= 0 in KBinsDiscretizer Feb 20, 2019

qinhanmin2014 changed the title ~~FIX Remove bins whose width <= 0 in KBinsDiscretizer~~ FIX Remove bins whose width <= 0 with a warning in KBinsDiscretizer Feb 20, 2019

new message

6569670

comment

7792234

jnothman approved these changes Feb 20, 2019

View reviewed changes

jnothman mentioned this pull request Feb 20, 2019

Release 0.20.3 #13186

Merged

17 tasks

adrinjalali merged commit b40868d into scikit-learn:master Feb 20, 2019

qinhanmin2014 deleted the redundant-bin branch February 20, 2019 14:43

qinhanmin2014 mentioned this pull request Feb 26, 2019

[MRG] Unary encoder -- continued #12893

Closed

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX Remove bins whose width <= 1e-8 with a warning in KBinsDi…

c4f2b67

…scretizer (scikit-learn#13165)" This reverts commit f4ea212.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX Remove bins whose width <= 1e-8 with a warning in KBinsDi…

7618d39

…scretizer (scikit-learn#13165)" This reverts commit f4ea212.

Uh oh!

FIX Remove bins whose width <= 0 with a warning in KBinsDiscretizer #13165

FIX Remove bins whose width <= 0 with a warning in KBinsDiscretizer #13165

Uh oh!

Conversation

qinhanmin2014 commented Feb 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Feb 19, 2019

Uh oh!

qinhanmin2014 commented Feb 20, 2019

Uh oh!

jnothman commented Feb 20, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Feb 20, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman Feb 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Feb 20, 2019 via email

Uh oh!

qinhanmin2014 commented Feb 20, 2019

Uh oh!

qinhanmin2014 commented Feb 20, 2019

Uh oh!

jnothman commented Feb 20, 2019 via email

Uh oh!

jnothman commented Feb 20, 2019 via email

Uh oh!

adrinjalali commented Feb 20, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Feb 20, 2019

Uh oh!

qinhanmin2014 commented Feb 20, 2019

Uh oh!

jnothman commented Feb 20, 2019 via email

Uh oh!

qinhanmin2014 commented Feb 14, 2019 •

edited

Loading

jnothman Feb 20, 2019 •

edited

Loading