-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX Remove bins whose width <= 0 with a warning in KBinsDiscretizer #13165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Remove bins whose width <= 0 with a warning in KBinsDiscretizer #13165
Conversation
if self.strategy in ('quantile', 'kmeans'): | ||
bin_edges[jj] = np.unique(bin_edges[jj]) | ||
if len(bin_edges[jj]) - 1 != n_bins[jj]: | ||
warnings.warn('Redundant bins (i.e., bins whose width = 0)' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means the user would get a UserWarning
without being able or knowing how to fix it. We usually try to tell the user how to handle and remove the warnings, don't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've solved the problem for the user. We've already raised similar warning when certain feature is constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we can add something like "Setting the number of bins for feature %d to %d."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we can add something like "Setting the number of bins for feature %d to %d."
I don't think this will solve the problem when data is skewed.
We should make sure to remove bins with width < 0. See #13194 |
@jnothman The PR now solves the new issue (tagged 0.20.3 by you) as well. |
Yes, I'd be okay to include this in 0.20.3 if others agree that that is the right place to fix up a design flaw in KBinsDiscretizer. |
bin_edges[jj] = np.array( | ||
[bin_edges[jj][0]] + | ||
[bin_edges[jj][i] for i in range(1, len(bin_edges[jj])) | ||
if bin_edges[jj][i] - bin_edges[jj][i - 1] > 1e-8]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would still be solving both issues if we had >0 here. Are we sure we want a hard coded tolerance of 1e-8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is mask = np.ediff1d(bin_edges[jj], to_begin=np.inf) > 1e-8; bin_edges[jj] = bin_edges[jj][mask]
clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need 1e-8 here, see e.g.,
import numpy as np
a = np.ones(279) * 0.56758051638767337
ans = []
for p in range(0, 100, 5):
ans.append(np.percentile(a, p))
print([ans[0]] + [ans[i] for i in range(1, len(ans)) if ans[i] - ans[i - 1] > 0])
# [0.5675805163876734, 0.5675805163876735]
print([ans[0]] + [ans[i] for i in range(1, len(ans)) if ans[i] - ans[i - 1] > 1e-8])
# [0.5675805163876734]
Though we should blame numpy for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is
mask = np.ediff1d(bin_edges[jj], to_begin=np.inf) > 1e-8; bin_edges[jj] = bin_edges[jj][mask]
clearer?
At least not from my side :) But I assume this can make use of some optimization in numpy.
ping @jnothman ready for another review :) |
@@ -102,6 +103,11 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin): | |||
:class:`sklearn.compose.ColumnTransformer` if you only want to preprocess | |||
part of the features. | |||
|
|||
``KBinsDiscretizer`` might produce constant features (e.g., when | |||
``encode = 'onehot'`` and certain bins do not contain any data). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention strategy=uniform here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention strategy=uniform here
Why? Other strategies also suffer from this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do they? Since the others follow the empirical distribution of the feature, and remove empty bins how can they have no data at training time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do they? Since the others follow the empirical distribution of the feature, and remove empty bins how can they have no data at training time?
We only remove bins whose width <= 0, e.g.,
from sklearn.preprocessing import KBinsDiscretizer
X = [[1], [2]]
kb = KBinsDiscretizer(n_bins=5, encode='ordinal')
kb.fit_transform(X)
# array([[0.], [4.]])
Maybe I should use bins whose width <= 0
instead of Redundant bins (i.e., bins whose width <= 0)
since it's difficult to define redundant bins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh... Okay. A bit of an edge case since there are fewer samples than bins which maybe we should prohibit anyway
@@ -177,6 +183,16 @@ def fit(self, X, y=None): | |||
bin_edges[jj] = (centers[1:] + centers[:-1]) * 0.5 | |||
bin_edges[jj] = np.r_[col_min, bin_edges[jj], col_max] | |||
|
|||
# Remove redundant bins (i.e., bins whose width <= 0) | |||
if self.strategy in ('quantile', 'kmeans'): | |||
mask = np.ediff1d(bin_edges[jj], to_begin=np.inf) > 1e-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we need to test this for a 32bit system as well? It'd be nice to have a test which tests this precision level and see if it works on all CI machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think we should make this 0 and not specify a tolerance for a negligible bin. With unary or ordinal encoding small bins won't harm the system... For one hot we could consider a user-specified tolerance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, fixed typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean, but in general, comparing floats to 0 has kinda proven itself unreliable on multiple occasions at least in the past few months. The last one I remember is the test you commented on, where the two floats logically should have been identical, but they weren't.
math.isclose()
has the default rel_tol=1e-9
, which kinda seems like a good value for 64bit systems to me, not sure if it'd be as good on the 32bit ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we need to test this for a 32bit system as well?
We have 32bit CI?
I actually think we should make this 0 and not specify a tolerance for a negligible bin.
Is it good to do so when comparing floats? If we use 0, this PR won't solve the new issue completely (see #13165 (comment)), or do you want to ignore these extreme cases?
I choose 1e-8 because I saw it several times in the repo and it's also the default atol of np.isclose, not sure if there're better ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have 32bit CI?
we have at least one 32bit windows.
I choose 1e-8 because I saw it several times in the repo and it's also the default atol of np.isclose, not sure if there're better ways.
AFAIK, np.isclose
is the way it is for backward compatibility. The new standard is the one adopted by python>=3.5 in PEP-485 (i.e. math.isclose
), and seems like a logical choice for new code to follow that one wherever possible/convenient.
Okay. Let's leave it at something like 1e-8, document it, and if there is
complaint we can consider making it configurable?
|
I've removed the term |
@jnothman How to document it? |
Say <= 1e-8 instead of <= 0?
|
I wouldn't thought 32 bit data was more relevant than 32 bit processor ...
But perhaps both
|
I'm not sure, that's why I think having a test checking the boundary conditions here is a good idea. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine by me. Re @adrinjalali's suggestion of testing platforms and datatypes, is that particularly with reference to the numpy bug in percentile
?
No, not really. Regardless, I think this is an improvement on the status-quo anyway; and I don't think we test these boundary cases in other similar situations. |
Please open an issue if you think it's worth discussing. Honestly I'm unable to fully understand your point :) |
@adrinjalali do I take it you agree to releasing this in 0.20.3?
|
scikit-learn#13165) * remove redundant bins * tests * what's new * issue number * numeric issue * move what's new * Joel's comment * forget something * flake8 * more doc update * Joel's comment * redundant bins * new message * comment
scikit-learn#13165) * remove redundant bins * tests * what's new * issue number * numeric issue * move what's new * Joel's comment * forget something * flake8 * more doc update * Joel's comment * redundant bins * new message * comment
…scretizer (scikit-learn#13165)" This reverts commit f4ea212.
…scretizer (scikit-learn#13165)" This reverts commit f4ea212.
scikit-learn#13165) * remove redundant bins * tests * what's new * issue number * numeric issue * move what's new * Joel's comment * forget something * flake8 * more doc update * Joel's comment * redundant bins * new message * comment
Closes #12774
Closes #13194
Closes #13195
(1) Remove bins whose width <= 0 with a warning.
(2) Tell users that KBinsDiscretizer might produce constant features (e.g., when
encode = 'onehot'
and certain bins do not contain any data) and these features can be removed with feature selection algorithms (e.g.,sklearn.compose.VarianceThreshold
).Similar to #12893, I don't think it's the duty of a preprocessor to remove redundant features.
I think this is a bug (seems that Joel agrees with me in the issue), so we don't need to worry about backward compatibility.