-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX Remove bins whose width <= 0 with a warning in KBinsDiscretizer #13165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
a43d534
b7acc86
54f9ff4
389d674
8660ff3
72831cd
1708910
38c7b23
4414982
0c09988
481267e
a408380
6569670
7792234
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -56,7 +56,8 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin): | |
Attributes | ||
---------- | ||
n_bins_ : int array, shape (n_features,) | ||
Number of bins per feature. | ||
Number of bins per feature. Bins whose width are too small | ||
(i.e., <= 1e-8) are removed with a warning. | ||
|
||
bin_edges_ : array of arrays, shape (n_features, ) | ||
The edges of each bin. Contain arrays of varying shapes ``(n_bins_, )`` | ||
|
@@ -102,6 +103,11 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin): | |
:class:`sklearn.compose.ColumnTransformer` if you only want to preprocess | ||
part of the features. | ||
|
||
``KBinsDiscretizer`` might produce constant features (e.g., when | ||
``encode = 'onehot'`` and certain bins do not contain any data). | ||
These features can be removed with feature selection algorithms | ||
(e.g., :class:`sklearn.feature_selection.VarianceThreshold`). | ||
|
||
See also | ||
-------- | ||
sklearn.preprocessing.Binarizer : class used to bin values as ``0`` or | ||
|
@@ -177,6 +183,16 @@ def fit(self, X, y=None): | |
bin_edges[jj] = (centers[1:] + centers[:-1]) * 0.5 | ||
bin_edges[jj] = np.r_[col_min, bin_edges[jj], col_max] | ||
|
||
# Remove bins whose width are too small (i.e., <= 1e-8) | ||
if self.strategy in ('quantile', 'kmeans'): | ||
mask = np.ediff1d(bin_edges[jj], to_begin=np.inf) > 1e-8 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't we need to test this for a 32bit system as well? It'd be nice to have a test which tests this precision level and see if it works on all CI machines. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I actually think we should make this 0 and not specify a tolerance for a negligible bin. With unary or ordinal encoding small bins won't harm the system... For one hot we could consider a user-specified tolerance There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, fixed typo There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see what you mean, but in general, comparing floats to 0 has kinda proven itself unreliable on multiple occasions at least in the past few months. The last one I remember is the test you commented on, where the two floats logically should have been identical, but they weren't.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We have 32bit CI?
Is it good to do so when comparing floats? If we use 0, this PR won't solve the new issue completely (see #13165 (comment)), or do you want to ignore these extreme cases? I choose 1e-8 because I saw it several times in the repo and it's also the default atol of np.isclose, not sure if there're better ways. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
we have at least one 32bit windows.
AFAIK, |
||
bin_edges[jj] = bin_edges[jj][mask] | ||
if len(bin_edges[jj]) - 1 != n_bins[jj]: | ||
warnings.warn('Bins whose width are too small (i.e., <= ' | ||
'1e-8) in feature %d are removed. Consider ' | ||
'decreasing the number of bins.' % jj) | ||
n_bins[jj] = len(bin_edges[jj]) - 1 | ||
|
||
self.bin_edges_ = bin_edges | ||
self.n_bins_ = n_bins | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention strategy=uniform here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Other strategies also suffer from this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do they? Since the others follow the empirical distribution of the feature, and remove empty bins how can they have no data at training time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only remove bins whose width <= 0, e.g.,
Maybe I should use
bins whose width <= 0
instead ofRedundant bins (i.e., bins whose width <= 0)
since it's difficult to define redundant bins?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh... Okay. A bit of an edge case since there are fewer samples than bins which maybe we should prohibit anyway