-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) #11272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2f8ad7a
to
ded8c41
Compare
ded8c41
to
2b8c99f
Compare
aaddc11
to
21b9356
Compare
21b9356
to
54522af
Compare
Nice!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should treat this as if it'll be released with KBinsDiscretizer. We should set the default strategy to 'quantile'. We should consider API changes to make this more consistent. We should update the what's new message rather than adding one.
I was a bit surprised to see you've only added, and not modified tests. But the only thing I'm missing there is checking the consistency of 1-column and multi-column transformations.
doc/modules/preprocessing.rst
Outdated
@@ -576,6 +576,18 @@ Discretization is similar to constructing histograms for continuous data. | |||
However, histograms focus on counting features which fall into particular | |||
bins, whereas discretization focuses on assigning feature values to these bins. | |||
|
|||
:class:`KBinsDiscretizer` implement different binning strategies, which can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*implements
@@ -62,7 +82,9 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin): | |||
Number of bins per feature. An ignored feature at index ``i`` | |||
will have ``n_bins_[i] == 0``. | |||
|
|||
bin_width_ : float array, shape (n_features,) | |||
bin_width_ : array, shape (n_features,) | |||
Contain floats with 'uniform' strategy, and arrays of varying shapes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we now want to store widths, or edges? Do we lose a lot by storing edges in all cases?
kmeans | ||
Widths are defined by a k-means on each features. | ||
|
||
random_state : int, RandomState instance or None (default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing as we're doing it in 1d, can't we just initialise kmeans at the midpoints of a uniform split, and make it deterministic?
n_clusters = self.n_bins_[jj] | ||
km = KMeans(n_clusters=n_clusters, | ||
random_state=self.random_state).fit(col) | ||
centers = np.sort(km.cluster_centers_[:, 0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we'd need this sort if we initialised with linspace
|
||
if self.strategy == 'quantile': | ||
n_quantiles = self.n_bins_[jj] + 1 | ||
qt = QuantileTransformer(n_quantiles=n_quantiles).fit(col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems overkill. Just use boundaries = np.percentile(col, n_quantiles + 1)
valid_strategy = ('uniform', 'quantile', 'kmeans') | ||
if self.strategy not in valid_strategy: | ||
raise ValueError("Valid options for 'strategy' are {}. " | ||
"Got 'strategy = {}' instead." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use {!r} for quoting
def test_nonuniform_strategies(): | ||
X = np.array([0, 1, 2, 3, 9, 10]).reshape(-1, 1) | ||
|
||
# with 2 bins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also test the size of bin_widths_
Thanks for the review.
|
520fa7b
to
785c14f
Compare
And I suppose we're not worried about a slow-down in the uniform case? |
Good point, if you think it's important, I can add a branch in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. All nitpicks. LGTM!
X /= bin_width | ||
bin_edges = self.bin_edges_[trans] | ||
for jj in range(X.shape[1]): | ||
# Values which are a multiple of the bin width are susceptible to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"a multiple of the bin width" -> "close to the bin edge"
for jj in range(X.shape[1]): | ||
# Values which are a multiple of the bin width are susceptible to | ||
# numeric instability. Add eps to X so these values are binned | ||
# correctly. See documentation for numpy.isclose for an explanation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "correctly" -> "correctly with respect to their decimal truncation"???
edges = np.r_[col.min(), edges, col.max()] | ||
|
||
if edges[0] == edges[-1] and self.n_bins_[jj] > 2: | ||
warnings.warn("Features %d is constant and will be " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Features -> Feature
Widths are defined by a quantile transform, to have a uniform | ||
number of samples in each bin. | ||
kmeans | ||
Widths are defined by a k-means on each features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
features -> feature
Perhaps:
"Values in each bin have the same nearest center of a k-means cluster."
uniform | ||
All bins in each feature have identical widths. | ||
quantile | ||
Widths are defined by a quantile transform, to have a uniform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"All bins have the same number of points"
I don't yet have reason to believe it's important... But yes, we could consider that now or in the future. |
e5586dc
to
24a8745
Compare
I think once this is merged, we should move the discretizer to |
Would @qinhanmin2014 like to give this a second look? |
I promise to give a review next weekend, and a second review from someone else before next weekend is welcomed. Sorry for the delay but I'm surrounded by school work at the end of the semester. |
That's fine! Just thought you might be a good person to give it a go if
available.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code LGTM. Will look at the examples and tests in a couple of hours.
edges = np.asarray(np.percentile(col, quantiles)) | ||
|
||
elif self.strategy == 'kmeans': | ||
from ..cluster import KMeans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific reason to import here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Importing loop problem, but it seems not to happen anymore so I moved it to the top.
doc/modules/preprocessing.rst
Outdated
@@ -565,7 +563,7 @@ example, these intervals are defined as: | |||
Based on these bin intervals, ``X`` is transformed as follows:: | |||
|
|||
>>> est.transform(X) # doctest: +SKIP | |||
array([[ 0., 2., 1.], | |||
array([[ 0., 1., 1.], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the result change because we're now using strategy='quantile'
, right?
If so, I think we need to modify the intervals above, otherwise they're not consistent:
- feature 1: :math:`{[-\infty, 0), [0, 3), [3, \infty)}`
- feature 2: :math:`{[-\infty, 4), [4, 5), [5, \infty)}`
- feature 3: :math:`{[-\infty, 13), [13, \infty)}`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch !
from ..utils.validation import check_array | ||
from ..utils.validation import check_is_fitted | ||
from ..utils.validation import column_or_1d | ||
from ..utils.fixes import np_version | ||
|
||
|
||
class KBinsDiscretizer(BaseEstimator, TransformerMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update the doc below:
Bins continuous data into k equal width intervals.
is no longer the case I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
Without looking at the code, 'auto' in
so if we want a better default, it's not hard to implement.... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm much more confident to merge discrete branch after this PR.
assert_array_equal(expected_3bins, Xt.ravel()) | ||
|
||
|
||
def test_kmeans_strategy(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the purpose of the test (maybe a comment)? It seems a bit strange (though I understand how it pass) and honestly seems redundant from my side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed redundant so I removed it.
========================================================== | ||
|
||
This example presents the different strategies implemented in KBinsDiscretizer: | ||
- 'uniform': The discretization is uniform in each feature, which means that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting issue according to Circle. I guess a blank line will solve the problem.
Further regarding default |
Thanks for the review ! I changed the default from |
bin_edges[jj] = np.asarray(np.percentile(column, quantiles)) | ||
|
||
elif self.strategy == 'kmeans': | ||
from ..cluster import KMeans # fixes import loops |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what's the final decision here? You said in #11272 (comment) that you'll move the import to the top but you actually leave it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do leave it here. I don't think importing from sklearn.preprocessing should automatically result in importing DBSCAN, etc. Better off keeping the loading as light as possible.
bin_edges[jj] = (centers[1:] + centers[:-1]) * 0.5 | ||
bin_edges[jj] = np.r_[col_min, bin_edges[jj], col_max] | ||
|
||
if bin_edges[jj][0] == bin_edges[jj][-1] and n_bins[jj] > 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with handling edge cases, but:
(1) Why don't you handle situations when n_bins[jj] == 2
?
(2) It seems strange that we reject n_bins = 1
and manually set n_bins_ = 1
in edge cases. Will it be reasonable to accept n_bins = 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not useful/meaningful to accept n_bins = 1...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine, then changing n_bins[jj] > 2
to n_bins[jj] >= 2
seems enough from my side, or sorry if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, maybe we need a test for the edge case.
doc/modules/preprocessing.rst
Outdated
>>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 3, 2], encode='ordinal').fit(X) | ||
>>> est.bin_width_ | ||
array([3., 1., 2.]) | ||
>>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 4, 2], encode='ordinal').fit(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it good to have n_bins > number of samples
in the example? I might still prefer n_bins=[3, 3, 2]
.
def test_same_min_max(): | ||
@pytest.mark.parametrize('strategy', ['uniform', 'kmeans', 'quantile']) | ||
@pytest.mark.parametrize('n_bins', [2, 3]) | ||
def test_same_min_max(strategy, n_bins): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we test n_bins_[0] = 1
somewhere? Or do you think we don't need such a test?
Also, maybe we don't need to test multiple n_bins
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_bins = 1
is tested in arrays (test_invalid_n_bins_array
) and as an integer (test_invalid_n_bins
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I misread your comment, I'll add a line to test n_bins_
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if CIs are green. ping @jnothman for a final check.
Fixes #9338.
This PR implements two non-uniform strategies for KBinsDiscretizer (in discrete branch #9342), adding a parameter
strategy
:KBinsDiscretizer(strategy='uniform')
: previous behavior, constant width bins.KBinsDiscretizer(strategy='quantile')
: new behavior (default), based onnp.percentile
.KBinsDiscretizer(strategy='kmeans')
: new behavior, based onKMeans
.It adds the following example:
