[MRG+2] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) #11272

TomDLT · 2018-06-15T11:23:11Z

This PR implements two non-uniform strategies for KBinsDiscretizer (in discrete branch #9342), adding a parameter strategy:

KBinsDiscretizer(strategy='uniform'): previous behavior, constant width bins.
KBinsDiscretizer(strategy='quantile'): new behavior (default), based on np.percentile.
KBinsDiscretizer(strategy='kmeans'): new behavior, based on KMeans.

It adds the following example:

jnothman · 2018-06-16T08:14:52Z

Nice!

jnothman

I think we should treat this as if it'll be released with KBinsDiscretizer. We should set the default strategy to 'quantile'. We should consider API changes to make this more consistent. We should update the what's new message rather than adding one.

I was a bit surprised to see you've only added, and not modified tests. But the only thing I'm missing there is checking the consistency of 1-column and multi-column transformations.

jnothman · 2018-06-16T08:15:20Z

doc/modules/preprocessing.rst

@@ -576,6 +576,18 @@ Discretization is similar to constructing histograms for continuous data.
 However, histograms focus on counting features which fall into particular
 bins, whereas discretization focuses on assigning feature values to these bins.

+:class:`KBinsDiscretizer` implement different binning strategies, which can be


*implements

jnothman · 2018-06-16T08:19:56Z

sklearn/preprocessing/discretization.py

@@ -62,7 +82,9 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
        Number of bins per feature. An ignored feature at index ``i``
        will have ``n_bins_[i] == 0``.

-    bin_width_ : float array, shape (n_features,)
+    bin_width_ : array, shape (n_features,)
+        Contain floats with 'uniform' strategy, and arrays of varying shapes


Do we now want to store widths, or edges? Do we lose a lot by storing edges in all cases?

jnothman · 2018-06-16T09:42:24Z

sklearn/preprocessing/discretization.py

+        kmeans
+            Widths are defined by a k-means on each features.
+
+    random_state : int, RandomState instance or None (default)


Seeing as we're doing it in 1d, can't we just initialise kmeans at the midpoints of a uniform split, and make it deterministic?

jnothman · 2018-06-16T09:48:39Z

sklearn/preprocessing/discretization.py

+                    n_clusters = self.n_bins_[jj]
+                    km = KMeans(n_clusters=n_clusters,
+                                random_state=self.random_state).fit(col)
+                    centers = np.sort(km.cluster_centers_[:, 0],


I don't think we'd need this sort if we initialised with linspace

jnothman · 2018-06-16T09:53:31Z

sklearn/preprocessing/discretization.py

+
+                if self.strategy == 'quantile':
+                    n_quantiles = self.n_bins_[jj] + 1
+                    qt = QuantileTransformer(n_quantiles=n_quantiles).fit(col)


This seems overkill. Just use boundaries = np.percentile(col, n_quantiles + 1)

jnothman · 2018-06-16T09:57:56Z

sklearn/preprocessing/discretization.py

+        valid_strategy = ('uniform', 'quantile', 'kmeans')
+        if self.strategy not in valid_strategy:
+            raise ValueError("Valid options for 'strategy' are {}. "
+                             "Got 'strategy = {}' instead."


use {!r} for quoting

jnothman · 2018-06-16T10:00:50Z

sklearn/preprocessing/tests/test_discretization.py

+def test_nonuniform_strategies():
+    X = np.array([0, 1, 2, 3, 9, 10]).reshape(-1, 1)
+
+    # with 2 bins


Please also test the size of bin_widths_

TomDLT · 2018-06-18T12:31:52Z

Thanks for the review.

I changed to storing the edges in all cases, instead of widths. It avoids successive diff and cumsum in non-uniform strategies, and makes the API more consistent across strategies.
I removed the offset attribute, since the information is now in the edges.
I changed the default to 'quantile'.
I made 'kmeans' strategy deterministic with a uniform init.

jnothman · 2018-06-18T12:37:58Z

I changed to storing the edges in all cases, instead of widths. It avoids successive diff and cumsum in non-uniform strategies, and makes the API more consistent across strategies.

And I suppose we're not worried about a slow-down in the uniform case?

TomDLT · 2018-06-18T12:44:24Z

Good point, if you think it's important, I can add a branch in _transform to be faster on the uniform strategy.

jnothman

Great work. All nitpicks. LGTM!

jnothman · 2018-06-18T12:39:24Z

sklearn/preprocessing/discretization.py

-            X /= bin_width
+        bin_edges = self.bin_edges_[trans]
+        for jj in range(X.shape[1]):
+            # Values which are a multiple of the bin width are susceptible to


"a multiple of the bin width" -> "close to the bin edge"

jnothman · 2018-06-18T12:39:54Z

sklearn/preprocessing/discretization.py

+        for jj in range(X.shape[1]):
+            # Values which are a multiple of the bin width are susceptible to
+            # numeric instability. Add eps to X so these values are binned
+            # correctly. See documentation for numpy.isclose for an explanation


Maybe "correctly" -> "correctly with respect to their decimal truncation"???

jnothman · 2018-06-18T12:43:28Z

sklearn/preprocessing/discretization.py

+                edges = np.r_[col.min(), edges, col.max()]
+
+            if edges[0] == edges[-1] and self.n_bins_[jj] > 2:
+                warnings.warn("Features %d is constant and will be "


Features -> Feature

jnothman · 2018-06-18T12:48:08Z

sklearn/preprocessing/discretization.py

+            Widths are defined by a quantile transform, to have a uniform
+            number of samples in each bin.
+        kmeans
+            Widths are defined by a k-means on each features.


features -> feature

Perhaps:
"Values in each bin have the same nearest center of a k-means cluster."

jnothman · 2018-06-18T12:48:19Z

sklearn/preprocessing/discretization.py

+        uniform
+            All bins in each feature have identical widths.
+        quantile
+            Widths are defined by a quantile transform, to have a uniform


"All bins have the same number of points"

jnothman · 2018-06-18T12:52:28Z

Good point, if you think it's important, I can add a branch in _transform to be faster on the uniform strategy.

I don't yet have reason to believe it's important... But yes, we could consider that now or in the future.

jnothman · 2018-06-24T22:57:49Z

I think once this is merged, we should move the discretizer to _encoders.py. This would resolve conflicts with master.

jnothman · 2018-07-01T12:07:58Z

Would @qinhanmin2014 like to give this a second look?

qinhanmin2014 · 2018-07-01T12:32:19Z

I promise to give a review next weekend, and a second review from someone else before next weekend is welcomed. Sorry for the delay but I'm surrounded by school work at the end of the semester.

jnothman · 2018-07-01T12:41:38Z

That's fine! Just thought you might be a good person to give it a go if available.

qinhanmin2014

The code LGTM. Will look at the examples and tests in a couple of hours.

qinhanmin2014 · 2018-07-07T08:40:51Z

sklearn/preprocessing/discretization.py

+                edges = np.asarray(np.percentile(col, quantiles))
+
+            elif self.strategy == 'kmeans':
+                from ..cluster import KMeans


Any specific reason to import here?

Importing loop problem, but it seems not to happen anymore so I moved it to the top.

qinhanmin2014 · 2018-07-07T08:44:51Z

doc/modules/preprocessing.rst

@@ -565,7 +563,7 @@ example, these intervals are defined as:
 Based on these bin intervals, ``X`` is transformed as follows::

  >>> est.transform(X)                      # doctest: +SKIP
-  array([[ 0., 2., 1.],
+  array([[ 0., 1., 1.],


If I understand correctly, the result change because we're now using strategy='quantile', right?
If so, I think we need to modify the intervals above, otherwise they're not consistent:

- feature 1: :math:`{[-\infty, 0), [0, 3), [3, \infty)}` - feature 2: :math:`{[-\infty, 4), [4, 5), [5, \infty)}` - feature 3: :math:`{[-\infty, 13), [13, \infty)}`

Good catch !

qinhanmin2014 · 2018-07-07T08:53:04Z

sklearn/preprocessing/discretization.py

+from ..utils.validation import check_array
+from ..utils.validation import check_is_fitted
+from ..utils.validation import column_or_1d
+from ..utils.fixes import np_version


 class KBinsDiscretizer(BaseEstimator, TransformerMixin):


Need to update the doc below:
Bins continuous data into k equal width intervals. is no longer the case I think.

qinhanmin2014 · 2018-07-07T08:54:16Z

@jnothman @TomDLT Will it be better to change the default value of n_bins to e.g., 3 or 5. n_bins=2 seems strange from my side, even though I'm aware that we're going to provide ways to automatically determine the value of n_bins

jnothman · 2018-07-07T09:12:09Z

Without looking at the code, 'auto' in np.histogram is:

q25, q75 = np.percentile(X, [25, 75], axis=0)
iqr = q75 - q25
fd = 2 * iqr / X.shape[0] ** (1/3)
n_bins_fd = np.ceil(np.ptp(X, axis=0) / fd)
n_bins_sturges = np.ceil(np.log2(X.shape[0])) + 1
n_bins = np.maximum(n_bins_fd, n_bins_sturges).astype(int)

so if we want a better default, it's not hard to implement....

qinhanmin2014

LGTM. I'm much more confident to merge discrete branch after this PR.

qinhanmin2014 · 2018-07-07T09:11:59Z

sklearn/preprocessing/tests/test_discretization.py

+    assert_array_equal(expected_3bins, Xt.ravel())
+
+
+def test_kmeans_strategy():


Could you explain the purpose of the test (maybe a comment)? It seems a bit strange (though I understand how it pass) and honestly seems redundant from my side.

It is indeed redundant so I removed it.

qinhanmin2014 · 2018-07-07T09:12:52Z

examples/preprocessing/plot_discretization_strategies.py

+==========================================================
+
+This example presents the different strategies implemented in KBinsDiscretizer:
+- 'uniform': The discretization is uniform in each feature, which means that


Formatting issue according to Circle. I guess a blank line will solve the problem.

jnothman · 2018-07-07T10:43:06Z

Further regarding default n_bins: showing that 'auto' is useful when not doing a histogram might be hard. I'm +1 for having it more than 2.

TomDLT · 2018-07-09T09:25:33Z

Thanks for the review !

I changed the default from n_bins=2 to n_bins=5, indeed only 2 bins is a bit weird.

qinhanmin2014 · 2018-07-09T12:25:03Z

sklearn/preprocessing/discretization.py

+                bin_edges[jj] = np.asarray(np.percentile(column, quantiles))
+
+            elif self.strategy == 'kmeans':
+                from ..cluster import KMeans  # fixes import loops


So what's the final decision here? You said in #11272 (comment) that you'll move the import to the top but you actually leave it here.

Please do leave it here. I don't think importing from sklearn.preprocessing should automatically result in importing DBSCAN, etc. Better off keeping the loading as light as possible.

qinhanmin2014 · 2018-07-09T12:25:15Z

sklearn/preprocessing/discretization.py

+                bin_edges[jj] = (centers[1:] + centers[:-1]) * 0.5
+                bin_edges[jj] = np.r_[col_min, bin_edges[jj], col_max]
+
+            if bin_edges[jj][0] == bin_edges[jj][-1] and n_bins[jj] > 2:


I'm fine with handling edge cases, but:
(1) Why don't you handle situations when n_bins[jj] == 2?
(2) It seems strange that we reject n_bins = 1 and manually set n_bins_ = 1 in edge cases. Will it be reasonable to accept n_bins = 1?

It's not useful/meaningful to accept n_bins = 1...

Fine, then changing n_bins[jj] > 2 to n_bins[jj] >= 2 seems enough from my side, or sorry if I'm wrong.

Also, maybe we need a test for the edge case.

qinhanmin2014 · 2018-07-09T12:26:32Z

doc/modules/preprocessing.rst

-  >>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 3, 2], encode='ordinal').fit(X)
-  >>> est.bin_width_
-  array([3., 1., 2.])
+  >>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 4, 2], encode='ordinal').fit(X)


Is it good to have n_bins > number of samples in the example? I might still prefer n_bins=[3, 3, 2].

qinhanmin2014 · 2018-07-09T14:33:04Z

sklearn/preprocessing/tests/test_discretization.py

-def test_same_min_max():
+@pytest.mark.parametrize('strategy', ['uniform', 'kmeans', 'quantile'])
+@pytest.mark.parametrize('n_bins', [2, 3])
+def test_same_min_max(strategy, n_bins):


Do we test n_bins_[0] = 1 somewhere? Or do you think we don't need such a test?
Also, maybe we don't need to test multiple n_bins here?

n_bins = 1 is tested in arrays (test_invalid_n_bins_array) and as an integer (test_invalid_n_bins).

Oh I misread your comment, I'll add a line to test n_bins_.

qinhanmin2014

LGTM if CIs are green. ping @jnothman for a final check.

TomDLT force-pushed the discrete_quantile branch from 2f8ad7a to ded8c41 Compare June 15, 2018 11:27

TomDLT force-pushed the discrete branch from e305039 to 8aa62ab Compare June 15, 2018 11:33

TomDLT force-pushed the discrete_quantile branch from ded8c41 to 2b8c99f Compare June 15, 2018 11:33

Implement two new strategies for KBinsDiscretizer

9001878

TomDLT force-pushed the discrete_quantile branch 2 times, most recently from aaddc11 to 21b9356 Compare June 15, 2018 13:36

FIX python viridis import

54522af

TomDLT force-pushed the discrete_quantile branch from 21b9356 to 54522af Compare June 15, 2018 13:54

jnothman reviewed Jun 16, 2018

View reviewed changes

TomDLT added 3 commits June 18, 2018 13:34

ENH store edges instead of widths

262813a

ENH change default to "quantile" and make "kmeans" deterministic

3b45b05

ENH improve tests and error messages

de0a349

CLN simplify branching

785c14f

TomDLT force-pushed the discrete_quantile branch from 520fa7b to 785c14f Compare June 18, 2018 12:37

jnothman approved these changes Jun 18, 2018

View reviewed changes

FIX example

b677ab3

jnothman mentioned this pull request Jun 18, 2018

[MRG+2] Merge discrete branch into master #9342

Merged

7 tasks

FIX compatibility with numpy 1.8

24a8745

TomDLT force-pushed the discrete_quantile branch from e5586dc to 24a8745 Compare June 18, 2018 15:33

TomDLT changed the title ~~[MRG] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch)~~ [MRG+1] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) Jun 18, 2018

CLN just renaming variables

b204f46

qinhanmin2014 reviewed Jul 7, 2018

View reviewed changes

qinhanmin2014 approved these changes Jul 7, 2018

View reviewed changes

qinhanmin2014 changed the title ~~[MRG+1] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch)~~ [MRG+2] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) Jul 7, 2018

qinhanmin2014 added this to the 0.20 milestone Jul 7, 2018

address comments

33dec83

TomDLT force-pushed the discrete_quantile branch from 7c228e4 to 33dec83 Compare July 9, 2018 09:30

qinhanmin2014 reviewed Jul 9, 2018

View reviewed changes

move constant feature check and improve logic

c01d3dc

qinhanmin2014 reviewed Jul 9, 2018

View reviewed changes

simpler test

d1d2369

TomDLT force-pushed the discrete_quantile branch from b5b6083 to d1d2369 Compare July 9, 2018 15:47

qinhanmin2014 approved these changes Jul 9, 2018

View reviewed changes

Update v0.20.rst

9d9f3f9

jnothman merged commit 5a61af9 into scikit-learn:discrete Jul 9, 2018

jnothman mentioned this pull request Jul 12, 2018

discrete branch: quantile and kmeans strategies #9338

Closed

		assert_array_equal(expected_3bins, Xt.ravel())


		def test_kmeans_strategy():

[MRG+2] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) #11272

[MRG+2] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) #11272

Conversation

TomDLT commented Jun 15, 2018 • edited Loading

jnothman commented Jun 16, 2018 via email

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomDLT commented Jun 18, 2018

jnothman commented Jun 18, 2018

TomDLT commented Jun 18, 2018

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jun 18, 2018

jnothman commented Jun 24, 2018

jnothman commented Jul 1, 2018

qinhanmin2014 commented Jul 1, 2018

jnothman commented Jul 1, 2018 via email

qinhanmin2014 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qinhanmin2014 commented Jul 7, 2018

jnothman commented Jul 7, 2018

qinhanmin2014 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jul 7, 2018

TomDLT commented Jul 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qinhanmin2014 left a comment

Choose a reason for hiding this comment

TomDLT commented Jun 15, 2018 •

edited

Loading