Skip to content

[MRG+2] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) #11272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 9, 2018

Conversation

TomDLT
Copy link
Member

@TomDLT TomDLT commented Jun 15, 2018

Fixes #9338.

This PR implements two non-uniform strategies for KBinsDiscretizer (in discrete branch #9342), adding a parameter strategy:

  • KBinsDiscretizer(strategy='uniform'): previous behavior, constant width bins.
  • KBinsDiscretizer(strategy='quantile'): new behavior (default), based on np.percentile.
  • KBinsDiscretizer(strategy='kmeans'): new behavior, based on KMeans.

It adds the following example:
figure_1

@TomDLT TomDLT force-pushed the discrete_quantile branch 2 times, most recently from aaddc11 to 21b9356 Compare June 15, 2018 13:36
@TomDLT TomDLT force-pushed the discrete_quantile branch from 21b9356 to 54522af Compare June 15, 2018 13:54
@jnothman
Copy link
Member

jnothman commented Jun 16, 2018 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should treat this as if it'll be released with KBinsDiscretizer. We should set the default strategy to 'quantile'. We should consider API changes to make this more consistent. We should update the what's new message rather than adding one.

I was a bit surprised to see you've only added, and not modified tests. But the only thing I'm missing there is checking the consistency of 1-column and multi-column transformations.

@@ -576,6 +576,18 @@ Discretization is similar to constructing histograms for continuous data.
However, histograms focus on counting features which fall into particular
bins, whereas discretization focuses on assigning feature values to these bins.

:class:`KBinsDiscretizer` implement different binning strategies, which can be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*implements

@@ -62,7 +82,9 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Number of bins per feature. An ignored feature at index ``i``
will have ``n_bins_[i] == 0``.

bin_width_ : float array, shape (n_features,)
bin_width_ : array, shape (n_features,)
Contain floats with 'uniform' strategy, and arrays of varying shapes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we now want to store widths, or edges? Do we lose a lot by storing edges in all cases?

kmeans
Widths are defined by a k-means on each features.

random_state : int, RandomState instance or None (default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing as we're doing it in 1d, can't we just initialise kmeans at the midpoints of a uniform split, and make it deterministic?

n_clusters = self.n_bins_[jj]
km = KMeans(n_clusters=n_clusters,
random_state=self.random_state).fit(col)
centers = np.sort(km.cluster_centers_[:, 0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'd need this sort if we initialised with linspace


if self.strategy == 'quantile':
n_quantiles = self.n_bins_[jj] + 1
qt = QuantileTransformer(n_quantiles=n_quantiles).fit(col)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overkill. Just use boundaries = np.percentile(col, n_quantiles + 1)

valid_strategy = ('uniform', 'quantile', 'kmeans')
if self.strategy not in valid_strategy:
raise ValueError("Valid options for 'strategy' are {}. "
"Got 'strategy = {}' instead."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use {!r} for quoting

def test_nonuniform_strategies():
X = np.array([0, 1, 2, 3, 9, 10]).reshape(-1, 1)

# with 2 bins
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also test the size of bin_widths_

@TomDLT
Copy link
Member Author

TomDLT commented Jun 18, 2018

Thanks for the review.

  • I changed to storing the edges in all cases, instead of widths. It avoids successive diff and cumsum in non-uniform strategies, and makes the API more consistent across strategies.
  • I removed the offset attribute, since the information is now in the edges.
  • I changed the default to 'quantile'.
  • I made 'kmeans' strategy deterministic with a uniform init.

@TomDLT TomDLT force-pushed the discrete_quantile branch from 520fa7b to 785c14f Compare June 18, 2018 12:37
@jnothman
Copy link
Member

I changed to storing the edges in all cases, instead of widths. It avoids successive diff and cumsum in non-uniform strategies, and makes the API more consistent across strategies.

And I suppose we're not worried about a slow-down in the uniform case?

@TomDLT
Copy link
Member Author

TomDLT commented Jun 18, 2018

Good point, if you think it's important, I can add a branch in _transform to be faster on the uniform strategy.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. All nitpicks. LGTM!

X /= bin_width
bin_edges = self.bin_edges_[trans]
for jj in range(X.shape[1]):
# Values which are a multiple of the bin width are susceptible to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a multiple of the bin width" -> "close to the bin edge"

for jj in range(X.shape[1]):
# Values which are a multiple of the bin width are susceptible to
# numeric instability. Add eps to X so these values are binned
# correctly. See documentation for numpy.isclose for an explanation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "correctly" -> "correctly with respect to their decimal truncation"???

edges = np.r_[col.min(), edges, col.max()]

if edges[0] == edges[-1] and self.n_bins_[jj] > 2:
warnings.warn("Features %d is constant and will be "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Features -> Feature

Widths are defined by a quantile transform, to have a uniform
number of samples in each bin.
kmeans
Widths are defined by a k-means on each features.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

features -> feature

Perhaps:
"Values in each bin have the same nearest center of a k-means cluster."

uniform
All bins in each feature have identical widths.
quantile
Widths are defined by a quantile transform, to have a uniform
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"All bins have the same number of points"

@jnothman
Copy link
Member

Good point, if you think it's important, I can add a branch in _transform to be faster on the uniform strategy.

I don't yet have reason to believe it's important... But yes, we could consider that now or in the future.

@TomDLT TomDLT force-pushed the discrete_quantile branch from e5586dc to 24a8745 Compare June 18, 2018 15:33
@TomDLT TomDLT changed the title [MRG] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) [MRG+1] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) Jun 18, 2018
@jnothman
Copy link
Member

I think once this is merged, we should move the discretizer to _encoders.py. This would resolve conflicts with master.

@jnothman
Copy link
Member

jnothman commented Jul 1, 2018

Would @qinhanmin2014 like to give this a second look?

@qinhanmin2014
Copy link
Member

I promise to give a review next weekend, and a second review from someone else before next weekend is welcomed. Sorry for the delay but I'm surrounded by school work at the end of the semester.

@jnothman
Copy link
Member

jnothman commented Jul 1, 2018 via email

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code LGTM. Will look at the examples and tests in a couple of hours.

edges = np.asarray(np.percentile(col, quantiles))

elif self.strategy == 'kmeans':
from ..cluster import KMeans
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason to import here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing loop problem, but it seems not to happen anymore so I moved it to the top.

@@ -565,7 +563,7 @@ example, these intervals are defined as:
Based on these bin intervals, ``X`` is transformed as follows::

>>> est.transform(X) # doctest: +SKIP
array([[ 0., 2., 1.],
array([[ 0., 1., 1.],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the result change because we're now using strategy='quantile', right?
If so, I think we need to modify the intervals above, otherwise they're not consistent:

  - feature 1: :math:`{[-\infty, 0), [0, 3), [3, \infty)}`
  - feature 2: :math:`{[-\infty, 4), [4, 5), [5, \infty)}`
  - feature 3: :math:`{[-\infty, 13), [13, \infty)}`

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch !

from ..utils.validation import check_array
from ..utils.validation import check_is_fitted
from ..utils.validation import column_or_1d
from ..utils.fixes import np_version


class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update the doc below:
Bins continuous data into k equal width intervals. is no longer the case I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@qinhanmin2014
Copy link
Member

@jnothman @TomDLT Will it be better to change the default value of n_bins to e.g., 3 or 5. n_bins=2 seems strange from my side, even though I'm aware that we're going to provide ways to automatically determine the value of n_bins

@jnothman
Copy link
Member

jnothman commented Jul 7, 2018

Without looking at the code, 'auto' in np.histogram is:

q25, q75 = np.percentile(X, [25, 75], axis=0)
iqr = q75 - q25
fd = 2 * iqr / X.shape[0] ** (1/3)
n_bins_fd = np.ceil(np.ptp(X, axis=0) / fd)
n_bins_sturges = np.ceil(np.log2(X.shape[0])) + 1
n_bins = np.maximum(n_bins_fd, n_bins_sturges).astype(int)

so if we want a better default, it's not hard to implement....

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm much more confident to merge discrete branch after this PR.

assert_array_equal(expected_3bins, Xt.ravel())


def test_kmeans_strategy():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the purpose of the test (maybe a comment)? It seems a bit strange (though I understand how it pass) and honestly seems redundant from my side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed redundant so I removed it.

==========================================================

This example presents the different strategies implemented in KBinsDiscretizer:
- 'uniform': The discretization is uniform in each feature, which means that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting issue according to Circle. I guess a blank line will solve the problem.

@qinhanmin2014 qinhanmin2014 changed the title [MRG+1] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) [MRG+2] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch) Jul 7, 2018
@qinhanmin2014 qinhanmin2014 added this to the 0.20 milestone Jul 7, 2018
@jnothman
Copy link
Member

jnothman commented Jul 7, 2018

Further regarding default n_bins: showing that 'auto' is useful when not doing a histogram might be hard. I'm +1 for having it more than 2.

@TomDLT
Copy link
Member Author

TomDLT commented Jul 9, 2018

Thanks for the review !

I changed the default from n_bins=2 to n_bins=5, indeed only 2 bins is a bit weird.

@TomDLT TomDLT force-pushed the discrete_quantile branch from 7c228e4 to 33dec83 Compare July 9, 2018 09:30
bin_edges[jj] = np.asarray(np.percentile(column, quantiles))

elif self.strategy == 'kmeans':
from ..cluster import KMeans # fixes import loops
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what's the final decision here? You said in #11272 (comment) that you'll move the import to the top but you actually leave it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do leave it here. I don't think importing from sklearn.preprocessing should automatically result in importing DBSCAN, etc. Better off keeping the loading as light as possible.

bin_edges[jj] = (centers[1:] + centers[:-1]) * 0.5
bin_edges[jj] = np.r_[col_min, bin_edges[jj], col_max]

if bin_edges[jj][0] == bin_edges[jj][-1] and n_bins[jj] > 2:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with handling edge cases, but:
(1) Why don't you handle situations when n_bins[jj] == 2?
(2) It seems strange that we reject n_bins = 1 and manually set n_bins_ = 1 in edge cases. Will it be reasonable to accept n_bins = 1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not useful/meaningful to accept n_bins = 1...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine, then changing n_bins[jj] > 2 to n_bins[jj] >= 2 seems enough from my side, or sorry if I'm wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe we need a test for the edge case.

>>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 3, 2], encode='ordinal').fit(X)
>>> est.bin_width_
array([3., 1., 2.])
>>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 4, 2], encode='ordinal').fit(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it good to have n_bins > number of samples in the example? I might still prefer n_bins=[3, 3, 2].

def test_same_min_max():
@pytest.mark.parametrize('strategy', ['uniform', 'kmeans', 'quantile'])
@pytest.mark.parametrize('n_bins', [2, 3])
def test_same_min_max(strategy, n_bins):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we test n_bins_[0] = 1 somewhere? Or do you think we don't need such a test?
Also, maybe we don't need to test multiple n_bins here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_bins = 1 is tested in arrays (test_invalid_n_bins_array) and as an integer (test_invalid_n_bins).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I misread your comment, I'll add a line to test n_bins_.

@TomDLT TomDLT force-pushed the discrete_quantile branch from b5b6083 to d1d2369 Compare July 9, 2018 15:47
Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if CIs are green. ping @jnothman for a final check.

@jnothman jnothman merged commit 5a61af9 into scikit-learn:discrete Jul 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants