[MRG+2] Add fixed width discretization to scikit-learn #7668

hlin117 · 2016-10-14T09:37:04Z

Summary

This PR addresses #5778. It adds the KBinsDiscretizer class to scikit-learn.

This PR closes #5825 as a much more straightforward discretization algorithm. It is a WIP, please feel free to leave comments about the current code before the code gets hairy and hard to read 😄

Roadmap

Permission

Package owner approval of algorithm

Features and code

Uniform discretization strategy for 2D arrays
inverse_transform function
Unit tests (duh)

Documentation

User guide
Docs in doc/modules/classes.rst
Docs in doc/modules/preprocessing.rst
Docstring tests

hlin117 · 2016-10-14T09:46:43Z

sklearn/preprocessing/discretization.py

+        # Generate output
+        discretized = np.floor(X_trans)
+
+        # Addressing corner case issues


Should definitely use discretized.min(axis=0) and discretized.max(axis=0) here. I also owe the reader an explanation about this corner case.

amueller · 2016-10-14T18:56:44Z

The user should be able to provide a mask or indices on what features to bin, right? Also, shouldn't we do uniform bins and percentile bins in the same estimator? It's such a minor change. Any other obvious binning methods we might want to have? What would the estimator be called?

jnothman · 2016-10-15T10:11:40Z

Have you looked at the code, @amueller? I'm not sure this scaling approach to fixed width discretisation easily transfers to quantiles.

hlin117 · 2016-10-15T10:42:55Z

@amueller

The user should be able to provide a mask or indices on what features to bin, right?

Yes, that was one feature I didn't include from #5825. I didn't add it into this PR yet - I wanted to keep the PR small, so I could get easy feedback from people whether this is a feasible route to go.

Also, shouldn't we do uniform bins and percentile bins in the same estimator? It's such a minor change. Any other obvious binning methods we might want to have? What would the estimator be called?

I'm trying to keep it simple like @mblondel's request here. This PR should only contain binning for k fixed bins. I can definitely extend the work more later on.

hlin117 · 2016-10-15T10:58:55Z

sklearn/preprocessing/discretization.py

+
+        # Shift data
+        epsilon = np.abs(X_trans.min() - np.ceil(X_trans.min()))
+        X_trans[X_trans != 0] += epsilon


@amueller You mentioned that you wanted this discretizer to mask out categorical features. My previous implementation in #5825 moved categorical features to the last columns of X. (Example, if there were 3 categorical features, after the transformation, they would appear as the last 3 columns in X_trans.) I'm trying to avoid that in this implementation.

But the above lines in my code are built to work with sparse matrices. I'm thinking the following code to have it implemented by specifying which continuous features to bin using the code below:

>>> from scipy.sparse import csr_matrix >>> import numpy as np >>> >>> matrix = csr_matrix([[0, 1, 2, 3], \ >>> [0, 0, 4, 0], \ >>> [5, 0, 0, 0]], dtype=float) >>> continuous = [0, 3] >>> epsilon = 0.5 >>> >>> submatrix = matrix[:, continuous] # Get only the continuous features >>> submatrix[submatrix != 0] += epsilon # The shift step >>> >>> matrix[:, continuous] = submatrix

which looks reasonable. Except scipy 0.11.0 raises a NotImplementedError on this line:

submatrix[submatrix != 0] += epsilon # The shift step

Do you have work arounds for "masking out" categorical features? (And allowing manipulation on only continuous features.)

Actually, this would work

>>> from scipy.sparse import csr_matrix, issparse >>> import numpy as np >>> >>> matrix = csr_matrix([[0, 1, 2, 3], \ >>> [0, 0, 4, 0], \ >>> [5, 0, 0, 0]], dtype=float) >>> continuous = [0, 3] >>> epsilon = 0.5 >>> >>> submatrix = matrix[:, continuous] # Get only the continuous features >>> if issparse(submatrix): >>> submatrix.data += epsilon >>> else: >>> submatrix[submatrix != 0] += epsilon >>> >>> matrix[:, continuous] = submatrix

But I run into this problematic error

>>> matrix[:, continuous] = submatrix NotImplementedError: Fancy indexing in assignment not supported for csr matrices.

Seeking suggestions, @amueller.

hlin117 · 2016-10-15T18:14:50Z

sklearn/preprocessing/discretization.py

+        X = X.copy()
+
+        # Clip outliers from data (TODO)
+        #X[X > self.max_] = self.max_


I need to do something that helps me achieve the following:

>>> import numpy as np >>> >>> array = np.array([[ 1, 2, 3], >>> [ 0, 5, 2], >>> [-1, 1, 2], >>> [-2, 6, 4]]) >>> max_ = [0, 3, 2] >>> array[array > max_] = self.max_ # NOTE: Syntax actually doesn't work >>> array array([[ 0, 2, 2], [ 0, 3, 2], [-1, 1, 2], [-2, 3, 2]])

But this is invalid syntax. Is there any nice numpy syntax to clip the maxes across each column? (The method needs to work on scipy sparse matricies.)

Well I'll be damned. np.clip exists.

This doesn't solve how we would adapt this solution to sparse matrices though.

hlin117 · 2016-10-16T23:37:59Z

@amueller

Also, shouldn't we do uniform bins and percentile bins in the same estimator? It's such a minor change.

Some design questions about percentiles. What do we want the API to look like if we do percentile / quantile binning? Do we want the user to pass in an array of values between 1-100 which express where the percentile "cutpoints" are?

amueller · 2016-10-17T17:49:33Z

shouldn't transform just be a call to np.digitize?

amueller · 2016-10-17T17:56:18Z

I didn't look at the implementation, but I'm wondering if this relatively complex implementation is actually better than what I would have thought would be the naive one, which is iterating over columns and doing np.digitize.
Fit would then be

mins = np.min(X, axis=0)
maxs = np.max(X, axis=0)
self.bins = [np.linspace(min, max, n_bins + 1) for min, max in zip(mins, maxs)]

and transform would be

out = [np.digitize(col, one_bins) for col, bins in zip(X.T, self.bins)]
return OneHotEncoder().fit_transform(np.c_[out])

that seems much more readable to me.

amueller · 2016-10-17T17:57:32Z

The api for percentiles would also be just provide a number of bins, and instead of a linspace we use np.percentile(X[:, i], np.linspace(0, 100, n_bins + 1))

amueller · 2016-10-17T17:59:29Z

I have to admit I didn't think about the sparse case yet.

hlin117 · 2016-10-17T23:15:45Z

One of the reasons we were trying to go for a more complex implementation is so we could handle sparse matrices. We wanted any zero in the sparse matrix to map to zero in the discretized matrix.

In the case of using np.digitize, we'll have trouble mapping zeros to zeros for negative values. I would otherwise go for a simple implementation.

hlin117 · 2016-10-17T23:16:37Z

If you consider sparse matrices @amueller, can you give me a second opinion of my code?

hlin117 · 2016-10-18T06:44:40Z

sklearn/preprocessing/__init__.py

@@ -35,6 +36,7 @@
    'Binarizer',
    'FunctionTransformer',
    'Imputer',
+    'KBinsDiscretizer'


Looks like I need to add a comma here.

jnothman · 2016-10-18T11:51:01Z

I understand that this "more complex" method allows for a linear time solution.

Is it not straightforward to renumber the bins such that 0 is always in the bin containing 0 in input?

hlin117 · 2016-10-18T16:50:36Z

Is it not straightforward to renumber the bins such that 0 is always in the bin containing 0 in input?

You can do that. But ultimately what that means is the algorithm will transform a sparse array to a dense array (before another transform to transform it back to a sparse array). This is what we were trying to avoid in #5825. (Which is what led to the long discussion in that PR.)

My solution would preserve the sparsity of the input matrix, which is most ideal. It's complex, but I don't think it's unreadable.

hlin117 · 2016-10-18T16:51:53Z

On another note, I don't think np.digitize is what we're looking for here. np.digitize, prior to 1.10.0, used a linear search. Also, np.digitize wouldn't work for sparse matrices.
Source.

hlin117 · 2016-10-18T16:58:04Z

sklearn/preprocessing/discretization.py

+
+        # Corner case arises when maximum values across each column of X
+        # get shifted to the next integer.
+        for col in discretized.T:


np.clip here for dense matrices. Still need to find a solution for sparse matrices though.

amueller · 2016-10-18T18:15:28Z

looks like my standpoint is the opposite of @mblondel's ^^ I'd rather have quantiles and no sparse support and a simple implementation.
Actually, I have to look at this and the previous implementation. I would have expected this transformation to do a one-hot-encoding. What's the point of discretizing without encoding the result as a categorical variable? (Another thing I appear to be disagreeing with @mblondel about #5825 (comment))

If we don't do the one-hot-encoding here, this estimator must always be followed by a OneHotEncoder - which I find rather strange. Why do we burden the user with that?

And if we one-hot-encode the data, what does that mean for sparse data?

@mblondel can you maybe give an example of your use case? I think I don't understand it.

I though the standard use-case of this would be some continuous features that you want to discretize and then learn a linear model on. That would work well with fixed width and quantiles and (at least with newer numpy) could be done in a couple of lines.

What is the sparse matrix use-case, what data representation do you want to get out and what models do you want to use?

amueller · 2016-10-18T18:23:55Z

I re-read #5778 and that is actually exactly what I would have thought of. I'm not sure how that relates to sparse input. If we one-hot-encode sparse input, we will have one non-zero entry in the one-hot-encoded version for every entry in the input, also the zero ones. We could add zero as the "reference class" and do a one-hot-encoding where zero is represented by none of the entries set to one. Was that the plan? Currently OneHotEncoder doesn't support that. Also, our OneHotEncoder doesn't support any sparse input, right?
So even if we have sparse matrix support here, there is nothing the user can do with the output without making it dense first.

jnothman · 2016-10-18T21:05:37Z

If you design your bin numbering such that 0 always lands up transforming
0, I don't see how you need to change sparsity: you only need to discretize
the nonzeros. But I suspect that should be special functionality to support
sparse matrices and we should worry about the dense case first.

On 19 October 2016 at 05:23, Andreas Mueller notifications@github.com
wrote:

I re-read #5778 #5778
and that is actually exactly what I would have thought of. I'm not sure how
that relates to sparse input. If we one-hot-encode sparse input, we will
have one non-zero entry in the one-hot-encoded version for every entry in
the input, also the zero ones. We could add zero as the "reference class"
and do a one-hot-encoding where zero is represented by none of the entries
set to one. Was that the plan? Currently OneHotEncoder doesn't support
that. Also, our OneHotEncoder doesn't support any sparse input, right?
So even if we have sparse matrix support here, there is nothing the user
can do with the output without making it dense first.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7668 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6wC7U74FyA1uDh-EYxfw0he-jlIyks5q1Q68gaJpZM4KW0dz
.

jnothman · 2016-10-18T21:06:07Z

That's right, I'm pretty sure OHE doesn't currently support sparse input.

On 19 October 2016 at 08:05, Joel Nothman joel.nothman@gmail.com wrote:

If you design your bin numbering such that 0 always lands up transforming
0, I don't see how you need to change sparsity: you only need to discretize
the nonzeros. But I suspect that should be special functionality to support
sparse matrices and we should worry about the dense case first.

On 19 October 2016 at 05:23, Andreas Mueller notifications@github.com
wrote:

I re-read #5778
#5778 and that is
actually exactly what I would have thought of. I'm not sure how that
relates to sparse input. If we one-hot-encode sparse input, we will have
one non-zero entry in the one-hot-encoded version for every entry in the
input, also the zero ones. We could add zero as the "reference class" and
do a one-hot-encoding where zero is represented by none of the entries set
to one. Was that the plan? Currently OneHotEncoder doesn't support that.
Also, our OneHotEncoder doesn't support any sparse input, right?
So even if we have sparse matrix support here, there is nothing the user
can do with the output without making it dense first.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7668 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6wC7U74FyA1uDh-EYxfw0he-jlIyks5q1Q68gaJpZM4KW0dz
.

mblondel · 2016-10-19T02:19:18Z

If we don't do the one-hot-encoding here, this estimator must always be followed by a OneHotEncoder - which I find rather strange. Why do we burden the user with that?

I do agree with you that returning the one-hot encoding is the main use case, as I originally intended in #5778. Returning the ordinal value could be potentially useful for "denoising" the features but I have no evidence of that.

@mblondel can you maybe give an example of your use case? I think I don't understand it.

My original use case was to feed the one-hot encoding to models that use feature combinations such as SVMs with a polynomial kernel or factorization machines. However, this was months ago and I don't remember it well.

If we one-hot-encode sparse input, we will have one non-zero entry in the one-hot-encoded version for every entry in the input, also the zero ones.

I would map zeros to a zero vector. Even if you decide to focus on the dense case for now, it would be nice to keep in mind the sparse case when designing the discretization strategy.

jnothman · 2016-10-19T03:28:44Z

I think it's worth returning ordinals at least initially. Certainly, other
binary encodings, such as [[2], [3]] -> [[1, 1, 0], [1, 1, 1]] are very
useful in practice. It's easy to make a pipeline, but hard if we only
offer the output one-hot encoded.

I would map zeros to a zero vector. Even if you decide to focus on the
dense case for now, it would be nice to keep in mind the sparse case when
designing the discretization strategy.

I think this could be counter-intuitive when our one-hot encoder and
others' are inclined to assume bins range from 0 to n_bins - 1. So I'd
rather this be switched on by a keep_zeros_zero=False flag.

On 19 October 2016 at 13:19, Mathieu Blondel notifications@github.com
wrote:

If we don't do the one-hot-encoding here, this estimator must always be
followed by a OneHotEncoder - which I find rather strange. Why do we burden
the user with that?

I do agree with you that returning the one-hot encoding is the main use
case, as I originally intended in #5778
#5778. Returning the
ordinal value could be potentially useful for "denoising" the features but
I have no evidence of that.

@mblondel https://github.com/mblondel can you maybe give an example of
your use case? I think I don't understand it.

My original use case was to feed the one-hot encoding to models that use
feature combinations such as SVMs with a polynomial kernel or factorization
machines. However, this was months ago and I don't remember it well.

If we one-hot-encode sparse input, we will have one non-zero entry in the
one-hot-encoded version for every entry in the input, also the zero ones.

I would map zeros to a zero vector. Even if you decide to focus on the
dense case for now, it would be nice to keep in mind the sparse case when
designing the discretization strategy.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7668 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6_vQDor64PNCSBXDhkboZY8itkfQks5q1X4ngaJpZM4KW0dz
.

amueller · 2016-10-19T16:00:37Z

On 10/18/2016 11:28 PM, Joel Nothman wrote:

I think it's worth returning ordinals at least initially. Certainly, other
binary encodings, such as [[2], [3]] -> [[1, 1, 0], [1, 1, 1]] are very
useful in practice. It's easy to make a pipeline, but hard if we only
offer the output one-hot encoded.
We could make it optional and then we can discuss the default?

hlin117 · 2016-10-20T06:15:30Z

I'll start the dense case soon. But like it's been alluded to, it'll be important to be mindful of the sparse case later.

hlin117 · 2016-10-23T00:55:37Z

PR is ready for review. I added unit tests to this PR.

I did not establish the behavior for 1d arrays though. I'm leaving that open for that.

jnothman · 2016-10-27T12:58:05Z

Change title to MRG?

jnothman · 2017-07-10T21:32:18Z

Well I said I would mark it as MRG+2 on your basis, but then forgot to. I also said that I would leave it to @amueller or @mblondel to merge.

jnothman · 2017-07-10T21:33:33Z

Of course, someone else could merge too. I've been too involved with it to be careful and impartial, though I'm sure you have been @gle.

agramfort · 2017-07-11T03:04:27Z

sklearn/preprocessing/discretization.py

+from sklearn.preprocessing.data import _transform_selected
+from sklearn.utils.validation import check_array
+from sklearn.utils.validation import check_is_fitted
+from sklearn.utils.validation import column_or_1d


these should be relative imports

agramfort · 2017-07-11T03:09:58Z

also could this be illustrated in an example to show gain in training time without perf loss or showing a gain in perf due to a form of feature normalization or removal of extreme/noisy samples?

jnothman · 2017-07-11T03:18:43Z

Hmm. Do you have anything in mind @agramfort? I would usually consider this useful when coupled with something like a UnaryEncoder.

…

On 11 July 2017 at 13:10, Alexandre Gramfort ***@***.***> wrote: also could this be illustrated in an example to show gain in training time without perf loss or showing a gain in perf due to a form of feature normalization or removal of extreme/noisy samples? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7668 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67otRKfPRGi_Wc_7ZCB5A6KEndU6ks5sMueIgaJpZM4KW0dz> .

ogrisel · 2017-07-11T12:33:36Z

I don't see how this kind of transform can be useful. To me binning is not what is implemented in this PR. Instead each bin should be expanded to a new dimension (using a sparse matrix as an output to avoid materializing zeros in memory).

For instance if you have a dataset with 5 3 continuous features as input and what to do binning with n_bins=5 I would expect 15 dimensions as output (using histograms as implemented in this PR) with only 3 non-zero values per output record.

This way, a linear model combined with such a binning preprocessor could model univariate non-linear dependencies between the output variable and each input variable. (It can be even more expressive by chaining binning with polynomial feature extraction).

I think @pprett has significant practical experience with using binning as a preprocessing strategy. I would love to get his input on this PR.

glemaitre · 2017-07-11T12:59:11Z

I think the feature that you mentioned was the original one that @mblondel was striving for. However, the initial idea moved toward the current proposal which is a R feature named "cut". If we pipeline a OHE or the Categorical Transformer, do we get something that is similar to what your proposing. I was thinking that we could enforce output dtype to be integer. Right now this is not the case because we want the user to let pass some column unchanged. However, I think that we should use the Column Transformer for this purpose and restrict the output dtype of the proposed transformer.

vene · 2017-07-11T13:40:46Z

For instance if you have a dataset with 5 continuous features as input and what to do binning with n_bins=5 I would expect 15 dimensions as output (using histograms as implemented in this PR) with only 3 non-zero values per output record.

Do you mean 3 continuous features or am I missing something?

I agree this is the common desirable output; I was thinking, just like @glemaitre, that it can be obtained by piping, e.g., OneHotEncoder.

While I can see the arguments for directly outputting the one-hot encoded format, here are a few arguments (albeit kinda weak) in favor of separating the concerns:

input for tree-based estimators
transforming y from regression to multi-class classification
tool for people writing custom algorithms that deal with categories directly.

ogrisel · 2017-07-11T17:27:17Z

Do you mean 3 continuous features or am I missing something?

Yes 3 input features, sorry :)

ogrisel · 2017-07-11T17:31:55Z

Ideally, I would like to get the option to do linear interpolation inside the bins and output sparse yet continuous values between 0 and 1 so as to be able to perform inverse_transform and approximately reconstruct the original data without losing too much signal (no strong discretization).

E.g. assume a an input continuous feature and assume that it's uniformly distributed between 1 and 6 and n_bins=5.

If the input is [2.7], then I would assume to get it transformed to [0, 0.7, 0, 0, 0].

This kind of transformer could not be implemented by pipelining the transformer in this PR with a integer categorical encoder.

vene · 2017-07-11T18:17:52Z

Good point.

Another usecase I would like is incremental binning as mentioned by @jnothman here,

for instance [2.7] -> [1, 1, 0, 0, 0] or with @ogrisel's suggestion for exact inverse transform, [1, 1, 0.7, 0, 0] (which can always be floor'd)

ogrisel · 2017-07-11T19:33:41Z

Indeed, incremental binning is another interesting transform.

jnothman · 2017-07-11T23:39:32Z

I would like to see this have an encode={ordinal, onehot, unary} parameter. But the UnaryEncoder is not merged and this is usable as is. The question from @agramfort was to prove that is useful as is... On 12 Jul 2017 3:27 am, "Olivier Grisel" <notifications@github.com> wrote: Yes 3 input features, sorry :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7668 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz68IPPnFcROlET6L3_3Ga_HQo8EUzks5sM7B3gaJpZM4KW0dz> .

jnothman · 2017-07-11T23:40:07Z

and good work Vlad on coming up with cases where this may be useful as is

…

On 12 Jul 2017 3:27 am, "Olivier Grisel" ***@***.***> wrote: Yes 3 input features, sorry :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7668 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz68IPPnFcROlET6L3_3Ga_HQo8EUzks5sM7B3gaJpZM4KW0dz> .

jnothman · 2017-07-11T23:58:40Z

is incremental binning what it's called these days? I think the core idea is the same. These are all extensions we should farm out to contributors for the next version, no?

…

On 12 Jul 2017 9:40 am, "Joel Nothman" ***@***.***> wrote: and good work Vlad on coming up with cases where this may be useful as is On 12 Jul 2017 3:27 am, "Olivier Grisel" ***@***.***> wrote: > Yes 3 input features, sorry :) > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#7668 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAEz68IPPnFcROlET6L3_3Ga_HQo8EUzks5sM7B3gaJpZM4KW0dz> > . >

ogrisel · 2017-07-12T07:16:59Z

input for tree-based estimators

I don't see any case where this preprocessing would perform better than QuantileTransfomer for decision tree based models (or even just using the raw features).

transforming y from regression to multi-class classification

This is probably the only valid use case I see for this transformer in its current state.

tool for people writing custom algorithms that deal with categories directly.

To me it smells YAGNI as long as I don't see another concrete use case.

In my opinion, binning is mostly useful to help linear models work on non-linearly separable tasks, (possibly as a complement to polynomial feature expansions). I would not be in favor of merging this PR if it cannot tackle that use case by default. The default behavior of our models should nudge our users into using them in way that reflects commonly acknowledged best practices.

ogrisel · 2017-07-12T08:07:06Z

is incremental binning what it's called these days?

I don't know. Any Kaggle master of feature engineering around here?

I did a bit of Kaggle forum scouting and another motivation of binning is to remove noise when you assume that small feature value variations are not important to the output. Still I would like to see a dataset that demonstrates the benefits of this univariate noise removal alone.

While googling I found another interesting binning strategy: Head Tail Breaks binning for features whose ranked values follow a long tail distribution:

https://youtu.be/HS7mObQttxU?t=22m13s

In the end of the presentation the presenter states that a linear regression model can benefit from this because of the resulting one-hot encoding makes the transformation non-linear although the normalization effect of the transform can also be responsible and it would be interesting to benchmark against Linear Regression on features preprocessed by QuantileTransformer.

The Head Tail Breaks binning method is still interesting because the number of bins is determined adaptively depending on the distribution of the data.

ogrisel · 2017-07-12T08:19:18Z

Also I question the choice of using fixed width bins as the default strategy. I think quantile based binning or even 1D K-Means based binning would be more robust to features with non-uniform distributions. We could keep fixed width binning as an option but would rather be in favor of using quantile based binning by default.

jnothman · 2017-07-12T08:26:10Z

Here is my solution. The code has been reviewed and found good in terms of usability, reliability, maintainability, etc. It falls short on demonstrated usefulness (although the output is flexible).

I have squashed this work into https://github.com/scikit-learn/scikit-learn/tree/discrete and propose that we:

seek an enhancement to that branch (from @hlin117 or someone else) which implements a default onehot encoding, possible to change to ordinal or the other encodings proposed here in the future
merge to master only when we're happy with the features available in that branch
perhaps seek other enhancements to that branch
Now that quantile transformer is available, having an inbuilt quantile transformation as an option would be good.

My motivation for this is simply that this PR has grown long and unwieldy, is otherwise approved, and the enhancements need not be forced onto @hlin117.

jnothman · 2017-07-12T08:26:36Z

So are we okay for me to close this and create new issues calling for PRs to the discrete branch?

ogrisel · 2017-07-12T08:42:30Z

So are we okay for me to close this and create new issues calling for PRs to the discrete branch?

+1

jnothman · 2017-07-12T09:55:19Z

Both quantile and KMeans discretization will use bin edges rather than bin widths in transform, so the fixed width case can be implemented more efficiently and has a different set of attributes. Things like automatic determination of fixed bin width (as per np.histogram) are also inappropriate to those methods. So either we need to reject this, or we need to accept that fixed width and those belong in different estimators. In which case perhaps this needs to change back to being FixedWidthDiscretizer. The output encoding stuff should be more-or-less (not sure about the interpolation stuff) shared across the discretizers.

…

On 12 July 2017 at 18:19, Olivier Grisel ***@***.***> wrote: Also I question the choice of using fixed width bins as the default strategy. I think quantile based binning of 1D K-Means based binning would be more robust to features with non-uniform distributions. We could keep fixed width binning as an option but would rather be in favor of using quantile based binning by default. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7668 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6_A1-qhalNx7ZZyBf8fUR1mCx-Ivks5sNIGJgaJpZM4KW0dz> .

jnothman · 2017-07-12T10:23:07Z

Congratulations and thanks to @hlin117 for his work!

jnothman · 2017-07-12T10:24:02Z

(And maybe it wouldn't be terrible to include quantile or kmeans discretization in the same class. It would be nice to know how useful or otherwise fixed width discretization is in practice relative to those.)

jnothman · 2017-07-12T11:29:52Z

@hlin117 If you'd like to take on one or more of the new issues (#9336, #9337, #9338, #9339), you should get priority.

hlin117 · 2017-07-13T02:45:52Z

Thanks @jnothman and @glemaitre for helping to review my code!

GaelVaroquaux · 2018-11-26T13:04:40Z

Little insight that I had today thinking about the KBinsDiscretizer: it is related a hard version of kernel expansion with Gaussian kernels. Is this explained in places? Maybe it would be interesting to add this to some example?

hlin117 mentioned this pull request Oct 14, 2016

[WIP] Adding Fixed Width Discretization #5825

Closed

12 tasks

hlin117 commented Oct 14, 2016

View reviewed changes

hlin117 commented Oct 15, 2016

View reviewed changes

hlin117 commented Oct 18, 2016

View reviewed changes

hlin117 changed the title ~~[WIP] Add fixed width discretization to scikit-learn~~ [MRG] Add fixed width discretization to scikit-learn Oct 27, 2016

jnothman removed the Waiting for Reviewer label Jul 10, 2017

agramfort reviewed Jul 11, 2017

View reviewed changes

jnothman closed this Jul 12, 2017

qinhanmin2014 mentioned this pull request Nov 23, 2017

[MRG+2] Merge discrete branch into master #9342

Merged

7 tasks

[MRG+2] Add fixed width discretization to scikit-learn #7668

[MRG+2] Add fixed width discretization to scikit-learn #7668

Conversation

hlin117 commented Oct 14, 2016 • edited Loading

Summary

Roadmap

hlin117 Oct 14, 2016

Choose a reason for hiding this comment

amueller commented Oct 14, 2016

jnothman commented Oct 15, 2016

hlin117 commented Oct 15, 2016 • edited Loading

hlin117 Oct 15, 2016 • edited Loading

Choose a reason for hiding this comment

hlin117 Oct 15, 2016 • edited Loading

Choose a reason for hiding this comment

hlin117 Oct 15, 2016 • edited Loading

Choose a reason for hiding this comment

hlin117 Oct 18, 2016

Choose a reason for hiding this comment

hlin117 commented Oct 16, 2016 • edited Loading

amueller commented Oct 17, 2016

amueller commented Oct 17, 2016 • edited Loading

amueller commented Oct 17, 2016 • edited Loading

amueller commented Oct 17, 2016

hlin117 commented Oct 17, 2016

hlin117 commented Oct 17, 2016

hlin117 Oct 18, 2016

Choose a reason for hiding this comment

jnothman commented Oct 18, 2016

hlin117 commented Oct 18, 2016

hlin117 commented Oct 18, 2016 • edited Loading

hlin117 Oct 18, 2016

Choose a reason for hiding this comment

amueller commented Oct 18, 2016

amueller commented Oct 18, 2016

jnothman commented Oct 18, 2016

jnothman commented Oct 18, 2016

mblondel commented Oct 19, 2016

jnothman commented Oct 19, 2016

amueller commented Oct 19, 2016

hlin117 commented Oct 20, 2016

hlin117 commented Oct 23, 2016

jnothman commented Oct 27, 2016

jnothman commented Jul 10, 2017 • edited Loading

jnothman commented Jul 10, 2017

agramfort Jul 11, 2017

Choose a reason for hiding this comment

agramfort commented Jul 11, 2017

jnothman commented Jul 11, 2017 via email

ogrisel commented Jul 11, 2017 • edited Loading

glemaitre commented Jul 11, 2017 via email

vene commented Jul 11, 2017

ogrisel commented Jul 11, 2017 • edited Loading

ogrisel commented Jul 11, 2017 • edited Loading

vene commented Jul 11, 2017

ogrisel commented Jul 11, 2017

jnothman commented Jul 11, 2017 via email

jnothman commented Jul 11, 2017 via email

jnothman commented Jul 11, 2017 via email

ogrisel commented Jul 12, 2017 • edited Loading

ogrisel commented Jul 12, 2017 • edited Loading

ogrisel commented Jul 12, 2017 • edited Loading

jnothman commented Jul 12, 2017

jnothman commented Jul 12, 2017

ogrisel commented Jul 12, 2017

jnothman commented Jul 12, 2017 via email

jnothman commented Jul 12, 2017

jnothman commented Jul 12, 2017

jnothman commented Jul 12, 2017

hlin117 commented Jul 13, 2017

GaelVaroquaux commented Nov 26, 2018

hlin117 commented Oct 14, 2016 •

edited

Loading

hlin117 commented Oct 15, 2016 •

edited

Loading

hlin117 Oct 15, 2016 •

edited

Loading

hlin117 Oct 15, 2016 •

edited

Loading

hlin117 Oct 15, 2016 •

edited

Loading

hlin117 commented Oct 16, 2016 •

edited

Loading

amueller commented Oct 17, 2016 •

edited

Loading

amueller commented Oct 17, 2016 •

edited

Loading

hlin117 commented Oct 18, 2016 •

edited

Loading

jnothman commented Jul 10, 2017 •

edited

Loading

ogrisel commented Jul 11, 2017 •

edited

Loading

ogrisel commented Jul 11, 2017 •

edited

Loading

ogrisel commented Jul 11, 2017 •

edited

Loading

ogrisel commented Jul 12, 2017 •

edited

Loading

ogrisel commented Jul 12, 2017 •

edited

Loading

ogrisel commented Jul 12, 2017 •

edited

Loading