-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] Add fixed width discretization to scikit-learn #7668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Generate output | ||
discretized = np.floor(X_trans) | ||
|
||
# Addressing corner case issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should definitely use discretized.min(axis=0)
and discretized.max(axis=0)
here. I also owe the reader an explanation about this corner case.
The user should be able to provide a mask or indices on what features to bin, right? Also, shouldn't we do uniform bins and percentile bins in the same estimator? It's such a minor change. Any other obvious binning methods we might want to have? What would the estimator be called? |
Have you looked at the code, @amueller? I'm not sure this scaling approach to fixed width discretisation easily transfers to quantiles. |
Yes, that was one feature I didn't include from #5825. I didn't add it into this PR yet - I wanted to keep the PR small, so I could get easy feedback from people whether this is a feasible route to go.
I'm trying to keep it simple like @mblondel's request here. This PR should only contain binning for |
|
||
# Shift data | ||
epsilon = np.abs(X_trans.min() - np.ceil(X_trans.min())) | ||
X_trans[X_trans != 0] += epsilon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller You mentioned that you wanted this discretizer to mask out categorical features. My previous implementation in #5825 moved categorical features to the last columns of X
. (Example, if there were 3 categorical features, after the transformation, they would appear as the last 3 columns in X_trans
.) I'm trying to avoid that in this implementation.
But the above lines in my code are built to work with sparse matrices. I'm thinking the following code to have it implemented by specifying which continuous features to bin using the code below:
>>> from scipy.sparse import csr_matrix
>>> import numpy as np
>>>
>>> matrix = csr_matrix([[0, 1, 2, 3], \
>>> [0, 0, 4, 0], \
>>> [5, 0, 0, 0]], dtype=float)
>>> continuous = [0, 3]
>>> epsilon = 0.5
>>>
>>> submatrix = matrix[:, continuous] # Get only the continuous features
>>> submatrix[submatrix != 0] += epsilon # The shift step
>>>
>>> matrix[:, continuous] = submatrix
which looks reasonable. Except scipy 0.11.0 raises a NotImplementedError
on this line:
submatrix[submatrix != 0] += epsilon # The shift step
Do you have work arounds for "masking out" categorical features? (And allowing manipulation on only continuous features.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this would work
>>> from scipy.sparse import csr_matrix, issparse
>>> import numpy as np
>>>
>>> matrix = csr_matrix([[0, 1, 2, 3], \
>>> [0, 0, 4, 0], \
>>> [5, 0, 0, 0]], dtype=float)
>>> continuous = [0, 3]
>>> epsilon = 0.5
>>>
>>> submatrix = matrix[:, continuous] # Get only the continuous features
>>> if issparse(submatrix):
>>> submatrix.data += epsilon
>>> else:
>>> submatrix[submatrix != 0] += epsilon
>>>
>>> matrix[:, continuous] = submatrix
But I run into this problematic error
>>> matrix[:, continuous] = submatrix
NotImplementedError: Fancy indexing in assignment not supported for csr matrices.
Seeking suggestions, @amueller.
X = X.copy() | ||
|
||
# Clip outliers from data (TODO) | ||
#X[X > self.max_] = self.max_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to do something that helps me achieve the following:
>>> import numpy as np
>>>
>>> array = np.array([[ 1, 2, 3],
>>> [ 0, 5, 2],
>>> [-1, 1, 2],
>>> [-2, 6, 4]])
>>> max_ = [0, 3, 2]
>>> array[array > max_] = self.max_ # NOTE: Syntax actually doesn't work
>>> array
array([[ 0, 2, 2],
[ 0, 3, 2],
[-1, 1, 2],
[-2, 3, 2]])
But this is invalid syntax. Is there any nice numpy syntax to clip the maxes across each column? (The method needs to work on scipy sparse matricies.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I'll be damned. np.clip exists.
This doesn't solve how we would adapt this solution to sparse matrices though.
Some design questions about percentiles. What do we want the API to look like if we do percentile / quantile binning? Do we want the user to pass in an array of values between 1-100 which express where the percentile "cutpoints" are? |
shouldn't |
I didn't look at the implementation, but I'm wondering if this relatively complex implementation is actually better than what I would have thought would be the naive one, which is iterating over columns and doing mins = np.min(X, axis=0)
maxs = np.max(X, axis=0)
self.bins = [np.linspace(min, max, n_bins + 1) for min, max in zip(mins, maxs)] and out = [np.digitize(col, one_bins) for col, bins in zip(X.T, self.bins)]
return OneHotEncoder().fit_transform(np.c_[out]) that seems much more readable to me. |
The api for percentiles would also be just provide a number of bins, and instead of a linspace we use |
I have to admit I didn't think about the sparse case yet. |
One of the reasons we were trying to go for a more complex implementation is so we could handle sparse matrices. We wanted any zero in the sparse matrix to map to zero in the discretized matrix. In the case of using |
If you consider sparse matrices @amueller, can you give me a second opinion of my code? |
sklearn/preprocessing/__init__.py
Outdated
@@ -35,6 +36,7 @@ | |||
'Binarizer', | |||
'FunctionTransformer', | |||
'Imputer', | |||
'KBinsDiscretizer' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I need to add a comma here.
I understand that this "more complex" method allows for a linear time solution. Is it not straightforward to renumber the bins such that 0 is always in the bin containing 0 in input? |
You can do that. But ultimately what that means is the algorithm will transform a sparse array to a dense array (before another transform to transform it back to a sparse array). This is what we were trying to avoid in #5825. (Which is what led to the long discussion in that PR.) My solution would preserve the sparsity of the input matrix, which is most ideal. It's complex, but I don't think it's unreadable. |
On another note, I don't think |
|
||
# Corner case arises when maximum values across each column of X | ||
# get shifted to the next integer. | ||
for col in discretized.T: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.clip here for dense matrices. Still need to find a solution for sparse matrices though.
looks like my standpoint is the opposite of @mblondel's ^^ I'd rather have quantiles and no sparse support and a simple implementation. If we don't do the one-hot-encoding here, this estimator must always be followed by a OneHotEncoder - which I find rather strange. Why do we burden the user with that? And if we one-hot-encode the data, what does that mean for sparse data? @mblondel can you maybe give an example of your use case? I think I don't understand it. I though the standard use-case of this would be some continuous features that you want to discretize and then learn a linear model on. That would work well with fixed width and quantiles and (at least with newer numpy) could be done in a couple of lines. What is the sparse matrix use-case, what data representation do you want to get out and what models do you want to use? |
I re-read #5778 and that is actually exactly what I would have thought of. I'm not sure how that relates to sparse input. If we one-hot-encode sparse input, we will have one non-zero entry in the one-hot-encoded version for every entry in the input, also the zero ones. We could add zero as the "reference class" and do a one-hot-encoding where zero is represented by none of the entries set to one. Was that the plan? Currently OneHotEncoder doesn't support that. Also, our OneHotEncoder doesn't support any sparse input, right? |
If you design your bin numbering such that 0 always lands up transforming On 19 October 2016 at 05:23, Andreas Mueller notifications@github.com
|
That's right, I'm pretty sure OHE doesn't currently support sparse input. On 19 October 2016 at 08:05, Joel Nothman joel.nothman@gmail.com wrote:
|
I do agree with you that returning the one-hot encoding is the main use case, as I originally intended in #5778. Returning the ordinal value could be potentially useful for "denoising" the features but I have no evidence of that.
My original use case was to feed the one-hot encoding to models that use feature combinations such as SVMs with a polynomial kernel or factorization machines. However, this was months ago and I don't remember it well.
I would map zeros to a zero vector. Even if you decide to focus on the dense case for now, it would be nice to keep in mind the sparse case when designing the discretization strategy. |
I think it's worth returning ordinals at least initially. Certainly, other
I think this could be counter-intuitive when our one-hot encoder and On 19 October 2016 at 13:19, Mathieu Blondel notifications@github.com
|
On 10/18/2016 11:28 PM, Joel Nothman wrote:
|
I'll start the dense case soon. But like it's been alluded to, it'll be important to be mindful of the sparse case later. |
PR is ready for review. I added unit tests to this PR. I did not establish the behavior for 1d arrays though. I'm leaving that open for that. |
Change title to MRG? |
Of course, someone else could merge too. I've been too involved with it to be careful and impartial, though I'm sure you have been @gle. |
from sklearn.preprocessing.data import _transform_selected | ||
from sklearn.utils.validation import check_array | ||
from sklearn.utils.validation import check_is_fitted | ||
from sklearn.utils.validation import column_or_1d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be relative imports
also could this be illustrated in an example to show gain in training time without perf loss or showing a gain in perf due to a form of feature normalization or removal of extreme/noisy samples? |
Hmm. Do you have anything in mind @agramfort? I would usually consider this
useful when coupled with something like a UnaryEncoder.
…On 11 July 2017 at 13:10, Alexandre Gramfort ***@***.***> wrote:
also could this be illustrated in an example to show gain in training time
without perf loss or showing a gain in perf due to a form of feature
normalization or removal of extreme/noisy samples?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7668 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67otRKfPRGi_Wc_7ZCB5A6KEndU6ks5sMueIgaJpZM4KW0dz>
.
|
I don't see how this kind of transform can be useful. To me binning is not what is implemented in this PR. Instead each bin should be expanded to a new dimension (using a sparse matrix as an output to avoid materializing zeros in memory). For instance if you have a dataset with This way, a linear model combined with such a binning preprocessor could model univariate non-linear dependencies between the output variable and each input variable. (It can be even more expressive by chaining binning with polynomial feature extraction). I think @pprett has significant practical experience with using binning as a preprocessing strategy. I would love to get his input on this PR. |
I think the feature that you mentioned was the original one that @mblondel
was striving for. However, the initial idea moved toward the current
proposal which is a R feature named "cut".
If we pipeline a OHE or the Categorical Transformer, do we get something
that is similar to what your proposing.
I was thinking that we could enforce output dtype to be integer. Right now
this is not the case because we want the user to let pass some column
unchanged.
However, I think that we should use the Column Transformer for this purpose
and restrict the output dtype of the proposed transformer.
|
Do you mean 3 continuous features or am I missing something? I agree this is the common desirable output; I was thinking, just like @glemaitre, that it can be obtained by piping, e.g., OneHotEncoder. While I can see the arguments for directly outputting the one-hot encoded format, here are a few arguments (albeit kinda weak) in favor of separating the concerns:
|
Yes 3 input features, sorry :) |
Ideally, I would like to get the option to do linear interpolation inside the bins and output sparse yet continuous values between 0 and 1 so as to be able to perform E.g. assume a an input continuous feature and assume that it's uniformly distributed between 1 and 6 and If the input is This kind of transformer could not be implemented by pipelining the transformer in this PR with a integer categorical encoder. |
Indeed, incremental binning is another interesting transform. |
I would like to see this have an encode={ordinal, onehot, unary} parameter.
But the UnaryEncoder is not merged and this is usable as is.
The question from @agramfort was to prove that is useful as is...
On 12 Jul 2017 3:27 am, "Olivier Grisel" <notifications@github.com> wrote:
Yes 3 input features, sorry :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7668 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz68IPPnFcROlET6L3_3Ga_HQo8EUzks5sM7B3gaJpZM4KW0dz>
.
|
and good work Vlad on coming up with cases where this may be useful as is
…On 12 Jul 2017 3:27 am, "Olivier Grisel" ***@***.***> wrote:
Yes 3 input features, sorry :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7668 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz68IPPnFcROlET6L3_3Ga_HQo8EUzks5sM7B3gaJpZM4KW0dz>
.
|
is incremental binning what it's called these days? I think the core idea
is the same. These are all extensions we should farm out to contributors
for the next version, no?
…On 12 Jul 2017 9:40 am, "Joel Nothman" ***@***.***> wrote:
and good work Vlad on coming up with cases where this may be useful as is
On 12 Jul 2017 3:27 am, "Olivier Grisel" ***@***.***> wrote:
> Yes 3 input features, sorry :)
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#7668 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz68IPPnFcROlET6L3_3Ga_HQo8EUzks5sM7B3gaJpZM4KW0dz>
> .
>
|
I don't see any case where this preprocessing would perform better than
This is probably the only valid use case I see for this transformer in its current state.
To me it smells YAGNI as long as I don't see another concrete use case. In my opinion, binning is mostly useful to help linear models work on non-linearly separable tasks, (possibly as a complement to polynomial feature expansions). I would not be in favor of merging this PR if it cannot tackle that use case by default. The default behavior of our models should nudge our users into using them in way that reflects commonly acknowledged best practices. |
I don't know. Any Kaggle master of feature engineering around here? I did a bit of Kaggle forum scouting and another motivation of binning is to remove noise when you assume that small feature value variations are not important to the output. Still I would like to see a dataset that demonstrates the benefits of this univariate noise removal alone. While googling I found another interesting binning strategy: Head Tail Breaks binning for features whose ranked values follow a long tail distribution: https://youtu.be/HS7mObQttxU?t=22m13s In the end of the presentation the presenter states that a linear regression model can benefit from this because of the resulting one-hot encoding makes the transformation non-linear although the normalization effect of the transform can also be responsible and it would be interesting to benchmark against Linear Regression on features preprocessed by QuantileTransformer. The Head Tail Breaks binning method is still interesting because the number of bins is determined adaptively depending on the distribution of the data. |
Also I question the choice of using fixed width bins as the default strategy. I think quantile based binning or even 1D K-Means based binning would be more robust to features with non-uniform distributions. We could keep fixed width binning as an option but would rather be in favor of using quantile based binning by default. |
Here is my solution. The code has been reviewed and found good in terms of usability, reliability, maintainability, etc. It falls short on demonstrated usefulness (although the output is flexible). I have squashed this work into https://github.com/scikit-learn/scikit-learn/tree/discrete and propose that we:
My motivation for this is simply that this PR has grown long and unwieldy, is otherwise approved, and the enhancements need not be forced onto @hlin117. |
So are we okay for me to close this and create new issues calling for PRs to the |
+1 |
Both quantile and KMeans discretization will use bin edges rather than bin
widths in transform, so the fixed width case can be implemented more
efficiently and has a different set of attributes. Things like automatic
determination of fixed bin width (as per np.histogram) are also
inappropriate to those methods. So either we need to reject this, or we
need to accept that fixed width and those belong in different estimators.
In which case perhaps this needs to change back to being
FixedWidthDiscretizer. The output encoding stuff should be more-or-less
(not sure about the interpolation stuff) shared across the discretizers.
…On 12 July 2017 at 18:19, Olivier Grisel ***@***.***> wrote:
Also I question the choice of using fixed width bins as the default
strategy. I think quantile based binning of 1D K-Means based binning would
be more robust to features with non-uniform distributions. We could keep
fixed width binning as an option but would rather be in favor of using
quantile based binning by default.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7668 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_A1-qhalNx7ZZyBf8fUR1mCx-Ivks5sNIGJgaJpZM4KW0dz>
.
|
Congratulations and thanks to @hlin117 for his work! |
(And maybe it wouldn't be terrible to include quantile or kmeans discretization in the same class. It would be nice to know how useful or otherwise fixed width discretization is in practice relative to those.) |
Thanks @jnothman and @glemaitre for helping to review my code! |
Little insight that I had today thinking about the KBinsDiscretizer: it is related a hard version of kernel expansion with Gaussian kernels. Is this explained in places? Maybe it would be interesting to add this to some example? |
Summary
This PR addresses #5778. It adds the
KBinsDiscretizer
class to scikit-learn.This PR closes #5825 as a much more straightforward discretization algorithm. It is a WIP, please feel free to leave comments about the current code before the code gets hairy and hard to read 😄
Roadmap
Permission
Features and code
inverse_transform
functionDocumentation
doc/modules/classes.rst
doc/modules/preprocessing.rst