Skip to content

[MRG+2] Add fixed width discretization to scikit-learn #7668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 42 commits into from

Conversation

hlin117
Copy link
Contributor

@hlin117 hlin117 commented Oct 14, 2016

Summary

This PR addresses #5778. It adds the KBinsDiscretizer class to scikit-learn.

This PR closes #5825 as a much more straightforward discretization algorithm. It is a WIP, please feel free to leave comments about the current code before the code gets hairy and hard to read 😄

Roadmap

Permission

  • Package owner approval of algorithm

Features and code

  • Uniform discretization strategy for 2D arrays
  • inverse_transform function
  • Unit tests (duh)

Documentation

  • User guide
  • Docs in doc/modules/classes.rst
  • Docs in doc/modules/preprocessing.rst
  • Docstring tests

# Generate output
discretized = np.floor(X_trans)

# Addressing corner case issues
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should definitely use discretized.min(axis=0) and discretized.max(axis=0) here. I also owe the reader an explanation about this corner case.

@amueller
Copy link
Member

The user should be able to provide a mask or indices on what features to bin, right? Also, shouldn't we do uniform bins and percentile bins in the same estimator? It's such a minor change. Any other obvious binning methods we might want to have? What would the estimator be called?

@jnothman
Copy link
Member

Have you looked at the code, @amueller? I'm not sure this scaling approach to fixed width discretisation easily transfers to quantiles.

@hlin117
Copy link
Contributor Author

hlin117 commented Oct 15, 2016

@amueller

The user should be able to provide a mask or indices on what features to bin, right?

Yes, that was one feature I didn't include from #5825. I didn't add it into this PR yet - I wanted to keep the PR small, so I could get easy feedback from people whether this is a feasible route to go.

Also, shouldn't we do uniform bins and percentile bins in the same estimator? It's such a minor change. Any other obvious binning methods we might want to have? What would the estimator be called?

I'm trying to keep it simple like @mblondel's request here. This PR should only contain binning for k fixed bins. I can definitely extend the work more later on.


# Shift data
epsilon = np.abs(X_trans.min() - np.ceil(X_trans.min()))
X_trans[X_trans != 0] += epsilon
Copy link
Contributor Author

@hlin117 hlin117 Oct 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller You mentioned that you wanted this discretizer to mask out categorical features. My previous implementation in #5825 moved categorical features to the last columns of X. (Example, if there were 3 categorical features, after the transformation, they would appear as the last 3 columns in X_trans.) I'm trying to avoid that in this implementation.

But the above lines in my code are built to work with sparse matrices. I'm thinking the following code to have it implemented by specifying which continuous features to bin using the code below:

>>> from scipy.sparse import csr_matrix
>>> import numpy as np
>>> 
>>> matrix = csr_matrix([[0, 1, 2, 3], \
>>>                      [0, 0, 4, 0], \
>>>                      [5, 0, 0, 0]], dtype=float)
>>> continuous = [0, 3]
>>> epsilon = 0.5
>>> 
>>> submatrix = matrix[:, continuous]  # Get only the continuous features
>>> submatrix[submatrix != 0] += epsilon  # The shift step
>>> 
>>> matrix[:, continuous] = submatrix

which looks reasonable. Except scipy 0.11.0 raises a NotImplementedError on this line:

submatrix[submatrix != 0] += epsilon  # The shift step

Do you have work arounds for "masking out" categorical features? (And allowing manipulation on only continuous features.)

Copy link
Contributor Author

@hlin117 hlin117 Oct 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this would work

>>> from scipy.sparse import csr_matrix, issparse
>>> import numpy as np
>>> 
>>> matrix = csr_matrix([[0, 1, 2, 3], \
>>>                      [0, 0, 4, 0], \
>>>                      [5, 0, 0, 0]], dtype=float)
>>> continuous = [0, 3]
>>> epsilon = 0.5
>>> 
>>> submatrix = matrix[:, continuous]  # Get only the continuous features
>>> if issparse(submatrix):
>>>    submatrix.data += epsilon
>>> else:
>>>    submatrix[submatrix != 0] += epsilon
>>> 
>>> matrix[:, continuous] = submatrix

But I run into this problematic error

>>> matrix[:, continuous] = submatrix
NotImplementedError: Fancy indexing in assignment not supported for csr matrices.

Seeking suggestions, @amueller.

X = X.copy()

# Clip outliers from data (TODO)
#X[X > self.max_] = self.max_
Copy link
Contributor Author

@hlin117 hlin117 Oct 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to do something that helps me achieve the following:

>>> import numpy as np
>>>
>>> array = np.array([[ 1, 2, 3],
>>>                   [ 0, 5, 2],
>>>                   [-1, 1, 2],
>>>                   [-2, 6, 4]])
>>> max_ = [0, 3, 2]
>>> array[array > max_] = self.max_  # NOTE: Syntax actually doesn't work
>>> array
array([[ 0, 2, 2],
       [ 0, 3, 2],
       [-1, 1, 2],
       [-2, 3, 2]])

But this is invalid syntax. Is there any nice numpy syntax to clip the maxes across each column? (The method needs to work on scipy sparse matricies.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I'll be damned. np.clip exists.

This doesn't solve how we would adapt this solution to sparse matrices though.

@hlin117
Copy link
Contributor Author

hlin117 commented Oct 16, 2016

@amueller

Also, shouldn't we do uniform bins and percentile bins in the same estimator? It's such a minor change.

Some design questions about percentiles. What do we want the API to look like if we do percentile / quantile binning? Do we want the user to pass in an array of values between 1-100 which express where the percentile "cutpoints" are?

@amueller
Copy link
Member

shouldn't transform just be a call to np.digitize?

@amueller
Copy link
Member

amueller commented Oct 17, 2016

I didn't look at the implementation, but I'm wondering if this relatively complex implementation is actually better than what I would have thought would be the naive one, which is iterating over columns and doing np.digitize.
Fit would then be

mins = np.min(X, axis=0)
maxs = np.max(X, axis=0)
self.bins = [np.linspace(min, max, n_bins + 1) for min, max in zip(mins, maxs)]

and transform would be

out = [np.digitize(col, one_bins) for col, bins in zip(X.T, self.bins)]
return OneHotEncoder().fit_transform(np.c_[out])

that seems much more readable to me.

@amueller
Copy link
Member

amueller commented Oct 17, 2016

The api for percentiles would also be just provide a number of bins, and instead of a linspace we use np.percentile(X[:, i], np.linspace(0, 100, n_bins + 1))

@amueller
Copy link
Member

I have to admit I didn't think about the sparse case yet.

@hlin117
Copy link
Contributor Author

hlin117 commented Oct 17, 2016

One of the reasons we were trying to go for a more complex implementation is so we could handle sparse matrices. We wanted any zero in the sparse matrix to map to zero in the discretized matrix.

In the case of using np.digitize, we'll have trouble mapping zeros to zeros for negative values. I would otherwise go for a simple implementation.

@hlin117
Copy link
Contributor Author

hlin117 commented Oct 17, 2016

If you consider sparse matrices @amueller, can you give me a second opinion of my code?

@@ -35,6 +36,7 @@
'Binarizer',
'FunctionTransformer',
'Imputer',
'KBinsDiscretizer'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I need to add a comma here.

@jnothman
Copy link
Member

I understand that this "more complex" method allows for a linear time solution.

Is it not straightforward to renumber the bins such that 0 is always in the bin containing 0 in input?

@hlin117
Copy link
Contributor Author

hlin117 commented Oct 18, 2016

Is it not straightforward to renumber the bins such that 0 is always in the bin containing 0 in input?

You can do that. But ultimately what that means is the algorithm will transform a sparse array to a dense array (before another transform to transform it back to a sparse array). This is what we were trying to avoid in #5825. (Which is what led to the long discussion in that PR.)

My solution would preserve the sparsity of the input matrix, which is most ideal. It's complex, but I don't think it's unreadable.

@hlin117
Copy link
Contributor Author

hlin117 commented Oct 18, 2016

On another note, I don't think np.digitize is what we're looking for here. np.digitize, prior to 1.10.0, used a linear search. Also, np.digitize wouldn't work for sparse matrices.
Source.


# Corner case arises when maximum values across each column of X
# get shifted to the next integer.
for col in discretized.T:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.clip here for dense matrices. Still need to find a solution for sparse matrices though.

@amueller
Copy link
Member

looks like my standpoint is the opposite of @mblondel's ^^ I'd rather have quantiles and no sparse support and a simple implementation.
Actually, I have to look at this and the previous implementation. I would have expected this transformation to do a one-hot-encoding. What's the point of discretizing without encoding the result as a categorical variable? (Another thing I appear to be disagreeing with @mblondel about #5825 (comment))

If we don't do the one-hot-encoding here, this estimator must always be followed by a OneHotEncoder - which I find rather strange. Why do we burden the user with that?

And if we one-hot-encode the data, what does that mean for sparse data?

@mblondel can you maybe give an example of your use case? I think I don't understand it.

I though the standard use-case of this would be some continuous features that you want to discretize and then learn a linear model on. That would work well with fixed width and quantiles and (at least with newer numpy) could be done in a couple of lines.

What is the sparse matrix use-case, what data representation do you want to get out and what models do you want to use?

@amueller
Copy link
Member

I re-read #5778 and that is actually exactly what I would have thought of. I'm not sure how that relates to sparse input. If we one-hot-encode sparse input, we will have one non-zero entry in the one-hot-encoded version for every entry in the input, also the zero ones. We could add zero as the "reference class" and do a one-hot-encoding where zero is represented by none of the entries set to one. Was that the plan? Currently OneHotEncoder doesn't support that. Also, our OneHotEncoder doesn't support any sparse input, right?
So even if we have sparse matrix support here, there is nothing the user can do with the output without making it dense first.

@jnothman
Copy link
Member

If you design your bin numbering such that 0 always lands up transforming
0, I don't see how you need to change sparsity: you only need to discretize
the nonzeros. But I suspect that should be special functionality to support
sparse matrices and we should worry about the dense case first.

On 19 October 2016 at 05:23, Andreas Mueller notifications@github.com
wrote:

I re-read #5778 #5778
and that is actually exactly what I would have thought of. I'm not sure how
that relates to sparse input. If we one-hot-encode sparse input, we will
have one non-zero entry in the one-hot-encoded version for every entry in
the input, also the zero ones. We could add zero as the "reference class"
and do a one-hot-encoding where zero is represented by none of the entries
set to one. Was that the plan? Currently OneHotEncoder doesn't support
that. Also, our OneHotEncoder doesn't support any sparse input, right?
So even if we have sparse matrix support here, there is nothing the user
can do with the output without making it dense first.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7668 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6wC7U74FyA1uDh-EYxfw0he-jlIyks5q1Q68gaJpZM4KW0dz
.

@jnothman
Copy link
Member

That's right, I'm pretty sure OHE doesn't currently support sparse input.

On 19 October 2016 at 08:05, Joel Nothman joel.nothman@gmail.com wrote:

If you design your bin numbering such that 0 always lands up transforming
0, I don't see how you need to change sparsity: you only need to discretize
the nonzeros. But I suspect that should be special functionality to support
sparse matrices and we should worry about the dense case first.

On 19 October 2016 at 05:23, Andreas Mueller notifications@github.com
wrote:

I re-read #5778
#5778 and that is
actually exactly what I would have thought of. I'm not sure how that
relates to sparse input. If we one-hot-encode sparse input, we will have
one non-zero entry in the one-hot-encoded version for every entry in the
input, also the zero ones. We could add zero as the "reference class" and
do a one-hot-encoding where zero is represented by none of the entries set
to one. Was that the plan? Currently OneHotEncoder doesn't support that.
Also, our OneHotEncoder doesn't support any sparse input, right?
So even if we have sparse matrix support here, there is nothing the user
can do with the output without making it dense first.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7668 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6wC7U74FyA1uDh-EYxfw0he-jlIyks5q1Q68gaJpZM4KW0dz
.

@mblondel
Copy link
Member

If we don't do the one-hot-encoding here, this estimator must always be followed by a OneHotEncoder - which I find rather strange. Why do we burden the user with that?

I do agree with you that returning the one-hot encoding is the main use case, as I originally intended in #5778. Returning the ordinal value could be potentially useful for "denoising" the features but I have no evidence of that.

@mblondel can you maybe give an example of your use case? I think I don't understand it.

My original use case was to feed the one-hot encoding to models that use feature combinations such as SVMs with a polynomial kernel or factorization machines. However, this was months ago and I don't remember it well.

If we one-hot-encode sparse input, we will have one non-zero entry in the one-hot-encoded version for every entry in the input, also the zero ones.

I would map zeros to a zero vector. Even if you decide to focus on the dense case for now, it would be nice to keep in mind the sparse case when designing the discretization strategy.

@jnothman
Copy link
Member

I think it's worth returning ordinals at least initially. Certainly, other
binary encodings, such as [[2], [3]] -> [[1, 1, 0], [1, 1, 1]] are very
useful in practice. It's easy to make a pipeline, but hard if we only
offer the output one-hot encoded.

I would map zeros to a zero vector. Even if you decide to focus on the
dense case for now, it would be nice to keep in mind the sparse case when
designing the discretization strategy.

I think this could be counter-intuitive when our one-hot encoder and
others' are inclined to assume bins range from 0 to n_bins - 1. So I'd
rather this be switched on by a keep_zeros_zero=False flag.

On 19 October 2016 at 13:19, Mathieu Blondel notifications@github.com
wrote:

If we don't do the one-hot-encoding here, this estimator must always be
followed by a OneHotEncoder - which I find rather strange. Why do we burden
the user with that?

I do agree with you that returning the one-hot encoding is the main use
case, as I originally intended in #5778
#5778. Returning the
ordinal value could be potentially useful for "denoising" the features but
I have no evidence of that.

@mblondel https://github.com/mblondel can you maybe give an example of
your use case? I think I don't understand it.

My original use case was to feed the one-hot encoding to models that use
feature combinations such as SVMs with a polynomial kernel or factorization
machines. However, this was months ago and I don't remember it well.

If we one-hot-encode sparse input, we will have one non-zero entry in the
one-hot-encoded version for every entry in the input, also the zero ones.

I would map zeros to a zero vector. Even if you decide to focus on the
dense case for now, it would be nice to keep in mind the sparse case when
designing the discretization strategy.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7668 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6_vQDor64PNCSBXDhkboZY8itkfQks5q1X4ngaJpZM4KW0dz
.

@amueller
Copy link
Member

On 10/18/2016 11:28 PM, Joel Nothman wrote:

I think it's worth returning ordinals at least initially. Certainly, other
binary encodings, such as [[2], [3]] -> [[1, 1, 0], [1, 1, 1]] are very
useful in practice. It's easy to make a pipeline, but hard if we only
offer the output one-hot encoded.
We could make it optional and then we can discuss the default?

@hlin117
Copy link
Contributor Author

hlin117 commented Oct 20, 2016

I'll start the dense case soon. But like it's been alluded to, it'll be important to be mindful of the sparse case later.

@hlin117
Copy link
Contributor Author

hlin117 commented Oct 23, 2016

PR is ready for review. I added unit tests to this PR.

I did not establish the behavior for 1d arrays though. I'm leaving that open for that.

@jnothman
Copy link
Member

Change title to MRG?

@hlin117 hlin117 changed the title [WIP] Add fixed width discretization to scikit-learn [MRG] Add fixed width discretization to scikit-learn Oct 27, 2016
@jnothman
Copy link
Member

jnothman commented Jul 10, 2017

Well I said I would mark it as MRG+2 on your basis, but then forgot to. I also said that I would leave it to @amueller or @mblondel to merge.

@jnothman
Copy link
Member

Of course, someone else could merge too. I've been too involved with it to be careful and impartial, though I'm sure you have been @gle.

from sklearn.preprocessing.data import _transform_selected
from sklearn.utils.validation import check_array
from sklearn.utils.validation import check_is_fitted
from sklearn.utils.validation import column_or_1d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be relative imports

@agramfort
Copy link
Member

also could this be illustrated in an example to show gain in training time without perf loss or showing a gain in perf due to a form of feature normalization or removal of extreme/noisy samples?

@jnothman
Copy link
Member

jnothman commented Jul 11, 2017 via email

@ogrisel
Copy link
Member

ogrisel commented Jul 11, 2017

I don't see how this kind of transform can be useful. To me binning is not what is implemented in this PR. Instead each bin should be expanded to a new dimension (using a sparse matrix as an output to avoid materializing zeros in memory).

For instance if you have a dataset with 5 3 continuous features as input and what to do binning with n_bins=5 I would expect 15 dimensions as output (using histograms as implemented in this PR) with only 3 non-zero values per output record.

This way, a linear model combined with such a binning preprocessor could model univariate non-linear dependencies between the output variable and each input variable. (It can be even more expressive by chaining binning with polynomial feature extraction).

I think @pprett has significant practical experience with using binning as a preprocessing strategy. I would love to get his input on this PR.

@glemaitre
Copy link
Member

glemaitre commented Jul 11, 2017 via email

@vene
Copy link
Member

vene commented Jul 11, 2017

For instance if you have a dataset with 5 continuous features as input and what to do binning with n_bins=5 I would expect 15 dimensions as output (using histograms as implemented in this PR) with only 3 non-zero values per output record.

Do you mean 3 continuous features or am I missing something?

I agree this is the common desirable output; I was thinking, just like @glemaitre, that it can be obtained by piping, e.g., OneHotEncoder.

While I can see the arguments for directly outputting the one-hot encoded format, here are a few arguments (albeit kinda weak) in favor of separating the concerns:

  • input for tree-based estimators
  • transforming y from regression to multi-class classification
  • tool for people writing custom algorithms that deal with categories directly.

@ogrisel
Copy link
Member

ogrisel commented Jul 11, 2017

Do you mean 3 continuous features or am I missing something?

Yes 3 input features, sorry :)

@ogrisel
Copy link
Member

ogrisel commented Jul 11, 2017

Ideally, I would like to get the option to do linear interpolation inside the bins and output sparse yet continuous values between 0 and 1 so as to be able to perform inverse_transform and approximately reconstruct the original data without losing too much signal (no strong discretization).

E.g. assume a an input continuous feature and assume that it's uniformly distributed between 1 and 6 and n_bins=5.

If the input is [2.7], then I would assume to get it transformed to [0, 0.7, 0, 0, 0].

This kind of transformer could not be implemented by pipelining the transformer in this PR with a integer categorical encoder.

@vene
Copy link
Member

vene commented Jul 11, 2017

Good point.

Another usecase I would like is incremental binning as mentioned by @jnothman here,

for instance [2.7] -> [1, 1, 0, 0, 0] or with @ogrisel's suggestion for exact inverse transform, [1, 1, 0.7, 0, 0] (which can always be floor'd)

@ogrisel
Copy link
Member

ogrisel commented Jul 11, 2017

Indeed, incremental binning is another interesting transform.

@jnothman
Copy link
Member

jnothman commented Jul 11, 2017 via email

@jnothman
Copy link
Member

jnothman commented Jul 11, 2017 via email

@jnothman
Copy link
Member

jnothman commented Jul 11, 2017 via email

@ogrisel
Copy link
Member

ogrisel commented Jul 12, 2017

input for tree-based estimators

I don't see any case where this preprocessing would perform better than QuantileTransfomer for decision tree based models (or even just using the raw features).

transforming y from regression to multi-class classification

This is probably the only valid use case I see for this transformer in its current state.

tool for people writing custom algorithms that deal with categories directly.

To me it smells YAGNI as long as I don't see another concrete use case.

In my opinion, binning is mostly useful to help linear models work on non-linearly separable tasks, (possibly as a complement to polynomial feature expansions). I would not be in favor of merging this PR if it cannot tackle that use case by default. The default behavior of our models should nudge our users into using them in way that reflects commonly acknowledged best practices.

@ogrisel
Copy link
Member

ogrisel commented Jul 12, 2017

is incremental binning what it's called these days?

I don't know. Any Kaggle master of feature engineering around here?

I did a bit of Kaggle forum scouting and another motivation of binning is to remove noise when you assume that small feature value variations are not important to the output. Still I would like to see a dataset that demonstrates the benefits of this univariate noise removal alone.

While googling I found another interesting binning strategy: Head Tail Breaks binning for features whose ranked values follow a long tail distribution:

https://youtu.be/HS7mObQttxU?t=22m13s

In the end of the presentation the presenter states that a linear regression model can benefit from this because of the resulting one-hot encoding makes the transformation non-linear although the normalization effect of the transform can also be responsible and it would be interesting to benchmark against Linear Regression on features preprocessed by QuantileTransformer.

The Head Tail Breaks binning method is still interesting because the number of bins is determined adaptively depending on the distribution of the data.

@ogrisel
Copy link
Member

ogrisel commented Jul 12, 2017

Also I question the choice of using fixed width bins as the default strategy. I think quantile based binning or even 1D K-Means based binning would be more robust to features with non-uniform distributions. We could keep fixed width binning as an option but would rather be in favor of using quantile based binning by default.

@jnothman
Copy link
Member

Here is my solution. The code has been reviewed and found good in terms of usability, reliability, maintainability, etc. It falls short on demonstrated usefulness (although the output is flexible).

I have squashed this work into https://github.com/scikit-learn/scikit-learn/tree/discrete and propose that we:

  • seek an enhancement to that branch (from @hlin117 or someone else) which implements a default onehot encoding, possible to change to ordinal or the other encodings proposed here in the future
  • merge to master only when we're happy with the features available in that branch
  • perhaps seek other enhancements to that branch
  • Now that quantile transformer is available, having an inbuilt quantile transformation as an option would be good.

My motivation for this is simply that this PR has grown long and unwieldy, is otherwise approved, and the enhancements need not be forced onto @hlin117.

@jnothman
Copy link
Member

So are we okay for me to close this and create new issues calling for PRs to the discrete branch?

@ogrisel
Copy link
Member

ogrisel commented Jul 12, 2017

So are we okay for me to close this and create new issues calling for PRs to the discrete branch?

+1

@jnothman
Copy link
Member

jnothman commented Jul 12, 2017 via email

@jnothman
Copy link
Member

Congratulations and thanks to @hlin117 for his work!

@jnothman jnothman closed this Jul 12, 2017
@jnothman
Copy link
Member

(And maybe it wouldn't be terrible to include quantile or kmeans discretization in the same class. It would be nice to know how useful or otherwise fixed width discretization is in practice relative to those.)

@jnothman
Copy link
Member

@hlin117 If you'd like to take on one or more of the new issues (#9336, #9337, #9338, #9339), you should get priority.

@hlin117
Copy link
Contributor Author

hlin117 commented Jul 13, 2017

Thanks @jnothman and @glemaitre for helping to review my code!

@GaelVaroquaux
Copy link
Member

Little insight that I had today thinking about the KBinsDiscretizer: it is related a hard version of kernel expansion with Gaussian kernels. Is this explained in places? Maybe it would be interesting to add this to some example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.