Skip to content

[WIP] Adding Fixed Width Discretization #5825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 28 commits into from

Conversation

hlin117
Copy link
Contributor

@hlin117 hlin117 commented Nov 16, 2015

Update

Please look at #7668 for the current PR.

Summary

This PR addresses #5778. It adds the Discretizer class to scikit-learn. This PR adds the following:

  • Uniform discretization strategy for 1D arrays
  • Uniform discretization strategy for 2D arrays
    • Ability to specify which columns are categorical in a matrix
    • Prevention from passing in invalid categorical column indices
    • Tests for basic discretization capabilities
  • Ability to pass in sparse matrices. (WIP, needs to support previous numpy as scipy versions)
    • Modify sparsefuncs_fast.pyx class
    • Use sparsefuncs.py for discretization
    • Tests for sparse matrix discretization.
    • Code review

Documentation

  • Parameter and Attribute documentation, in the Sphinx comment format
  • Docstring tests

Doctest Example:

>>> # X has two examples, with X[:, 2] as categorical
>>> X = [[0, 1, 0, 4], \
         [6, 7, 1, 5]]
>>> from sklearn.preprocessing import Discretizer
>>> discretizer = Discretizer(n_bins=3, categorical_features=[2])
>>> discretizer.fit(X) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
Discretizer(...)
>>> discretizer.cut_points_ # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
array([[ 2. , 3. , 4.3333...],
       [ 4. , 5. , 4.6666...]])
>>> # Transforming X will move the categorical features to the last indices
>>> discretizer.transform(X)
array([[0, 0, 0, 0],
       [2, 2, 2, 1]])

Original post:

The discretizer should be used as follows. (Just copying from the docstring test.)

>>> from sklearn.preprocessing import FixedWidthDiscretizer
>>> X = [0, 1, 2, 3, 4, 5, 6]
>>> discretizer = FixedWidthDiscretizer(n_bins=3)
>>> discretizer.fit(X)
FixedWidthDiscretizer(n_bins=3)
>>> discretizer.cut_points_
array([2., 4.])
>>> discretizer.transform(X)
array([0, 0, 0, 1, 1, 2, 2])

A couple of comments and questions:

  • This PR is currently failing the build because the tests assume that you can pass in a 2D array into this discretizer. However, I'm limiting the discretizer to only accept 1D arrays. I'm modeling this class off of LabelEncoder, which also requires 1D arrays. Does it sound fine that we limit the inputs to only 1D arrays?
  • I'm not using a one hot encoding, as @mblondel requested in his original issue in Discretizer #5778. How do you feel about this, @mblondel?
  • FixedWidthDiscretizer is a mouthful. Do we have opinions about what else we could name the class? (I'm thinking KBinsDiscretizer.)
  • Unit tests need to be added.

@hlin117 hlin117 mentioned this pull request Nov 16, 2015
@agramfort
Copy link
Member

agramfort commented Nov 16, 2015 via email

]


class FixedWidthDiscretizer(BaseEstimator, TransformerMixin):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just use Discretizer and later have a strategy option in the constructor.

@mblondel
Copy link
Member

I'm not using a one hot encoding, as @mblondel requested in his original issue in #5778

As suggested by@jnothman, this can be handled easily by a OneHotEncoder in a pipeline. So maybe for separation of concern reasons, it's better to not support one-hot encoding in Discretizer, unless we find out that it is much more efficient to do both in one step.


self.min_ = y.min()
self.max_ = y.max()
self.spacing_ = (self.max_ - self.min_) / self.n_bins
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it's not necessary to save spacing as an attribute: it's not used in transform and can be inferred from cut_points.

@hlin117
Copy link
Contributor Author

hlin117 commented Nov 18, 2015

Alright @mblondel, the code is ready to be reviewed.

Couple of comments:

  • strategy is a lonely parameter right now, because there's only one discretization method. What other discretization methods would you like to see? (One idea is having cut points as inputs, as discussed in Discretizer #5778.)
  • I did my best to avoid for-loops, but alas I have two, 138 and line 171. np.linspace and np.searchsorted both can only accept vectors as inputs, and not matrices. Do you have a suggestion of writing more idiomatic numpy code?

@mblondel
Copy link
Member

I suggest that in this PR you focus on the uniform strategy only. I prefer to have only one strategy but good support for sparse X. So for now you can remove the strategy option, we'll add it back later. Please make sure to add tests for sparse data.

In another PR we can add a quantile based approach. Basically you sort the feature values across all samples and choose the cutting points within the sorted values. This will be harder to implement for sparse data and thus needs to be addressed in a separate PR.

Thanks for the good work, you're getting there.

@hlin117
Copy link
Contributor Author

hlin117 commented Nov 18, 2015

Hi @mblondel, thank you for supporting this PR =]

My latest two commits removed the strategy parameter, and I added support for csc and csr sparse matrices. I also added a test case for sparse data.

Please help suggest how I can improve my code. Thanks!

def _check_sparse(self, X, ravel=True):
if ravel:
return X.toarray().ravel() if sp.issparse(X) else X
return X.toarray() if sp.issparse(X) else X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't densify the sparse matrix. Otherwise there's no point using sparse structures in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking this, @mblondel. There's only a few places where I call this function, and that's when the passed in array X is actually a vector, and I need it to be dense. (Example, setting self.min_)

The only situation when I densify a matrix is on line 176, and I don't know how to avoid this call. discretized on line 175 is a dense matrix, and you can't use np.hstack on a dense/sparse matrix.

This function might get borked at the end of the day. scipy 0.9.0 does not have csr_matrix.min(), so that's why the current build is failing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about CSC format? Does it have a min() method? It's ok to support CSC only for now.

If even CSC doesn't have a min() method in 0.9, you can add a small utility function here:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/sparsefuncs_fast.pyx

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it's ok to have two distinct implementations (one for dense and one for CSC) if it makes the implementation clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither csr_matrix nor csc_matrix have the min or max function implemented in scipy 0.9.0.

I might take your suggestion in adding a utility in sparsefuncs_fast.pyx. Thank you!

@hlin117 hlin117 force-pushed the kbins branch 2 times, most recently from 97385f8 to 5328821 Compare November 21, 2015 17:00
col_indices = X.indices
np.ndarray[DOUBLE, ndim=1, mode="c"] minimum = np.zeros(n_features)
np.ndarray[DOUBLE, ndim=1, mode="c"] data = X.data
#DOUBLE[::1] data = X.data
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I used DOUBLE[::1] data = X.data instead of np.ndarray[DOUBLE, ndim=1, mode="c"], I ended up getting segmentation faults when I ran my code while running nosetests test_sparsefuncs.py. Can someone help explain why?

@hlin117
Copy link
Contributor Author

hlin117 commented Nov 22, 2015

Hi @mblondel, I wrote the code to find the axis=0 min/max of csr/csc matrices. Can you help review my code?


# TODO: I don't know how to avoid this call to `astype()`;
# my test cases don't pass with integers otherwise.
X = X.astype(float)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to avoid calling this? I actually run into the following error if I don't convert my data into floats:

ValueError: Buffer dtype mismatch, expected 'DOUBLE' but got 'long long'

I usually trigger this error on line 57 of sparsefuncs_fast.pyx.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is type of conversion is necessary because the Cython code doesn't do the conversion for us. I would use np.float64 instead of float to be more explicit. This is the type that corresponds to double.

@hlin117 hlin117 changed the title Adding Fixed Width Discretization [WIP] Adding Fixed Width Discretization Nov 23, 2015
@mblondel
Copy link
Member

The use of searchsorted in transform is problematic, since it seems difficult to avoid a loop.
Maybe we will have to use Cython for transform as well.

An algorithm possibly worth investigating:

  1. first scale features to [0, 1] range
  2. round using np.round
  3. map to integers

For step 1 I thought we could use MinMaxScaler but it doesn't seem to work on sparse data.
Also, not sure if it's going to be possible to avoid a loop for step 3.

In any case, we need to design a discretization scheme that preserves sparsity (zeros are mapped to zeros). This would mean that features smaller than 0 are mapped to negative numbers.

@jnothman and @amueller might want to pitch in :)

@mblondel
Copy link
Member

In any case, we need to design a discretization scheme that preserves sparsity

This should be automatically the case for non-negative sparse data. For sparse data including negative values, this might need special care.

@hlin117
Copy link
Contributor Author

hlin117 commented Dec 15, 2015

@mblondel: How about this idea?

First, let's assume that when a user creates the discretizer, the user will not add more intervals to the class. (The interface doesn't allow this anyways.) Also, let's suppose that we have intervals

[-infty, x1], [x2, x3], ... [xK <= 0 < x(K+1)]..., [xL, infty]

Let's do a modified binary search on our set of intervals.

Upon creation

  1. Instantiate a sorted array such that each cell holds both a key and value with these three rules:
    • The array is sorted by key
    • xK is excluded from being a key, and each value must be at least 1.
    • np.inf and -np.inf are also keys.

When discretizing x

  1. If x is within [xK, x(K+1)], return 0.
  2. Otherwise, just perform a binary search on this array for some key=xi such that xi <= x < x(i + 1). The returned value is the value whose key is xi.
    • Observe that the sentinel -np.inf and np.inf values help us map values between [-infty, x1] and [xL, infty].

Notice step 1 takes advantage that sparse matrices have mostly 0's. Also, this isn't too difficult to implement.

I probably would unroot a lot of my code for this change, so this adaptation can work for both sparse and dense matrices. The pipeline would work for both types of matrices.

I'd like to take a stab at this because I'm still learning cython, and this would be an easy grab for me. Thanks! I just want package owner approval to take a stab at this.

@hlin117
Copy link
Contributor Author

hlin117 commented Dec 22, 2015

@mblondel: I rewrote a lot of the code to accommodate the binary search implemented in cython. Can you take a look and tell me what you think?

@vene
Copy link
Member

vene commented Feb 6, 2016

Transforming X will move the categorical features to the last indices

I'm not sure why this should be like this. Do you mean the last index overall, among all the features?

If the output is fed into (1.) an one-hot encoder or (2.) a tree-based model, it would make no difference, right?

If we could do away with this feature, this transformer could just assume that no features are numerical. This will simplify the code and interface. Since this is a transformer and not a classifier, users can easily work around the issue with FeatureUnions. (The continuous and non-continuous data probably come from different sources at some point, so just discretize sooner rather than later.)

What do you think? (ping @jnothman, @mblondel)

(P.S. @mblondel, if this works well with polynomial features, does it mean that fuzzy logic is making a comeback? 😛)

max_ : float
The maximum value of the input data.

cut_points_ : array, shape [numBins - 1, n_continuous_features]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_bins

@vene
Copy link
Member

vene commented Feb 8, 2016

Regarding rounding, I assume @mblondel's idea went something like:

(for dense data)

  • scale X into [0, k]
  • round

which seems like it would give you k + 1 bins. I'm not sure if this is right, nor how to make it work for sparse data, but it does make sense to me.

Regardless the discrete columns: I don't understand why you would need to change anything in the discrete columns. And if we don't change the discrete columns, we can remove all the code that deals with discrete columns, simplifying this greatly (including the 🐱 method :) )

If the user wants an one-hot encoding or get_dummies on the other columns, they could do it themselves using a FeatureUnion or something. I am leaning towards encouraging this, but I'd love to hear other opinions. (Hi @jnothman :) )

@hlin117
Copy link
Contributor Author

hlin117 commented Feb 8, 2016

Regarding rounding, I assume @mblondel's idea went something like:

(for dense data)

  • scale X into [0, k]
  • round
    which seems like it would give you k + 1 bins. I'm not sure if this is right, nor how to make it work for sparse data, but it does make sense to me.

Ah, that's clever. However, I think there are cases when this wouldn't work.

For example, if we have [-1, 0, 1], and we want k=3 bins, then [-1, 0, 1] becomes [0, 1.5, 3] (by scaling into [0, k]), and after rounding, we would have [0, 1, 3]. So we unintentionally add another class in there.

I think scaling into [0, k-1] is the correct solution, but - like you mentioned - it won't work for sparse data.

If we scale into [0, k-1], we would have [0, 1, 2]. The zero's in the sparse matrix won't be mapped to zero's in the transformed matrix.

@hlin117
Copy link
Contributor Author

hlin117 commented Feb 8, 2016

Regardless the discrete columns: I don't understand why you would need to change anything in the discrete columns. And if we don't change the discrete columns, we can remove all the code that deals with discrete columns, simplifying this greatly (including the 🐱 method :) )

I just shifted the discrete (categorical) columns to the end of the matrix, just because - implementation wise - this simplified my code. I'd imagine that this is the case with OneHotEncoder too (just simplifying implementation).

But on the other hand, the OneHotEncoder and pd.get_dummies adds more features to the feature matrix. In my case, it shouldn't be terribly difficult to keep things in the same columns.

@hlin117
Copy link
Contributor Author

hlin117 commented Feb 8, 2016

It looks like travis's python3 build is failing because it can't find the _discretization module. I can replicate this error on my computer, but I don't know how to fix it.

I reckon that my configurations in setup.py are incorrect, but I don't understand enough Cython to know. If someone can diagnose why the build is failing, that would be great.

@hlin117
Copy link
Contributor Author

hlin117 commented Feb 17, 2016

@alright, I think I understand the scaling scheme @mblondel was mentioning earlier.

I imagine the scaling to go like this. Given some feature X and our chosen number of bins, k:

  1. Let X_trans = X * k / (X.max() - X.min()). Observe that X_trans is a dataset of width k.
  2. epsilon = abs(X_trans.min() - math.ceil(X_trans.min())).
  3. X_trans[X_trans != 0] += epsilon. We do this operation because it works with sparse matrices.
  4. X_trans = np.floor(X_trans)
  5. X_trans.eliminate_zeros(), for sparse matrices
  6. X_trans now contains all of the discretized features.

Like @vene said, we need to encourage code maintainability, and I think this 6 step process would be faster and much less code than the current implementation. It also works for sparse matrices, which is essentially our blocker here.

Please let me know if there is any interest in continuing this PR.

(Edit 10/14/16: Added clarifications about ceilings and floors.)

@vene
Copy link
Member

vene commented Mar 24, 2016

@hlin117 I'm guessing you need the epsilon to make sure that the smallest non-zero values get discretized to non-zero? I'm guessing you could just add 1 in the first step instead, right? (basically rescale all non-zeros to lie in (1, 1+k) instead of (0, k).

(later edit: I think what I said above works for non-negatives. Negative values should lie in (-k-1, -1) then?

Indeed this was my understanding of @mblondel's suggestion. We need to check two things now:

  • how different this is from the explicit discretization (ideally both in approximation error and in downstream utility when running some classifier on the result)
  • how it compares in terms of runtime.

Also if the rounding approach does work well and is fast, we will need to document the code really clearly, because it's pretty counterintuitive.

@amueller
Copy link
Member

Any news on this?

@hlin117
Copy link
Contributor Author

hlin117 commented Oct 11, 2016

I'll be happy to continue this PR, if there's support for it.

@jnothman
Copy link
Member

I think there is interest.

In the case of sparse matrices, you're only going to be able to maintain sparsity if you engineer 0 to remain 0. I'm confused whether that's what you're doing with your epsilon.

I think you're also missing a clip stage in your processing, as the transformed data may have outliers relative to the fit ranges.

@hlin117
Copy link
Contributor Author

hlin117 commented Oct 14, 2016

I think if we were to move forward with this, I would be okay with abandoning my code here, and starting afresh.

@hlin117
Copy link
Contributor Author

hlin117 commented Oct 14, 2016

@jnothman

I think you're also missing a clip stage in your processing, as the transformed data may have outliers relative to the fit ranges.

I think this data preprocessing should be up to the user. Some users may choose not to clip off these outliers, and it's not really clear how to define what is an outlier, and what is not.

@jnothman
Copy link
Member

Outliers are things out of range of the training data and therefore its
bins. These should be handled by the transformer.

On 14 October 2016 at 17:53, Henry Lin notifications@github.com wrote:

@jnothman https://github.com/jnothman

I think you're also missing a clip stage in your processing, as the
transformed data may have outliers relative to the fit ranges.

I think this data preprocessing should be up to the user. Some users may
choose not to clip off these outliers, and it's not really clear how to
define what is an outlier, and what is not.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#5825 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61dgwRf91K5-x7ufIk_n8S7aAZnNks5qzycHgaJpZM4Giwch
.

@hlin117
Copy link
Contributor Author

hlin117 commented Oct 14, 2016

Outliers are things out of range of the training data and therefore its
bins. These should be handled by the transformer.

Ah true. I wasn't thinking about the testing set, only the training set. Thank you.

@hlin117
Copy link
Contributor Author

hlin117 commented Oct 14, 2016

Please look at #7668. It contains the much more straightforward algorithm. (And much much much less code.) Thanks!

@hlin117 hlin117 closed this Oct 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants