[WIP] Adding Fixed Width Discretization #5825

hlin117 · 2015-11-16T06:52:24Z

Update

Please look at #7668 for the current PR.

Summary

This PR addresses #5778. It adds the Discretizer class to scikit-learn. This PR adds the following:

Documentation

Parameter and Attribute documentation, in the Sphinx comment format
Docstring tests

Doctest Example:

>>> # X has two examples, with X[:, 2] as categorical
>>> X = [[0, 1, 0, 4], \
         [6, 7, 1, 5]]
>>> from sklearn.preprocessing import Discretizer
>>> discretizer = Discretizer(n_bins=3, categorical_features=[2])
>>> discretizer.fit(X) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
Discretizer(...)
>>> discretizer.cut_points_ # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
array([[ 2. , 3. , 4.3333...],
       [ 4. , 5. , 4.6666...]])
>>> # Transforming X will move the categorical features to the last indices
>>> discretizer.transform(X)
array([[0, 0, 0, 0],
       [2, 2, 2, 1]])

Original post:

The discretizer should be used as follows. (Just copying from the docstring test.)

>>> from sklearn.preprocessing import FixedWidthDiscretizer
>>> X = [0, 1, 2, 3, 4, 5, 6]
>>> discretizer = FixedWidthDiscretizer(n_bins=3)
>>> discretizer.fit(X)
FixedWidthDiscretizer(n_bins=3)
>>> discretizer.cut_points_
array([2., 4.])
>>> discretizer.transform(X)
array([0, 0, 0, 1, 1, 2, 2])

A couple of comments and questions:

This PR is currently failing the build because the tests assume that you can pass in a 2D array into this discretizer. However, I'm limiting the discretizer to only accept 1D arrays. I'm modeling this class off of LabelEncoder, which also requires 1D arrays. Does it sound fine that we limit the inputs to only 1D arrays?
I'm not using a one hot encoding, as @mblondel requested in his original issue in Discretizer #5778. How do you feel about this, @mblondel?
FixedWidthDiscretizer is a mouthful. Do we have opinions about what else we could name the class? (I'm thinking KBinsDiscretizer.)
Unit tests need to be added.

agramfort · 2015-11-16T08:17:30Z

X must be 2D and discretization should be done for each column / feature

mblondel · 2015-11-16T08:20:32Z

sklearn/preprocessing/discretization.py

+]
+
+
+class FixedWidthDiscretizer(BaseEstimator, TransformerMixin):


I would just use Discretizer and later have a strategy option in the constructor.

mblondel · 2015-11-16T08:29:09Z

I'm not using a one hot encoding, as @mblondel requested in his original issue in #5778

As suggested by@jnothman, this can be handled easily by a OneHotEncoder in a pipeline. So maybe for separation of concern reasons, it's better to not support one-hot encoding in Discretizer, unless we find out that it is much more efficient to do both in one step.

mblondel · 2015-11-17T08:45:53Z

sklearn/preprocessing/discretization.py

+
+        self.min_ = y.min()
+        self.max_ = y.max()
+        self.spacing_ = (self.max_ - self.min_) / self.n_bins


It seems like it's not necessary to save spacing as an attribute: it's not used in transform and can be inferred from cut_points.

hlin117 · 2015-11-18T02:37:07Z

Alright @mblondel, the code is ready to be reviewed.

Couple of comments:

strategy is a lonely parameter right now, because there's only one discretization method. What other discretization methods would you like to see? (One idea is having cut points as inputs, as discussed in Discretizer #5778.)
I did my best to avoid for-loops, but alas I have two, 138 and line 171. np.linspace and np.searchsorted both can only accept vectors as inputs, and not matrices. Do you have a suggestion of writing more idiomatic numpy code?

mblondel · 2015-11-18T06:31:33Z

I suggest that in this PR you focus on the uniform strategy only. I prefer to have only one strategy but good support for sparse X. So for now you can remove the strategy option, we'll add it back later. Please make sure to add tests for sparse data.

In another PR we can add a quantile based approach. Basically you sort the feature values across all samples and choose the cutting points within the sorted values. This will be harder to implement for sparse data and thus needs to be addressed in a separate PR.

Thanks for the good work, you're getting there.

hlin117 · 2015-11-18T22:30:59Z

Hi @mblondel, thank you for supporting this PR =]

My latest two commits removed the strategy parameter, and I added support for csc and csr sparse matrices. I also added a test case for sparse data.

Please help suggest how I can improve my code. Thanks!

mblondel · 2015-11-19T01:38:39Z

sklearn/preprocessing/discretization.py

+    def _check_sparse(self, X, ravel=True):
+        if ravel:
+            return X.toarray().ravel() if sp.issparse(X) else X
+        return X.toarray() if sp.issparse(X) else X


Don't densify the sparse matrix. Otherwise there's no point using sparse structures in the first place.

Thanks for checking this, @mblondel. There's only a few places where I call this function, and that's when the passed in array X is actually a vector, and I need it to be dense. (Example, setting self.min_)

The only situation when I densify a matrix is on line 176, and I don't know how to avoid this call. discretized on line 175 is a dense matrix, and you can't use np.hstack on a dense/sparse matrix.

This function might get borked at the end of the day. scipy 0.9.0 does not have csr_matrix.min(), so that's why the current build is failing.

How about CSC format? Does it have a min() method? It's ok to support CSC only for now.

If even CSC doesn't have a min() method in 0.9, you can add a small utility function here:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/sparsefuncs_fast.pyx

BTW, it's ok to have two distinct implementations (one for dense and one for CSC) if it makes the implementation clearer.

Neither csr_matrix nor csc_matrix have the min or max function implemented in scipy 0.9.0.

I might take your suggestion in adding a utility in sparsefuncs_fast.pyx. Thank you!

hlin117 · 2015-11-22T07:01:50Z

sklearn/utils/sparsefuncs_fast.pyx

+        col_indices = X.indices
+        np.ndarray[DOUBLE, ndim=1, mode="c"] minimum = np.zeros(n_features)
+        np.ndarray[DOUBLE, ndim=1, mode="c"] data = X.data
+        #DOUBLE[::1] data = X.data


When I used DOUBLE[::1] data = X.data instead of np.ndarray[DOUBLE, ndim=1, mode="c"], I ended up getting segmentation faults when I ran my code while running nosetests test_sparsefuncs.py. Can someone help explain why?

hlin117 · 2015-11-22T07:04:04Z

Hi @mblondel, I wrote the code to find the axis=0 min/max of csr/csc matrices. Can you help review my code?

hlin117 · 2015-11-22T07:07:14Z

sklearn/utils/sparsefuncs.py

+
+    # TODO: I don't know how to avoid this call to `astype()`;
+    # my test cases don't pass with integers otherwise.
+    X = X.astype(float)


Is there any way to avoid calling this? I actually run into the following error if I don't convert my data into floats:

ValueError: Buffer dtype mismatch, expected 'DOUBLE' but got 'long long'

I usually trigger this error on line 57 of sparsefuncs_fast.pyx.

This is type of conversion is necessary because the Cython code doesn't do the conversion for us. I would use np.float64 instead of float to be more explicit. This is the type that corresponds to double.

mblondel · 2015-11-23T03:04:06Z

The use of searchsorted in transform is problematic, since it seems difficult to avoid a loop.
Maybe we will have to use Cython for transform as well.

An algorithm possibly worth investigating:

first scale features to [0, 1] range
round using np.round
map to integers

For step 1 I thought we could use MinMaxScaler but it doesn't seem to work on sparse data.
Also, not sure if it's going to be possible to avoid a loop for step 3.

In any case, we need to design a discretization scheme that preserves sparsity (zeros are mapped to zeros). This would mean that features smaller than 0 are mapped to negative numbers.

@jnothman and @amueller might want to pitch in :)

mblondel · 2015-11-23T04:20:19Z

In any case, we need to design a discretization scheme that preserves sparsity

This should be automatically the case for non-negative sparse data. For sparse data including negative values, this might need special care.

hlin117 · 2015-12-15T09:46:50Z

@mblondel: How about this idea?

First, let's assume that when a user creates the discretizer, the user will not add more intervals to the class. (The interface doesn't allow this anyways.) Also, let's suppose that we have intervals

[-infty, x1], [x2, x3], ... [xK <= 0 < x(K+1)]..., [xL, infty]

Let's do a modified binary search on our set of intervals.

Upon creation

Instantiate a sorted array such that each cell holds both a key and value with these three rules:
- The array is sorted by key
- xK is excluded from being a key, and each value must be at least 1.
- np.inf and -np.inf are also keys.

When discretizing `x`

If x is within [xK, x(K+1)], return 0.
Otherwise, just perform a binary search on this array for some key=xi such that xi <= x < x(i + 1). The returned value is the value whose key is xi.
- Observe that the sentinel -np.inf and np.inf values help us map values between [-infty, x1] and [xL, infty].

Notice step 1 takes advantage that sparse matrices have mostly 0's. Also, this isn't too difficult to implement.

I probably would unroot a lot of my code for this change, so this adaptation can work for both sparse and dense matrices. The pipeline would work for both types of matrices.

I'd like to take a stab at this because I'm still learning cython, and this would be an easy grab for me. Thanks! I just want package owner approval to take a stab at this.

hlin117 · 2015-12-22T07:28:20Z

@mblondel: I rewrote a lot of the code to accommodate the binary search implemented in cython. Can you take a look and tell me what you think?

vene · 2016-02-06T23:07:15Z

Transforming X will move the categorical features to the last indices

I'm not sure why this should be like this. Do you mean the last index overall, among all the features?

If the output is fed into (1.) an one-hot encoder or (2.) a tree-based model, it would make no difference, right?

If we could do away with this feature, this transformer could just assume that no features are numerical. This will simplify the code and interface. Since this is a transformer and not a classifier, users can easily work around the issue with FeatureUnions. (The continuous and non-continuous data probably come from different sources at some point, so just discretize sooner rather than later.)

What do you think? (ping @jnothman, @mblondel)

(P.S. @mblondel, if this works well with polynomial features, does it mean that fuzzy logic is making a comeback? 😛)

vene · 2016-02-06T23:10:36Z

sklearn/preprocessing/discretization.py

+    max_ : float
+        The maximum value of the input data.
+
+    cut_points_ : array, shape [numBins - 1, n_continuous_features]


…onversion

…_ and searched_points_

…ization scheme

…rical feature

vene · 2016-02-08T14:03:50Z

Regarding rounding, I assume @mblondel's idea went something like:

(for dense data)

scale X into [0, k]
round

which seems like it would give you k + 1 bins. I'm not sure if this is right, nor how to make it work for sparse data, but it does make sense to me.

Regardless the discrete columns: I don't understand why you would need to change anything in the discrete columns. And if we don't change the discrete columns, we can remove all the code that deals with discrete columns, simplifying this greatly (including the 🐱 method :) )

If the user wants an one-hot encoding or get_dummies on the other columns, they could do it themselves using a FeatureUnion or something. I am leaning towards encouraging this, but I'd love to hear other opinions. (Hi @jnothman :) )

hlin117 · 2016-02-08T17:29:23Z

Regarding rounding, I assume @mblondel's idea went something like:

(for dense data)

scale X into [0, k]

round
which seems like it would give you k + 1 bins. I'm not sure if this is right, nor how to make it work for sparse data, but it does make sense to me.

Ah, that's clever. However, I think there are cases when this wouldn't work.

For example, if we have [-1, 0, 1], and we want k=3 bins, then [-1, 0, 1] becomes [0, 1.5, 3] (by scaling into [0, k]), and after rounding, we would have [0, 1, 3]. So we unintentionally add another class in there.

I think scaling into [0, k-1] is the correct solution, but - like you mentioned - it won't work for sparse data.

If we scale into [0, k-1], we would have [0, 1, 2]. The zero's in the sparse matrix won't be mapped to zero's in the transformed matrix.

hlin117 · 2016-02-08T17:40:27Z

Regardless the discrete columns: I don't understand why you would need to change anything in the discrete columns. And if we don't change the discrete columns, we can remove all the code that deals with discrete columns, simplifying this greatly (including the 🐱 method :) )

I just shifted the discrete (categorical) columns to the end of the matrix, just because - implementation wise - this simplified my code. I'd imagine that this is the case with OneHotEncoder too (just simplifying implementation).

But on the other hand, the OneHotEncoder and pd.get_dummies adds more features to the feature matrix. In my case, it shouldn't be terribly difficult to keep things in the same columns.

hlin117 · 2016-02-08T23:01:35Z

It looks like travis's python3 build is failing because it can't find the _discretization module. I can replicate this error on my computer, but I don't know how to fix it.

I reckon that my configurations in setup.py are incorrect, but I don't understand enough Cython to know. If someone can diagnose why the build is failing, that would be great.

hlin117 · 2016-02-17T16:18:11Z

@alright, I think I understand the scaling scheme @mblondel was mentioning earlier.

I imagine the scaling to go like this. Given some feature X and our chosen number of bins, k:

Let X_trans = X * k / (X.max() - X.min()). Observe that X_trans is a dataset of width k.
epsilon = abs(X_trans.min() - math.ceil(X_trans.min())).
X_trans[X_trans != 0] += epsilon. We do this operation because it works with sparse matrices.
X_trans = np.floor(X_trans)
X_trans.eliminate_zeros(), for sparse matrices
X_trans now contains all of the discretized features.

Like @vene said, we need to encourage code maintainability, and I think this 6 step process would be faster and much less code than the current implementation. It also works for sparse matrices, which is essentially our blocker here.

Please let me know if there is any interest in continuing this PR.

(Edit 10/14/16: Added clarifications about ceilings and floors.)

vene · 2016-03-24T23:22:03Z

@hlin117 I'm guessing you need the epsilon to make sure that the smallest non-zero values get discretized to non-zero? I'm guessing you could just add 1 in the first step instead, right? (basically rescale all non-zeros to lie in (1, 1+k) instead of (0, k).

(later edit: I think what I said above works for non-negatives. Negative values should lie in (-k-1, -1) then?

Indeed this was my understanding of @mblondel's suggestion. We need to check two things now:

how different this is from the explicit discretization (ideally both in approximation error and in downstream utility when running some classifier on the result)
how it compares in terms of runtime.

Also if the rounding approach does work well and is fast, we will need to document the code really clearly, because it's pretty counterintuitive.

amueller · 2016-10-11T00:01:27Z

Any news on this?

hlin117 · 2016-10-11T06:00:12Z

I'll be happy to continue this PR, if there's support for it.

jnothman · 2016-10-12T12:33:44Z

I think there is interest.

In the case of sparse matrices, you're only going to be able to maintain sparsity if you engineer 0 to remain 0. I'm confused whether that's what you're doing with your epsilon.

I think you're also missing a clip stage in your processing, as the transformed data may have outliers relative to the fit ranges.

hlin117 · 2016-10-14T06:45:32Z

I think if we were to move forward with this, I would be okay with abandoning my code here, and starting afresh.

hlin117 · 2016-10-14T06:53:58Z

@jnothman

I think you're also missing a clip stage in your processing, as the transformed data may have outliers relative to the fit ranges.

I think this data preprocessing should be up to the user. Some users may choose not to clip off these outliers, and it's not really clear how to define what is an outlier, and what is not.

jnothman · 2016-10-14T07:01:53Z

Outliers are things out of range of the training data and therefore its
bins. These should be handled by the transformer.

On 14 October 2016 at 17:53, Henry Lin notifications@github.com wrote:

@jnothman https://github.com/jnothman

I think you're also missing a clip stage in your processing, as the
transformed data may have outliers relative to the fit ranges.

I think this data preprocessing should be up to the user. Some users may
choose not to clip off these outliers, and it's not really clear how to
define what is an outlier, and what is not.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#5825 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61dgwRf91K5-x7ufIk_n8S7aAZnNks5qzycHgaJpZM4Giwch
.

hlin117 · 2016-10-14T07:02:46Z

Outliers are things out of range of the training data and therefore its
bins. These should be handled by the transformer.

Ah true. I wasn't thinking about the testing set, only the training set. Thank you.

hlin117 · 2016-10-14T09:38:03Z

Please look at #7668. It contains the much more straightforward algorithm. (And much much much less code.) Thanks!

hlin117 mentioned this pull request Nov 16, 2015

Discretizer #5778

Closed

mblondel reviewed Nov 16, 2015
View reviewed changes

mblondel reviewed Nov 17, 2015
View reviewed changes

mblondel reviewed Nov 19, 2015
View reviewed changes

hlin117 force-pushed the kbins branch 2 times, most recently from 97385f8 to 5328821 Compare November 21, 2015 17:00

hlin117 reviewed Nov 22, 2015
View reviewed changes

hlin117 changed the title ~~Adding Fixed Width Discretization~~ [WIP] Adding Fixed Width Discretization Nov 23, 2015

hlin117 force-pushed the kbins branch from 559ec05 to 08607f5 Compare December 21, 2015 09:00

vene reviewed Feb 6, 2016
View reviewed changes

hlin117 added 11 commits February 8, 2016 02:14

scikit-learn#5778: Changed min and max of csr/csc to use np.float64 c…

6955a76

…onversion

scikit-learn#5778: Replaced cut_points_ attribute with zero_intervals…

8129e16

…_ and searched_points_

scikit-learn#5778: Added test case, fixed bugs related to new discret…

e1f909d

…ization scheme

scikit-learn#5778: Adding cython discretization, need to test

dd7c0ca

scikit-learn#5778: Preliminary fixes to the binary search

aa81520

scikit-learn#5778: Resolving dense matrix discretization

7ffedb0

scikit-learn#5778: Can now discretize csc_matricies with 2 bins

b7cfa62

scikit-learn#5778: Added test case to discretize sparse with 3 bins

ce87d39

scikit-learn#5778: Releasing python gil in cython

204b4d9

scikit-learn#5778: Fixed documentation, added tests for single catego…

3667577

…rical feature

scikit-learn#5778: Adding setup.py

c50eaf4

Addressing minor changes from @vene

01b934b

hlin117 force-pushed the kbins branch from 4d93412 to 01b934b Compare February 8, 2016 17:43

Fixing build by using embedded six package

5762cf2

hlin117 mentioned this pull request Oct 14, 2016

[MRG+2] Add fixed width discretization to scikit-learn #7668

Closed

8 tasks

hlin117 closed this Oct 14, 2016

		]


		class FixedWidthDiscretizer(BaseEstimator, TransformerMixin):

[WIP] Adding Fixed Width Discretization #5825

[WIP] Adding Fixed Width Discretization #5825

Conversation

hlin117 commented Nov 16, 2015 • edited Loading

Update

Summary

Doctest Example:

Original post:

agramfort commented Nov 16, 2015 via email

Choose a reason for hiding this comment

mblondel commented Nov 16, 2015

Choose a reason for hiding this comment

hlin117 commented Nov 18, 2015

mblondel commented Nov 18, 2015

hlin117 commented Nov 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hlin117 commented Nov 22, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mblondel commented Nov 23, 2015

mblondel commented Nov 23, 2015

hlin117 commented Dec 15, 2015

Upon creation

When discretizing x

hlin117 commented Dec 22, 2015

vene commented Feb 6, 2016

Choose a reason for hiding this comment

vene commented Feb 8, 2016

hlin117 commented Feb 8, 2016

hlin117 commented Feb 8, 2016

hlin117 commented Feb 8, 2016

hlin117 commented Feb 17, 2016 • edited Loading

vene commented Mar 24, 2016

amueller commented Oct 11, 2016

hlin117 commented Oct 11, 2016

jnothman commented Oct 12, 2016

hlin117 commented Oct 14, 2016

hlin117 commented Oct 14, 2016

jnothman commented Oct 14, 2016

hlin117 commented Oct 14, 2016

hlin117 commented Oct 14, 2016 • edited Loading

hlin117 commented Nov 16, 2015 •

edited

Loading

When discretizing `x`

hlin117 commented Feb 17, 2016 •

edited

Loading

hlin117 commented Oct 14, 2016 •

edited

Loading