-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Adding Fixed Width Discretization #5825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
X must be 2D and discretization should be done for each column / feature
|
] | ||
|
||
|
||
class FixedWidthDiscretizer(BaseEstimator, TransformerMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just use Discretizer
and later have a strategy
option in the constructor.
As suggested by@jnothman, this can be handled easily by a OneHotEncoder in a pipeline. So maybe for separation of concern reasons, it's better to not support one-hot encoding in Discretizer, unless we find out that it is much more efficient to do both in one step. |
|
||
self.min_ = y.min() | ||
self.max_ = y.max() | ||
self.spacing_ = (self.max_ - self.min_) / self.n_bins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it's not necessary to save spacing
as an attribute: it's not used in transform
and can be inferred from cut_points
.
Alright @mblondel, the code is ready to be reviewed. Couple of comments:
|
I suggest that in this PR you focus on the uniform strategy only. I prefer to have only one strategy but good support for sparse In another PR we can add a quantile based approach. Basically you sort the feature values across all samples and choose the cutting points within the sorted values. This will be harder to implement for sparse data and thus needs to be addressed in a separate PR. Thanks for the good work, you're getting there. |
Hi @mblondel, thank you for supporting this PR =] My latest two commits removed the Please help suggest how I can improve my code. Thanks! |
def _check_sparse(self, X, ravel=True): | ||
if ravel: | ||
return X.toarray().ravel() if sp.issparse(X) else X | ||
return X.toarray() if sp.issparse(X) else X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't densify the sparse matrix. Otherwise there's no point using sparse structures in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking this, @mblondel. There's only a few places where I call this function, and that's when the passed in array X
is actually a vector, and I need it to be dense. (Example, setting self.min_
)
The only situation when I densify a matrix is on line 176, and I don't know how to avoid this call. discretized
on line 175 is a dense matrix, and you can't use np.hstack
on a dense/sparse matrix.
This function might get borked at the end of the day. scipy 0.9.0 does not have csr_matrix.min()
, so that's why the current build is failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about CSC format? Does it have a min() method? It's ok to support CSC only for now.
If even CSC doesn't have a min() method in 0.9, you can add a small utility function here:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/sparsefuncs_fast.pyx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it's ok to have two distinct implementations (one for dense and one for CSC) if it makes the implementation clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither csr_matrix nor csc_matrix have the min or max function implemented in scipy 0.9.0.
I might take your suggestion in adding a utility in sparsefuncs_fast.pyx
. Thank you!
97385f8
to
5328821
Compare
col_indices = X.indices | ||
np.ndarray[DOUBLE, ndim=1, mode="c"] minimum = np.zeros(n_features) | ||
np.ndarray[DOUBLE, ndim=1, mode="c"] data = X.data | ||
#DOUBLE[::1] data = X.data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I used DOUBLE[::1] data = X.data
instead of np.ndarray[DOUBLE, ndim=1, mode="c"]
, I ended up getting segmentation faults when I ran my code while running nosetests test_sparsefuncs.py
. Can someone help explain why?
Hi @mblondel, I wrote the code to find the |
|
||
# TODO: I don't know how to avoid this call to `astype()`; | ||
# my test cases don't pass with integers otherwise. | ||
X = X.astype(float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to avoid calling this? I actually run into the following error if I don't convert my data into floats:
ValueError: Buffer dtype mismatch, expected 'DOUBLE' but got 'long long'
I usually trigger this error on line 57 of sparsefuncs_fast.pyx
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is type of conversion is necessary because the Cython code doesn't do the conversion for us. I would use np.float64
instead of float
to be more explicit. This is the type that corresponds to double
.
The use of searchsorted in transform is problematic, since it seems difficult to avoid a loop. An algorithm possibly worth investigating:
For step 1 I thought we could use MinMaxScaler but it doesn't seem to work on sparse data. In any case, we need to design a discretization scheme that preserves sparsity (zeros are mapped to zeros). This would mean that features smaller than 0 are mapped to negative numbers. |
This should be automatically the case for non-negative sparse data. For sparse data including negative values, this might need special care. |
@mblondel: How about this idea? First, let's assume that when a user creates the discretizer, the user will not add more intervals to the class. (The interface doesn't allow this anyways.) Also, let's suppose that we have intervals
Let's do a modified binary search on our set of intervals. Upon creation
When discretizing
|
@mblondel: I rewrote a lot of the code to accommodate the binary search implemented in cython. Can you take a look and tell me what you think? |
I'm not sure why this should be like this. Do you mean the last index overall, among all the features? If the output is fed into (1.) an one-hot encoder or (2.) a tree-based model, it would make no difference, right? If we could do away with this feature, this transformer could just assume that no features are numerical. This will simplify the code and interface. Since this is a transformer and not a classifier, users can easily work around the issue with What do you think? (ping @jnothman, @mblondel) (P.S. @mblondel, if this works well with polynomial features, does it mean that fuzzy logic is making a comeback? 😛) |
max_ : float | ||
The maximum value of the input data. | ||
|
||
cut_points_ : array, shape [numBins - 1, n_continuous_features] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_bins
…_ and searched_points_
Regarding rounding, I assume @mblondel's idea went something like: (for dense data)
which seems like it would give you k + 1 bins. I'm not sure if this is right, nor how to make it work for sparse data, but it does make sense to me. Regardless the discrete columns: I don't understand why you would need to change anything in the discrete columns. And if we don't change the discrete columns, we can remove all the code that deals with discrete columns, simplifying this greatly (including the 🐱 method :) ) If the user wants an one-hot encoding or get_dummies on the other columns, they could do it themselves using a FeatureUnion or something. I am leaning towards encouraging this, but I'd love to hear other opinions. (Hi @jnothman :) ) |
Ah, that's clever. However, I think there are cases when this wouldn't work. For example, if we have I think scaling into If we scale into |
I just shifted the discrete (categorical) columns to the end of the matrix, just because - implementation wise - this simplified my code. I'd imagine that this is the case with OneHotEncoder too (just simplifying implementation). But on the other hand, the OneHotEncoder and |
It looks like travis's python3 build is failing because it can't find the I reckon that my configurations in setup.py are incorrect, but I don't understand enough Cython to know. If someone can diagnose why the build is failing, that would be great. |
@alright, I think I understand the scaling scheme @mblondel was mentioning earlier. I imagine the scaling to go like this. Given some feature
Like @vene said, we need to encourage code maintainability, and I think this 6 step process would be faster and much less code than the current implementation. It also works for sparse matrices, which is essentially our blocker here. Please let me know if there is any interest in continuing this PR. (Edit 10/14/16: Added clarifications about ceilings and floors.) |
@hlin117 I'm guessing you need the epsilon to make sure that the smallest non-zero values get discretized to non-zero? I'm guessing you could just add 1 in the first step instead, right? (basically rescale all non-zeros to lie in (1, 1+k) instead of (0, k). (later edit: I think what I said above works for non-negatives. Negative values should lie in (-k-1, -1) then? Indeed this was my understanding of @mblondel's suggestion. We need to check two things now:
Also if the rounding approach does work well and is fast, we will need to document the code really clearly, because it's pretty counterintuitive. |
Any news on this? |
I'll be happy to continue this PR, if there's support for it. |
I think there is interest. In the case of sparse matrices, you're only going to be able to maintain sparsity if you engineer 0 to remain 0. I'm confused whether that's what you're doing with your I think you're also missing a |
I think if we were to move forward with this, I would be okay with abandoning my code here, and starting afresh. |
I think this data preprocessing should be up to the user. Some users may choose not to clip off these outliers, and it's not really clear how to define what is an outlier, and what is not. |
Outliers are things out of range of the training data and therefore its On 14 October 2016 at 17:53, Henry Lin notifications@github.com wrote:
|
Ah true. I wasn't thinking about the testing set, only the training set. Thank you. |
Please look at #7668. It contains the much more straightforward algorithm. (And much much much less code.) Thanks! |
Update
Please look at #7668 for the current PR.
Summary
This PR addresses #5778. It adds the Discretizer class to scikit-learn. This PR adds the following:
sparsefuncs_fast.pyx
classsparsefuncs.py
for discretizationDocumentation
Doctest Example:
Original post:
The discretizer should be used as follows. (Just copying from the docstring test.)
A couple of comments and questions:
LabelEncoder
, which also requires 1D arrays. Does it sound fine that we limit the inputs to only 1D arrays?FixedWidthDiscretizer
is a mouthful. Do we have opinions about what else we could name the class? (I'm thinkingKBinsDiscretizer
.)