-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Non-categorical variables in OneHotEncoder. #2027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
def fit_transform(self, X, y=None): | ||
"""Fit OneHotEncoder to X, then transform X. | ||
def _apply_transform(self, X, transform): | ||
if isinstance(self.categorical_features, six.string_types) and \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a need for the isinstance
check, it's implicit in the comparison to "all"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks. I was worried that since ndarray overrides ==, I might run in some issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well,
>>> array("all") == "all"
array(True, dtype=bool)
but since
>>> bool(array("all") == "all")
True
>>> bool(array(["any"]) == "all")
False
I don't think that'll pose any problems. Comparisons between str
and non-stringlike ndarray
seem to always return just False
.
I think a list of indices is actually easier from a user perspective, since a boolean mask is easily translated to indices with |
|
||
if n_cat == 0: | ||
# No categorical variables. | ||
return X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the input validation is potentially wasteful, especially the densification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could use atleast2d_or_csc
. This way I can do the categorical / non-categorical splitting without densification. Also, if there are only few categorical variables, this means that the densification done by _fit_transform and _transform will be affordable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, X = check_arrays(X, sparse_format='dense', dtype=np.int)[0]
doesn't do an implicit dense conversion, it raises an exception. So my idea doesn't work unless I make an explicit toarray() before calling _fit_transform / _transform.
This also means that the validation is not wasteful.
Thinking about it, If we go for a list of features indices / names, we need to choose if it's more convenient to list the categorical variables or the non-categorical ones. Which is best highly depends on the application... For |
I think it's more convenient to list the categorical ones, because otherwise you'd have a parameter called |
Travis has caught a doctest to update: https://travis-ci.org/scikit-learn/scikit-learn/builds/7766345#L2752 |
I thought that it does the one-hot coding even for integer features but I probably misread the code. Then, great, I indeed do not need to change anything. |
It looks like
but However, I do think this can be pulled out as a generic util: |
I thought of |
|
Any other opinion on indices vs. mask? Another thing I like about masks is that you can use np.ones or np.zeros to generate the initial mask then fill in the few categorical or non-categorial variables for your application of interest. This way we are not favoring applications with few categorical variables. |
I don't want to complicate things, but wouldn't it be possible to support both? I was thinking about converting (implicitely) masks to indices (or vice-versa depending on what you want to feed to the implementation). If the array is as long as the number of features, then it is a boolean mask and can be converted to indices, otherwise it is assumed to be an array of indices. Couldn't this work? |
Indeed, it shouldn't be too hard to support both. Thanks! |
That leads to ambiguity. What if I say we leave it up to the user to call |
@larsmans Indeed, you are right. By the way, what are they doing in Numpy in such a case? (since you can index an array both with a mask or a list of indices) |
Numpy seems to rely on the type of the array in that case.
Anyway, up to you. I don't want to complicate things for nothing :) |
Indeed, if we want to support both, the user has to explicity set dtype=bool in the case of masks. That sounds fair enough to me. |
Instead of the |
|
I added combined support of masks and index arrays. I kept |
Indirectly related to this PR: does any one know how to handle variables which only make sense if another variable takes some particular value ? For example, a variable which only makes sense if the patient is a female (e.g., number of years since menopause) |
I guess I'd handle that as a quadratic feature:
|
I'd still suggest this belongs as a utility function, not something special to |
enc = OneHotEncoder(categorical_features=cat)
X_train = enc.fit_tansform(X_train)
X_test = enc.transform(X_test) vs. enc = OneHotEncoder()
X_train = transform_selected(X_train, enc.fit_transform, categorical_features=cat)
X_test = transform_selected(X_test, enc.transform, categorical_features=cat) vs. enc = OneHotEncoder()
X_cat, X_non_cat = split_selected(X_train)
X_cat = enc.fit_transform(X_cat)
X_train = sp.hstack((X_cat, X_non_cat))
X_cat, X_non_cat = split_selected(X_test)
X_cat = enc.transform(X_cat)
X_test = sp.hstack((X_cat, X_non_cat)) Any other opinions? Abstractions are nice but it's also important to have easy-to-use APIs... |
If scikit-learn was based on abstractions like this, we wouldn't need OVR support directly in SGDClassifier and would rely on OneVsRestClassifier instead. |
I should clarify, because you're getting me completely wrong here and making a farce: I don't think you should require the user to use the abstraction. Of course the user can build such an abstraction out of appropriate transformers, Still, you can create a helper function and use it only in
I am sorry for not making this clearer in the first place; but surely your interpretation of my suggestion is so patently ridiculous (not to mention even messier if providing a class API with parameters, etc.) that it must be a misinterpretation. |
I was not trying to make a farce of you. From the above comment, I thought you were suggesting to let the user call the utility, not to use it in OneHotEncoder directly. If you get angry so easily you should try to be clearer in the first place. I will try to implement your suggestion tonight. If I create such an abstraction, while I am it, I can also create some kind of FeatureSplit transformer. |
I didn't get angry until I felt my suggestion was being ridiculed. Unless you can come up with some convincingly simple API, I don't think |
Alternatively, you could just add a parameter to |
(And that suggestion of adding a parameter to |
In case of possible misunderstanding, my view is to always believe that I think that this suggestion could have been applied both ways here :) |
@jnothman I agree the In the mean time +1 for splitting the |
Might as well make an issue so this is remembered: #2034. (And it's looking unlikely that I'll be at the sprint.) |
rather than a method.
Done. |
------- | ||
X : array or sparse matrix, shape=(n_samples, n_features_new) | ||
""" | ||
if len(selected) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exploits the fact that we're using a string and not, say, None
to indicate the "all" case. So I find it a bit weird here. It's also an uncommon use-case, and so should be tested after the "all" case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using selected=None as default value to mean "select all" felt strange to me. Hence the default value of "all". None can be used as synonym for [] though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern here was using len("all") != 0
. Better to put the "all" condition first...
Can I haz merge? |
Shall we rename Otherwise, +1. |
+1 |
""" | ||
if selected == "all": | ||
return transform(X) | ||
elif len(selected) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we copy here? I am worried that there will be a strange behavior with copies happening sometimes but not always.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a reason to copy in a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a reason to copy in a no-op.
For consistency: you don't want a copy to happen or not depending on some
small changes in parameters. This can lead to bugs really hard to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you mean:
X1hot = onehot.fit_transform(X)
X *= 2 # absurd example
then X1hot
changes? I guess you're right, +1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you mean:
X1hot = onehot.fit_transform(X)
X *= 2 # absurd examplethen X1hot changes?
That's exactly what I have in mind. I've been working with library that
have inconsistent copy behavior, and I usual end up forcing the copy
myself, just to avoid surprises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in d25f12b.
Very cool. I made a few minor comment (feel free not to address them, even though I feel that some should :). Besides that, 👍 for merge. |
@larsmans I personally prefer the more explicit |
I think that I prefer categorical_features, as it is more explicit. |
But the statistical term might be confusing to new users. Also, |
I prefer |
Ok, that's 3-2 for you guys. Whoever still objects gets to submit an issue (I won't ;) |
ENH Non-categorical variables in OneHotEncoder
Thanks Lars! |
I added a
categorical_features
option toOneHotEncoder
. It lets the user specify whether features / variables are categorical or not. Non-categorical features are not touched and are stacked as is at the end of the matrix. I was hesitating between using an array of indices or a boolean mask. I went for the latter as it seems easier to handle for the user but I would be glad to change if other people think otherwise.I'm thinking of adding a similar feature to
DictVectorizer
but I would like us to agree on whether I should use an array of indices or a boolean mask first.CC @amueller @larsmans