-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG + 2] Add Drop Option to OneHotEncoder. #12908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG + 2] Add Drop Option to OneHotEncoder. #12908
Conversation
…d is not present in the training data. Added tests to confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet looked in detail at transform
or tests. Nice work!
Thanks for the feedback @jnothman! I just got back from a vacation and am working my way through these. |
When do we need to specify which features to drop? Maybe #12884 is enough? |
@qinhanmin2014 The main appeal of specifying what feature to drop is for situations in which the coefficient interpretation is important (and done relative to the dropped category), for example in unregularized regression. |
I think it would be good to support what feature to drop. If you do this in regularized linear regression, picking which to drop will actually change the optimization problem. |
Thanks, now I agree that this one is better than #12884 |
…ving all values of some features. Updated behavior to by default drop columns with same value throughout
sklearn/preprocessing/_encoders.py
Outdated
drop_value = self.drop_idx_[i] | ||
""" | ||
Add the cells where the drop value is present to the list | ||
of those to be masked-out. Decrement the indices of the values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see three options for implementing transform
:
- ignore the added category when encoding, a bit like handle_unknown='ignore'
- remove the value once encoded as an int
- remove the columns once one-hot encoded
I think you've implemented #2. Why is this one chosen? What are its pros and cons in terms of code simplicity and computational complexity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do the first option, it seems like I'd have to modify _transform(), which is in the BaseEncoder class. We're not adding drop options to the other classes that inherit from BaseEncoder, so I thought that changing this in BaseEncoder might not be the best approach.
The approach I took with 2) had some advantages, since I was able to mix it in to reuse some of the logic that was already used for masking unknown values. It also avoids having to do column slicing on an output that might be a CSR matrix (as would be the case in 3)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman do you think that one of the other approaches is superior?
@jnothman any further comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is only a partial review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making those changes, @drewmjohnston
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest changes look all good, thanks!
Great--made the small documentation fix you recommended. |
@jorisvandenbossche OK, sounds good. I removed this in a moment of confusion when I though this was going in to .22, not .21. |
@drewmjohnston Can you also add back the test? (it's in here: 2eb2c16) |
OK, removed the extra code and added the tests for the drop/n_values interactions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, looks all good now!
merging as it has two +1s and appveyor is kinda redundant now. |
You didn't merge @amueller ? |
@drewmjohnston Thanks a lot! |
Great! Thanks for the help everyone. You guess make contributing to this a breeze. |
This reverts commit 2554159.
This reverts commit 2554159.
Reference Issues/PRs
Fixes #6488 and fixes #6053. This builds upon some of the code from #12884 (thanks @NicolasHug!), but also incorporates functionality which lets the user manually specify which category in each column they would like to be dropped, so this is a more general solution along the lines of what @amueller suggested in #6053. This is useful in some cases (such as OLS regression) where the dropped group affects the interpretation of coefficients.
This should also fix #9361, which has less functionality and appears stalled.
What does this implement/fix? Explain your changes.
This code implements a new parameter (drop) in the OneHotEncoder, which can take any of three values:
None (default), which implements the existing behavior.
'first', which drops the first value in each category.
list of length n_features, which allows for the manual specification of the reference group.
Any other comments?
This new feature does not work in Legacy mode (this was discussed in #6053), and it requires the manual specification of "categories='auto'" in the case in which the input is all integers, so as not to interfere with the ongoing change in the treatment of integers in OneHotEncoder.
This code is also incompatible with "handle_missing='ignore'", since in that case it is not possible to determine which categories are 0 because they are in the reference category and which are all 0 due to missing data.
Closes #12884