Skip to content

[MRG + 2] Add Drop Option to OneHotEncoder. #12908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

drewmjohnston
Copy link
Contributor

@drewmjohnston drewmjohnston commented Jan 3, 2019

Reference Issues/PRs

Fixes #6488 and fixes #6053. This builds upon some of the code from #12884 (thanks @NicolasHug!), but also incorporates functionality which lets the user manually specify which category in each column they would like to be dropped, so this is a more general solution along the lines of what @amueller suggested in #6053. This is useful in some cases (such as OLS regression) where the dropped group affects the interpretation of coefficients.
This should also fix #9361, which has less functionality and appears stalled.

What does this implement/fix? Explain your changes.

This code implements a new parameter (drop) in the OneHotEncoder, which can take any of three values:
None (default), which implements the existing behavior.
'first', which drops the first value in each category.
list of length n_features, which allows for the manual specification of the reference group.

Any other comments?

This new feature does not work in Legacy mode (this was discussed in #6053), and it requires the manual specification of "categories='auto'" in the case in which the input is all integers, so as not to interfere with the ongoing change in the treatment of integers in OneHotEncoder.
This code is also incompatible with "handle_missing='ignore'", since in that case it is not possible to determine which categories are 0 because they are in the reference category and which are all 0 due to missing data.

Closes #12884

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet looked in detail at transform or tests. Nice work!

@drewmjohnston
Copy link
Contributor Author

Thanks for the feedback @jnothman! I just got back from a vacation and am working my way through these.

@qinhanmin2014
Copy link
Member

When do we need to specify which features to drop? Maybe #12884 is enough?

@drewmjohnston
Copy link
Contributor Author

@qinhanmin2014 The main appeal of specifying what feature to drop is for situations in which the coefficient interpretation is important (and done relative to the dropped category), for example in unregularized regression.

@amueller
Copy link
Member

I think it would be good to support what feature to drop. If you do this in regularized linear regression, picking which to drop will actually change the optimization problem.

@qinhanmin2014
Copy link
Member

@qinhanmin2014 The main appeal of specifying what feature to drop is for situations in which the coefficient interpretation is important (and done relative to the dropped category), for example in unregularized regression.

I think it would be good to support what feature to drop. If you do this in regularized linear regression, picking which to drop will actually change the optimization problem.

Thanks, now I agree that this one is better than #12884

Drew Johnston added 2 commits January 28, 2019 02:31
…ving all values of some features. Updated behavior to by default drop columns with same value throughout
drop_value = self.drop_idx_[i]
"""
Add the cells where the drop value is present to the list
of those to be masked-out. Decrement the indices of the values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see three options for implementing transform:

  1. ignore the added category when encoding, a bit like handle_unknown='ignore'
  2. remove the value once encoded as an int
  3. remove the columns once one-hot encoded

I think you've implemented #2. Why is this one chosen? What are its pros and cons in terms of code simplicity and computational complexity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do the first option, it seems like I'd have to modify _transform(), which is in the BaseEncoder class. We're not adding drop options to the other classes that inherit from BaseEncoder, so I thought that changing this in BaseEncoder might not be the best approach.
The approach I took with 2) had some advantages, since I was able to mix it in to reuse some of the logic that was already used for masking unknown values. It also avoids having to do column slicing on an output that might be a CSR matrix (as would be the case in 3)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman do you think that one of the other approaches is superior?

@drewmjohnston
Copy link
Contributor Author

@jnothman any further comments?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is only a partial review

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making those changes, @drewmjohnston

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes look all good, thanks!

@drewmjohnston
Copy link
Contributor Author

Great--made the small documentation fix you recommended.

@drewmjohnston
Copy link
Contributor Author

@jorisvandenbossche OK, sounds good. I removed this in a moment of confusion when I though this was going in to .22, not .21.

@jorisvandenbossche
Copy link
Member

@drewmjohnston Can you also add back the test? (it's in here: 2eb2c16)

@drewmjohnston
Copy link
Contributor Author

OK, removed the extra code and added the tests for the drop/n_values interactions.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looks all good now!

@drewmjohnston drewmjohnston changed the title [MRG + 1] Add Drop Option to OneHotEncoder. [MRG + 2] Add Drop Option to OneHotEncoder. Feb 26, 2019
@amueller
Copy link
Member

merging as it has two +1s and appveyor is kinda redundant now.

@NicolasHug
Copy link
Member

You didn't merge @amueller ?

@jorisvandenbossche jorisvandenbossche merged commit 350cd4a into scikit-learn:master Feb 26, 2019
@jorisvandenbossche
Copy link
Member

@drewmjohnston Thanks a lot!

@jorisvandenbossche jorisvandenbossche added this to the 0.21 milestone Feb 26, 2019
@drewmjohnston
Copy link
Contributor Author

Great! Thanks for the help everyone. You guess make contributing to this a breeze.

@drewmjohnston drewmjohnston deleted the leave_one_out_ohe branch February 26, 2019 14:27
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OneHotEncoder - add option for 1 of k-1 encoding OneHotEncoding - Defining a reference category
6 participants