[MRG + 2] Add Drop Option to OneHotEncoder. #12908

drewmjohnston · 2019-01-03T04:21:02Z

Reference Issues/PRs

Fixes #6488 and fixes #6053. This builds upon some of the code from #12884 (thanks @NicolasHug!), but also incorporates functionality which lets the user manually specify which category in each column they would like to be dropped, so this is a more general solution along the lines of what @amueller suggested in #6053. This is useful in some cases (such as OLS regression) where the dropped group affects the interpretation of coefficients.
This should also fix #9361, which has less functionality and appears stalled.

What does this implement/fix? Explain your changes.

This code implements a new parameter (drop) in the OneHotEncoder, which can take any of three values:
None (default), which implements the existing behavior.
'first', which drops the first value in each category.
list of length n_features, which allows for the manual specification of the reference group.

Any other comments?

This new feature does not work in Legacy mode (this was discussed in #6053), and it requires the manual specification of "categories='auto'" in the case in which the input is all integers, so as not to interfere with the ongoing change in the treatment of integers in OneHotEncoder.
This code is also incompatible with "handle_missing='ignore'", since in that case it is not possible to determine which categories are 0 because they are in the reference category and which are all 0 due to missing data.

Closes #12884

…d is not present in the training data. Added tests to confirm

jnothman

Not yet looked in detail at transform or tests. Nice work!

sklearn/preprocessing/_encoders.py

drewmjohnston · 2019-01-17T04:12:40Z

Thanks for the feedback @jnothman! I just got back from a vacation and am working my way through these.

…lity

…hman

qinhanmin2014 · 2019-01-25T01:38:50Z

When do we need to specify which features to drop? Maybe #12884 is enough?

drewmjohnston · 2019-01-25T22:48:23Z

@qinhanmin2014 The main appeal of specifying what feature to drop is for situations in which the coefficient interpretation is important (and done relative to the dropped category), for example in unregularized regression.

amueller · 2019-01-27T20:59:16Z

I think it would be good to support what feature to drop. If you do this in regularized linear regression, picking which to drop will actually change the optimization problem.

qinhanmin2014 · 2019-01-28T02:15:23Z

@qinhanmin2014 The main appeal of specifying what feature to drop is for situations in which the coefficient interpretation is important (and done relative to the dropped category), for example in unregularized regression.

I think it would be good to support what feature to drop. If you do this in regularized linear regression, picking which to drop will actually change the optimization problem.

Thanks, now I agree that this one is better than #12884

…ving all values of some features. Updated behavior to by default drop columns with same value throughout

sklearn/preprocessing/_encoders.py

jnothman · 2019-01-28T22:41:36Z

sklearn/preprocessing/_encoders.py

+                drop_value = self.drop_idx_[i]
+                """
+                Add the cells where the drop value is present to the list
+                of those to be masked-out. Decrement the indices of the values


I see three options for implementing transform:

ignore the added category when encoding, a bit like handle_unknown='ignore'

remove the value once encoded as an int

remove the columns once one-hot encoded

I think you've implemented #2. Why is this one chosen? What are its pros and cons in terms of code simplicity and computational complexity?

To do the first option, it seems like I'd have to modify _transform(), which is in the BaseEncoder class. We're not adding drop options to the other classes that inherit from BaseEncoder, so I thought that changing this in BaseEncoder might not be the best approach.
The approach I took with 2) had some advantages, since I was able to mix it in to reuse some of the logic that was already used for masking unknown values. It also avoids having to do column slicing on an output that might be a CSR matrix (as would be the case in 3)).

@jnothman do you think that one of the other approaches is superior?

drewmjohnston · 2019-02-08T19:17:52Z

@jnothman any further comments?

jnothman

Sorry this is only a partial review

sklearn/preprocessing/_encoders.py

jnothman

Thanks for making those changes, @drewmjohnston

jorisvandenbossche

Latest changes look all good, thanks!

sklearn/preprocessing/_encoders.py

doc/modules/preprocessing.rst

drewmjohnston · 2019-02-26T09:06:54Z

Great--made the small documentation fix you recommended.

sklearn/preprocessing/_encoders.py

drewmjohnston · 2019-02-26T09:38:31Z

@jorisvandenbossche OK, sounds good. I removed this in a moment of confusion when I though this was going in to .22, not .21.

jorisvandenbossche · 2019-02-26T09:45:28Z

@drewmjohnston Can you also add back the test? (it's in here: 2eb2c16)

sklearn/preprocessing/_encoders.py

drewmjohnston · 2019-02-26T09:51:16Z

OK, removed the extra code and added the tests for the drop/n_values interactions.

jorisvandenbossche

OK, looks all good now!

amueller · 2019-02-26T11:51:35Z

merging as it has two +1s and appveyor is kinda redundant now.

NicolasHug · 2019-02-26T12:46:23Z

You didn't merge @amueller ?

jorisvandenbossche · 2019-02-26T12:50:31Z

@drewmjohnston Thanks a lot!

drewmjohnston · 2019-02-26T13:47:54Z

Great! Thanks for the help everyone. You guess make contributing to this a breeze.

This reverts commit 2554159.

Drew Johnston added 11 commits December 27, 2018 21:56

Added documentation for drop one OneHotEncoder

a4845f1

Added functionality for drop first and drop manually-specified.

f2f5c8a

Added tests for OneHotEncoder drop parameter

47c61bc

Fixed flake8

335406d

Resolved merge conflict in preprocessing/_encoders.py

fb83786

Added additional code to detect when a column that ought to be droppe…

76a8274

…d is not present in the training data. Added tests to confirm

Fixed docstring

a35b8b9

Fixed docstring2

bddbad9

Finally will clear pytest

e368cfe

Removed code that is not compatible with numpy 1.11.0

6ae1019

Fixed docs to match current OneHotEncoder string

89a55e1

NicolasHug mentioned this pull request Jan 9, 2019

[MRG] add drop_first option to OneHotEncoder #12884

Closed

jnothman reviewed Jan 10, 2019

View reviewed changes

Drew Johnston added 2 commits January 24, 2019 19:04

Updated error message with language that addresses new drop functiona…

6c10149

…lity

Updated documentation and testing to resolve errors mentioned by jnot…

da3f989

…hman

Fixed Flake8 bug

82f8f07

Drew Johnston added 2 commits January 28, 2019 02:31

Added new way to encode features to be dropped. Added support for lea…

cf1a425

…ving all values of some features. Updated behavior to by default drop columns with same value throughout

Merge branch 'master' into leave_one_out_ohe

f09ada2

jnothman reviewed Jan 28, 2019

View reviewed changes

Drew Johnston added 4 commits January 30, 2019 00:33

Implemented basic text corrections and code optimizations recommended

aa25bdf

Merge branch 'master' into leave_one_out_ohe

d74a0f4

fixed Flake8 Literal complaint

a5c58f1

Merge branch 'master' into leave_one_out_ohe

e1942dd

jnothman reviewed Feb 12, 2019

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

Drew Johnston added 2 commits February 25, 2019 12:25

Merge branch 'master' into leave_one_out_ohe

da73238

Reformatting, fixed flake8

e351b1f

jnothman approved these changes Feb 26, 2019

View reviewed changes

Small doc fix

6033765

jorisvandenbossche reviewed Feb 26, 2019

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

doc/modules/preprocessing.rst Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Feb 26, 2019

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

Fixed n_values/drop compatibility

9597543

Removed cruft. Added in test for drop/n_values interaction

7149eac

jorisvandenbossche reviewed Feb 26, 2019

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

jorisvandenbossche approved these changes Feb 26, 2019

View reviewed changes

drewmjohnston changed the title ~~[MRG + 1] Add Drop Option to OneHotEncoder.~~ [MRG + 2] Add Drop Option to OneHotEncoder. Feb 26, 2019

jorisvandenbossche merged commit 350cd4a into scikit-learn:master Feb 26, 2019

jorisvandenbossche added this to the 0.21 milestone Feb 26, 2019

drewmjohnston deleted the leave_one_out_ohe branch February 26, 2019 14:27

jrbourbeau mentioned this pull request Feb 28, 2019

Fix sklearn dev tests dask/dask-ml#474

Merged

jorisvandenbossche mentioned this pull request Mar 29, 2019

[WIP] Handle missing values in OrdinalEncoder #12045

Closed

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

ENH: Add Drop Option to OneHotEncoder. (scikit-learn#12908)

2554159

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "ENH: Add Drop Option to OneHotEncoder. (scikit-learn#12908)"

e7c3ff8

This reverts commit 2554159.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "ENH: Add Drop Option to OneHotEncoder. (scikit-learn#12908)"

14164c2

This reverts commit 2554159.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

ENH: Add Drop Option to OneHotEncoder. (scikit-learn#12908)

552cf89

thomasjpfan mentioned this pull request Sep 17, 2019

[MRG] FEA: Stacking estimator for classification and regression #11047

Merged

cmarmo mentioned this pull request Feb 28, 2020

OneHotEncoder drop 'if_binary' drop one column from all categorical variables #16552

Closed

Uh oh!

[MRG + 2] Add Drop Option to OneHotEncoder. #12908

[MRG + 2] Add Drop Option to OneHotEncoder. #12908

Uh oh!

Conversation

drewmjohnston commented Jan 3, 2019 • edited by jnothman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drewmjohnston commented Jan 17, 2019

Uh oh!

qinhanmin2014 commented Jan 25, 2019

Uh oh!

drewmjohnston commented Jan 25, 2019

Uh oh!

amueller commented Jan 27, 2019

Uh oh!

qinhanmin2014 commented Jan 28, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jnothman Jan 28, 2019

Choose a reason for hiding this comment

Uh oh!

drewmjohnston Jan 30, 2019

Choose a reason for hiding this comment

Uh oh!

drewmjohnston Feb 4, 2019

Choose a reason for hiding this comment

Uh oh!

drewmjohnston commented Feb 8, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

drewmjohnston commented Feb 26, 2019

Uh oh!

Uh oh!

drewmjohnston commented Feb 26, 2019

Uh oh!

jorisvandenbossche commented Feb 26, 2019

Uh oh!

Uh oh!

drewmjohnston commented Feb 26, 2019

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

amueller commented Feb 26, 2019

Uh oh!

NicolasHug commented Feb 26, 2019

Uh oh!

jorisvandenbossche commented Feb 26, 2019

Uh oh!

drewmjohnston commented Jan 3, 2019 •

edited by jnothman

Loading