-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG] Fixes get_feature_names results when using drop functionality #13894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fixes get_feature_names results when using drop functionality #13894
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this behavior is better. LGTM.
@@ -590,6 +590,25 @@ def test_one_hot_encoder_feature_names_unicode(): | |||
assert_array_equal(['n👍me_c❤t1', 'n👍me_dat2'], feature_names) | |||
|
|||
|
|||
def test_one_hot_encoder_feature_names_drop(): | |||
# Assume that this is OK for manual drop, if OK for first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not assume this ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspected not 😏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can parametrize this with something like
@pytest.mark.parametrize(
'drop, expected_feature_names', [
('first', [...]),
(['Female', 41, 'girl', 10], [...])
])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the "expected_feature_names" might get too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented your suggestion @NicolasHug. Are you able to give your approval so we can add to the release?
expected_names = list(ohe_base.get_feature_names()) | ||
for i, cat_drop in enumerate(drop_cats): | ||
feat_drop = "x{}_{}".format(i, cat_drop) | ||
expected_names.remove(feat_drop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little bit involved. Seems like a good way to construct the expected_names without just reimplementing it. It's also robust to any changes in basic ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you benefit from this logic. Simpler (and more obviously correct) just to have the expected list of feature names as input to the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I think you're probably right. I don't think I realised how much I'd simplified the test case anyway such that the expected names are now quite short.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
expected_names = list(ohe_base.get_feature_names()) | ||
for i, cat_drop in enumerate(drop_cats): | ||
feat_drop = "x{}_{}".format(i, cat_drop) | ||
expected_names.remove(feat_drop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you benefit from this logic. Simpler (and more obviously correct) just to have the expected list of feature names as input to the function.
Please add an entry to the change log at |
We are trying to move from using |
Yes, I need to update my boilerplate message, sorry
|
And fwiw, @jamesmyatt, I'd like to include this in the 0.21.2 release which might occur as soon as Friday to ship a fix for a severe and hidden bug in euclidean distances. |
@jamesmyatt, I'll wrap this up for you so it can be put into the release |
@jnothman , Thanks for updating this. I agree it's better like this. I don't think I'd realised how much simpler the expected names are now than they were before. Is there anything else that you need from me to get this merged? The CI failures look unrelated. |
No I'm waiting on another core developer to give their approval
|
Thank you @jamesmyatt. @jnothman I let you backport this as part of the 0.21.2 release PR. |
The new "drop" functionality in OneHotEncoder from v0.21 is not taken into account in the "get_feature_names" method. This fixes that problem.