Skip to content

[MRG] Fixes get_feature_names results when using drop functionality #13894

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 23, 2019
Merged

[MRG] Fixes get_feature_names results when using drop functionality #13894

merged 6 commits into from
May 23, 2019

Conversation

jamesmyatt
Copy link
Contributor

The new "drop" functionality in OneHotEncoder from v0.21 is not taken into account in the "get_feature_names" method. This fixes that problem.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this behavior is better. LGTM.

@@ -590,6 +590,25 @@ def test_one_hot_encoder_feature_names_unicode():
assert_array_equal(['n👍me_c❤t1', 'n👍me_dat2'], feature_names)


def test_one_hot_encoder_feature_names_drop():
# Assume that this is OK for manual drop, if OK for first
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not assume this ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspected not 😏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can parametrize this with something like

@pytest.mark.parametrize(
    'drop, expected_feature_names', [
    ('first', [...]),
    (['Female', 41, 'girl', 10], [...])
])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the "expected_feature_names" might get too long.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented your suggestion @NicolasHug. Are you able to give your approval so we can add to the release?

expected_names = list(ohe_base.get_feature_names())
for i, cat_drop in enumerate(drop_cats):
feat_drop = "x{}_{}".format(i, cat_drop)
expected_names.remove(feat_drop)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little bit involved. Seems like a good way to construct the expected_names without just reimplementing it. It's also robust to any changes in basic ordering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you benefit from this logic. Simpler (and more obviously correct) just to have the expected list of feature names as input to the function.

Copy link
Contributor Author

@jamesmyatt jamesmyatt May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I think you're probably right. I don't think I realised how much I'd simplified the test case anyway such that the expected names are now quite short.

@jamesmyatt jamesmyatt changed the title Fixes get_feature_names results when using drop functionality [MRG] Fixes get_feature_names results when using drop functionality May 16, 2019
@jnothman jnothman added this to the 0.21.2 milestone May 21, 2019
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

expected_names = list(ohe_base.get_feature_names())
for i, cat_drop in enumerate(drop_cats):
feat_drop = "x{}_{}".format(i, cat_drop)
expected_names.remove(feat_drop)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you benefit from this logic. Simpler (and more obviously correct) just to have the expected list of feature names as input to the function.

@jnothman
Copy link
Member

Please add an entry to the change log at doc/whats_new/v0.21.rst under Version 0.21.1. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@jnothman jnothman added the Bug label May 21, 2019
@thomasjpfan
Copy link
Member

We are trying to move from using :issue: to :pr: (although they both work)

@jnothman
Copy link
Member

jnothman commented May 21, 2019 via email

@jnothman
Copy link
Member

And fwiw, @jamesmyatt, I'd like to include this in the 0.21.2 release which might occur as soon as Friday to ship a fix for a severe and hidden bug in euclidean distances.

@jnothman
Copy link
Member

@jamesmyatt, I'll wrap this up for you so it can be put into the release

@jamesmyatt
Copy link
Contributor Author

jamesmyatt commented May 23, 2019

@jnothman , Thanks for updating this. I agree it's better like this. I don't think I'd realised how much simpler the expected names are now than they were before.

Is there anything else that you need from me to get this merged? The CI failures look unrelated.

@jnothman
Copy link
Member

jnothman commented May 23, 2019 via email

@jnothman jnothman mentioned this pull request May 23, 2019
@ogrisel ogrisel merged commit e35f040 into scikit-learn:master May 23, 2019
@ogrisel
Copy link
Member

ogrisel commented May 23, 2019

Thank you @jamesmyatt. @jnothman I let you backport this as part of the 0.21.2 release PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants