[MRG] Add get_feature_names to OneHotEncoder #10198

Nirvan101 · 2017-11-24T16:19:38Z

Reference Issues/PRs
Fixes #10181

What does this implement/fix? Explain your changes.
Added function get_feature_names() to CategoricalEncoder class. This is in data.py under sklearn.preprocessing

Nirvan101 · 2017-11-25T20:41:43Z

I'm getting a typeError in test_docstring_parameters.py in the tests.
The line giving the error is:

 assert '\t' not in source, ('"%s" has tabs, please remove them ','or add it to theignore list' % modname)

i.e line 164 in test_tabs() in sklearn/tests/test_docstring_parameters.py .

Is this a bug or is it just my code?

jnothman · 2017-11-26T08:27:34Z

apparently there is a tab character in your additions. There should not be. Remove it, please!

jnothman

This requires a test in sklearn/preprocessing/tests/test_data.py

jnothman · 2017-11-26T09:24:38Z

sklearn/preprocessing/data.py

+
+        feature_names = np.concatenate(feature_names)
+        return feature_names
+


please do not leave a blank line at the end of the file

jnothman · 2017-11-26T09:24:53Z

sklearn/preprocessing/data.py

@@ -2873,3 +2873,39 @@ def inverse_transform(self, X):
                    X_tr[mask, idx] = None

        return X_tr
+
+def get_feature_names(self, input_features=None):


this should be defined as a method within the class, not as an external function

jnothman · 2017-11-26T09:35:26Z

sklearn/preprocessing/data.py

+                temp.append(str(t))
+            cats2.append(np.array(temp))
+
+


This line has spaces in it, which should be removed. It also has a tab character.

amueller · 2017-11-27T20:07:36Z

sklearn/preprocessing/tests/test_data.py

+    feature_names2 = enc.get_feature_names(['one', 'two', 
+                                            'three', 'four', 'five'])
+
+    assert_array_equal(['one_Female', 'one_Male', 


test failures related to pep8 and whitespace around here.

amueller

Works but I think the implementation is a bit needlessly complicated.

amueller · 2017-11-27T20:08:39Z

sklearn/preprocessing/data.py

@@ -2873,3 +2873,37 @@ def inverse_transform(self, X):
                    X_tr[mask, idx] = None

        return X_tr
+
+    def get_feature_names(self, input_features=None):
+        """


No newline after """

amueller · 2017-11-27T20:09:12Z

sklearn/preprocessing/data.py

+
+        Returns
+        -------
+        output_feature_names : list of string, length n_output_features


Maybe a very short doctest example would be illustrative?

I have added it under the class declaration, please check.

amueller · 2017-11-27T20:11:13Z

sklearn/preprocessing/data.py

+            input_features = ['x%d' % i for i in range(len(cats))]
+
+        cats2 = []
+        for i in range(len(cats)):


I don't understand why you iterate over cats twice

amueller · 2017-11-27T20:12:51Z

sklearn/preprocessing/data.py

+            t = [input_features[i] + '_' + f for f in cats2[i]]
+            feature_names.append(t)
+
+        feature_names = np.concatenate(feature_names).tolist()


thats a weird way to join lists: you converte them to numpy arrays, concatenate them, and convert them back to lists.

You could probably just do a single list comprehension for everything.

Nirvan101 · 2017-12-01T11:58:52Z

@amueller @jnothman please review. Let me know if any more modifications are needed.

jnothman

This strategy is only applicable if encoding='onehot' or encoding='onehot-dense'

jnothman · 2017-12-04T12:19:48Z

sklearn/preprocessing/data.py

+        """
+        cats = self.categories_
+        feature_names = []
+        if input_features is None:


if not, please validate that input_features's length corresponds to self.categories_

jnothman · 2017-12-04T12:22:41Z

sklearn/preprocessing/data.py

+            for t in cats[i]:
+                temp.append(str(t))
+            t = [input_features[i] + '_' + f for f in temp]
+            feature_names.append(t)


why not just

feature_names.extend(input_features[i] + '_' + str(t) for t in cats[i])

?

I also think we should be using unicode here in python 2.

I've added the 2 changes you mentioned. Will make a PR soon.I also tried with encoding = 'ordinal' and it worked with that too.

I don't understand what you mean by using unicode here . Can you explain that better?

No need to make a new PR, just push more commits to this branch, please.

In Python 2, we need to use unicode not str, etc, to allow for non-ascii strings.

@jnothman I am not sure this conversion to unicode is needed.
If a user has unicode categories or passes unicode input features, concatenating them still works as expected in python 2 (a mixture of string and unicode will just automatically be unicode as result).
Not doing this conversion AFAIK only means we can end up with a list of both string and unicode values in python 2.

As long as we test for reasonable behaviour, I'm fine with it!

OK, will do some more testing on this (I actually expect the current tests to fail on python 2, let's see)

commit 11 rebased commit 8,9 commit 10

Nirvan101 · 2017-12-31T12:17:50Z

I've added a check for input_features and also unicode support for python2. Please review.

jorisvandenbossche

I think this is looking good, added some minor comments.

@Nirvan101 do you have time to update this (on a short term)?
It would be nice to include this in the release.

jorisvandenbossche · 2018-06-13T08:26:35Z

sklearn/preprocessing/data.py

+                             " length equal to number of features")
+
+        def to_unicode(text):
+            return text if isinstance(text, unicode) else text.encode('utf8')


I think you can use six.text_type here instead of the unicode (and redefining unicode depending on the version)

jorisvandenbossche · 2018-06-13T08:39:21Z

sklearn/preprocessing/tests/test_data.py

+                        'three_boy', 'three_girl',
+                        'four_1', 'four_2', 'four_12', 'four_21',
+                        'five_3', 'five_10', 'five_30'], feature_names2)
+


Can you also add a test catching the error if the passed input_features are not correct length or not strings?

jorisvandenbossche · 2018-06-21T11:28:02Z

@Nirvan101 do you have time to update this on a short term ?
I am also happy to make the final edits based on the comments to get this merged.

Nirvan101 · 2018-06-21T15:48:08Z

@jorisvandenbossche Sorry for the late reply. I'm afraid I don't have the time to work on this currently. Please feel free to take it up.

jorisvandenbossche · 2018-06-27T08:21:14Z

@jorisvandenbossche Sorry for the late reply. I'm afraid I don't have the time to work on this currently. Please feel free to take it up.

Thanks for letting it know!
I already started with merging master to update it with the latest changes. I will also update for the small remaining comments, then we can see how to go further from there.

sklearn-lgtm · 2018-06-27T09:01:17Z

This pull request introduces 1 alert when merging 4e17c86 into 3b5abf7 - view on LGTM.com

new alerts:

1 for Unused import

Comment posted by LGTM.com

jorisvandenbossche · 2018-06-27T12:22:22Z

@jnothman AFAIK the added input_features here is the first time it is added to get_feature_names. Was there already some consensus that this is the good way forward? (for dealing with passing through of feature names).
EDIT: there is quite some discussion about this on #6425

If that is the case, I think this PR is pretty good to go, and would be nice to include in 0.20.0, as it is really useful for the OneHotEncoder.

jorisvandenbossche · 2018-06-27T12:23:13Z

Ah, it seems that PolynomialFeatures.get_feature_names has this already as well.

amueller · 2018-06-28T13:46:36Z

@jorisvandenbossche yes, that's the only other place where that was done. We don't really have an API for making this work with pipelines or using pandas column names, but I think the optional input_features provides a convenient user interface for now.

amueller · 2018-06-28T13:50:33Z

@jorisvandenbossche so you're taking this over? It would be great to have!

jorisvandenbossche · 2018-06-28T14:02:03Z

Yes, I can finish this. But I think it is as good as ready (so new review round is certainly welcome). I just want to check if we have enough tests for python 2 / unicode.

amueller · 2018-06-28T14:35:14Z

sklearn/preprocessing/_encoders.py

+
+        for i in range(len(cats)):
+            feature_names.extend(to_unicode(
+                input_features[i] + '_' + str(t)) for t in cats[i])


Is underscore a good choice? Should we allow users to specify this?

I think it is a reasonable default (at least better than . or = IMO). Do you have other ideas? (double underscore, ..)
I think the option to specify it could be left for another PR (as long as we now choose what we would otherwise have as the default for that option).

amueller · 2018-06-28T14:36:34Z

sklearn/preprocessing/_encoders.py

+
+        Returns
+        -------
+        output_feature_names : list of string, length n_output_features


do we want list of string? I think numpy arrays might be better. I hate the fact that CountVectorizer returns lists because I always need to convert before I can slice.
So it's a question of convenience vs consistency - but I might rather want to change that in CountVectorizer for 1.0 ;)

I am fine with making this numpy arrays (of object dtype then). Since we are adding it here to a new class, seems the good time to do this, if we consider changing it later for CountVectorizer

amueller · 2018-06-28T14:38:25Z

sklearn/preprocessing/_encoders.py

+            input_features = ['x%d' % i for i in range(len(cats))]
+        elif(len(input_features) != len(self.categories_)):
+            raise ValueError("input_features should have"
+                             " length equal to number of features")


error message should probably contain both values, right?
"input_features should have length equal to number of features {}, got {}".format(len(self.categories_), len(input_features))?

amueller

Haven't checked the unicode stuff but otherwise looks good.

jorisvandenbossche · 2018-06-29T08:32:29Z

Updated, and also changed to return array.

amueller · 2018-06-29T16:03:36Z

lgtm. you could test the numbers in the error message but that seems overkill I guess.

jnothman · 2018-07-03T02:08:28Z

sklearn/preprocessing/_encoders.py

+        feature_names = []
+        for i in range(len(cats)):
+            names = [
+                input_features[i] + '_' + six.text_type(t) for t in cats[i]]


Would it be appropriate to use __?

I'm fine with using single underscore. Merge?

jorisvandenbossche · 2018-07-03T07:17:38Z

Would it be appropriate to use __

Andy asked a similar question in #10198 (comment), wondering if there could be a parameter to override the default separator.

I think the best options are _ and __ (we use = in the DictVectorizer, but would rather not use that since it turns it always into an invalid python identifier).
__ holds resemblance with how parameter names are divided in pipelines, but I didn't get to a conclusion yet if I think that is good or bad :-) (it's in a certain way consistent, but the two are also not exactly the same).
I think _ is a bit nicer to 'look' at, __ might be more robust to confusion if the input feature name already has a underscore in it.

jorisvandenbossche · 2018-07-16T19:54:20Z

Fixed the merge conflict.

jnothman reviewed Nov 26, 2017

View reviewed changes

Nirvan101 force-pushed the my-feature branch 2 times, most recently from 39216f0 to 4cd0e96 Compare November 26, 2017 13:21

qinhanmin2014 mentioned this pull request Nov 26, 2017

TypeError in test_docstring_parameters.py #10203

Closed

Nirvan101 force-pushed the my-feature branch 4 times, most recently from bca6cdb to d7e6a47 Compare November 27, 2017 10:31

amueller reviewed Nov 27, 2017

View reviewed changes

Nirvan101 force-pushed the my-feature branch 2 times, most recently from 1b10208 to 423302d Compare November 28, 2017 22:21

jnothman reviewed Dec 4, 2017

View reviewed changes

Nirvan101 force-pushed the my-feature branch from b4f49e1 to 68e3f92 Compare December 31, 2017 10:50

commit 12

a0e91d7

commit 11 rebased commit 8,9 commit 10

Nirvan101 force-pushed the my-feature branch from 68e3f92 to a0e91d7 Compare December 31, 2017 11:59

Merge branch 'master' into my-feature

2941ba9

eyadsibai mentioned this pull request Jun 7, 2018

Add get_feature_names() method scikit-learn-contrib/category_encoders#79

Closed

jorisvandenbossche reviewed Jun 13, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into Nirvan101-my-feature

30b3ba8

fixup merge

491984a

jorisvandenbossche changed the title ~~[MRG] Added get_feature_names() to CategoricalEncoder~~ [MRG] Add get_feature_names to OneHotEncoder Jun 27, 2018

updates

f1ce05b

jorisvandenbossche force-pushed the my-feature branch from 4e17c86 to f1ce05b Compare June 27, 2018 12:19

amueller reviewed Jun 28, 2018

View reviewed changes

jorisvandenbossche added 3 commits June 29, 2018 09:49

Merge remote-tracking branch 'upstream/master' into Nirvan101-my-feature

274a335

better error message + simplify and test unicode handling

8052902

return array instead of list

0d6ce8e

amueller approved these changes Jun 29, 2018

View reviewed changes

add whatsnew note to acknowledge Nirvan

1646299

jnothman approved these changes Jul 3, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into Nirvan101-my-feature

0f05ff9

amueller added this to the 0.20 milestone Jul 16, 2018

amueller merged commit 61fa315 into scikit-learn:master Jul 16, 2018

jorisvandenbossche mentioned this pull request Sep 24, 2018

[MRG] ENH Add get_feature_names for OneHotEncoder #6441

Closed


		feature_names = np.concatenate(feature_names)
		return feature_names

Uh oh!

[MRG] Add get_feature_names to OneHotEncoder #10198

[MRG] Add get_feature_names to OneHotEncoder #10198

Uh oh!

Conversation

Nirvan101 commented Nov 24, 2017

Uh oh!

Nirvan101 commented Nov 25, 2017

Uh oh!

jnothman commented Nov 26, 2017 via email

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nirvan101 commented Dec 1, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman Dec 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman Dec 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nirvan101 commented Dec 31, 2017

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jun 21, 2018

Uh oh!

Nirvan101 commented Jun 21, 2018

Uh oh!

jorisvandenbossche commented Jun 27, 2018

jnothman Dec 4, 2017 •

edited

Loading

jnothman Dec 4, 2017 •

edited

Loading

jorisvandenbossche commented Jun 27, 2018 •

edited

Loading

amueller Jun 28, 2018 •

edited

Loading