Skip to content

Conversation

Nirvan101
Copy link
Contributor

Reference Issues/PRs
Fixes #10181

What does this implement/fix? Explain your changes.
Added function get_feature_names() to CategoricalEncoder class. This is in data.py under sklearn.preprocessing

@Nirvan101
Copy link
Contributor Author

I'm getting a typeError in test_docstring_parameters.py in the tests.
The line giving the error is:

 assert '\t' not in source, ('"%s" has tabs, please remove them ','or add it to theignore list' % modname)

i.e line 164 in test_tabs() in sklearn/tests/test_docstring_parameters.py .

Is this a bug or is it just my code?

@jnothman
Copy link
Member

jnothman commented Nov 26, 2017 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires a test in sklearn/preprocessing/tests/test_data.py


feature_names = np.concatenate(feature_names)
return feature_names

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not leave a blank line at the end of the file

@@ -2873,3 +2873,39 @@ def inverse_transform(self, X):
X_tr[mask, idx] = None

return X_tr

def get_feature_names(self, input_features=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be defined as a method within the class, not as an external function

temp.append(str(t))
cats2.append(np.array(temp))


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line has spaces in it, which should be removed. It also has a tab character.

@Nirvan101 Nirvan101 force-pushed the my-feature branch 2 times, most recently from 39216f0 to 4cd0e96 Compare November 26, 2017 13:21
@Nirvan101 Nirvan101 force-pushed the my-feature branch 4 times, most recently from bca6cdb to d7e6a47 Compare November 27, 2017 10:31
feature_names2 = enc.get_feature_names(['one', 'two',
'three', 'four', 'five'])

assert_array_equal(['one_Female', 'one_Male',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test failures related to pep8 and whitespace around here.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works but I think the implementation is a bit needlessly complicated.

@@ -2873,3 +2873,37 @@ def inverse_transform(self, X):
X_tr[mask, idx] = None

return X_tr

def get_feature_names(self, input_features=None):
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No newline after """


Returns
-------
output_feature_names : list of string, length n_output_features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a very short doctest example would be illustrative?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added it under the class declaration, please check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!

input_features = ['x%d' % i for i in range(len(cats))]

cats2 = []
for i in range(len(cats)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you iterate over cats twice

t = [input_features[i] + '_' + f for f in cats2[i]]
feature_names.append(t)

feature_names = np.concatenate(feature_names).tolist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats a weird way to join lists: you converte them to numpy arrays, concatenate them, and convert them back to lists.

You could probably just do a single list comprehension for everything.

@Nirvan101 Nirvan101 force-pushed the my-feature branch 2 times, most recently from 1b10208 to 423302d Compare November 28, 2017 22:21
@Nirvan101
Copy link
Contributor Author

@amueller @jnothman please review. Let me know if any more modifications are needed.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strategy is only applicable if encoding='onehot' or encoding='onehot-dense'

"""
cats = self.categories_
feature_names = []
if input_features is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not, please validate that input_features's length corresponds to self.categories_

for t in cats[i]:
temp.append(str(t))
t = [input_features[i] + '_' + f for f in temp]
feature_names.append(t)
Copy link
Member

@jnothman jnothman Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just

    feature_names.extend(input_features[i] + '_' + str(t)
                         for t in cats[i])

?

Copy link
Member

@jnothman jnothman Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we should be using unicode here in python 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the 2 changes you mentioned. Will make a PR soon.I also tried with encoding = 'ordinal' and it worked with that too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you mean by using unicode here . Can you explain that better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to make a new PR, just push more commits to this branch, please.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python 2, we need to use unicode not str, etc, to allow for non-ascii strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman I am not sure this conversion to unicode is needed.
If a user has unicode categories or passes unicode input features, concatenating them still works as expected in python 2 (a mixture of string and unicode will just automatically be unicode as result).
Not doing this conversion AFAIK only means we can end up with a list of both string and unicode values in python 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we test for reasonable behaviour, I'm fine with it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will do some more testing on this (I actually expect the current tests to fail on python 2, let's see)

commit 11

rebased commit 8,9

commit 10
@Nirvan101
Copy link
Contributor Author

I've added a check for input_features and also unicode support for python2. Please review.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking good, added some minor comments.

@Nirvan101 do you have time to update this (on a short term)?
It would be nice to include this in the release.

" length equal to number of features")

def to_unicode(text):
return text if isinstance(text, unicode) else text.encode('utf8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use six.text_type here instead of the unicode (and redefining unicode depending on the version)

'three_boy', 'three_girl',
'four_1', 'four_2', 'four_12', 'four_21',
'five_3', 'five_10', 'five_30'], feature_names2)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test catching the error if the passed input_features are not correct length or not strings?

@jorisvandenbossche
Copy link
Member

@Nirvan101 do you have time to update this on a short term ?
I am also happy to make the final edits based on the comments to get this merged.

@Nirvan101
Copy link
Contributor Author

@jorisvandenbossche Sorry for the late reply. I'm afraid I don't have the time to work on this currently. Please feel free to take it up.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche Sorry for the late reply. I'm afraid I don't have the time to work on this currently. Please feel free to take it up.

Thanks for letting it know!
I already started with merging master to update it with the latest changes. I will also update for the small remaining comments, then we can see how to go further from there.

@jorisvandenbossche jorisvandenbossche changed the title [MRG] Added get_feature_names() to CategoricalEncoder [MRG] Add get_feature_names to OneHotEncoder Jun 27, 2018
@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 4e17c86 into 3b5abf7 - view on LGTM.com

new alerts:

  • 1 for Unused import

Comment posted by LGTM.com

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 27, 2018

@jnothman AFAIK the added input_features here is the first time it is added to get_feature_names. Was there already some consensus that this is the good way forward? (for dealing with passing through of feature names).
EDIT: there is quite some discussion about this on #6425

If that is the case, I think this PR is pretty good to go, and would be nice to include in 0.20.0, as it is really useful for the OneHotEncoder.

@jorisvandenbossche
Copy link
Member

Ah, it seems that PolynomialFeatures.get_feature_names has this already as well.

@amueller
Copy link
Member

@jorisvandenbossche yes, that's the only other place where that was done. We don't really have an API for making this work with pipelines or using pandas column names, but I think the optional input_features provides a convenient user interface for now.

@amueller
Copy link
Member

@jorisvandenbossche so you're taking this over? It would be great to have!

@jorisvandenbossche
Copy link
Member

Yes, I can finish this. But I think it is as good as ready (so new review round is certainly welcome). I just want to check if we have enough tests for python 2 / unicode.


for i in range(len(cats)):
feature_names.extend(to_unicode(
input_features[i] + '_' + str(t)) for t in cats[i])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is underscore a good choice? Should we allow users to specify this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a reasonable default (at least better than . or = IMO). Do you have other ideas? (double underscore, ..)
I think the option to specify it could be left for another PR (as long as we now choose what we would otherwise have as the default for that option).


Returns
-------
output_feature_names : list of string, length n_output_features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want list of string? I think numpy arrays might be better. I hate the fact that CountVectorizer returns lists because I always need to convert before I can slice.
So it's a question of convenience vs consistency - but I might rather want to change that in CountVectorizer for 1.0 ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with making this numpy arrays (of object dtype then). Since we are adding it here to a new class, seems the good time to do this, if we consider changing it later for CountVectorizer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

input_features = ['x%d' % i for i in range(len(cats))]
elif(len(input_features) != len(self.categories_)):
raise ValueError("input_features should have"
" length equal to number of features")
Copy link
Member

@amueller amueller Jun 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error message should probably contain both values, right?
"input_features should have length equal to number of features {}, got {}".format(len(self.categories_), len(input_features))?

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't checked the unicode stuff but otherwise looks good.

@jorisvandenbossche
Copy link
Member

Updated, and also changed to return array.

@amueller
Copy link
Member

lgtm. you could test the numbers in the error message but that seems overkill I guess.

feature_names = []
for i in range(len(cats)):
names = [
input_features[i] + '_' + six.text_type(t) for t in cats[i]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be appropriate to use __?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with using single underscore. Merge?

@jorisvandenbossche
Copy link
Member

Would it be appropriate to use __

Andy asked a similar question in #10198 (comment), wondering if there could be a parameter to override the default separator.

I think the best options are _ and __ (we use = in the DictVectorizer, but would rather not use that since it turns it always into an invalid python identifier).
__ holds resemblance with how parameter names are divided in pipelines, but I didn't get to a conclusion yet if I think that is good or bad :-) (it's in a certain way consistent, but the two are also not exactly the same).
I think _ is a bit nicer to 'look' at, __ might be more robust to confusion if the input feature name already has a underscore in it.

@jorisvandenbossche
Copy link
Member

Fixed the merge conflict.

@amueller amueller added this to the 0.20 milestone Jul 16, 2018
@amueller amueller merged commit 61fa315 into scikit-learn:master Jul 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add get_feature_names to CategoricalEncoder
5 participants