-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG] Add get_feature_names to OneHotEncoder #10198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm getting a
i.e line 164 in Is this a bug or is it just my code? |
apparently there is a tab character in your additions. There should not be.
Remove it, please!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires a test in sklearn/preprocessing/tests/test_data.py
sklearn/preprocessing/data.py
Outdated
|
||
feature_names = np.concatenate(feature_names) | ||
return feature_names | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do not leave a blank line at the end of the file
sklearn/preprocessing/data.py
Outdated
@@ -2873,3 +2873,39 @@ def inverse_transform(self, X): | |||
X_tr[mask, idx] = None | |||
|
|||
return X_tr | |||
|
|||
def get_feature_names(self, input_features=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be defined as a method within the class, not as an external function
sklearn/preprocessing/data.py
Outdated
temp.append(str(t)) | ||
cats2.append(np.array(temp)) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line has spaces in it, which should be removed. It also has a tab character.
39216f0
to
4cd0e96
Compare
bca6cdb
to
d7e6a47
Compare
feature_names2 = enc.get_feature_names(['one', 'two', | ||
'three', 'four', 'five']) | ||
|
||
assert_array_equal(['one_Female', 'one_Male', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test failures related to pep8 and whitespace around here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works but I think the implementation is a bit needlessly complicated.
sklearn/preprocessing/data.py
Outdated
@@ -2873,3 +2873,37 @@ def inverse_transform(self, X): | |||
X_tr[mask, idx] = None | |||
|
|||
return X_tr | |||
|
|||
def get_feature_names(self, input_features=None): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No newline after """
sklearn/preprocessing/data.py
Outdated
|
||
Returns | ||
------- | ||
output_feature_names : list of string, length n_output_features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a very short doctest example would be illustrative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added it under the class declaration, please check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great!
sklearn/preprocessing/data.py
Outdated
input_features = ['x%d' % i for i in range(len(cats))] | ||
|
||
cats2 = [] | ||
for i in range(len(cats)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why you iterate over cats twice
sklearn/preprocessing/data.py
Outdated
t = [input_features[i] + '_' + f for f in cats2[i]] | ||
feature_names.append(t) | ||
|
||
feature_names = np.concatenate(feature_names).tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats a weird way to join lists: you converte them to numpy arrays, concatenate them, and convert them back to lists.
You could probably just do a single list comprehension for everything.
1b10208
to
423302d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This strategy is only applicable if encoding='onehot' or encoding='onehot-dense'
sklearn/preprocessing/data.py
Outdated
""" | ||
cats = self.categories_ | ||
feature_names = [] | ||
if input_features is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not, please validate that input_features
's length corresponds to self.categories_
sklearn/preprocessing/data.py
Outdated
for t in cats[i]: | ||
temp.append(str(t)) | ||
t = [input_features[i] + '_' + f for f in temp] | ||
feature_names.append(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just
feature_names.extend(input_features[i] + '_' + str(t)
for t in cats[i])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think we should be using unicode here in python 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the 2 changes you mentioned. Will make a PR soon.I also tried with encoding = 'ordinal' and it worked with that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you mean by using unicode here
. Can you explain that better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to make a new PR, just push more commits to this branch, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python 2, we need to use unicode
not str
, etc, to allow for non-ascii strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman I am not sure this conversion to unicode is needed.
If a user has unicode categories or passes unicode input features, concatenating them still works as expected in python 2 (a mixture of string and unicode will just automatically be unicode as result).
Not doing this conversion AFAIK only means we can end up with a list of both string and unicode values in python 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we test for reasonable behaviour, I'm fine with it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will do some more testing on this (I actually expect the current tests to fail on python 2, let's see)
b4f49e1
to
68e3f92
Compare
68e3f92
to
a0e91d7
Compare
I've added a check for input_features and also unicode support for python2. Please review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking good, added some minor comments.
@Nirvan101 do you have time to update this (on a short term)?
It would be nice to include this in the release.
sklearn/preprocessing/data.py
Outdated
" length equal to number of features") | ||
|
||
def to_unicode(text): | ||
return text if isinstance(text, unicode) else text.encode('utf8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use six.text_type
here instead of the unicode
(and redefining unicode depending on the version)
'three_boy', 'three_girl', | ||
'four_1', 'four_2', 'four_12', 'four_21', | ||
'five_3', 'five_10', 'five_30'], feature_names2) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a test catching the error if the passed input_features
are not correct length or not strings?
@Nirvan101 do you have time to update this on a short term ? |
@jorisvandenbossche Sorry for the late reply. I'm afraid I don't have the time to work on this currently. Please feel free to take it up. |
Thanks for letting it know! |
This pull request introduces 1 alert when merging 4e17c86 into 3b5abf7 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
4e17c86
to
f1ce05b
Compare
@jnothman AFAIK the added If that is the case, I think this PR is pretty good to go, and would be nice to include in 0.20.0, as it is really useful for the OneHotEncoder. |
Ah, it seems that |
@jorisvandenbossche yes, that's the only other place where that was done. We don't really have an API for making this work with pipelines or using pandas column names, but I think the optional |
@jorisvandenbossche so you're taking this over? It would be great to have! |
Yes, I can finish this. But I think it is as good as ready (so new review round is certainly welcome). I just want to check if we have enough tests for python 2 / unicode. |
sklearn/preprocessing/_encoders.py
Outdated
|
||
for i in range(len(cats)): | ||
feature_names.extend(to_unicode( | ||
input_features[i] + '_' + str(t)) for t in cats[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is underscore a good choice? Should we allow users to specify this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a reasonable default (at least better than .
or =
IMO). Do you have other ideas? (double underscore, ..)
I think the option to specify it could be left for another PR (as long as we now choose what we would otherwise have as the default for that option).
sklearn/preprocessing/_encoders.py
Outdated
|
||
Returns | ||
------- | ||
output_feature_names : list of string, length n_output_features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want list of string? I think numpy arrays might be better. I hate the fact that CountVectorizer
returns lists because I always need to convert before I can slice.
So it's a question of convenience vs consistency - but I might rather want to change that in CountVectorizer for 1.0 ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with making this numpy arrays (of object dtype then). Since we are adding it here to a new class, seems the good time to do this, if we consider changing it later for CountVectorizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
sklearn/preprocessing/_encoders.py
Outdated
input_features = ['x%d' % i for i in range(len(cats))] | ||
elif(len(input_features) != len(self.categories_)): | ||
raise ValueError("input_features should have" | ||
" length equal to number of features") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error message should probably contain both values, right?
"input_features should have length equal to number of features {}, got {}".format(len(self.categories_), len(input_features))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't checked the unicode stuff but otherwise looks good.
Updated, and also changed to return array. |
lgtm. you could test the numbers in the error message but that seems overkill I guess. |
feature_names = [] | ||
for i in range(len(cats)): | ||
names = [ | ||
input_features[i] + '_' + six.text_type(t) for t in cats[i]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be appropriate to use __
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with using single underscore. Merge?
Andy asked a similar question in #10198 (comment), wondering if there could be a parameter to override the default separator. I think the best options are |
Fixed the merge conflict. |
Reference Issues/PRs
Fixes #10181
What does this implement/fix? Explain your changes.
Added function get_feature_names() to CategoricalEncoder class. This is in
data.py
undersklearn.preprocessing