Skip to content

DOC: change CountVectorizer(...lambda..) to OneHotEncoder() in ColumnTransformer examples #13212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 22, 2019

Conversation

maikia
Copy link
Contributor

@maikia maikia commented Feb 21, 2019

What does this implement/fix? Explain your changes.

Replaced the CountVectorizer(analyzer=lambda x: [x]) (workaround to do one-hot encoding using the CountVectorizer) with OneHotEncoder(), now this supports string features and get_feature_names.

... ('title_bow', CountVectorizer(), 'title')],
... remainder='drop')

column_trans.fit(X)
column_trans.get_feature_names()
column_trans.transform(X).toarray()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is a left-over that can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup. done

@jorisvandenbossche jorisvandenbossche changed the title changed CountVectorizer(...lambda..) to OneHotEncoder() in few examples of ColumnTransformer DOC: change CountVectorizer(...lambda..) to OneHotEncoder() in ColumnTransformer examples Feb 21, 2019
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

>>> column_trans = ColumnTransformer(
... [('city_category', CountVectorizer(analyzer=lambda x: [x]), 'city'),
... [('city_category', OneHotEncoder(dtype='int'),(['city'])),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
... [('city_category', OneHotEncoder(dtype='int'),(['city'])),
... [('city_category', OneHotEncoder(dtype='int'), ['city']),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maikia Could accept this suggestion as well. I forgot to add it to the previous one.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small changes for PEP8 and removing unnecessary parenthesis.

input and therefore the columns were specified as a string (``'city'``).
However, other transformers generally expect 2D data, and in that case you need
input and therefore the columns were specified as a string (``'title'``).
However, :class:`preprocessing.OneHotEncoder <sklearn.preprocessing.OneHotEncoder>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the stuff in the angle brackets

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was there originally. @jorisvandenbossche reading those lines yesterday I did not why you used this syntax? Any reasons?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason it is written like this is because the "currentmodule" of sphinx is sklearn.pipeline, so I think we need to use the full path to refer to sklearn.preprocessing.OneHotEncoder. What this then does is shorten that in the display to preprocessing.OneHotEncoder (the functionality to link to its docstring page of course stays the same).

Whether this complexity is worth it, that's another question :)

But to be consistent with how the rest of the text is, it can be replaced here with :class:`~sklearn.preprocessing.OneHotEncode` (which will just display OneHotEncoder)

glemaitre and others added 2 commits February 22, 2019 11:31
Two small changes for PEP8 and removing unnecessary parenthesis.

Co-Authored-By: maikia <maja_ka@hotmail.com>
Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @maikia

@qinhanmin2014 qinhanmin2014 merged commit 2480368 into scikit-learn:master Feb 22, 2019
@glemaitre
Copy link
Member

Thanks, @maikia!!!

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Feb 24, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants