Skip to content

DOC: change CountVectorizer(...lambda..) to OneHotEncoder() in ColumnTransformer examples #13212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 22, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 13 additions & 9 deletions doc/modules/compose.rst
Original file line number Diff line number Diff line change
Expand Up @@ -408,16 +408,19 @@ preprocessing or a specific feature extraction method::
... 'user_rating': [4, 5, 4, 3]})

For this data, we might want to encode the ``'city'`` column as a categorical
variable, but apply a :class:`feature_extraction.text.CountVectorizer
variable using :class:`preprocessing.OneHotEncoder
<sklearn.preprocessing.OneHotEncoder>` but apply a
:class:`feature_extraction.text.CountVectorizer
<sklearn.feature_extraction.text.CountVectorizer>` to the ``'title'`` column.
As we might use multiple feature extraction methods on the same column, we give
each transformer a unique name, say ``'city_category'`` and ``'title_bow'``.
By default, the remaining rating columns are ignored (``remainder='drop'``)::

>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.preprocessing import OneHotEncoder
>>> column_trans = ColumnTransformer(
... [('city_category', CountVectorizer(analyzer=lambda x: [x]), 'city'),
... [('city_category', OneHotEncoder(dtype='int'),['city']),
... ('title_bow', CountVectorizer(), 'title')],
... remainder='drop')

Expand All @@ -428,7 +431,7 @@ By default, the remaining rating columns are ignored (``remainder='drop'``)::

>>> column_trans.get_feature_names()
... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
['city_category__London', 'city_category__Paris', 'city_category__Sallisaw',
['city_category__x0_London', 'city_category__x0_Paris', 'city_category__x0_Sallisaw',
'title_bow__bow', 'title_bow__feast', 'title_bow__grapes', 'title_bow__his',
'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable',
'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson',
Expand All @@ -443,8 +446,9 @@ By default, the remaining rating columns are ignored (``remainder='drop'``)::

In the above example, the
:class:`~sklearn.feature_extraction.text.CountVectorizer` expects a 1D array as
input and therefore the columns were specified as a string (``'city'``).
However, other transformers generally expect 2D data, and in that case you need
input and therefore the columns were specified as a string (``'title'``).
However, :class:`preprocessing.OneHotEncoder <sklearn.preprocessing.OneHotEncoder>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the stuff in the angle brackets

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was there originally. @jorisvandenbossche reading those lines yesterday I did not why you used this syntax? Any reasons?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason it is written like this is because the "currentmodule" of sphinx is sklearn.pipeline, so I think we need to use the full path to refer to sklearn.preprocessing.OneHotEncoder. What this then does is shorten that in the display to preprocessing.OneHotEncoder (the functionality to link to its docstring page of course stays the same).

Whether this complexity is worth it, that's another question :)

But to be consistent with how the rest of the text is, it can be replaced here with :class:`~sklearn.preprocessing.OneHotEncode` (which will just display OneHotEncoder)

as most of other transformers expects 2D data, therefore in that case you need
to specify the column as a list of strings (``['city']``).

Apart from a scalar or a single item list, the column selection can be specified
Expand All @@ -457,7 +461,7 @@ We can keep the remaining rating columns by setting
transformation::

>>> column_trans = ColumnTransformer(
... [('city_category', CountVectorizer(analyzer=lambda x: [x]), 'city'),
... [('city_category', OneHotEncoder(dtype='int'),['city']),
... ('title_bow', CountVectorizer(), 'title')],
... remainder='passthrough')

Expand All @@ -474,7 +478,7 @@ the transformation::

>>> from sklearn.preprocessing import MinMaxScaler
>>> column_trans = ColumnTransformer(
... [('city_category', CountVectorizer(analyzer=lambda x: [x]), 'city'),
... [('city_category', OneHotEncoder(), ['city']),
... ('title_bow', CountVectorizer(), 'title')],
... remainder=MinMaxScaler())

Expand All @@ -492,14 +496,14 @@ above example would be::

>>> from sklearn.compose import make_column_transformer
>>> column_trans = make_column_transformer(
... (CountVectorizer(analyzer=lambda x: [x]), 'city'),
... (OneHotEncoder(), ['city']),
... (CountVectorizer(), 'title'),
... remainder=MinMaxScaler())
>>> column_trans # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
ColumnTransformer(n_jobs=None, remainder=MinMaxScaler(copy=True, ...),
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('countvectorizer-1', ...)
transformers=[('onehotencoder', ...)

.. topic:: Examples:

Expand Down