Skip to content

ENH Adds infrequent categories support to OrdinalEncoder #25677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Mar 29, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
a021438
ENH Adds infrequent categories support to OrdinalEncoder
thomasjpfan Feb 23, 2023
d07eec5
DOC Adds PR number
thomasjpfan Feb 23, 2023
4753302
FIX Adds parameter validation
thomasjpfan Feb 23, 2023
2a0ee6f
DOC Adds example with unknown_value and encoded_missing_value
thomasjpfan Feb 25, 2023
84f7a9f
DOC Explain max_categories
thomasjpfan Feb 25, 2023
8d113d8
CLN Better variable name
thomasjpfan Feb 26, 2023
95f333e
CLN Better variable name
thomasjpfan Feb 26, 2023
2f602bc
FIX Fixes issue with edge case
thomasjpfan Feb 26, 2023
da45840
CLN Better variable names
thomasjpfan Feb 26, 2023
0ae6180
CLN Better parameter name
thomasjpfan Feb 27, 2023
0ff894f
CLN Simplify code paths
thomasjpfan Feb 27, 2023
75da3cb
TST Adds test for corner case
thomasjpfan Feb 27, 2023
13f40cf
Apply suggestions from code review
thomasjpfan Feb 27, 2023
38b0e6c
DOC Better wording for user guide
thomasjpfan Feb 27, 2023
410eec3
DOC Adds comment
thomasjpfan Feb 27, 2023
bc2ddc0
Merge remote-tracking branch 'upstream/main' into oridinal_encoder_ma…
thomasjpfan Mar 1, 2023
a1685e4
DOC Address comments
thomasjpfan Mar 13, 2023
399b0ff
Merge remote-tracking branch 'upstream/main' into oridinal_encoder_ma…
thomasjpfan Mar 13, 2023
19fa3f4
FIX Fixes merge issues
thomasjpfan Mar 13, 2023
7cf3dc4
DOC Fix docstrings
thomasjpfan Mar 14, 2023
4aa0754
DOC Formatting
thomasjpfan Mar 14, 2023
8743af8
Merge remote-tracking branch 'upstream/main' into oridinal_encoder_ma…
thomasjpfan Mar 22, 2023
6e1e732
DOC Reword
thomasjpfan Mar 22, 2023
fa0c7ae
Merge remote-tracking branch 'upstream/main' into oridinal_encoder_ma…
thomasjpfan Mar 23, 2023
43ce0c9
FIX Includes required import
thomasjpfan Mar 23, 2023
4bdb513
Merge remote-tracking branch 'upstream/main' into oridinal_encoder_ma…
thomasjpfan Mar 23, 2023
93b8679
FIX Fixes validation framework test
thomasjpfan Mar 23, 2023
1e8f139
Apply suggestions from code review
thomasjpfan Mar 26, 2023
d300fe9
TST Increase coverage
thomasjpfan Mar 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 43 additions & 6 deletions doc/modules/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -729,14 +729,15 @@ separate categories::
See :ref:`dict_feature_extraction` for categorical features that are
represented as a dict, not as scalars.

.. _one_hot_encoder_infrequent_categories:
.. _encoder_infrequent_categories:

Infrequent categories
---------------------

:class:`OneHotEncoder` supports aggregating infrequent categories into a single
output for each feature. The parameters to enable the gathering of infrequent
categories are `min_frequency` and `max_categories`.
:class:`OneHotEncoder` and :class:`OrdinalEncoder` support aggregating
infrequent categories into a single output for each feature. The parameters to
enable the gathering of infrequent categories are `min_frequency` and
`max_categories`.

1. `min_frequency` is either an integer greater or equal to 1, or a float in
the interval `(0.0, 1.0)`. If `min_frequency` is an integer, categories with
Expand All @@ -750,11 +751,47 @@ categories are `min_frequency` and `max_categories`.
input feature. `max_categories` includes the feature that combines
infrequent categories.

In the following example, the categories, `'dog', 'snake'` are considered
infrequent::
In the following example with :class:`OrdinalEncoder`, the categories `'dog' and
'snake'` are considered infrequent::

>>> X = np.array([['dog'] * 5 + ['cat'] * 20 + ['rabbit'] * 10 +
... ['snake'] * 3], dtype=object).T
>>> enc = preprocessing.OrdinalEncoder(min_frequency=6).fit(X)
>>> enc.infrequent_categories_
[array(['dog', 'snake'], dtype=object)]
>>> enc.transform(np.array([['dog'], ['cat'], ['rabbit'], ['snake']]))
array([[2.],
[0.],
[1.],
[2.]])

:class:`OrdinalEncoder`'s `max_categories` do **not** take into account missing
or unknown categories. Setting `unknown_value` or `encoded_missing_value` to an
integer will increase the number of unique integer codes by one each. This can
result in up to `max_categories + 2` integer codes. In the following example,
"a" and "d" are considered infrequent and grouped together into a single
category, "b" and "c" are their own categories, unknown values are encoded as 3
and missing values are encoded as 4.

>>> X_train = np.array(
... [["a"] * 5 + ["b"] * 20 + ["c"] * 10 + ["d"] * 3 + [np.nan]],
... dtype=object).T
>>> enc = preprocessing.OrdinalEncoder(
... handle_unknown="use_encoded_value", unknown_value=3,
... max_categories=3, encoded_missing_value=4)
>>> _ = enc.fit(X_train)
>>> X_test = np.array([["a"], ["b"], ["c"], ["d"], ["e"], [np.nan]], dtype=object)
>>> enc.transform(X_test)
array([[2.],
[0.],
[1.],
[2.],
[3.],
[4.]])

Similarity, :class:`OneHotEncoder` can be configured to group together infrequent
categories::

>>> enc = preprocessing.OneHotEncoder(min_frequency=6, sparse_output=False).fit(X)
>>> enc.infrequent_categories_
[array(['dog', 'snake'], dtype=object)]
Expand Down
5 changes: 5 additions & 0 deletions doc/whats_new/v1.3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -381,6 +381,11 @@ Changelog
:pr:`24935` by :user:`Seladus <seladus>`, :user:`Guillaume Lemaitre <glemaitre>`, and
:user:`Dea María Léon <deamarialeon>`, :pr:`25257` by :user:`Gleb Levitski <glevv>`.

- |Feature| :class:`preprocessing.OrdinalEncoder` now supports grouping
infrequent categories into a single feature. Grouping infrequent categories
is enabled by specifying how to select infrequent categories with
`min_frequency` or `max_categories`. :pr:`25677` by `Thomas Fan`_.

- |Fix| :class:`AdditiveChi2Sampler` is now stateless.
The `sample_interval_` attribute is deprecated and will be removed in 1.5.
:pr:`25190` by :user:`Vincent Maladière <Vincent-Maladiere>`.
Expand Down
Loading