META OHE / OrdinalEncoder: NaN support, unfrequent cat. and pd.Categorical

There have been 3 significant improvements proposed for `OneHotEncoder` (and to a lesser extent `OrdinalEncoder`), often with an associated PR,
 - NaN handling (issue https://github.com/scikit-learn/scikit-learn/issues/11996, PR https://github.com/scikit-learn/scikit-learn/pull/13028)
 - support of `pd.Categorical` dtype (issue https://github.com/scikit-learn/scikit-learn/issues/14953, PR https://github.com/scikit-learn/scikit-learn/pull/15396)
 - handling of infrequent categories (issue https://github.com/scikit-learn/scikit-learn/issues/12153, PR https://github.com/scikit-learn/scikit-learn/pull/13833)

the goal of this issue to have a high level agreement on the desired solution, that it is consistent/compatible for different available encoders, or encoders that we may want to add in the near future (e.g. target encoder, https://github.com/scikit-learn/scikit-learn/issues/5853). Some of the possible solutions for the above 3 features are mutually exclusive. Also putting aside aside backward compatibility constraints for a start, what default options we would want ideally.

I have not followed in detail all past discussions about encoders (in particular about ordering concerns https://github.com/scikit-learn/scikit-learn/pull/15050). Following are some of the observations / open questions I have, please add more if I missed something/link with existing comments.

**NaN support**

 - Mainly if we want to implement support directly or suggest to use an imputer for the pre-processing step (as one can do now).

**pd.Categorical support**
 - Do we say that `categories='dtype'` would make OHE categories match categories in the dtype 
? Including the ordering? But then, this means only at fit since, for transform, the test set could have unknown categories. 
 - Actually, if one does a train test split, some categories from the dtype can be missing from the train set as well. Do we then create a column with 0s in `fit_transform`, or disregard this column breaking the assumption of conforming to dtype categories? Solution proposed by @thomasjpfan in https://github.com/scikit-learn/scikit-learn/pull/15396#issuecomment-552065493
 - Finally, if we do not conform to the dtype categories, and only use `pd.Categorical` for computational efficiency internally, what is the point of defining a `categories='dtype'` in the first place (or warning that categories order doesn't match the order of categories in dtype https://github.com/scikit-learn/scikit-learn/pull/15050).
 
**Infrequent categories**

Overall the plan seems fairly clear in https://github.com/scikit-learn/scikit-learn/issues/12153#issuecomment-561774199

**NaN & pd.Categorical**

 - Handing `NaN` as a separate category, means we are no longer using purely the categories from the dtype (even with `dtype='category'`) which can be fine as long as we agree on it.
 - Another possibility is to implement a say `CategoricalPreprocessor` to add `NaN` as a new dtype category in a separate preprocessing step,
   ```py
   df[column].cat.add_categories("NaN", inplace=True)
   df[column].fillna("NaN", inplace=True)
   ```
   which can make interpretation simpler. Say `df[column].value_counts()` would then show NaN properly and one might want to do it for exploratory analysis in any case. 
 
**NaN & infrequent categories**

- Do infrequent categories rules apply to NaN or is it always a separate category (even if passes the infrequent criteria)?

**pd.Categorical and infrequent categories**

- Similarly we could consider a preprocessor that would add `infrequent` as a category to dtype, instead of doing that internally in encoders. Say to evaluate how many infrequent elements one has, I find that doing (approximately),
  ```py
  ohe = OneHotEncoder(categories='dtype', min_frequency=5)
  X = ohe.fit_transform(df['col'])
  unfrequent_idx = ohe.get_feature_names().tolist().find("infrequent")
  print(X.sum(axis=0).A1[unfrequent_idx])
  ```
  very awkward as opposed to,
  ```py
  df['col'].value_counts()["infrequent"]
  ```
  where it was properly added to `df['col'].cat.categories` previously and we are using all the nice features of `pd.Categorical`. 

So there is some tension here between adding these features to scikit-learn and keeping exploratory analysis with pandas user friendly (and not asking users to implement the same thing twice).

Given the complexity of this interaction, maybe separating "Imputer + Unfrequent categories conversion with pd.Categorical support" and "OneHot, Ordinal, Target etc encoder" into 2 or 3 estimators might be easier to understand? Not sure about usability though. The alternative that would mean we also plan to add these features (and enforce consistency) for any future encoder.

cc @thomasjpfan @NicolasHug @glemaitre @jorisvandenbossche  @amueller @jnothman @ogrisel 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

META OHE / OrdinalEncoder: NaN support, unfrequent cat. and pd.Categorical #15796

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

META OHE / OrdinalEncoder: NaN support, unfrequent cat. and pd.Categorical #15796

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions