Handle unseen labels in `LabelEncoder` #13423

rragundez · 2019-03-08T17:36:57Z

Reference Issues/PRs

There are several issues that reference what this PR addresses: #8136 #3599 #9151 #6231
nevertheless the problem is still open as CategoricalEncoder #9151 does not fix the issue as said in some of the threads. Handling unknowns is not currently supported for encoding='ordinal', which is the problem mentioned on some of these issues.

What does this implement/fix? Explain your changes.

The problem here is that LabelEncoder as a part of a pipeline will only handle a single feature therefore has no knowledge of how to throw away the complete observation (all other features) if it encounters an unknown value. Therefore the non-support from CategoricalEncoder. The only solution is to impute/replace some known value to this unknown ones. To start I propose to give the option to the user to impute the most_common seen label during fitting. Next the mean rounded value can be another one for example.

Any other comments?

from sklearn.preprocessing.label import LabelEncoder
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'training_data': ['A', 'A', 'C', 'B', 'Z', 'C', 'C'],
    'test_data': ['A', 'A', 'C', 'B', 'Z', 'C', 'SOMETHING'],
    'training_data_num': [20., 3., 4., 5., 20., 3., 4.],
    'test_data_num': [3, 3, 100, 5, 4, 4, 2]
})

# it raises ValueError as it should since it found an unseen label
lbl_encoder = LabelEncoder()
lbl_encoder.fit_transform(df['training_data'])
lbl_encoder.transform(df['test_data'])

# with the new changes the user has the option of imputing the unknown values
lbl_encoder = LabelEncoder(impute_method='most_common')
lbl_encoder.fit_transform(df['training_data'])
lbl_encoder.transform(df['test_data'])

lbl_encoder = LabelEncoder(impute_method='most_common')
lbl_encoder.fit_transform(df['training_data_num'])
lbl_encoder.transform(df['test_data_num'])

Some of the use cases are when you have ordinal features. In some cases I have also encounter that for memory concern I cannot/don't want to expand to one-hot encoded type vectors, so keeping an ordinal feature is very useful.

jnothman · 2019-03-10T22:40:56Z

LabelEncoder is not intended for features, but for targets, which is the main reason those previous PRs didn't get seriously reviewed. CategoricalEmcoder does not exist in master or any release. So I'm confused about several of your comments.

rragundez · 2019-03-11T06:24:13Z

Apologies for the confusion @jnothman , just following the thread of the issue and related PR I ended in #9151, I see that the class was refactored now into OneHotEncoder and OrdinalEncoder (I added a comment to that PR about that).

Nevertheless my comment and issue still stands I believe, LabelEncoder nor OrdinalEncoder handle unseen labels (using the same dataframe from the description):

# it raises ValueError as it should since it found an unseen label
lbl_encoder = OrdinalEncoder()
lbl_encoder.fit_transform(df[['training_data']])
lbl_encoder.transform(df[['test_data']])

I understand that LabelEncoder is meant for targets, but still I find it a useful to give an option too the user to do something instead of breaking when LabelEncoder is part of a preprocessing pipeline, don't you think?
I will think somethingg similar would be useful to implement in OrdinalEncoder, if you think that's the case I can start taking a look.

rragundez · 2019-03-11T06:31:29Z

I just saw that you raised a similar point in #11997, there you suggest that the user gives that missing value to impute, I thought about that myself for this PR, but then the user will have to handle that logic of calculation outside the pipeline and then create the pipeline with the imputation value calculated, I found it nicer to implement something in the pipeline itself. But I'm willing to do the work to change the proposed behavior or leave both: user gives value to impute and the most common value is imputed. Please let me know, thanks.

rragundez added 5 commits March 8, 2019 17:23

Add impute value option to LabelEncoder

c906171

Update docs with impute_method and impute_value

a4e6e3b

Add test for impute_method most_common

8bcbcfe

Create imput_value None by default

60b1f3a

Update docs and __init__ pattern

af9c6a9

rragundez closed this May 6, 2019

rragundez deleted the update-lbl-encoder branch May 7, 2019 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle unseen labels in `LabelEncoder` #13423

Handle unseen labels in `LabelEncoder` #13423

rragundez commented Mar 8, 2019 •

edited

Loading

jnothman commented Mar 10, 2019

rragundez commented Mar 11, 2019 •

edited

Loading

rragundez commented Mar 11, 2019 •

edited

Loading

Handle unseen labels in LabelEncoder #13423

Handle unseen labels in LabelEncoder #13423

Conversation

rragundez commented Mar 8, 2019 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

jnothman commented Mar 10, 2019

rragundez commented Mar 11, 2019 • edited Loading

rragundez commented Mar 11, 2019 • edited Loading

Handle unseen labels in `LabelEncoder` #13423

Handle unseen labels in `LabelEncoder` #13423

rragundez commented Mar 8, 2019 •

edited

Loading

rragundez commented Mar 11, 2019 •

edited

Loading

rragundez commented Mar 11, 2019 •

edited

Loading