Skip to content

Handle unseen labels in LabelEncoder #13423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

Handle unseen labels in LabelEncoder #13423

wants to merge 5 commits into from

Conversation

rragundez
Copy link
Contributor

@rragundez rragundez commented Mar 8, 2019

Reference Issues/PRs

There are several issues that reference what this PR addresses: #8136 #3599 #9151 #6231
nevertheless the problem is still open as CategoricalEncoder #9151 does not fix the issue as said in some of the threads. Handling unknowns is not currently supported for encoding='ordinal', which is the problem mentioned on some of these issues.

What does this implement/fix? Explain your changes.

The problem here is that LabelEncoder as a part of a pipeline will only handle a single feature therefore has no knowledge of how to throw away the complete observation (all other features) if it encounters an unknown value. Therefore the non-support from CategoricalEncoder. The only solution is to impute/replace some known value to this unknown ones. To start I propose to give the option to the user to impute the most_common seen label during fitting. Next the mean rounded value can be another one for example.

Any other comments?

from sklearn.preprocessing.label import LabelEncoder
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'training_data': ['A', 'A', 'C', 'B', 'Z', 'C', 'C'],
    'test_data': ['A', 'A', 'C', 'B', 'Z', 'C', 'SOMETHING'],
    'training_data_num': [20., 3., 4., 5., 20., 3., 4.],
    'test_data_num': [3, 3, 100, 5, 4, 4, 2]
})

# it raises ValueError as it should since it found an unseen label
lbl_encoder = LabelEncoder()
lbl_encoder.fit_transform(df['training_data'])
lbl_encoder.transform(df['test_data'])

# with the new changes the user has the option of imputing the unknown values
lbl_encoder = LabelEncoder(impute_method='most_common')
lbl_encoder.fit_transform(df['training_data'])
lbl_encoder.transform(df['test_data'])

lbl_encoder = LabelEncoder(impute_method='most_common')
lbl_encoder.fit_transform(df['training_data_num'])
lbl_encoder.transform(df['test_data_num'])

Some of the use cases are when you have ordinal features. In some cases I have also encounter that for memory concern I cannot/don't want to expand to one-hot encoded type vectors, so keeping an ordinal feature is very useful.

@jnothman
Copy link
Member

LabelEncoder is not intended for features, but for targets, which is the main reason those previous PRs didn't get seriously reviewed. CategoricalEmcoder does not exist in master or any release. So I'm confused about several of your comments.

@rragundez
Copy link
Contributor Author

rragundez commented Mar 11, 2019

Apologies for the confusion @jnothman , just following the thread of the issue and related PR I ended in #9151, I see that the class was refactored now into OneHotEncoder and OrdinalEncoder (I added a comment to that PR about that).

Nevertheless my comment and issue still stands I believe, LabelEncoder nor OrdinalEncoder handle unseen labels (using the same dataframe from the description):

# it raises ValueError as it should since it found an unseen label
lbl_encoder = OrdinalEncoder()
lbl_encoder.fit_transform(df[['training_data']])
lbl_encoder.transform(df[['test_data']])
  • I understand that LabelEncoder is meant for targets, but still I find it a useful to give an option too the user to do something instead of breaking when LabelEncoder is part of a preprocessing pipeline, don't you think?

  • I will think somethingg similar would be useful to implement in OrdinalEncoder, if you think that's the case I can start taking a look.

@rragundez
Copy link
Contributor Author

rragundez commented Mar 11, 2019

I just saw that you raised a similar point in #11997, there you suggest that the user gives that missing value to impute, I thought about that myself for this PR, but then the user will have to handle that logic of calculation outside the pipeline and then create the pipeline with the imputation value calculated, I found it nicer to implement something in the pipeline itself. But I'm willing to do the work to change the proposed behavior or leave both: user gives value to impute and the most common value is imputed. Please let me know, thanks.

@rragundez rragundez closed this May 6, 2019
@rragundez rragundez deleted the update-lbl-encoder branch May 7, 2019 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants