Handle unseen labels in LabelEncoder
#13423
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
There are several issues that reference what this PR addresses: #8136 #3599 #9151 #6231
nevertheless the problem is still open as
CategoricalEncoder
#9151 does not fix the issue as said in some of the threads. Handling unknowns is not currently supported forencoding='ordinal'
, which is the problem mentioned on some of these issues.What does this implement/fix? Explain your changes.
The problem here is that
LabelEncoder
as a part of a pipeline will only handle a single feature therefore has no knowledge of how to throw away the complete observation (all other features) if it encounters an unknown value. Therefore the non-support fromCategoricalEncoder
. The only solution is to impute/replace some known value to this unknown ones. To start I propose to give the option to the user to impute themost_common
seen label during fitting. Next the mean rounded value can be another one for example.Any other comments?
Some of the use cases are when you have ordinal features. In some cases I have also encounter that for memory concern I cannot/don't want to expand to one-hot encoded type vectors, so keeping an ordinal feature is very useful.