-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Label Encoder Unseen Labels #3599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I removed "fit_labels" and "new_label_mapping_" because they seem unnecessary if the updates are made directly to classes_ |
It seems to me that the simplest way to get this desired behvaior from Label Encoder is to keep the new_labels parameter an Int. If it is an int we use it for any unseen label, if it is None we use the original behavior with a value error. Finally to get the "update" behavior we make a |
I have changed the insertion of the new labels to the "classes_" attribute so that the "classes_" is always sorted. @mjbommar can you say why you were |
But the output needs to be consistent from one call to the next! I.e. if I use: >>> le = LabelEncoder(...)
>>> le.fit_transform([1,4])
array([0, 1])
>>> le.transform([1,2]) the result here can't change the meaning of the output 1, nor can a subsequent call with However, there are runtime cost advantages to keeping |
@mjbommar, I'm still not persuaded that Needing to have encoding of more classes than are present in the training sample makes sense. But I don't understand the use case where those classes are not known beforehand, or can't be normalised by the user. I'd rather the solution at #1643 which allows a list of classes as a parameter; I don't know why @amueller closed it except for long-term inactivity. |
What do you think of having a |
If it comes with a meaningful example, I have no problem with it. But I don't know whether between calls to |
The solution with classes as a parameter seems like a reasonable way to offer support for unseen labels while not making many assumptions. If update needs to be left out of this PR maybe we can revisit when there is a justifying example. For now though I will implement the integer default value and the classes parameter. |
I am - unsurprisingly - +1 for delaying update until it has a clear On 2 September 2014 22:06, hamsal notifications@github.com wrote:
|
@jnothman , @hamsal , the TBH, I am personally indifferent w.r.t. this PR as I have so much proprietary wrapper code written around these use cases already, but it might help to go back to the mailing list where questions of this type have been repeatedly brought up. |
While I do believe there is application for this in an online learning environment, your example of the So I think such online learning scenarios -- with extensible vocabularies and shape-expandable models -- belong to the realm of external toolkits and user code at least for now. The solution for |
@hamsal could you rebase? I can try reviewing this... |
This will be solved by #9151. I think |
This can be closed now |
CategoricalEncoder doc says: |
What CategoricalEncoder doc? CategoricalEncoder doesn't exist.
|
upd: CategoricalEncoder shortly existed in a dev version. It was deprecated (importable, but throws an error recommending to use OrdinalEncoder). |
Hi guys, I'm trying to train a video classification network on the UCF101 dataset and I'm getting the same error when the network tries encoding the labels. Traceback (most recent call last): I know the file I'm providing with the label names is correct. Anyone has any idea of what could be done to solve this problem? |
This is a pull request to adopt the work done by @mjbommar at #3483
This PR intends to make preprocessing.LabelEncoder more friendly for production/pipeline usage by adding a new_labels constructor argument.
Instead of always raising ValueError for unseen/new labels in transform, LabelEncoder may be initialized with new_labels as:
"update": update classes with new IDs [N, ..., N+m-1] for m new labels and assign