Skip to content

[MRG] Label Encoder Unseen Labels #3599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from
Closed

[MRG] Label Encoder Unseen Labels #3599

wants to merge 18 commits into from

Conversation

hamsal
Copy link
Contributor

@hamsal hamsal commented Aug 28, 2014

This is a pull request to adopt the work done by @mjbommar at #3483

This PR intends to make preprocessing.LabelEncoder more friendly for production/pipeline usage by adding a new_labels constructor argument.

Instead of always raising ValueError for unseen/new labels in transform, LabelEncoder may be initialized with new_labels as:

  • None: current behavior, i.e., raise ValueError; to remain default behavior
  • "update": update classes with new IDs [N, ..., N+m-1] for m new labels and assign
  • an integer value: set newly seen labels to have fixed value corresponding to this integer value
  • Add classes parameter to transform function
  • Classes parameter during initilization

@hamsal hamsal changed the title Label Encoder Unseen Labels [WIP] Label Encoder Unseen Labels Aug 28, 2014
@hamsal
Copy link
Contributor Author

hamsal commented Aug 30, 2014

I removed "fit_labels" and "new_label_mapping_" because they seem unnecessary if the updates are made directly to classes_

@hamsal
Copy link
Contributor Author

hamsal commented Aug 30, 2014

It seems to me that the simplest way to get this desired behvaior from Label Encoder is to keep the new_labels parameter an Int. If it is an int we use it for any unseen label, if it is None we use the original behavior with a value error. Finally to get the "update" behavior we make a partial_fit or change fit so the classes_ is updated with new labels.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) when pulling 39bd9e8 on hamsal:label-encoder-unseen into d6bfe09 on scikit-learn:master.

@hamsal
Copy link
Contributor Author

hamsal commented Aug 30, 2014

I have changed the insertion of the new labels to the "classes_" attribute so that the "classes_" is always sorted.

@mjbommar can you say why you were outputing expecting the classes in line 437 of the doctest as ['amsterdam', 'paris', 'tokyo', 'rome'] instead of ['amsterdam', 'paris', 'rome', 'tokyo']?

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) when pulling 767bea9 on hamsal:label-encoder-unseen into d6bfe09 on scikit-learn:master.

@jnothman
Copy link
Member

I have changed the insertion of the new labels to the "classes_" attribute so that the "classes_" is always sorted.

But the output needs to be consistent from one call to the next! I.e. if I use:

>>> le = LabelEncoder(...)
>>> le.fit_transform([1,4])
array([0, 1])
>>> le.transform([1,2])

the result here can't change the meaning of the output 1, nor can a subsequent call with transform([4]) change the mapping from 4 to 1. The mapping (as known so far) must remain invariant, and there should be a test to ensure this is so.

However, there are runtime cost advantages to keeping classes_ sorted, certainly under the assumption that update is not the usual use-case. But in the update case, you would need another vector of label re-mappings.

@jnothman
Copy link
Member

@mjbommar, I'm still not persuaded that update is a common use-case, or that if it is, the user is applying a LabelEncoder correctly. Given that they aren't meant to function in a Pipeline anyway (see #3112/#3113), I don't understand your comment at #3483 regarding their production/pipeline usage. I get that it might be useful to grow the set of seen labels over time in an online learning environment, but I'm not convinced that this is the domain of the LabelEncoder.

Needing to have encoding of more classes than are present in the training sample makes sense. But I don't understand the use case where those classes are not known beforehand, or can't be normalised by the user. I'd rather the solution at #1643 which allows a list of classes as a parameter; I don't know why @amueller closed it except for long-term inactivity.

@arjoly
Copy link
Member

arjoly commented Sep 1, 2014

What do you think of having a partial_fit for the "update" case?

@jnothman
Copy link
Member

jnothman commented Sep 1, 2014

If it comes with a meaningful example, I have no problem with it. But I don't know whether between calls to partial_fit, calls to transform need to be consistent.

@hamsal
Copy link
Contributor Author

hamsal commented Sep 2, 2014

The solution with classes as a parameter seems like a reasonable way to offer support for unseen labels while not making many assumptions. If update needs to be left out of this PR maybe we can revisit when there is a justifying example. For now though I will implement the integer default value and the classes parameter.

@jnothman
Copy link
Member

jnothman commented Sep 2, 2014

I am - unsurprisingly - +1 for delaying update until it has a clear
use-case.

On 2 September 2014 22:06, hamsal notifications@github.com wrote:

The solution with classes as a parameter seems like a reasonable way to
offer support for unseen labels while not making many assumptions. If
update needs to be left out of this PR maybe we can revisit when there is a
justifying example. For know though I will implement the integer default
value and the classes parameter.


Reply to this email directly or view it on GitHub
#3599 (comment)
.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) when pulling 9153fc0 on hamsal:label-encoder-unseen into d6bfe09 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling aaf0425 on hamsal:label-encoder-unseen into d6bfe09 on scikit-learn:master.

@hamsal hamsal changed the title [WIP] Label Encoder Unseen Labels [MRG] Label Encoder Unseen Labels Sep 4, 2014
@mjbommar
Copy link
Contributor

mjbommar commented Sep 6, 2014

@jnothman , @hamsal , the "update" use case I have is online, and, a priori, classes may not all be known (or sampled per fit). The feature_extraction.text methods are similarly complicated in online scenarios, e.g., where the N+1th document has tokens or n-grams that may be informative but not present in the 1, ..., Nth documents.

TBH, I am personally indifferent w.r.t. this PR as I have so much proprietary wrapper code written around these use cases already, but it might help to go back to the mailing list where questions of this type have been repeatedly brought up.

@jnothman
Copy link
Member

jnothman commented Sep 6, 2014

While I do believe there is application for this in an online learning environment, your example of the CountVectorizer emphasises that it's not something for which the scikit-learn API is sufficiently clear yet. While FeatureHasher will readily extend to new vocab, and while one can predefine a more expansive vocabulary for CountVectorizer than will be used in fit, a real partial learning application means that successive calls to transform will have different shapes, thus affecting any downstream processes; but there is no means in the API to tell a linear model, for instance, to make a warm start but extend the shape of its coef_ matrix to account for new features or classes.

So I think such online learning scenarios -- with extensible vocabularies and shape-expandable models -- belong to the realm of external toolkits and user code at least for now. The solution for LabelEncoder is to support at least the facility available for vocabulary expansion in CountVectorizer: pre-pspecifying an expanded vocabulary (i.e. classes parameter).

@raghavrv
Copy link
Member

@hamsal could you rebase? I can try reviewing this...

@amueller
Copy link
Member

This will be solved by #9151. I think LabelEncoder should not be used for data.

@alanyee
Copy link
Contributor

alanyee commented Aug 3, 2017

@amueller what should i do in the mean time before #9151 gets merged?

@jorisvandenbossche
Copy link
Member

This can be closed now CategoricalEncoder (#9151) is merged.

@qinhanmin2014
Copy link
Member

Resolved in #9151
@hamsal thanks a lot for your contribution :)

@daniilmaltsev
Copy link

CategoricalEncoder doc says:
"Ignoring unknown categories is not supported for encoding='ordinal'."
So, it actually doesn't support handling unknown categories in the LabelEncoder's usecase.
Looks like the issue should be reopened.

@jnothman
Copy link
Member

jnothman commented Aug 9, 2019 via email

@daniilmaltsev
Copy link

upd: CategoricalEncoder shortly existed in a dev version. It was deprecated (importable, but throws an error recommending to use OrdinalEncoder).
OrdinalEncoder doesn't support unseen categories.
So, the reason for which this issue has been closed is no longer valid (and, actually, was never valid). The issue should be reopened.

@nicholasguimaraes
Copy link

Hi guys, I'm trying to train a video classification network on the UCF101 dataset and I'm getting the same error when the network tries encoding the labels.

Traceback (most recent call last):
File "UCF101_ResNetCRNN.py", line 168, in
all_y_list = labels2cat(le, actions) # all video labels
File "C:\Users\Windows\Documents\video-classification-master\ResNetCRNN\functions.py", line 14, in labels2cat
return label_encoder.transform(list)
File "C:\Users\Windows\AppData\Roaming\Python\Python36\site-packages\sklearn\preprocessing\label.py", line 256, in transform
_, y = encode(y, uniques=self.classes, encode=True)
File "C:\Users\Windows\AppData\Roaming\Python\Python36\site-packages\sklearn\preprocessing\label.py", line 109, in _encode
return _encode_numpy(values, uniques, encode)
File "C:\Users\Windows\AppData\Roaming\Python\Python36\site-packages\sklearn\preprocessing\label.py", line 52, in _encode_numpy
raise ValueError("y contains previously unseen labels: %s"% str(diff))
ValueError: y contains previously unseen labels: ['ableTennisSho', 'abyCrawlin', 'aftin', 'aiCh', 'aircu', 'alanceBea', 'alkingWithDo', 'allPushup', 'alsaSpi', 'ammerThro', 'ammerin', 'andMarchin', 'andstandPushup', 'andstandWalkin', 'arallelBar', 'aseballPitc', 'asketbal', 'asketballDun', 'avelinThro', 'ayakin', 'ceDancin', 'eadMassag', 'enchPres', 'encin', 'ennisSwin', 'havingBear', 'hotpu', 'hrowDiscu', 'ieldHockeyPenalt', 'ighJum', 'ikin', 'ilitaryParad', 'illiard', 'ivin', 'ixin', 'izzaTossin', 'kateBoardin', 'kiin', 'kije', 'kyDivin', 'layingCell', 'layingDa', 'layingDho', 'layingFlut', 'layingGuita', 'layingPian', 'layingSita', 'layingTabl', 'layingVioli', 'leanAndJer', 'liffDivin', 'loorGymnastic', 'lowDryHai', 'lowingCandle', 'nevenBar', 'nittin', 'oY', 'occerJugglin', 'occerPenalt', 'ockClimbingIndoo', 'odyWeightSquat', 'oleVaul', 'olfSwin', 'olleyballSpikin', 'ommelHors', 'ongJum', 'opeClimbin', 'oppingFloo', 'orseRac', 'orseRidin', 'owin', 'owlin', 'oxingPunchingBa', 'oxingSpeedBa', 'pplyEyeMakeu', 'pplyLipstic', 'rampolineJumpin', 'rcher', 'reastStrok', 'ricketBowlin', 'ricketSho', 'risbeeCatc', 'ritingOnBoar', 'rontCraw', 'rummin', 'rushingTeet', 'tillRing', 'ugglingBall', 'ulaHoo', 'ullUp', 'umoWrestlin', 'umpRop', 'umpingJac', 'unc', 'unchuck', 'unge', 'urfin', 'ushUp', 'uttingInKitche', 'win', 'ypin']

I know the file I'm providing with the label names is correct.
For some reason it seems to chop some of the letters from the label names.

Anyone has any idea of what could be done to solve this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.