[MRG] Label Encoder Unseen Labels #3599

hamsal · 2014-08-28T17:06:18Z

This is a pull request to adopt the work done by @mjbommar at #3483

This PR intends to make preprocessing.LabelEncoder more friendly for production/pipeline usage by adding a new_labels constructor argument.

Instead of always raising ValueError for unseen/new labels in transform, LabelEncoder may be initialized with new_labels as:

None: current behavior, i.e., raise ValueError; to remain default behavior
~~"update": update classes with new IDs [N, ..., N+m-1] for m new labels and assign~~
an integer value: set newly seen labels to have fixed value corresponding to this integer value
Add classes parameter to transform function
Classes parameter during initilization

…ls seen.

… for now

hamsal · 2014-08-30T16:17:49Z

I removed "fit_labels" and "new_label_mapping_" because they seem unnecessary if the updates are made directly to classes_

hamsal · 2014-08-30T16:22:18Z

It seems to me that the simplest way to get this desired behvaior from Label Encoder is to keep the new_labels parameter an Int. If it is an int we use it for any unseen label, if it is None we use the original behavior with a value error. Finally to get the "update" behavior we make a partial_fit or change fit so the classes_ is updated with new labels.

coveralls · 2014-08-30T23:14:27Z

Coverage decreased (-0.0%) when pulling 39bd9e8 on hamsal:label-encoder-unseen into d6bfe09 on scikit-learn:master.

hamsal · 2014-08-30T23:16:36Z

I have changed the insertion of the new labels to the "classes_" attribute so that the "classes_" is always sorted.

@mjbommar can you say why you were ~~outputing~~ expecting the classes in line 437 of the doctest as ['amsterdam', 'paris', 'tokyo', 'rome'] instead of ['amsterdam', 'paris', 'rome', 'tokyo']?

coveralls · 2014-08-30T23:40:13Z

Coverage decreased (-0.0%) when pulling 767bea9 on hamsal:label-encoder-unseen into d6bfe09 on scikit-learn:master.

jnothman · 2014-08-31T03:15:33Z

I have changed the insertion of the new labels to the "classes_" attribute so that the "classes_" is always sorted.

But the output needs to be consistent from one call to the next! I.e. if I use:

>>> le = LabelEncoder(...)
>>> le.fit_transform([1,4])
array([0, 1])
>>> le.transform([1,2])

the result here can't change the meaning of the output 1, nor can a subsequent call with transform([4]) change the mapping from 4 to 1. The mapping (as known so far) must remain invariant, and there should be a test to ensure this is so.

However, there are runtime cost advantages to keeping classes_ sorted, certainly under the assumption that update is not the usual use-case. But in the update case, you would need another vector of label re-mappings.

jnothman · 2014-08-31T03:31:01Z

@mjbommar, I'm still not persuaded that update is a common use-case, or that if it is, the user is applying a LabelEncoder correctly. Given that they aren't meant to function in a Pipeline anyway (see #3112/#3113), I don't understand your comment at #3483 regarding their production/pipeline usage. I get that it might be useful to grow the set of seen labels over time in an online learning environment, but I'm not convinced that this is the domain of the LabelEncoder.

Needing to have encoding of more classes than are present in the training sample makes sense. But I don't understand the use case where those classes are not known beforehand, or can't be normalised by the user. I'd rather the solution at #1643 which allows a list of classes as a parameter; I don't know why @amueller closed it except for long-term inactivity.

arjoly · 2014-09-01T07:18:55Z

What do you think of having a partial_fit for the "update" case?

jnothman · 2014-09-01T12:10:21Z

If it comes with a meaningful example, I have no problem with it. But I don't know whether between calls to partial_fit, calls to transform need to be consistent.

hamsal · 2014-09-02T12:06:48Z

The solution with classes as a parameter seems like a reasonable way to offer support for unseen labels while not making many assumptions. If update needs to be left out of this PR maybe we can revisit when there is a justifying example. For now though I will implement the integer default value and the classes parameter.

jnothman · 2014-09-02T12:10:40Z

I am - unsurprisingly - +1 for delaying update until it has a clear
use-case.

On 2 September 2014 22:06, hamsal notifications@github.com wrote:

The solution with classes as a parameter seems like a reasonable way to
offer support for unseen labels while not making many assumptions. If
update needs to be left out of this PR maybe we can revisit when there is a
justifying example. For know though I will implement the integer default
value and the classes parameter.

—
Reply to this email directly or view it on GitHub
#3599 (comment)
.

coveralls · 2014-09-02T17:15:32Z

Coverage decreased (-0.0%) when pulling 9153fc0 on hamsal:label-encoder-unseen into d6bfe09 on scikit-learn:master.

coveralls · 2014-09-03T13:29:06Z

Coverage increased (+0.01%) when pulling aaf0425 on hamsal:label-encoder-unseen into d6bfe09 on scikit-learn:master.

mjbommar · 2014-09-06T14:19:12Z

@jnothman , @hamsal , the "update" use case I have is online, and, a priori, classes may not all be known (or sampled per fit). The feature_extraction.text methods are similarly complicated in online scenarios, e.g., where the N+1th document has tokens or n-grams that may be informative but not present in the 1, ..., Nth documents.

TBH, I am personally indifferent w.r.t. this PR as I have so much proprietary wrapper code written around these use cases already, but it might help to go back to the mailing list where questions of this type have been repeatedly brought up.

jnothman · 2014-09-06T22:48:44Z

While I do believe there is application for this in an online learning environment, your example of the CountVectorizer emphasises that it's not something for which the scikit-learn API is sufficiently clear yet. While FeatureHasher will readily extend to new vocab, and while one can predefine a more expansive vocabulary for CountVectorizer than will be used in fit, a real partial learning application means that successive calls to transform will have different shapes, thus affecting any downstream processes; but there is no means in the API to tell a linear model, for instance, to make a warm start but extend the shape of its coef_ matrix to account for new features or classes.

So I think such online learning scenarios -- with extensible vocabularies and shape-expandable models -- belong to the realm of external toolkits and user code at least for now. The solution for LabelEncoder is to support at least the facility available for vocabulary expansion in CountVectorizer: pre-pspecifying an expanded vocabulary (i.e. classes parameter).

raghavrv · 2016-10-31T19:55:03Z

@hamsal could you rebase? I can try reviewing this...

amueller · 2017-07-26T15:31:50Z

This will be solved by #9151. I think LabelEncoder should not be used for data.

alanyee · 2017-08-03T23:46:34Z

@amueller what should i do in the mean time before #9151 gets merged?

jorisvandenbossche · 2017-11-24T16:11:03Z

This can be closed now CategoricalEncoder (#9151) is merged.

qinhanmin2014 · 2017-11-24T16:19:18Z

Resolved in #9151
@hamsal thanks a lot for your contribution :)

daniilmaltsev · 2019-08-08T16:33:51Z

CategoricalEncoder doc says:
"Ignoring unknown categories is not supported for encoding='ordinal'."
So, it actually doesn't support handling unknown categories in the LabelEncoder's usecase.
Looks like the issue should be reopened.

jnothman · 2019-08-09T02:32:56Z

What CategoricalEncoder doc? CategoricalEncoder doesn't exist.

daniilmaltsev · 2019-08-13T15:54:00Z

upd: CategoricalEncoder shortly existed in a dev version. It was deprecated (importable, but throws an error recommending to use OrdinalEncoder).
OrdinalEncoder doesn't support unseen categories.
So, the reason for which this issue has been closed is no longer valid (and, actually, was never valid). The issue should be reopened.

nicholasguimaraes · 2019-12-15T18:10:14Z

Hi guys, I'm trying to train a video classification network on the UCF101 dataset and I'm getting the same error when the network tries encoding the labels.

Traceback (most recent call last):
File "UCF101_ResNetCRNN.py", line 168, in
all_y_list = labels2cat(le, actions) # all video labels
File "C:\Users\Windows\Documents\video-classification-master\ResNetCRNN\functions.py", line 14, in labels2cat
return label_encoder.transform(list)
File "C:\Users\Windows\AppData\Roaming\Python\Python36\site-packages\sklearn\preprocessing\label.py", line 256, in transform
_, y = encode(y, uniques=self.classes, encode=True)
File "C:\Users\Windows\AppData\Roaming\Python\Python36\site-packages\sklearn\preprocessing\label.py", line 109, in _encode
return _encode_numpy(values, uniques, encode)
File "C:\Users\Windows\AppData\Roaming\Python\Python36\site-packages\sklearn\preprocessing\label.py", line 52, in _encode_numpy
raise ValueError("y contains previously unseen labels: %s"% str(diff))
ValueError: y contains previously unseen labels: ['ableTennisSho', 'abyCrawlin', 'aftin', 'aiCh', 'aircu', 'alanceBea', 'alkingWithDo', 'allPushup', 'alsaSpi', 'ammerThro', 'ammerin', 'andMarchin', 'andstandPushup', 'andstandWalkin', 'arallelBar', 'aseballPitc', 'asketbal', 'asketballDun', 'avelinThro', 'ayakin', 'ceDancin', 'eadMassag', 'enchPres', 'encin', 'ennisSwin', 'havingBear', 'hotpu', 'hrowDiscu', 'ieldHockeyPenalt', 'ighJum', 'ikin', 'ilitaryParad', 'illiard', 'ivin', 'ixin', 'izzaTossin', 'kateBoardin', 'kiin', 'kije', 'kyDivin', 'layingCell', 'layingDa', 'layingDho', 'layingFlut', 'layingGuita', 'layingPian', 'layingSita', 'layingTabl', 'layingVioli', 'leanAndJer', 'liffDivin', 'loorGymnastic', 'lowDryHai', 'lowingCandle', 'nevenBar', 'nittin', 'oY', 'occerJugglin', 'occerPenalt', 'ockClimbingIndoo', 'odyWeightSquat', 'oleVaul', 'olfSwin', 'olleyballSpikin', 'ommelHors', 'ongJum', 'opeClimbin', 'oppingFloo', 'orseRac', 'orseRidin', 'owin', 'owlin', 'oxingPunchingBa', 'oxingSpeedBa', 'pplyEyeMakeu', 'pplyLipstic', 'rampolineJumpin', 'rcher', 'reastStrok', 'ricketBowlin', 'ricketSho', 'risbeeCatc', 'ritingOnBoar', 'rontCraw', 'rummin', 'rushingTeet', 'tillRing', 'ugglingBall', 'ulaHoo', 'ullUp', 'umoWrestlin', 'umpRop', 'umpingJac', 'unc', 'unchuck', 'unge', 'urfin', 'ushUp', 'uttingInKitche', 'win', 'ypin']

I know the file I'm providing with the label names is correct.
For some reason it seems to chop some of the letters from the label names.

Anyone has any idea of what could be done to solve this problem?

mjbommar added 9 commits July 24, 2014 16:28

Clean commit for PR 3243

7fabf54

Updating docstrings

fac95e1

Adding test coverage and support for inverse_transform after new labe…

0d3851f

…ls seen.

Updating documentation examples

4ac58af

Updating docs

e314ed6

Improving error-handling for inverse_transform

751b585

Improving docstrings

866e939

python3 dict.iteritems deprecation fix

da4cafb

Switching from classes_/get_classes() to classes_ property.

0f3e3d3

hamsal mentioned this pull request Aug 28, 2014

[WIP] Sparse and Multioutput LabelEncoder #3592

Closed

7 tasks

hamsal changed the title ~~Label Encoder Unseen Labels~~ [WIP] Label Encoder Unseen Labels Aug 28, 2014

Strip unnecessary attributes, comment out implementation of int label…

468c92a

… for now

Implement new_labels with integer type

39bd9e8

Fomate pep8 and remove redundant test for 'new_labels' error

767bea9

hamsal added 3 commits September 2, 2014 08:44

Remove update behavior, Make new_labels an int param

bcdd05f

Update doctest, new_labels=None

235d681

Add a classes parameter to LabelEncoder.transform

9153fc0

Add classes paramter to LabelEncoder init

e35b92b

Initiallize selc.classes_ asarray

aaf0425

Drop unnecessary error from transform

fd4d426

hamsal changed the title ~~[WIP] Label Encoder Unseen Labels~~ [MRG] Label Encoder Unseen Labels Sep 4, 2014

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

amueller added the Waiting for Reviewer label Dec 10, 2015

raghavrv mentioned this pull request Oct 31, 2016

NOCATS: Categorical splits for tree-based learners #4899

Closed

jnothman mentioned this pull request Dec 30, 2016

LabelEncoder should add flexibility to future new label #8136

Closed

jorisvandenbossche mentioned this pull request Oct 20, 2017

[MRG + 1] ENH: new CategoricalEncoder class #9151

Merged

qinhanmin2014 closed this Nov 24, 2017

rragundez mentioned this pull request Mar 8, 2019

Handle unseen labels in LabelEncoder #13423

Closed

Uh oh!

[MRG] Label Encoder Unseen Labels #3599

[MRG] Label Encoder Unseen Labels #3599

Uh oh!

Conversation

hamsal commented Aug 28, 2014

Uh oh!

hamsal commented Aug 30, 2014

Uh oh!

hamsal commented Aug 30, 2014

Uh oh!

coveralls commented Aug 30, 2014

Uh oh!

hamsal commented Aug 30, 2014

Uh oh!

coveralls commented Aug 30, 2014

Uh oh!

jnothman commented Aug 31, 2014

Uh oh!

jnothman commented Aug 31, 2014

Uh oh!

arjoly commented Sep 1, 2014

Uh oh!

jnothman commented Sep 1, 2014

Uh oh!

hamsal commented Sep 2, 2014

Uh oh!

jnothman commented Sep 2, 2014

Uh oh!

coveralls commented Sep 2, 2014

Uh oh!

coveralls commented Sep 3, 2014

Uh oh!

mjbommar commented Sep 6, 2014

Uh oh!

jnothman commented Sep 6, 2014

Uh oh!

raghavrv commented Oct 31, 2016

Uh oh!

amueller commented Jul 26, 2017

Uh oh!

alanyee commented Aug 3, 2017

Uh oh!

jorisvandenbossche commented Nov 24, 2017

Uh oh!

qinhanmin2014 commented Nov 24, 2017

Uh oh!

daniilmaltsev commented Aug 8, 2019

Uh oh!

jnothman commented Aug 9, 2019 via email

Uh oh!

daniilmaltsev commented Aug 13, 2019

Uh oh!

nicholasguimaraes commented Dec 15, 2019

Uh oh!

Uh oh!