LabelEncoder should add flexibility to future new label #8136

kaiwang0112006 · 2016-12-30T01:39:50Z

Description

I used LabelEncoder to transform categorical feature to numerical feature. But my test set data has new labels that are not fit in the training set. So LabelEncoder raise a ValueError. I think LabelEncoder should be able to deal with the unknown label, maybe just assign the len(self.classes_)+1 to it and update the current LabelEncoder 's self.classes_?

Steps/Code to Reproduce

for i,f in enumerate(train_cat.columns):
    train_cat[f] = le.fit_transform(train_cat[f])
    test_cat[f] = le.transform(test_cat[f])

Expected Results

Actual Results

Traceback (most recent call last):  File "insurancev3.py", line 110, in <module>    test_cat[f] = le.transform(test_cat[f])  
File "/usr/local/python-3.4.4/lib/python3.4/site-packages/sklearn/preprocessing/label.py", line 149, in transform    raise ValueError("y contains new labels: %s" % str(diff))
ValueError: y contains new labels: ['F']

Versions

>>> import platform; print(platform.platform())
Windows-10-10.0.14393-SP0
>>> import sys; print("Python", sys.version)
Python 3.5.2 |Anaconda 4.2.0 (64-bit)| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.11.1
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 0.18.1
>>> import sklearn; print("Scikit-Learn", sklearn.__version__)
Scikit-Learn 0.18.1

The text was updated successfully, but these errors were encountered:

jnothman · 2016-12-30T02:02:58Z

Do you know at training time that there will be more labels in the test? If so, just fit the LabelEncoder with the full set of labels...?

…

On 30 December 2016 at 12:39, kaiwang ***@***.***> wrote: Description I used LabelEncoder to transform categorical feature to numerical feature. But my test set data has new labels that are not fit in the training set. So LabelEncoder raise a ValueError. I think LabelEncoder should be able to deal with the unknown label, maybe just assign the len(self.classes_)+1 to it and update the current LabelEncoder 's self.classes_? Steps/Code to Reproduce for i,f in enumerate(train_cat.columns): train_cat[f] = le.fit_transform(train_cat[f]) test_cat[f] = le.transform(test_cat[f]) Expected Results Actual Results Traceback (most recent call last): File "insurancev3.py", line 110, in <module> test_cat[f] = le.transform(test_cat[f]) File "/usr/local/python-3.4.4/lib/python3.4/site-packages/sklearn/preprocessing/label.py", line 149, in transform raise ValueError("y contains new labels: %s" % str(diff)) ValueError: y contains new labels: ['F'] Versions >>> import platform; print(platform.platform()) Windows-10-10.0.14393-SP0 >>> import sys; print("Python", sys.version) Python 3.5.2 |Anaconda 4.2.0 (64-bit)| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)] >>> import numpy; print("NumPy", numpy.__version__) NumPy 1.11.1 >>> import scipy; print("SciPy", scipy.__version__) SciPy 0.18.1 >>> import sklearn; print("Scikit-Learn", sklearn.__version__) Scikit-Learn 0.18.1 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8136>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6w2qI1z39TwXyNZvnjnFe134DXtiks5rNGDngaJpZM4LX9BD> .

kaiwang0112006 · 2016-12-30T02:14:25Z

The process will be used to unknown data in the future. Beside, I think it's function can be added to LabelEncoder.

jnothman · 2016-12-30T03:26:13Z

okay. see #3599

…

On 30 December 2016 at 13:14, kaiwang ***@***.***> wrote: The process will be used to unknown data in the future. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8136 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz610XR_SHbhYFIUdW8bbnIrVGDZ09ks5rNGkCgaJpZM4LX9BD> .

devanshdalal · 2016-12-30T06:57:35Z

Hi, I want to work on this. @jnothman, what would be the desirable change for this?

jnothman · 2016-12-30T07:28:45Z

Continue working on #3599, assuming @hamsal isn't working on it. I note that there was some dispute about this and its interface in previous incarnations.

…

On 30 December 2016 at 17:57, Devansh D. ***@***.***> wrote: Hi, I want to work on this. @jnothman <https://github.com/jnothman>, what would be the desirable change for this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8136 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6x4Kuccrl5SVPm_jNAivXMvaoUTvks5rNKtggaJpZM4LX9BD> .

devanshdalal · 2017-01-01T18:18:48Z

@jnothman , I dug up the issue, pull request and the views of people involved. You slightly seemed in less support for the feature hamsal added. I personally think mjbommar's comment can be a use case. What are your view and what kind of further changes do you think we will to resolve this ?

jnothman · 2017-01-02T02:13:33Z

Shrug. Again I'd like to see a compelling example for the use of an expandable label set, and @mjbommar's comment isn't clear enough on what learning paradigm one is trying to apply over an expanding set of classes. The feature_extraction comparison is poor, as all but feature hashing discard features unseen at fit time. Scikit-learn does not currently handle expanding feature spaces except by hashing. It does not handle expanding class sets except by pre-specifying them. Can someone please give a clear example of how you would use such a label encoder in a machine learning context??

mjbommar · 2017-01-04T02:55:51Z

I think we can boil this down to one example and a two-part question.

Example: Let's say you have users coding samples, and up to T, you've only seen two categories: ["red", "blue"]. On T+1, all of a sudden, a user observes a new encoding value: ["green"].

Question(s): Who is responsible for dealing with this - the implementer or sklearn? Is there any "graceful" option in sklearn to recode or ignore, or do we only fail hard?

To be honest, sklearn already requires a significant effort to use in practical, online situations, so the marginal benefit relative to other headaches is small.

That said, if we want to provide a better experience for users, I am still strongly in favor of allowing the user to either nan/None or update flexibly. It's such a trivial change from a software perspective relative to many other things we have done that I don't understand the opposition.

kaiwang0112006 · 2017-01-04T03:09:10Z

@jnothman , I met this issue when I do kaggle competition. The categorical feature in the test set have unseen labels. Of cause I can use fit for a combined data set. But I felt it's neither elegant nor fit the reality when the model is used for product. When the future test set have unseen label, I would expect the model to give me a prediction anyway, with just a new number assigned to the label.

kaiwang0112006 · 2017-01-04T03:12:27Z

My temporary solution for pandas dataframe:

class label_encoder(object):
    def fit_pd(self,df,cols=[]):
        '''
        fit all columns in the df or specific list. 
        generate a dict:
        {feature1:{label1:1,label2:2}, feature2:{label1:1,label2:2}...}
        '''
        if len(cols) == 0:
            cols = df.columns
        self.class_index = {}
        for f in cols:
            uf = df[f].unique()
            self.class_index[f] = {}
            index = 1
            for item in uf:
                self.class_index[f][item] = index
                index += 1
    
    def fit_transform_pd(self,df,cols=[]):
        '''
        fit all columns in the df or specific list and return an update dataframe.
        '''
        if len(cols) == 0:
            cols = df.columns
        newdf = copy.deepcopy(df)
        self.class_index = {}
        for f in cols:
            uf = df[f].unique()
            self.class_index[f] = {}
            index = 1
            for item in uf:
                self.class_index[f][item] = index
                index += 1
                
            newdf[f] = df[f].apply(lambda d: self.update_label(f,d))
        return newdf
    
    def transform_pd(self,df,cols=[]):
        '''
        transform all columns in the df or specific list from lable to index, return an update dataframe.
        '''
        newdf = copy.deepcopy(df)
        if len(cols) == 0:
            cols = df.columns
        for f in cols:
            if f in self.class_index:
                newdf[f] = df[f].apply(lambda d: self.update_label(f,d))
        return newdf
                
    def update_label(self,f,x):
        '''
        update the label to index, if not found in the dict, add and update the dict.
        '''
        try:
            return self.class_index[f][x]
        except:
            self.class_index[f][x] = max(self.class_index[f].values())+1
            return self.class_index[f][x]

devanshdalal · 2017-01-04T05:36:29Z

@jnothman, what are your views on this?

jnothman · 2017-01-04T08:34:23Z

There's no question of how to achieve it. There's a question of how it integrates with the rest of the scikit-learn ecosystem. We currently presume that anything supporting partial_fit has the set of classes pre-specified in the first call to partial_fit. No provided estimators, to my knowledge, will cope if you run them in a warm start mode and present an extra class: the coef_ will remain the old width and errors will ensue.

I appreciate your use case, @mjbommar. Thanks. I still don't get how it integrates with the rest of scikit-learn's estimator provisions (which you hint at) and don't want to provide something that will confuse users by giving them a dead-end.

Do you think that for classifiers supporting partial_fit or warm_start we need an expand_classes method which adds a further encoding, extends a coef_ with random or zero initialisation, etc?

I agree our current online learning support is weak. I want to know how to practically make it work.

tzano · 2017-01-06T13:18:03Z

@jnothman
The scenario that has been mentioned by @mjbommar explains the issue clearly. It's not always practical to ignore new labels, specially if you want to capture these new categories.

I think that using "hashing" can give us more flexibility. Adding "expand_classes" method, as you mentioned, can be useful to add a further encoding. I have worked around this idea and submitted a PR. Let me know if we should go this direction or work on another solution.

FrugoFruit90 · 2017-07-26T10:40:17Z

Hello, everyone!

I am currently working on an NLP project, in which I create a custom dictionary of 15000 most common lemmatized words. Then I want to transform the dataset by converting the texts into lists of encoded labels.

Obviously, I have missing categories (by design!) and I run into ValueError("y contains new labels: %s" % str(diff)). Is there any way that I can make LabelEncoder simply treat words that are not among the 15000 most common as 0s or NaNs or whatever else?

I get it that I could do it "separately", but my dataset is rather big and it would be extremely cumbersome (and defeats the purpose of having a label encoder, as I would have to basically do the same thing twice!). Perhaps we could add a parameter to the transform method, such as e.g. missing='error' (behaves as it does now by default) and where I can state what should be done with missing categories?

jnothman · 2017-07-26T11:28:42Z

I think you want a CountVectorizer, not a LabelEncoder.

…

On 26 July 2017 at 20:40, Jan Zyśko ***@***.***> wrote: Hello, everyone! I am currently working on an NLP project, in which I create a custom dictionary of 15000 most common lemmatized words. Then I want to transform the dataset by converting the texts into lists of encoded labels. Obviously, I have missing categories (by design!) and I run into ValueError("y contains new labels: %s" % str(diff)). Is there any way that I can make LabelEncoder simply treat words that are not among the 15000 most common as 0s or NaNs or whatever else? I get it that I could do it "separately", but my dataset is rather big and it would be extremely cumbersome. Perhaps we could add a parameter to the transform method, such as e.g. missing='error' (behaves as it does now by default) and where I can state what should be done with missing categories? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8136 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz65fwuzHG76GNdKilNpzK91aQmBE2ks5sRxeSgaJpZM4LX9BD> .

FrugoFruit90 · 2017-07-26T11:38:47Z

I already have the counts. My dataframe contains, as one of the columns, lists of words for each document. I want to convert them into lists of labels (preserving the order of the words in the document!) to later feed into a keras model for embedding.

# Count the frequency of each word in all of my documents
freq = collections.Counter()
df["documents_as_lists_of_words"].apply(freq.update)

# Convert that into labels:
z = LabelEncoder()
z.fit([x[0] for x in freq.most_common(15000)])

# Use the encoder to convert the lists of words into lists of labels
a = z.transform(df)

This generates the aforementioned error.

BTW. I am currently trying to define my own LabelEncoder2(LabelEncoder) and overriding the transform() method, but unfortunately simply removing the part that produces error does not work - I get 0s only. Perhaps I would need to tinker with the underlying np.searchsorted to achieve my goal?

jnothman · 2017-07-26T12:27:56Z

Again, I think you want a CountVectorizer.

…

On 26 July 2017 at 21:38, Jan Zyśko ***@***.***> wrote: I already have the counts. My dataframe contains, as one of the columns, a list of words for each document. I want to convert them into lists of labels to later feed into a keras model for embedding. # Count the frequency of each word in all of my documents freq = collections.Counter() df["documents_as_lists_of_words"].apply(freq.update) # Convert that into labels: z = LabelEncoder() z.fit([x[0] for x in freq.most_common(15000)]) # Use the encoder to convert the lists of words into lists of labels a = z.transform(df) This generates the aforementioned error. BTW. I am currently trying to define my own LabelEncoder2(LabelEncoder) and overriding the transform() method, but unfortunately simply removing the part that produces error returns 0s only. Perhaps I would need to tinker with the underlying np.searchsorted to achieve my goal? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8136 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-9vvA_ItBormFjEttFkD6Qg5_bSks5sRyVJgaJpZM4LX9BD> .

amueller · 2017-07-26T15:32:40Z

The original issue will be solved by #9151.

pchalasani · 2017-09-18T17:07:20Z

I'd like to argue this is an important feature to add. Imagine we're training a large dataset that cannot be loaded all at once into memory, in multiple batches. Say there's a very high cardinality feature F that can take on thousands of possible values. In each batch we'd not see all possible values of F. In this situation it would be very useful to have an "incrementally updating" version of the labelencoder. The reason I'm using the labelencoder on a feature (rather than on a label) is because I want to pass in a normalized feature (i.e. where values are a contiguous set of integers from 0 to N) to the embedding layer in the PyTorch deep learning framework. The embedding layer accepts only such a normalized feature (and this is probably relevant to other DL frameworks as well).

jorisvandenbossche · 2017-11-24T16:14:05Z

This can possibly be closed now CategoricalEncoder (#9151) is merged. It supports handling of unseen categories by ignoring them instead of giving an error. It however does not support updating the categories in a kind of partial_fit manner as was mentioned by the OP.

qinhanmin2014 · 2017-11-25T03:45:35Z

Closing since after CategoricalEncoder, I suppose we are not going to do such things in LabelEncoder. Also, it seems that we are not persuaded to update the categories (See #3599).

arungansi · 2020-02-09T02:36:40Z

I used

       le.fit_transform(Col)

and I was able to resolve the issue. It does fit and transform both. we dont need to worry about unknown values in the test split

tzano mentioned this issue Jan 6, 2017

[WIP] Fixes #8136: Added support for new labels #8164

Closed

jorisvandenbossche mentioned this issue Oct 20, 2017

[MRG + 1] ENH: new CategoricalEncoder class #9151

Merged

qinhanmin2014 closed this as completed Nov 25, 2017

rragundez mentioned this issue Mar 8, 2019

Handle unseen labels in LabelEncoder #13423

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LabelEncoder should add flexibility to future new label #8136

LabelEncoder should add flexibility to future new label #8136

kaiwang0112006 commented Dec 30, 2016 •

edited

Loading

jnothman commented Dec 30, 2016 via email

kaiwang0112006 commented Dec 30, 2016 •

edited

Loading

jnothman commented Dec 30, 2016 via email

devanshdalal commented Dec 30, 2016

jnothman commented Dec 30, 2016 via email

devanshdalal commented Jan 1, 2017 •

edited

Loading

jnothman commented Jan 2, 2017

mjbommar commented Jan 4, 2017

kaiwang0112006 commented Jan 4, 2017

kaiwang0112006 commented Jan 4, 2017 •

edited

Loading

devanshdalal commented Jan 4, 2017 •

edited

Loading

jnothman commented Jan 4, 2017 •

edited

Loading

tzano commented Jan 6, 2017 •

edited

Loading

FrugoFruit90 commented Jul 26, 2017 •

edited

Loading

jnothman commented Jul 26, 2017 via email

FrugoFruit90 commented Jul 26, 2017 •

edited

Loading

jnothman commented Jul 26, 2017 via email

amueller commented Jul 26, 2017

pchalasani commented Sep 18, 2017 •

edited

Loading

jorisvandenbossche commented Nov 24, 2017

qinhanmin2014 commented Nov 25, 2017

arungansi commented Feb 9, 2020

LabelEncoder should add flexibility to future new label #8136

LabelEncoder should add flexibility to future new label #8136

Comments

kaiwang0112006 commented Dec 30, 2016 • edited Loading

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

jnothman commented Dec 30, 2016 via email

kaiwang0112006 commented Dec 30, 2016 • edited Loading

jnothman commented Dec 30, 2016 via email

devanshdalal commented Dec 30, 2016

jnothman commented Dec 30, 2016 via email

devanshdalal commented Jan 1, 2017 • edited Loading

jnothman commented Jan 2, 2017

mjbommar commented Jan 4, 2017

kaiwang0112006 commented Jan 4, 2017

kaiwang0112006 commented Jan 4, 2017 • edited Loading

devanshdalal commented Jan 4, 2017 • edited Loading

jnothman commented Jan 4, 2017 • edited Loading

tzano commented Jan 6, 2017 • edited Loading

FrugoFruit90 commented Jul 26, 2017 • edited Loading

jnothman commented Jul 26, 2017 via email

FrugoFruit90 commented Jul 26, 2017 • edited Loading

jnothman commented Jul 26, 2017 via email

amueller commented Jul 26, 2017

pchalasani commented Sep 18, 2017 • edited Loading

jorisvandenbossche commented Nov 24, 2017

qinhanmin2014 commented Nov 25, 2017

arungansi commented Feb 9, 2020

kaiwang0112006 commented Dec 30, 2016 •

edited

Loading

kaiwang0112006 commented Dec 30, 2016 •

edited

Loading

devanshdalal commented Jan 1, 2017 •

edited

Loading

kaiwang0112006 commented Jan 4, 2017 •

edited

Loading

devanshdalal commented Jan 4, 2017 •

edited

Loading

jnothman commented Jan 4, 2017 •

edited

Loading

tzano commented Jan 6, 2017 •

edited

Loading

FrugoFruit90 commented Jul 26, 2017 •

edited

Loading

FrugoFruit90 commented Jul 26, 2017 •

edited

Loading

pchalasani commented Sep 18, 2017 •

edited

Loading