Skip to content

LabelEncoder should add flexibility to future new label #8136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kaiwang0112006 opened this issue Dec 30, 2016 · 22 comments
Closed

LabelEncoder should add flexibility to future new label #8136

kaiwang0112006 opened this issue Dec 30, 2016 · 22 comments

Comments

@kaiwang0112006
Copy link

kaiwang0112006 commented Dec 30, 2016

Description

I used LabelEncoder to transform categorical feature to numerical feature. But my test set data has new labels that are not fit in the training set. So LabelEncoder raise a ValueError. I think LabelEncoder should be able to deal with the unknown label, maybe just assign the len(self.classes_)+1 to it and update the current LabelEncoder 's self.classes_?

Steps/Code to Reproduce

for i,f in enumerate(train_cat.columns):
    train_cat[f] = le.fit_transform(train_cat[f])
    test_cat[f] = le.transform(test_cat[f])

Expected Results

Actual Results

Traceback (most recent call last):  File "insurancev3.py", line 110, in <module>    test_cat[f] = le.transform(test_cat[f])  
File "/usr/local/python-3.4.4/lib/python3.4/site-packages/sklearn/preprocessing/label.py", line 149, in transform    raise ValueError("y contains new labels: %s" % str(diff))
ValueError: y contains new labels: ['F']

Versions

>>> import platform; print(platform.platform())
Windows-10-10.0.14393-SP0
>>> import sys; print("Python", sys.version)
Python 3.5.2 |Anaconda 4.2.0 (64-bit)| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.11.1
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 0.18.1
>>> import sklearn; print("Scikit-Learn", sklearn.__version__)
Scikit-Learn 0.18.1
@jnothman
Copy link
Member

jnothman commented Dec 30, 2016 via email

@kaiwang0112006
Copy link
Author

kaiwang0112006 commented Dec 30, 2016

The process will be used to unknown data in the future. Beside, I think it's function can be added to LabelEncoder.

@jnothman
Copy link
Member

jnothman commented Dec 30, 2016 via email

@devanshdalal
Copy link
Contributor

Hi, I want to work on this. @jnothman, what would be the desirable change for this?

@jnothman
Copy link
Member

jnothman commented Dec 30, 2016 via email

@devanshdalal
Copy link
Contributor

devanshdalal commented Jan 1, 2017

@jnothman , I dug up the issue, pull request and the views of people involved. You slightly seemed in less support for the feature hamsal added. I personally think mjbommar's comment can be a use case. What are your view and what kind of further changes do you think we will to resolve this ?

@jnothman
Copy link
Member

jnothman commented Jan 2, 2017

Shrug. Again I'd like to see a compelling example for the use of an expandable label set, and @mjbommar's comment isn't clear enough on what learning paradigm one is trying to apply over an expanding set of classes. The feature_extraction comparison is poor, as all but feature hashing discard features unseen at fit time. Scikit-learn does not currently handle expanding feature spaces except by hashing. It does not handle expanding class sets except by pre-specifying them. Can someone please give a clear example of how you would use such a label encoder in a machine learning context??

@mjbommar
Copy link
Contributor

mjbommar commented Jan 4, 2017

I think we can boil this down to one example and a two-part question.

Example: Let's say you have users coding samples, and up to T, you've only seen two categories: ["red", "blue"]. On T+1, all of a sudden, a user observes a new encoding value: ["green"].

Question(s): Who is responsible for dealing with this - the implementer or sklearn? Is there any "graceful" option in sklearn to recode or ignore, or do we only fail hard?

To be honest, sklearn already requires a significant effort to use in practical, online situations, so the marginal benefit relative to other headaches is small.

That said, if we want to provide a better experience for users, I am still strongly in favor of allowing the user to either nan/None or update flexibly. It's such a trivial change from a software perspective relative to many other things we have done that I don't understand the opposition.

@kaiwang0112006
Copy link
Author

@jnothman , I met this issue when I do kaggle competition. The categorical feature in the test set have unseen labels. Of cause I can use fit for a combined data set. But I felt it's neither elegant nor fit the reality when the model is used for product. When the future test set have unseen label, I would expect the model to give me a prediction anyway, with just a new number assigned to the label.

@kaiwang0112006
Copy link
Author

kaiwang0112006 commented Jan 4, 2017

My temporary solution for pandas dataframe:

class label_encoder(object):
    def fit_pd(self,df,cols=[]):
        '''
        fit all columns in the df or specific list. 
        generate a dict:
        {feature1:{label1:1,label2:2}, feature2:{label1:1,label2:2}...}
        '''
        if len(cols) == 0:
            cols = df.columns
        self.class_index = {}
        for f in cols:
            uf = df[f].unique()
            self.class_index[f] = {}
            index = 1
            for item in uf:
                self.class_index[f][item] = index
                index += 1
    
    def fit_transform_pd(self,df,cols=[]):
        '''
        fit all columns in the df or specific list and return an update dataframe.
        '''
        if len(cols) == 0:
            cols = df.columns
        newdf = copy.deepcopy(df)
        self.class_index = {}
        for f in cols:
            uf = df[f].unique()
            self.class_index[f] = {}
            index = 1
            for item in uf:
                self.class_index[f][item] = index
                index += 1
                
            newdf[f] = df[f].apply(lambda d: self.update_label(f,d))
        return newdf
    
    def transform_pd(self,df,cols=[]):
        '''
        transform all columns in the df or specific list from lable to index, return an update dataframe.
        '''
        newdf = copy.deepcopy(df)
        if len(cols) == 0:
            cols = df.columns
        for f in cols:
            if f in self.class_index:
                newdf[f] = df[f].apply(lambda d: self.update_label(f,d))
        return newdf
                
    def update_label(self,f,x):
        '''
        update the label to index, if not found in the dict, add and update the dict.
        '''
        try:
            return self.class_index[f][x]
        except:
            self.class_index[f][x] = max(self.class_index[f].values())+1
            return self.class_index[f][x]

@devanshdalal
Copy link
Contributor

devanshdalal commented Jan 4, 2017

@jnothman, what are your views on this?

@jnothman
Copy link
Member

jnothman commented Jan 4, 2017

There's no question of how to achieve it. There's a question of how it integrates with the rest of the scikit-learn ecosystem. We currently presume that anything supporting partial_fit has the set of classes pre-specified in the first call to partial_fit. No provided estimators, to my knowledge, will cope if you run them in a warm start mode and present an extra class: the coef_ will remain the old width and errors will ensue.

I appreciate your use case, @mjbommar. Thanks. I still don't get how it integrates with the rest of scikit-learn's estimator provisions (which you hint at) and don't want to provide something that will confuse users by giving them a dead-end.

Do you think that for classifiers supporting partial_fit or warm_start we need an expand_classes method which adds a further encoding, extends a coef_ with random or zero initialisation, etc?

I agree our current online learning support is weak. I want to know how to practically make it work.

@tzano
Copy link
Contributor

tzano commented Jan 6, 2017

@jnothman
The scenario that has been mentioned by @mjbommar explains the issue clearly. It's not always practical to ignore new labels, specially if you want to capture these new categories.

I think that using "hashing" can give us more flexibility. Adding "expand_classes" method, as you mentioned, can be useful to add a further encoding. I have worked around this idea and submitted a PR. Let me know if we should go this direction or work on another solution.

@FrugoFruit90
Copy link

FrugoFruit90 commented Jul 26, 2017

Hello, everyone!

I am currently working on an NLP project, in which I create a custom dictionary of 15000 most common lemmatized words. Then I want to transform the dataset by converting the texts into lists of encoded labels.

Obviously, I have missing categories (by design!) and I run into ValueError("y contains new labels: %s" % str(diff)). Is there any way that I can make LabelEncoder simply treat words that are not among the 15000 most common as 0s or NaNs or whatever else?

I get it that I could do it "separately", but my dataset is rather big and it would be extremely cumbersome (and defeats the purpose of having a label encoder, as I would have to basically do the same thing twice!). Perhaps we could add a parameter to the transform method, such as e.g. missing='error' (behaves as it does now by default) and where I can state what should be done with missing categories?

@jnothman
Copy link
Member

jnothman commented Jul 26, 2017 via email

@FrugoFruit90
Copy link

FrugoFruit90 commented Jul 26, 2017

I already have the counts. My dataframe contains, as one of the columns, lists of words for each document. I want to convert them into lists of labels (preserving the order of the words in the document!) to later feed into a keras model for embedding.

# Count the frequency of each word in all of my documents
freq = collections.Counter()
df["documents_as_lists_of_words"].apply(freq.update)

# Convert that into labels:
z = LabelEncoder()
z.fit([x[0] for x in freq.most_common(15000)])

# Use the encoder to convert the lists of words into lists of labels
a = z.transform(df)

This generates the aforementioned error.

BTW. I am currently trying to define my own LabelEncoder2(LabelEncoder) and overriding the transform() method, but unfortunately simply removing the part that produces error does not work - I get 0s only. Perhaps I would need to tinker with the underlying np.searchsorted to achieve my goal?

@jnothman
Copy link
Member

jnothman commented Jul 26, 2017 via email

@amueller
Copy link
Member

The original issue will be solved by #9151.

@pchalasani
Copy link

pchalasani commented Sep 18, 2017

I'd like to argue this is an important feature to add. Imagine we're training a large dataset that cannot be loaded all at once into memory, in multiple batches. Say there's a very high cardinality feature F that can take on thousands of possible values. In each batch we'd not see all possible values of F. In this situation it would be very useful to have an "incrementally updating" version of the labelencoder. The reason I'm using the labelencoder on a feature (rather than on a label) is because I want to pass in a normalized feature (i.e. where values are a contiguous set of integers from 0 to N) to the embedding layer in the PyTorch deep learning framework. The embedding layer accepts only such a normalized feature (and this is probably relevant to other DL frameworks as well).

@jorisvandenbossche
Copy link
Member

This can possibly be closed now CategoricalEncoder (#9151) is merged. It supports handling of unseen categories by ignoring them instead of giving an error. It however does not support updating the categories in a kind of partial_fit manner as was mentioned by the OP.

@qinhanmin2014
Copy link
Member

Closing since after CategoricalEncoder, I suppose we are not going to do such things in LabelEncoder. Also, it seems that we are not persuaded to update the categories (See #3599).

@arungansi
Copy link

I used

       le.fit_transform(Col) 

and I was able to resolve the issue. It does fit and transform both. we dont need to worry about unknown values in the test split

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests