-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
LabelEncoder should add flexibility to future new label #8136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Do you know at training time that there will be more labels in the test? If
so, just fit the LabelEncoder with the full set of labels...?
…On 30 December 2016 at 12:39, kaiwang ***@***.***> wrote:
Description
I used LabelEncoder to transform categorical feature to numerical feature.
But my test set data has new labels that are not fit in the training set.
So LabelEncoder raise a ValueError. I think LabelEncoder should be able to
deal with the unknown label, maybe just assign the len(self.classes_)+1 to
it and update the current LabelEncoder 's self.classes_?
Steps/Code to Reproduce
for i,f in enumerate(train_cat.columns):
train_cat[f] = le.fit_transform(train_cat[f])
test_cat[f] = le.transform(test_cat[f])
Expected Results Actual Results
Traceback (most recent call last): File "insurancev3.py", line 110, in <module> test_cat[f] = le.transform(test_cat[f])
File "/usr/local/python-3.4.4/lib/python3.4/site-packages/sklearn/preprocessing/label.py", line 149, in transform raise ValueError("y contains new labels: %s" % str(diff))
ValueError: y contains new labels: ['F']
Versions
>>> import platform; print(platform.platform())
Windows-10-10.0.14393-SP0
>>> import sys; print("Python", sys.version)
Python 3.5.2 |Anaconda 4.2.0 (64-bit)| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.11.1
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 0.18.1
>>> import sklearn; print("Scikit-Learn", sklearn.__version__)
Scikit-Learn 0.18.1
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8136>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz6w2qI1z39TwXyNZvnjnFe134DXtiks5rNGDngaJpZM4LX9BD>
.
|
The process will be used to unknown data in the future. Beside, I think it's function can be added to LabelEncoder. |
okay. see #3599
…On 30 December 2016 at 13:14, kaiwang ***@***.***> wrote:
The process will be used to unknown data in the future.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8136 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz610XR_SHbhYFIUdW8bbnIrVGDZ09ks5rNGkCgaJpZM4LX9BD>
.
|
Hi, I want to work on this. @jnothman, what would be the desirable change for this? |
Continue working on #3599, assuming @hamsal isn't working on it. I note
that there was some dispute about this and its interface in previous
incarnations.
…On 30 December 2016 at 17:57, Devansh D. ***@***.***> wrote:
Hi, I want to work on this. @jnothman <https://github.com/jnothman>, what
would be the desirable change for this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8136 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6x4Kuccrl5SVPm_jNAivXMvaoUTvks5rNKtggaJpZM4LX9BD>
.
|
@jnothman , I dug up the issue, pull request and the views of people involved. You slightly seemed in less support for the feature hamsal added. I personally think mjbommar's comment can be a use case. What are your view and what kind of further changes do you think we will to resolve this ? |
Shrug. Again I'd like to see a compelling example for the use of an expandable label set, and @mjbommar's comment isn't clear enough on what learning paradigm one is trying to apply over an expanding set of classes. The |
I think we can boil this down to one example and a two-part question. Example: Let's say you have users coding samples, and up to T, you've only seen two categories: Question(s): Who is responsible for dealing with this - the implementer or sklearn? Is there any "graceful" option in sklearn to recode or ignore, or do we only fail hard? To be honest, sklearn already requires a significant effort to use in practical, online situations, so the marginal benefit relative to other headaches is small. That said, if we want to provide a better experience for users, I am still strongly in favor of allowing the user to either |
@jnothman , I met this issue when I do kaggle competition. The categorical feature in the test set have unseen labels. Of cause I can use fit for a combined data set. But I felt it's neither elegant nor fit the reality when the model is used for product. When the future test set have unseen label, I would expect the model to give me a prediction anyway, with just a new number assigned to the label. |
My temporary solution for pandas dataframe:
|
@jnothman, what are your views on this? |
There's no question of how to achieve it. There's a question of how it integrates with the rest of the scikit-learn ecosystem. We currently presume that anything supporting I appreciate your use case, @mjbommar. Thanks. I still don't get how it integrates with the rest of scikit-learn's estimator provisions (which you hint at) and don't want to provide something that will confuse users by giving them a dead-end. Do you think that for classifiers supporting I agree our current online learning support is weak. I want to know how to practically make it work. |
@jnothman I think that using "hashing" can give us more flexibility. Adding "expand_classes" method, as you mentioned, can be useful to add a further encoding. I have worked around this idea and submitted a PR. Let me know if we should go this direction or work on another solution. |
Hello, everyone! I am currently working on an NLP project, in which I create a custom dictionary of 15000 most common lemmatized words. Then I want to transform the dataset by converting the texts into lists of encoded labels. Obviously, I have missing categories (by design!) and I run into ValueError("y contains new labels: %s" % str(diff)). Is there any way that I can make LabelEncoder simply treat words that are not among the 15000 most common as 0s or NaNs or whatever else? I get it that I could do it "separately", but my dataset is rather big and it would be extremely cumbersome (and defeats the purpose of having a label encoder, as I would have to basically do the same thing twice!). Perhaps we could add a parameter to the transform method, such as e.g. missing='error' (behaves as it does now by default) and where I can state what should be done with missing categories? |
I think you want a CountVectorizer, not a LabelEncoder.
…On 26 July 2017 at 20:40, Jan Zyśko ***@***.***> wrote:
Hello, everyone!
I am currently working on an NLP project, in which I create a custom
dictionary of 15000 most common lemmatized words. Then I want to transform
the dataset by converting the texts into lists of encoded labels.
Obviously, I have missing categories (by design!) and I run into
ValueError("y contains new labels: %s" % str(diff)). Is there any way that
I can make LabelEncoder simply treat words that are not among the 15000
most common as 0s or NaNs or whatever else?
I get it that I could do it "separately", but my dataset is rather big and
it would be extremely cumbersome. Perhaps we could add a parameter to the
transform method, such as e.g. missing='error' (behaves as it does now by
default) and where I can state what should be done with missing categories?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8136 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz65fwuzHG76GNdKilNpzK91aQmBE2ks5sRxeSgaJpZM4LX9BD>
.
|
I already have the counts. My dataframe contains, as one of the columns, lists of words for each document. I want to convert them into lists of labels (preserving the order of the words in the document!) to later feed into a keras model for embedding.
This generates the aforementioned error. BTW. I am currently trying to define my own LabelEncoder2(LabelEncoder) and overriding the transform() method, but unfortunately simply removing the part that produces error does not work - I get 0s only. Perhaps I would need to tinker with the underlying np.searchsorted to achieve my goal? |
Again, I think you want a CountVectorizer.
…On 26 July 2017 at 21:38, Jan Zyśko ***@***.***> wrote:
I already have the counts. My dataframe contains, as one of the columns, a
list of words for each document. I want to convert them into lists of
labels to later feed into a keras model for embedding.
# Count the frequency of each word in all of my documents
freq = collections.Counter()
df["documents_as_lists_of_words"].apply(freq.update)
# Convert that into labels:
z = LabelEncoder()
z.fit([x[0] for x in freq.most_common(15000)])
# Use the encoder to convert the lists of words into lists of labels
a = z.transform(df)
This generates the aforementioned error.
BTW. I am currently trying to define my own LabelEncoder2(LabelEncoder)
and overriding the transform() method, but unfortunately simply removing
the part that produces error returns 0s only. Perhaps I would need to
tinker with the underlying np.searchsorted to achieve my goal?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8136 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-9vvA_ItBormFjEttFkD6Qg5_bSks5sRyVJgaJpZM4LX9BD>
.
|
The original issue will be solved by #9151. |
I'd like to argue this is an important feature to add. Imagine we're training a large dataset that cannot be loaded all at once into memory, in multiple batches. Say there's a very high cardinality feature F that can take on thousands of possible values. In each batch we'd not see all possible values of F. In this situation it would be very useful to have an "incrementally updating" version of the |
This can possibly be closed now |
Closing since after CategoricalEncoder, I suppose we are not going to do such things in LabelEncoder. Also, it seems that we are not persuaded to update the categories (See #3599). |
I used
and I was able to resolve the issue. It does fit and transform both. we dont need to worry about unknown values in the test split |
Description
I used LabelEncoder to transform categorical feature to numerical feature. But my test set data has new labels that are not fit in the training set. So LabelEncoder raise a ValueError. I think LabelEncoder should be able to deal with the unknown label, maybe just assign the len(self.classes_)+1 to it and update the current LabelEncoder 's self.classes_?
Steps/Code to Reproduce
Expected Results
Actual Results
Versions
The text was updated successfully, but these errors were encountered: