-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] LabelEncoder: adding flexibility to future/new labels #8169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks. I am trying to start working on expand_classes methodand add the features that you mentioned. However, if we are not using hashtables (dictionary), it wouldn't be ideal to use 'np.searchsorted' in this method as it's used in fit/transform ( it returns indexes based on their sorting rank, If we re-sort it, indexes will be miss-leading). I am assuming that expanding classes will take the set of classes and append new ones. Class label would be determined by the index of the element in the list. Hashing, as you mentioned, doesn't work all the time. It leads to probabilistic results, depends on the hashing function. Generally, it's not ideal, but it works in some cases (if we know the type of items). What are the things that we need to consider while implementing "expand_classes" ?. A first attempt has been pushed. |
No, actually, I suspect we'd maintain the invariant that classes are in alphabetic sorted order, but I'm not sure about that, really. |
In this case, labels' indexes change every time we expand classes. As an example, if we encode a set of labels
|
I realise. That's why I'm talking about a different API, not just putting new classes into |
Thanks @jnothman . As you said, adding a new method on top of |
Has there been any further development here? |
This PR is for the feature discussed in 8136, an issue related to preprocessing.LabelEncoder flexibility to future new labels.
What does this implement/fix?
Using a hashing instead of "sorted search" would make it easier to hash and label new classes.
There are two options:
1- add a universal hash function to transform any label, without storing data. But results would be relatively probabilistic (as we need to have a good universal hashing function). The thing that we don't need to face.
2- Using a hashtable to encode labels.
In this case, we can add a dictionary
classes_lookup
to index any new inserted label. The value would be incremented whenever we add new label. (This built upon discussions on #7432 #7455. )self.classes_ has been retained as it's used by other methods.
This will make it easier to add new method
expand_classes
which takes any new label that appears in the test data and assign a new class to it.Error handling:
1- If we add
classes_lookup
method, it means that we don't need to consider finding new label while encoding them as a problem.The question here is, how do we handle errors ?
In
transform
method, we can distinguish two cases.Meanwhile, we don't need to change anything in
inverse_transform
methodLet me know if we should go this direction or work on another solution.