Skip to content

[WIP] LabelEncoder: adding flexibility to future/new labels #8169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from
Closed

[WIP] LabelEncoder: adding flexibility to future/new labels #8169

wants to merge 0 commits into from

Conversation

tzano
Copy link
Contributor

@tzano tzano commented Jan 7, 2017

This PR is for the feature discussed in 8136, an issue related to preprocessing.LabelEncoder flexibility to future new labels.

What does this implement/fix?

Using a hashing instead of "sorted search" would make it easier to hash and label new classes.

There are two options:
1- add a universal hash function to transform any label, without storing data. But results would be relatively probabilistic (as we need to have a good universal hashing function). The thing that we don't need to face.

def transform(lst):
    return map(lambda x: hash(x) , lst)

transform(["paris","berlin"])
transform([2,1])

2- Using a hashtable to encode labels.
In this case, we can add a dictionary classes_lookup to index any new inserted label. The value would be incremented whenever we add new label. (This built upon discussions on #7432 #7455. )
self.classes_ has been retained as it's used by other methods.

self.classes_lookup = defaultdict(int)
# adding values to dict 
for i, v in enumerate(np.unique(y)):
            self.classes_lookup[v] = i
self.classes_ = self.classes_lookup.values()

This will make it easier to add new method expand_classes which takes any new label that appears in the test data and assign a new class to it.

    def expand_classes( y):
        for item in y:
            if item not in self.classes_lookup:
                self.classes_ = np.append(self.classes_, [item])
                self.classes_lookup[item] = len(self.classes_) - 1

        return self.classes_lookup[item]

Error handling:
1- If we add classes_lookup method, it means that we don't need to consider finding new label while encoding them as a problem.

The question here is, how do we handle errors ?

In transform method, we can distinguish two cases.

diff = np.setdiff1d(classes, self.classes_)
# case 1: consider expanding classes whenever we find new one
# for item in diff:
#     self.expand_classes([item])
# case 2: keep detecting it as an outlier during the training phase, but it can be expanded later
# raise ValueError("y contains new labels: %s" % str(diff))

Meanwhile, we don't need to change anything in inverse_transform method

diff = np.setdiff1d(y, np.arange(len(self.classes_)))
if diff:
   raise ValueError("y contains new labels: %s" % str(diff))

Let me know if we should go this direction or work on another solution.

@jnothman
Copy link
Member

jnothman commented Jan 7, 2017

  • The default behaviour needs to remain as it was, for the sake of backwards compatibility.
  • I don't think the data structure (i.e. hash vs search) is the big issue here. Optimisation can come later.
  • My suggestion of expand_classes was really a way to consider the problem of online learning in a downstream classifier (or transformer), e.g. SGDClassifier, which otherwise requires the set of classes to be specified up-front (in the first call to partial_fit). It would instead support, either an expanded set of classes passed to partial_fit, or a method expand_classes which more explicitly handles that case.
  • Hashing as a way of handling new classes may also be a theoretical possibility, assuming, say we run the first partial_fit(X, hashed_encode(y, max_hash=MAX_HASH), classes=np.arange(MAX_HASH)), however, models may be much larger than necessary if MAX_HASH is chosen to minimise collisions. It also seems a bit of a hacky way of doing things.

@tzano
Copy link
Contributor Author

tzano commented Jan 7, 2017

Thanks.

I am trying to start working on expand_classes methodand add the features that you mentioned. However, if we are not using hashtables (dictionary), it wouldn't be ideal to use 'np.searchsorted' in this method as it's used in fit/transform ( it returns indexes based on their sorting rank, If we re-sort it, indexes will be miss-leading). I am assuming that expanding classes will take the set of classes and append new ones. Class label would be determined by the index of the element in the list.

Hashing, as you mentioned, doesn't work all the time. It leads to probabilistic results, depends on the hashing function. Generally, it's not ideal, but it works in some cases (if we know the type of items).

What are the things that we need to consider while implementing "expand_classes" ?.

A first attempt has been pushed.

@jnothman
Copy link
Member

jnothman commented Jan 7, 2017

No, actually, I suspect we'd maintain the invariant that classes are in alphabetic sorted order, but I'm not sure about that, really.

@tzano
Copy link
Contributor Author

tzano commented Jan 8, 2017

we'd maintain the invariant that classes are in alphabetic sorted order

In this case, labels' indexes change every time we expand classes. As an example, if we encode a set of labels {"paris", "tokyo", "berlin"} , then we add a new label (in this case: "amsterdam"). By retaining the alphabetic sorted order of classes, indexes of {"paris", "tokyo", "berlin"} should change.

le = LabelEncoder()
# Initial test
ret = le.fit_transform(["paris", "paris", "tokyo", "berlin"])
assert [1,1,2,0] == list(ret)

# case where we expand classes
ret = le.expand_classes(["amsterdam"])

# Second test
ret = le.fit_transform(["paris", "paris", "tokyo", "berlin", "amsterdam"])
assert [2,2,3,1,0] == list(ret)

@jnothman
Copy link
Member

jnothman commented Jan 8, 2017

In this case, labels' indexes change every time we expand classes.

I realise. That's why I'm talking about a different API, not just putting new classes into LabelEncoder and expecting it to work some magic for the entirety of a pipeline.

@tzano tzano closed this Jan 9, 2017
@tzano tzano deleted the patch-8164 branch January 9, 2017 17:42
@tzano tzano restored the patch-8164 branch January 9, 2017 17:48
@tzano tzano deleted the patch-8164 branch January 9, 2017 17:49
@tzano
Copy link
Contributor Author

tzano commented Jan 9, 2017

Thanks @jnothman . As you said, adding a new method on top of LabelEncoder doesn't seem to be a good direction.
I will try to make use of the ideas that have been discussed here, and submit another PR (working on different API).

@sauravp
Copy link

sauravp commented Apr 7, 2017

Has there been any further development here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants