[WIP] LabelEncoder: adding flexibility to future/new labels #8169

tzano · 2017-01-07T09:59:08Z

This PR is for the feature discussed in 8136, an issue related to preprocessing.LabelEncoder flexibility to future new labels.

What does this implement/fix?

Using a hashing instead of "sorted search" would make it easier to hash and label new classes.

There are two options:
1- add a universal hash function to transform any label, without storing data. But results would be relatively probabilistic (as we need to have a good universal hashing function). The thing that we don't need to face.

def transform(lst):
    return map(lambda x: hash(x) , lst)

transform(["paris","berlin"])
transform([2,1])

2- Using a hashtable to encode labels.
In this case, we can add a dictionary classes_lookup to index any new inserted label. The value would be incremented whenever we add new label. (This built upon discussions on #7432 #7455. )
self.classes_ has been retained as it's used by other methods.

self.classes_lookup = defaultdict(int)
# adding values to dict 
for i, v in enumerate(np.unique(y)):
            self.classes_lookup[v] = i
self.classes_ = self.classes_lookup.values()

This will make it easier to add new method expand_classes which takes any new label that appears in the test data and assign a new class to it.

    def expand_classes( y):
        for item in y:
            if item not in self.classes_lookup:
                self.classes_ = np.append(self.classes_, [item])
                self.classes_lookup[item] = len(self.classes_) - 1

        return self.classes_lookup[item]

Error handling:
1- If we add classes_lookup method, it means that we don't need to consider finding new label while encoding them as a problem.

The question here is, how do we handle errors ?

In transform method, we can distinguish two cases.

diff = np.setdiff1d(classes, self.classes_)
# case 1: consider expanding classes whenever we find new one
# for item in diff:
#     self.expand_classes([item])
# case 2: keep detecting it as an outlier during the training phase, but it can be expanded later
# raise ValueError("y contains new labels: %s" % str(diff))

Meanwhile, we don't need to change anything in inverse_transform method

diff = np.setdiff1d(y, np.arange(len(self.classes_)))
if diff:
   raise ValueError("y contains new labels: %s" % str(diff))

Let me know if we should go this direction or work on another solution.

jnothman · 2017-01-07T13:03:37Z

The default behaviour needs to remain as it was, for the sake of backwards compatibility.
I don't think the data structure (i.e. hash vs search) is the big issue here. Optimisation can come later.
My suggestion of expand_classes was really a way to consider the problem of online learning in a downstream classifier (or transformer), e.g. SGDClassifier, which otherwise requires the set of classes to be specified up-front (in the first call to partial_fit). It would instead support, either an expanded set of classes passed to partial_fit, or a method expand_classes which more explicitly handles that case.
Hashing as a way of handling new classes may also be a theoretical possibility, assuming, say we run the first partial_fit(X, hashed_encode(y, max_hash=MAX_HASH), classes=np.arange(MAX_HASH)), however, models may be much larger than necessary if MAX_HASH is chosen to minimise collisions. It also seems a bit of a hacky way of doing things.

tzano · 2017-01-07T22:47:33Z

Thanks.

I am trying to start working on expand_classes methodand add the features that you mentioned. However, if we are not using hashtables (dictionary), it wouldn't be ideal to use 'np.searchsorted' in this method as it's used in fit/transform ( it returns indexes based on their sorting rank, If we re-sort it, indexes will be miss-leading). I am assuming that expanding classes will take the set of classes and append new ones. Class label would be determined by the index of the element in the list.

Hashing, as you mentioned, doesn't work all the time. It leads to probabilistic results, depends on the hashing function. Generally, it's not ideal, but it works in some cases (if we know the type of items).

What are the things that we need to consider while implementing "expand_classes" ?.

A first attempt has been pushed.

jnothman · 2017-01-07T23:29:00Z

No, actually, I suspect we'd maintain the invariant that classes are in alphabetic sorted order, but I'm not sure about that, really.

tzano · 2017-01-08T17:37:14Z

we'd maintain the invariant that classes are in alphabetic sorted order

In this case, labels' indexes change every time we expand classes. As an example, if we encode a set of labels {"paris", "tokyo", "berlin"} , then we add a new label (in this case: "amsterdam"). By retaining the alphabetic sorted order of classes, indexes of {"paris", "tokyo", "berlin"} should change.

le = LabelEncoder()
# Initial test
ret = le.fit_transform(["paris", "paris", "tokyo", "berlin"])
assert [1,1,2,0] == list(ret)

# case where we expand classes
ret = le.expand_classes(["amsterdam"])

# Second test
ret = le.fit_transform(["paris", "paris", "tokyo", "berlin", "amsterdam"])
assert [2,2,3,1,0] == list(ret)

jnothman · 2017-01-08T21:44:32Z

In this case, labels' indexes change every time we expand classes.

I realise. That's why I'm talking about a different API, not just putting new classes into LabelEncoder and expecting it to work some magic for the entirety of a pipeline.

tzano · 2017-01-09T17:55:42Z

Thanks @jnothman . As you said, adding a new method on top of LabelEncoder doesn't seem to be a good direction.
I will try to make use of the ideas that have been discussed here, and submit another PR (working on different API).

sauravp · 2017-04-07T22:45:21Z

Has there been any further development here?

tzano mentioned this pull request Jan 7, 2017

LabelEncoder should add flexibility to future new label #8136

Closed

tzano closed this Jan 9, 2017

tzano deleted the patch-8164 branch January 9, 2017 17:42

tzano restored the patch-8164 branch January 9, 2017 17:48

tzano deleted the patch-8164 branch January 9, 2017 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] LabelEncoder: adding flexibility to future/new labels #8169

[WIP] LabelEncoder: adding flexibility to future/new labels #8169

tzano commented Jan 7, 2017 •

edited

Loading

jnothman commented Jan 7, 2017

tzano commented Jan 7, 2017

jnothman commented Jan 7, 2017

tzano commented Jan 8, 2017 •

edited

Loading

jnothman commented Jan 8, 2017

tzano commented Jan 9, 2017

sauravp commented Apr 7, 2017

[WIP] LabelEncoder: adding flexibility to future/new labels #8169

[WIP] LabelEncoder: adding flexibility to future/new labels #8169

Conversation

tzano commented Jan 7, 2017 • edited Loading

jnothman commented Jan 7, 2017

tzano commented Jan 7, 2017

jnothman commented Jan 7, 2017

tzano commented Jan 8, 2017 • edited Loading

jnothman commented Jan 8, 2017

tzano commented Jan 9, 2017

sauravp commented Apr 7, 2017

tzano commented Jan 7, 2017 •

edited

Loading

tzano commented Jan 8, 2017 •

edited

Loading