[MRG + 1] ENH: new CategoricalEncoder class #9151

jorisvandenbossche · 2017-06-18T20:05:47Z

This is currently simply a rebase of PR #6559 by @vighneshbirodkar .

Some context: this PR #6559 was the first of a series of PRs related to this. This PR added a CategoricalEncoder. Then it was decided, instead of adding a new class, to add this functionality to the existing OneHotEncoder: #8793 and #7327, and recently taken up by @stephen-hoover in #8793.

At the sprint we discussed this, and @amueller put a summary of that in #8793 (comment).
The main reason not to add this in OneHotEncoder is that this is fundamentally different behaviour (OneHotEncoder determines the categories based on the range of the positive integer values passed, the new CategoricalEncoder would determine it based on unique values), and that almost all keyword, attributes and behaviour of the current OneHotEncoder would be deprecated, which makes the implementation to do this in one class (deprecated + new behaviour) overly complex.

The basics already work nicely with the rebased PR:

In [30]: cat = CategoricalEncoder()

In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T

In [32]: cat.fit_transform(X).toarray()
Out[32]: 
array([[ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  1.]])

Some changes I would like to make to the current PR:

add some more tests
rename 'classes' keyword to 'categories' (IMO this is more in line with the name of the class 'CategoricalEncoder')
possibly remove the categorical_features keyword (the ability to select certain columns) for now to keep it more simple, as this can always be achieved in combination with the ColumnTransformer
I would like to add a categories_ attribute that is a dict of {column number/name : [categories]}. And maybe the underlying LabelEncoders can be hidden from the users (currently stored in the label_encoders_ attribute)
add support for pandas DataFrames (they already work, but would be nice to keep column -> categories information, see previous)
don't deprecate OneHotEncoder for now (we can leave this for a separate discussion)
move to sklearn.experimental (if we keep to this for the ColumnTransformer)
add get_feature_names() method ?

But before doing that, I wanted to check if we agree on this way forward (separate CategoricalEncoder class). @jnothman are you OK with this or still in favor of doing the changes in the OneHotEncoder?

If we want to keep the name OneHotEncoder, another option would be to implement the 'new' OneHotEncoder in eg a 'sklearn.future' module, so people can still move to it gradually and the current one can be deprecated, but keeping the implementations separate.

Closes #7375, closes #7327, closes #4920, closes #3956

Related issue that can possibly be closed as well: #3599, #8136

amueller · 2017-06-18T20:33:02Z

doc/modules/classes.rst

@@ -1197,7 +1197,7 @@ See the :ref:`metrics` section of the user guide for further details.
   preprocessing.MaxAbsScaler
   preprocessing.MinMaxScaler
   preprocessing.Normalizer
-   preprocessing.OneHotEncoder


This indicated deprecation of the old class? I'm not sure we want to do that yet.

yeah, for now just rebased old pr. See my to do list in top post. I would for now just leave the OneHotEncoder as is. We can always decide to deprecate it later if we want

Looked at the code before I looked at your description. I think your description is a good summary.

amueller · 2017-06-18T20:35:28Z

I'm generally good with the to-dos, though the categories_ attribute is lower priority to me than the rest.

amueller · 2017-06-18T20:36:00Z

And obviously I would prefer not to move this to a different module, I'd be fine with adding a note to the docs that this is experimental and might change, but I'll not press that point so we can move forward.

jnothman

At least for the record, could you please remind me why this is superior to DictVectorizer().fit(frame.to_json(orient='records') and to ColumnTransformer([(f, LabelEncoder(), f) for f in fields])?

I appreciate the difference of this from OHE, and that it provides a more definitive interface for this kind of operation. We should at the same time clarify what OHE is for (ordinals; #8628 should get a similar interface) and what LabelEncoder is not for.

jnothman · 2017-06-19T00:08:08Z

sklearn/preprocessing/data.py

+
+    Parameters
+    ----------
+    classes : 'auto', 2D array of ints or strings or both.


not 2d. A list of lists of values.

jnothman · 2017-06-19T00:08:34Z

sklearn/preprocessing/data.py

+        - 'auto' : Determine classes automatically from the training data.
+        - array: ``classes[i]`` holds the classes expected in the ith column.
+
+    categorical_features : 'all' or array of indices or mask


I'd support getting rid of this as unnecesarry complexity.

jnothman · 2017-06-19T00:10:03Z

sklearn/preprocessing/data.py

+        Values per feature.
+
+        - 'auto' : Determine classes automatically from the training data.
+        - array: ``classes[i]`` holds the classes expected in the ith column.


Does this specify the order of these values in the output, or just the set? Must be clear.

jnothman · 2017-06-19T00:11:41Z

sklearn/preprocessing/data.py

+      dictionary items (also handles string-valued features).
+    sklearn.feature_extraction.FeatureHasher : performs an approximate one-hot
+      encoding of dictionary items or strings.
+    """


Surely we need See Also to also describe the relationship to / distinction from OHE

amueller · 2017-06-19T00:25:42Z

ColumnTransformer([(f, LabelEncoder(), f) for f in fields])

Followed by some version of the one-hot-encoder, right?
Or I guess with LabelBinarizer() it would be fine? If that's a correct implementation that allows for an inverse_transform and get_feature_names. I'm all for it.

DictVectorizer().fit(frame.to_json(orient='records')

Tbh I'm not familiar enough with how DictVectorizer treats integers and floats for that. Maybe a good argument would be the possibility of an exact inverse_transform which we decided is out-of-scope?

There is also somewhere a hack that uses CountVectorizer(analyzer=lambda x: x) or something like that, and that also works for a single column.

If we actually decide (which was the consensus at the sprint between @GaelVaroquaux, @jorisvandenbossche an me) that we always want to transform all columns, then maybe one of these implementations could actually work.

I would like something discoverable and with good feature names and the possibility to have some feature provenance in the future.

Maybe someone can write a blogpost about all the subtle differences between these lol.
I think that DictVectorizer().fit(frame.to_json(orient='records') is a bit obscure, and it throws away the dtype of the columns, right?

jnothman · 2017-06-19T01:40:45Z

Yes, I meant LabelBinarizer.

…

On 19 June 2017 at 10:25, Andreas Mueller ***@***.***> wrote: ColumnTransformer([(f, LabelEncoder(), f) for f in fields]) Followed by some version of the one-hot-encoder, right? Or I guess with LabelBinarizer() it would be fine? If that's a correct implementation that allows for an inverse_transform and get_feature_names. I'm all for it. DictVectorizer().fit(frame.to_json(orient='records') Tbh I'm not familiar enough with how DictVectorizer treats integers and floats for that. Maybe a good argument would be the possibility of an exact inverse_transform which we decided is out-of-scope? There is also somewhere a hack that uses CountVectorizer(analyzer=lambda x: x) or something like that, and that also works for a single Column. If we actually decide (which was the consensus at the sprint between @GaelVaroquaux <https://github.com/gaelvaroquaux>, @jorisvandenbossche <https://github.com/jorisvandenbossche> an me) that we always want to transform all columns, then maybe one of these implementations could actually work. I would like something discoverable and with good feature names and the possibility to have some feature provenance in the future. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9151 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67avpEDmrx4xAQQcUXb5FChCjjpPks5sFcAIgaJpZM4N9lyH> .

jorisvandenbossche · 2017-06-19T08:26:15Z

ColumnTransformer([(f, LabelBinarizer(), f) for f in fields])

If that's a correct implementation that allows for an inverse_transform and get_feature_names. I'm all for it.

Problem with the LabelBinarizer is that it currently only works on strings, not numerical values (which could maybe be fixed), and that it doesn't play nice with the ColumnTransformer due to both X and y being passed in the transformers (but there is a PR to fix this?)
It also doesn't give us a get_feature_names out of the box.

DictVectorizer().fit(frame.to_dict(orient='records')

Tbh I'm not familiar enough with how DictVectorizer treats integers and floats for that. Maybe a good argument would be the possibility of an exact inverse_transform which we decided is out-of-scope?

DictVectorizer seems to work and gives us get_feature_names, but, it treats string values and numerical values differently. Strings are dummy encoded, but integer are just passed through. So not fully the behaviour we want.
I also think the conversion to a dict (instead of working on the column as arrays) can become quite costly for larger datasets.

There is also somewhere a hack that uses CountVectorizer(analyzer=lambda x: x) or something like that, and that also works for a single column.

It is indeed using CountVectorizer(analyzer=lambda x: [x]) that gives us more or less exactly what we want. It also gives us get_feature_names (we only have to fix it to be able to deal with mixed strings and numerical values).
So this could be used under the hood instead of LabelEncoder/OneHotEncoder. But given the quite different original use case, I am not sure this is a good way to go.

Full experimentation of the different possibilities: http://nbviewer.ipython.org/d6a79e96b490872905e74202d0818ab2

jnothman · 2017-06-19T08:45:25Z

I wouldn't use CountVectorizer under the hood here. But we might need to investigate performance of LabelEncoder's binary search (it uses that, doesn't it?) with a large vocabulary of categories.

…

On 19 June 2017 at 18:26, Joris Van den Bossche ***@***.***> wrote: ColumnTransformer([(f, LabelBinarizer(), f) for f in fields]) If that's a correct implementation that allows for an inverse_transform and get_feature_names. I'm all for it. Problem with the LabelBinarizer is that it currently only works on strings, not numerical values (which could maybe be fixed), and that it doesn't play nice with the ColumnTransformer due to both X and y being passed in the transformers (but there is a PR to fix this?) It also doesn't give us a get_feature_names out of the box. DictVectorizer().fit(frame.to_dict(orient='records') Tbh I'm not familiar enough with how DictVectorizer treats integers and floats for that. Maybe a good argument would be the possibility of an exact inverse_transform which we decided is out-of-scope? DictVectorizer seems to work and gives us get_feature_names, but, it treats string values and numerical values differently. Strings are dummy encoded, but integer are just passed through. So not fully the behaviour we want. I also think the conversion to a dict (instead of working on the column as arrays) can become quite costly for larger datasets. There is also somewhere a hack that uses CountVectorizer(analyzer=lambda x: x) or something like that, and that also works for a single column. It is indeed using CountVectorizer(analyzer=lambda x: [x]) that gives us more or less exactly what we want. It also gives us get_feature_names (we only have to fix it to be able to deal with mixed strings and numerical values). So this could be used under the hood instead of LabelEncoder/OneHotEncoder. But given the quite different original use case, I am not sure this is a good way to go. Full experimentation of the different possibilities: http://nbviewer.ipython.org/d6a79e96b490872905e74202d0818ab2 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9151 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz63Xf_VlQx-OoIJmIkwKrNYV3zPtbks5sFjCpgaJpZM4N9lyH> .

jorisvandenbossche · 2017-06-19T09:04:41Z

LabelEncoder just does a np.unique(y) for determining the different classes in fit, and a np.unique(y, return_inverse=True) for the conversion to integer codes in fit_transform or np.searchsorted in transform.

jnothman · 2017-06-19T10:49:28Z

searchsorted is a binary search and is likely to be much slower for a large number of strings than dict lookup used by *Vectorizer

- remove compat code for numpy < 1.8 - remove categorical_features keyword - make label_encoders_ private - rename classes to categories

Trion129 · 2017-06-26T13:52:48Z

Hy Sci-kittens! :D I recently suggested in mailing list about having a drop_one parameter in the OneHotEncoder so that one of the columns in the end or beginning of encoded array is dropped as it is like doubling same column twice, it will benefit some models like LinearRegression. I got guided to this PR, would like to know if it can be added to new Categorical encoder?

- check that it works on pandas frames - fix doctests - un-deprecate OneHotEncoder - undo changes in _transform_selected (as we no longer need those changes for CategoricalEncoder) - add see also to OneHotEncoder and vice versa - for now remove the self.feature_indices_ attribute

jorisvandenbossche · 2017-06-27T20:22:48Z

OK, I cleaned the code a bit further and added some more tests, I think this is ready for more detailed review.
The current PR is basically the most simplest version of a CategoricalEncoder which just does what you need in most cases and should be rather uncontroversial, but without much additional features/attributes (so eg no attributes yet to inspect the categories, no get_feature_names, no inverse_transform, ..).

@Trion129 That's indeed a possible extension of the current PR. If this is desired, we could add a keyword the determines this behaviour, with the default to not drop any column (current behaviour). As a reference, the pandas get_dummies uses a drop_first keyword for this.

jorisvandenbossche · 2017-06-27T20:23:51Z

(the docs should probably be further updated, as now it just changes the example using OneHotEncoder to CategoricalEncoder. But we might want to keep an example with OneHotEncoder as well, if there is a good example)

amueller · 2017-07-12T19:41:05Z

Btw, in our tutorial @agramfort mentioned using integer encodings for categorical as that works reasonably for trees. Should we implement that? If so, in this estimator? but probably not for now.

amueller

Mostly looks good. Needs attribute documentation and then we can talk about what a good way to expose the per-feature categories is. Maybe also get_feature_names?

amueller · 2017-07-12T19:39:04Z

sklearn/feature_extraction/dict_vectorizer.py

-    sklearn.preprocessing.OneHotEncoder : handles nominal/categorical features
-      encoded as columns of integers.
+    sklearn.preprocessing.CategoricalEncoder : handles nominal/categorical
+      features encoded as columns of arbitraty data types.


is it? or strings or integers? What happens with pandas categorical?

arbitraty -> arbitrary

amueller · 2017-07-12T19:39:45Z

sklearn/preprocessing/data.py

@@ -1730,13 +1728,14 @@ def _transform_selected(X, transform, selected="all", copy=True):


 class OneHotEncoder(BaseEstimator, TransformerMixin):
-    """Encode categorical integer features using a one-hot aka one-of-K scheme.
+    """Encode ordinal integer features using a one-hot aka one-of-K scheme.


I don't think ordinal is right as they are not ordered.

I used "ordinal" to indicate that the actual set of values do mean something (eg the values [1,4,5] means that there are also categories 2 and 3 which are not present in the data), but I agree that this is not the same as being ordered.
That said, I also find 'categorical' misleading given the above behaviour.

I would say categorical and in the next line clarify they need to be 0 to (n_categories - 1).

amueller · 2017-07-12T19:42:13Z

sklearn/preprocessing/data.py

+
+    handle_unknown : str, 'error' or 'ignore'
+        Whether to raise an error or ignore if a unknown categorical feature is
+        present during transform.


we always raise an error when categories is not auto and an unknown category is encountered during training, right? Maybe we should make that explicit?

We should probably say what happens with the unknown value: all columns will be zero.

we always raise an error when categories is not auto and an unknown category is encountered during training, right?

Shouldn't we also let that depend on this keyword? So also when you specify the categories yourself, you can ignore such errors of unknown values with this keyword (the default it to raise anyhow)

Ok, I'm fine with that.

amueller · 2017-07-12T19:42:55Z

sklearn/preprocessing/data.py

+    Examples
+    --------
+    Given a dataset with three features and two samples, we let the encoder
+    find the maximum value per feature and transform the data to a binary


maximum value? Unique values, right? Maybe add in a 999? or just a 5?

indeed, remaining from OneHotEncoder

Will add other number. Ideally string as well, that's not possible with just using simple lists for the example ..

still says maximum ;)

amueller · 2017-07-12T19:43:41Z

sklearn/preprocessing/data.py

+    See also
+    --------
+    sklearn.preprocessing.OneHotEncoder : performs a one-hot encoding of
+      integer ordinal features. This transformer assumes that input features


I don't think ordinal is right. "This" -> OneHotEncoder (otherwise I feel it's ambiguous)

amueller · 2017-07-12T19:54:31Z