Skip to content

[MRG + 1] ENH: new CategoricalEncoder class #9151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Nov 21, 2017

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jun 18, 2017

This is currently simply a rebase of PR #6559 by @vighneshbirodkar .

Some context: this PR #6559 was the first of a series of PRs related to this. This PR added a CategoricalEncoder. Then it was decided, instead of adding a new class, to add this functionality to the existing OneHotEncoder: #8793 and #7327, and recently taken up by @stephen-hoover in #8793.

At the sprint we discussed this, and @amueller put a summary of that in #8793 (comment).
The main reason not to add this in OneHotEncoder is that this is fundamentally different behaviour (OneHotEncoder determines the categories based on the range of the positive integer values passed, the new CategoricalEncoder would determine it based on unique values), and that almost all keyword, attributes and behaviour of the current OneHotEncoder would be deprecated, which makes the implementation to do this in one class (deprecated + new behaviour) overly complex.

The basics already work nicely with the rebased PR:

In [30]: cat = CategoricalEncoder()

In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T

In [32]: cat.fit_transform(X).toarray()
Out[32]: 
array([[ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  1.]])

Some changes I would like to make to the current PR:

  • add some more tests
  • rename 'classes' keyword to 'categories' (IMO this is more in line with the name of the class 'CategoricalEncoder')
  • possibly remove the categorical_features keyword (the ability to select certain columns) for now to keep it more simple, as this can always be achieved in combination with the ColumnTransformer
  • I would like to add a categories_ attribute that is a dict of {column number/name : [categories]}. And maybe the underlying LabelEncoders can be hidden from the users (currently stored in the label_encoders_ attribute)
  • add support for pandas DataFrames (they already work, but would be nice to keep column -> categories information, see previous)
  • don't deprecate OneHotEncoder for now (we can leave this for a separate discussion)
  • move to sklearn.experimental (if we keep to this for the ColumnTransformer)
  • add get_feature_names() method ?

But before doing that, I wanted to check if we agree on this way forward (separate CategoricalEncoder class). @jnothman are you OK with this or still in favor of doing the changes in the OneHotEncoder?

If we want to keep the name OneHotEncoder, another option would be to implement the 'new' OneHotEncoder in eg a 'sklearn.future' module, so people can still move to it gradually and the current one can be deprecated, but keeping the implementations separate.

Closes #7375, closes #7327, closes #4920, closes #3956

Related issue that can possibly be closed as well: #3599, #8136

@@ -1197,7 +1197,7 @@ See the :ref:`metrics` section of the user guide for further details.
preprocessing.MaxAbsScaler
preprocessing.MinMaxScaler
preprocessing.Normalizer
preprocessing.OneHotEncoder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indicated deprecation of the old class? I'm not sure we want to do that yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, for now just rebased old pr. See my to do list in top post. I would for now just leave the OneHotEncoder as is. We can always decide to deprecate it later if we want

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at the code before I looked at your description. I think your description is a good summary.

@amueller
Copy link
Member

I'm generally good with the to-dos, though the categories_ attribute is lower priority to me than the rest.

@amueller
Copy link
Member

amueller commented Jun 18, 2017

And obviously I would prefer not to move this to a different module, I'd be fine with adding a note to the docs that this is experimental and might change, but I'll not press that point so we can move forward.

@amueller amueller modified the milestone: 0.19 Jun 19, 2017
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for the record, could you please remind me why this is superior to DictVectorizer().fit(frame.to_json(orient='records') and to ColumnTransformer([(f, LabelEncoder(), f) for f in fields])?

I appreciate the difference of this from OHE, and that it provides a more definitive interface for this kind of operation. We should at the same time clarify what OHE is for (ordinals; #8628 should get a similar interface) and what LabelEncoder is not for.


Parameters
----------
classes : 'auto', 2D array of ints or strings or both.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not 2d. A list of lists of values.

- 'auto' : Determine classes automatically from the training data.
- array: ``classes[i]`` holds the classes expected in the ith column.

categorical_features : 'all' or array of indices or mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd support getting rid of this as unnecesarry complexity.

Values per feature.

- 'auto' : Determine classes automatically from the training data.
- array: ``classes[i]`` holds the classes expected in the ith column.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this specify the order of these values in the output, or just the set? Must be clear.

dictionary items (also handles string-valued features).
sklearn.feature_extraction.FeatureHasher : performs an approximate one-hot
encoding of dictionary items or strings.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely we need See Also to also describe the relationship to / distinction from OHE

@amueller
Copy link
Member

amueller commented Jun 19, 2017

ColumnTransformer([(f, LabelEncoder(), f) for f in fields])

Followed by some version of the one-hot-encoder, right?
Or I guess with LabelBinarizer() it would be fine? If that's a correct implementation that allows for an inverse_transform and get_feature_names. I'm all for it.

DictVectorizer().fit(frame.to_json(orient='records')

Tbh I'm not familiar enough with how DictVectorizer treats integers and floats for that. Maybe a good argument would be the possibility of an exact inverse_transform which we decided is out-of-scope?

There is also somewhere a hack that uses CountVectorizer(analyzer=lambda x: x) or something like that, and that also works for a single column.

If we actually decide (which was the consensus at the sprint between @GaelVaroquaux, @jorisvandenbossche an me) that we always want to transform all columns, then maybe one of these implementations could actually work.

I would like something discoverable and with good feature names and the possibility to have some feature provenance in the future.

Maybe someone can write a blogpost about all the subtle differences between these lol.
I think that DictVectorizer().fit(frame.to_json(orient='records') is a bit obscure, and it throws away the dtype of the columns, right?

@jnothman
Copy link
Member

jnothman commented Jun 19, 2017 via email

@jorisvandenbossche
Copy link
Member Author

ColumnTransformer([(f, LabelBinarizer(), f) for f in fields])

If that's a correct implementation that allows for an inverse_transform and get_feature_names. I'm all for it.

Problem with the LabelBinarizer is that it currently only works on strings, not numerical values (which could maybe be fixed), and that it doesn't play nice with the ColumnTransformer due to both X and y being passed in the transformers (but there is a PR to fix this?)
It also doesn't give us a get_feature_names out of the box.

DictVectorizer().fit(frame.to_dict(orient='records')

Tbh I'm not familiar enough with how DictVectorizer treats integers and floats for that. Maybe a good argument would be the possibility of an exact inverse_transform which we decided is out-of-scope?

DictVectorizer seems to work and gives us get_feature_names, but, it treats string values and numerical values differently. Strings are dummy encoded, but integer are just passed through. So not fully the behaviour we want.
I also think the conversion to a dict (instead of working on the column as arrays) can become quite costly for larger datasets.

There is also somewhere a hack that uses CountVectorizer(analyzer=lambda x: x) or something like that, and that also works for a single column.

It is indeed using CountVectorizer(analyzer=lambda x: [x]) that gives us more or less exactly what we want. It also gives us get_feature_names (we only have to fix it to be able to deal with mixed strings and numerical values).
So this could be used under the hood instead of LabelEncoder/OneHotEncoder. But given the quite different original use case, I am not sure this is a good way to go.

Full experimentation of the different possibilities: http://nbviewer.ipython.org/d6a79e96b490872905e74202d0818ab2

@jnothman
Copy link
Member

jnothman commented Jun 19, 2017 via email

@jorisvandenbossche
Copy link
Member Author

LabelEncoder just does a np.unique(y) for determining the different classes in fit, and a np.unique(y, return_inverse=True) for the conversion to integer codes in fit_transform or np.searchsorted in transform.

@jnothman
Copy link
Member

jnothman commented Jun 19, 2017 via email

@amueller amueller modified the milestones: 0.20, 0.19 Jun 19, 2017
- remove compat code for numpy < 1.8
- remove categorical_features keyword
- make label_encoders_ private
- rename classes to categories
@Trion129
Copy link
Contributor

Hy Sci-kittens! :D I recently suggested in mailing list about having a drop_one parameter in the OneHotEncoder so that one of the columns in the end or beginning of encoded array is dropped as it is like doubling same column twice, it will benefit some models like LinearRegression. I got guided to this PR, would like to know if it can be added to new Categorical encoder?

- check that it works on pandas frames
- fix doctests
- un-deprecate OneHotEncoder
- undo changes in _transform_selected (as we no longer need those changes for CategoricalEncoder)
- add see also to OneHotEncoder and vice versa
- for now remove the self.feature_indices_ attribute
@raghavrv raghavrv self-requested a review June 27, 2017 18:50
@jorisvandenbossche
Copy link
Member Author

OK, I cleaned the code a bit further and added some more tests, I think this is ready for more detailed review.
The current PR is basically the most simplest version of a CategoricalEncoder which just does what you need in most cases and should be rather uncontroversial, but without much additional features/attributes (so eg no attributes yet to inspect the categories, no get_feature_names, no inverse_transform, ..).

@Trion129 That's indeed a possible extension of the current PR. If this is desired, we could add a keyword the determines this behaviour, with the default to not drop any column (current behaviour). As a reference, the pandas get_dummies uses a drop_first keyword for this.

@jorisvandenbossche
Copy link
Member Author

(the docs should probably be further updated, as now it just changes the example using OneHotEncoder to CategoricalEncoder. But we might want to keep an example with OneHotEncoder as well, if there is a good example)

@amueller
Copy link
Member

Btw, in our tutorial @agramfort mentioned using integer encodings for categorical as that works reasonably for trees. Should we implement that? If so, in this estimator? but probably not for now.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Needs attribute documentation and then we can talk about what a good way to expose the per-feature categories is. Maybe also get_feature_names?

sklearn.preprocessing.OneHotEncoder : handles nominal/categorical features
encoded as columns of integers.
sklearn.preprocessing.CategoricalEncoder : handles nominal/categorical
features encoded as columns of arbitraty data types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it? or strings or integers? What happens with pandas categorical?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arbitraty -> arbitrary

@@ -1730,13 +1728,14 @@ def _transform_selected(X, transform, selected="all", copy=True):


class OneHotEncoder(BaseEstimator, TransformerMixin):
"""Encode categorical integer features using a one-hot aka one-of-K scheme.
"""Encode ordinal integer features using a one-hot aka one-of-K scheme.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think ordinal is right as they are not ordered.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used "ordinal" to indicate that the actual set of values do mean something (eg the values [1,4,5] means that there are also categories 2 and 3 which are not present in the data), but I agree that this is not the same as being ordered.
That said, I also find 'categorical' misleading given the above behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say categorical and in the next line clarify they need to be 0 to (n_categories - 1).


handle_unknown : str, 'error' or 'ignore'
Whether to raise an error or ignore if a unknown categorical feature is
present during transform.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we always raise an error when categories is not auto and an unknown category is encountered during training, right? Maybe we should make that explicit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably say what happens with the unknown value: all columns will be zero.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we always raise an error when categories is not auto and an unknown category is encountered during training, right?

Shouldn't we also let that depend on this keyword? So also when you specify the categories yourself, you can ignore such errors of unknown values with this keyword (the default it to raise anyhow)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm fine with that.

Examples
--------
Given a dataset with three features and two samples, we let the encoder
find the maximum value per feature and transform the data to a binary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maximum value? Unique values, right? Maybe add in a 999? or just a 5?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, remaining from OneHotEncoder

Will add other number. Ideally string as well, that's not possible with just using simple lists for the example ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still says maximum ;)

See also
--------
sklearn.preprocessing.OneHotEncoder : performs a one-hot encoding of
integer ordinal features. This transformer assumes that input features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think ordinal is right. "This" -> OneHotEncoder (otherwise I feel it's ambiguous)

enc = CategoricalEncoder(sparse=False)
Xtr3 = enc.fit_transform(X)

assert_allclose(Xtr1.toarray(), Xtr2.toarray())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this test. Determinism?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, before they did something different (different selection of the columns, but i removed that feature), so not needed anymore

assert_allclose(Xtr, [[1, 0, 1, 0], [0, 1, 0, 1]])

Xtr = CategoricalEncoder().fit_transform(X)
assert_allclose(Xtr.toarray(), [[1, 0, 1, 0, 1], [0, 1, 0, 1, 1]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably explain the handling of constant features. Does this make the most sense? Not sure what else....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean the case when there is just one category, and now this gets one column with ones?
Also not sure what else we could do for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. We could also remove it, or make it zeros. But ones is maybe fine? But we should be explicit on what we do.

enc.fit(X)

X[0][0] = -1
msg = re.escape('Unknown feature(s) [-1] in column 0')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be feature value? Unknown category?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed. Took 'category'

assert_array_equal(enc.fit_transform(X).toarray(), exp)

# don't follow order of passed categories, but sort them
enc = CategoricalEncoder(categories=[['c', 'b', 'a']])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not documented, is it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See last bullet point in #9151 (but indeed, this is not yet documented)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now, documented it how it is now

enc = CategoricalEncoder(categories=[['a', 'b']])
assert_raises(ValueError, enc.fit, X)
enc = CategoricalEncoder(categories=[['a', 'b']], handle_unknown='ignore')
enc.fit(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check the outcome for that. We only get two columns, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this gives a row of all zeros (with indeed only two columns). Is this the desired outcome? (not sure what to do otherwise)

@jnothman
Copy link
Member

jnothman commented Jul 13, 2017 via email

@amueller
Copy link
Member

Any news on this? ;)

@amueller
Copy link
Member

Sorry for the delay. The question is whether we want dataframe in -> dataframe out? That might be nice, but I'd rather merge without that and possibly add that later.

@amueller
Copy link
Member

We're gonna gonna do get_feature_names in a follow-up PR, right?

@jorisvandenbossche
Copy link
Member Author

We're gonna gonna do get_feature_names in a follow-up PR, right?

Yes, that is on my list of follow-ups (#9151 (comment)), although there are some questions about what it should exactly do (#9151 (comment))

The question is whether we want dataframe in -> dataframe out?

I didn't consider the 'dataframe out' (although that can also be a consideration, but a much bigger one I think). Here it was more about having some special code inside the transformer to prevent converting a dataframe with different dtypes to 'object' dtyped array(check_array). This conversion is not really needed, as the transformer encodes the input column by column anyway, so it would be rather easy to preserve the datatypes per column. The transformed output will always have a uniform dtype anyway, so that is ok to be an array.

@amueller
Copy link
Member

@jorisvandenbossche ah, that makes more sense. I would leave it as-is. Merge?

@amueller
Copy link
Member

I think the proposal with something like ['0__female', '0__male', '1__0', '1__1'] is good. I would do it as in PolynomialFeatures with uses x0, x1 etc (maybe), with an option to pass in input feature names to transform them. That would allow preserving the semantics more easily.

@jorisvandenbossche
Copy link
Member Author

Merge?

Yes, I think somebody can merge it!

@jnothman
Copy link
Member

Sure. Let's see what this thing does in the wild!

@jnothman jnothman merged commit a2ebb8c into scikit-learn:master Nov 21, 2017
@jnothman
Copy link
Member

Congratulations! And thanks. Please feel free to make follow-up issues.

@jorisvandenbossche
Copy link
Member Author

And thanks a lot for all the review!

@jorisvandenbossche jorisvandenbossche deleted the pr/6559 branch November 21, 2017 12:26
@amueller
Copy link
Member

My excitement about this is pretty much through the roof lol ;)

@amueller
Copy link
Member

I would appreciate it if you could focus on #9012, maybe we can leave the get_feature_names for someone else, it shouldn't be too tricky.

@vighneshbirodkar
Copy link
Contributor

Congratulations and thanks @jorisvandenbossche
It is nice to see this finally merged.

@austinmw
Copy link

Hi, any chance you could add a drop_first parameter like in pandas.get_dummies()? It'd make it easier to put this in a pipeline without requiring something additional to drop a column.

@jnothman
Copy link
Member

jnothman commented Jan 11, 2018 via email

@jorisvandenbossche
Copy link
Member Author

For people who are subscribed here: I opened a new issue with some questions on the API and naming: #10521

@rragundez
Copy link
Contributor

rragundez commented Mar 11, 2019

CategoricalEncoderis now refactored into OneHotEncoder and OrdinalEncoder. #10523

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OneHotEncoder should accept string values Pipeline doesn't work with Label Encoder
9 participants