[MRG] Support for strings in OneHotEncoder #8793

stephen-hoover · 2017-04-25T19:17:51Z

Reference Issue

Continues PR from #7327 .

What does this implement/fix? Explain your changes.

This PR modifies the OneHotEncoder to use the LabelEncoder for encoding each column. This change allows the OneHotEncoder to expand arrays with type np.object, which may include string values as well as numerical values.

Any other comments?

This PR is continuing from #7327 . Apart from some refactoring, function changes from that PR are the removal of forced copies (and the copy parameter on the OneHotEncoder), restricting range checking to the numerical columns, erroring if numerical columns have values < 0, and modifying active_features_, feature_indices_, and n_values_ to reflect the way the code now operates.

CC @vighneshbirodkar . Thanks for putting this together!

TODO

Review tests
Review documentation
Add entry to What's New
Add attribute to hold categorical level at each column of the output

…checking

Refactor the OneHotEncoder for easier reading. Avoid mandatory copies of input data in both the `fit` and `transform` steps. Add a test that the input data aren't modified after fitting or transforming.

stephen-hoover · 2017-04-25T19:37:16Z

@jnothman , here's a continuation of #7327 . My big question is whether or not a new transformer object would be acceptable. There's a lot of complexity in here which is wrapped up with the range-checking behavior of the existing OneHotEncoder.

How do you feel about moving this functionality to a new class, perhaps named CategoricalEncoder, and leaving the existing OneHotEncoder the way it is?

jnothman · 2017-04-25T22:15:44Z

I'm a bit torn, but: I don't think we should create a new class for the sake of a simplified implementation. If we need to have two seperate code paths in implementation, so be it. I would be persuaded more by having a cleaner API due to separate classes.

…

On 26 Apr 2017 5:37 am, "Stephen Hoover" ***@***.***> wrote: @jnothman <https://github.com/jnothman> , here's a continuation of #7327 <#7327> . My big question is whether or not a new transformer object would be acceptable. There's a lot of complexity in here which is wrapped up with the range-checking behavior of the existing OneHotEncoder. How do you feel about moving this functionality to a new class, perhaps named CategoricalEncoder, and leaving the existing OneHotEncoder the way it is? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8793 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz69t-xcuXSCX4lnsvp71fEYFjk2qRks5rzktugaJpZM4NH755> .

vighneshbirodkar · 2017-04-26T04:12:06Z

There already was a failed attempt at a new class at #6559

jnothman · 2017-04-26T10:36:53Z

My time is very limited at the moment due to teaching. You've marked this WIP. What do you feel remains to be changed before this is in a state to be reviewed-for-merge?

Also restore `n_values_` and `active_features_` attributes.

stephen-hoover · 2017-04-26T21:24:33Z

I did a bit more work on the code. I think the changes to OneHotEncoder are in good shape now, but I still want to look over the tests and see if there's any cases to add. This also needs an entry in the What's New, and I haven't looked over the documentation yet. I put up a checklist for these items. I'll be able to work on this more this week.

Changes with this latest update:

Removed 'error-strict' option for handle_unknown and added 'auto-strict' option for values. I think it's easier to write the code for the options set up this way, and arguably the original OHE behavior is a choice about how to fit the data. "Auto" is the original behavior, while "auto-strict" only uses categories from the input data.
Restored active_features_, since it's the easiest way to maintain the original class behavior with values='auto'.
Restored n_values_. It still seems potentially of use, but I don't feel strongly about it.
Removed the max value check. I'm handling this with the LabelEncoders now.
Removed the cast to np.object. This converted inputs of e.g. integers into object arrays, which I'd rather not do. The downside(?) is that lists of mixed string and numerical types get converted into string types, which should only matter with values='auto'.

stephen-hoover · 2017-04-26T21:55:22Z

Sorry, looks like I have some test failures to deal with as well.

stephen-hoover · 2017-04-27T11:53:05Z

@jnothman , this is ready to be reviewed for merge now.

stephen-hoover · 2017-04-27T12:07:16Z

sklearn/preprocessing/data.py

+                # 0 and column max. The transform will still only
+                # return dummy columns for integers present in training data.
+                if (not isinstance(X[0, i], six.string_types) and
+                        int(X[0, i]) == X[0, i]):


I think this is the weakest part of this PR. I'm type-checking the first row of the column and seeing if it looks like an integer, then if it is, assuming that the entire column is integers. I don't think we can avoid type-checking, since the encoder needs to treat integers differently than everything else in "auto" mode. I don't want to do type checking on every entry in the column, though. That seems too slow.

The np.unique call will error if the column has mixed string and numerical types, but we could still run into problems if there's mixed floats and integers.

jnothman

Some comments on API so far. I'd be interested to know what you think about what OHE should look like in a couple of versions' time.

jnothman · 2017-04-30T10:13:34Z

doc/modules/preprocessing.rst

-           handle_unknown='error', n_values=[2, 3, 4], sparse=True)
-    >>> enc.transform([[1, 0, 0]]).toarray()
-    array([[ 0.,  1.,  1.,  0.,  0.,  1.,  0.,  0.,  0.]])
+    >>> browsers = ['uses Internet Explorer', 'uses Chrome' , 'uses Safari', 'uses Firefox']


please try to limit lines to 80 chars. "uses " here seems a gratuitous waste of horizontal space

jnothman · 2017-04-30T10:18:19Z

doc/whats_new.rst

@@ -171,6 +171,16 @@ Enhancements
     removed by setting it to `None`.
     :issue:`7674` by:user:`Yichuan Liu <yl565>`.

+   - :class:`preprocessing.OneHotEncoder` now fits and transforms inputs of


I would split this between enhancements (the new stuff) and api changes (deprecations).

jnothman · 2017-04-30T10:22:33Z

sklearn/preprocessing/data.py

-                  in ``range(n_values[i])``
+    values : 'auto', 'auto-strict', int, List[int], or List[List[objects]]
+        - 'auto' (default) : Determine set of values from training data.
+            If values are integers, then allowed values will be between


we have a distinction between "allowed" and "transformed" values for "auto" with ints. Maybe specify "errors will only be raised in transform for values out of the input range, when the input is integer". Really, though, this is a weird API, which is why we had designed handle_unknown="error-strict". Perhaps it's easier to change handle_unknown="error-strict" to become the default in a future version than values="auto-strict". WDYT?

jnothman · 2017-04-30T10:24:14Z

sklearn/preprocessing/data.py

    n_values_ : array of shape (n_features,)
-        Maximum number of values per feature.
+        Number of categories per feature. Has value `0` for


Unclear how this works in the auto integer case

I changed to "Number of encoded categories" and moved active_features_ back to deprecation. Hopefully that will make it clearer.

jnothman · 2017-04-30T10:24:59Z

sklearn/preprocessing/data.py

+
+    one_hot_feature_index_ : array, shape [n_features_new]
+        ``one_hot_feature_index_[i]`` specifies which feature of the input
+        is encoded by column `i` in the one-hot encoded array.


need double-backticks

jnothman · 2017-04-30T10:27:37Z

sklearn/preprocessing/data.py


    Attributes
    ----------
+    feature_index_range_ : array, shape [n_feature, 2]
+        ``feature_index_range_[i]`` specifies the range of column indices
+        occupied by the input feature `i` in the one-hot encoded array.


Is this the range prior to or after masking by active_features_?

After masking. The users never see output before masking, so this is giving column indices in the actual output data.

jnothman · 2017-04-30T10:28:07Z

sklearn/preprocessing/data.py

+        ``feature_index_range_[i]`` specifies the range of column indices
+        occupied by the input feature `i` in the one-hot encoded array.
+
+    one_hot_feature_index_ : array, shape [n_features_new]


Why do we need this as well as feature_index_range?

Don't we also need an attribute which indicates what values correspond to each feature of the output?

feature_index_range_ maps from inputs to outputs. one_hot_feature_index_ maps from outputs to inputs. Maybe @vighneshbirodkar could comment more about the motivation for having both.

That's a good point about having an attribute to show the levels. I'll add a 1D object array whose values are the category encoded in each column.

Added categories_.

jnothman · 2017-04-30T10:31:09Z

sklearn/preprocessing/data.py

+        # Record which column of input data corresponds
+        # to each column of output data
+        n_expanded_cols = end + n_features - num_cat
+        self.one_hot_feature_index_ = np.empty(n_expanded_cols, dtype=np.int)


I think this can be filled with a one-liner (or something like it) using np.repeat and np.diff

jnothman · 2017-04-30T10:34:27Z

sklearn/preprocessing/tests/test_data.py

+
+    error_msg = re.escape("Column 0 contains new labels: [2]")
+    assert_raises_regex(ValueError, error_msg,
+                        OneHotEncoder(n_values=2).fit_transform, X)


surely this will also raise a deprecationwarning which should be caught/ignored

Added tests that deprecated parameters raise warnings.

jnothman · 2017-04-30T10:35:54Z

sklearn/preprocessing/tests/test_data.py

@@ -1500,13 +1506,37 @@ def test_one_hot_encoder_sparse():
    # test exception on wrong init param
    assert_raises(TypeError, OneHotEncoder(n_values=np.int).fit, X)

+
+def test_one_hot_encoder_error_on_negative():
+    # Negative numerical values in inputs should raise an exception


I don't like this constraint unless we keep OHE as only dealing with ordinal inputs. Why do we need it? If we're raising an error when it happens, it's not like we're keeping things any more efficient than using a LabelEncoder in this case.

Erroring on negative integers is a part of the current OHE. I added this test to make sure that I preserved that behavior. The fit stage has an explicit check, but the transform stage relies on the LabelEncoder to error for unknown values.

That behaviour derives from the assumption that inputs are 0-based ordinal representation of categories (like output from LabelEncoder). It no longer applies...?

Maybe I should say that the explicit check in fit is only when values="auto" and only on integer features. That mode is still assuming that the feature is a 0-based ordinal. I'm not sure what better way there is to not break the way the OHE used to work.

Okay. Do we have a test that checks negative integers are fine in the auto-strict case?

I added a line to this test case to check that.

stephen-hoover · 2017-04-30T14:04:38Z

API thoughts:

I think the cleanest interface would be to have values accept either "auto" or a sequence of sequences which specify the expected levels for each categorical. handle-unknown should be either "ignore" or "error". (Possibly there could be an extra option to encode an "unknown values" column, but adding that would be out of scope for this PR.)

That wouldn't let users completely reproduce the current behavior of ignoring unseen integers inside a 0-to-max-fit range but erroring on integers outside of that range. I'm curious to know how many people rely on that behavior, but I don't know how I'd find out.

If we're going to keep the current OHE "auto" behavior, I'd add a new option for handle_unknown, perhaps "error-out-of-range", and ask users to switch to that if they want the old-style behavior.

How to get there:

Writing this, I switched back and forth between thinking of the current OHE behavior on "auto" mode as being a different way of fitting and a different way of raising errors. Now I'm back to thinking that it's about raising errors. I'm inclined to get rid of "auto-strict" and use "error-strict" again. I'd do that without significant code changes, by checking the handle_unknown value at fit time. If we don't keep the current "auto" behavior, I'd also deprecate active_features_.

jnothman · 2017-05-03T15:06:10Z

I'm inclined towards error-strict from a user orientation. The auto-strict definition makes the current auto behaviour confusing vis-a-vis active_features_ (even if active_features_ should be more of an implementation detail than it is because of the current error behaviour).

I think you should always assume that someone has taken advantage of the current behaviour, and aim to maintain/deprecate it.

(Supporting strings here makes this very close in behaviour to DictVectorizer (especially since DictVectorizer currently assumes numeric features or single-valued categorical features). I suspect it won't be faster than DictVectorizer. What do we gain by this alternative interface, other than a reduction in user upset.)

stephen-hoover · 2017-05-04T03:10:03Z

I switched back to "error-strict" and got rid of "auto-strict". I was also more aggressive with deprecations. I moved active_features_ back to deprecated, which will hopefully remove some confusion about whether fit attributes are masked. There's a lot of deprecations now, but the OneHotEncoder would need much less code in another two versions.

Could you explain more about how someone would use a DictVectorizer for one-hot-encoding categoricals? It seems like there'd need to be a lot of extra preprocessing to make that work. With the OneHotEncoder in this PR, you can read in a CSV and send it straight into a Pipeline that starts with a OneHotEncoder. That's valuable to me.

jnothman · 2017-05-04T03:32:14Z

>>> df = pd.DataFrame({'c1': ['a', 'b', 'c', 'a'], 'c2': [.6, .5, .7, 1]})  # or read_csv
>>> DictVectorizer().fit_transform(df.to_dict('records')).toarray()
array([[ 1. ,  0. ,  0. ,  0.6],
       [ 0. ,  1. ,  0. ,  0.5],
       [ 0. ,  0. ,  1. ,  0.7],
       [ 1. ,  0. ,  0. ,  1. ]])

jnothman · 2017-05-04T03:34:35Z

class PandasVectorizer(DictVectorizer):
    def fit(self, X, y=None):
        return super(PandasVectorizer, self).fit(X.to_dict('records'), y)
    def fit_transform(self, X, y=None):
        return super(PandasVectorizer, self).fit_transform(X.to_dict('records'), y)
    def transform(self, X):
        return super(PandasVectorizer, self).transform(X.to_dict('records'))

are we wasting our time?

jnothman · 2017-05-04T10:43:28Z

What are your thoughts on whether we need this functionality beyond Pandas/DictVectorizer? Should we start by putting an example of the above somewhere, with the possibility that we'll leave OHE alone?

stephen-hoover · 2017-05-04T12:50:40Z

That's a nice implementation. I'd forgotten about to_dict; I don't often have reason to use it.

I think there's a number of simple ways to write a transformer which one-hot-encodes categorical features. Most of the terrible complexity of this PR is so that the OneHotEncoder can do both the old thing and the new thing at the same time. That PandasVectorizer may be the simplest implementation, assuming you're using pandas, but I'd need to poke at speed and memory use before I trusted it for anything big.

Personally, I'd like to see an official, off-the-shelf solution for categorical expansion. My own use case is that I'm writing code to automate model fitting and prediction. It involves sending Pipeline objects to AWS and retrieving them afterwards. Right now I'm handling categorical expansion outside of the Pipeline (I've been trying to avoid custom code inside it so that users don't need to install anything extra), but it would be nice to be able to include all of the steps together.

FWIW, I think people are often surprised that scikit-learn doesn't already handle one-hot-encoding of string categoricals. I was surprised, and spent a while digging through the documentation to confirm it for myself. Coworkers I've talked to have also been surprised.

Whether this functionality comes through a modified OneHotEncoder or a new class, I do think that handling string categorical features is a common enough use case that it should be supported through code in scikit-learn.

jnothman · 2017-05-25T23:01:06Z

Part of the confusion here is that numpy really doesn't do a neat job of managing strings and mixed data types. I mean, it has struct arrays / recarrays, but we're not even handling them here.

Perhaps what we need these days is a DataFrameVectorizer. In particular, this would record input column structures (perhaps even when MultiIndex is involved) and dtypes to enable inverse transformation. It would one-hot encode strings. An initial implementation could build upon DictVectorizer although it may be better to operate on a DataFrame column-wise. WDYT? @amueller?

stephen-hoover · 2017-05-25T23:12:04Z

I agree that numpy arrays are really the wrong data structure for mixed string/numerical data. I think that a transformer which operates column-wise on a DataFrame would be a useful addition. If this is going to be in scikit-learn itself, then my intuition is that column-wise operation would be more efficient (although a few more lines of code) than your PandasVectorizer suggestion.

I would happily work on such a contribution, if that's something you'd accept. Would you be willing to have a DataFrame-specific transformer? I don't think it would require a pandas dependency (except for testing), but it would assume that its input was a DataFrame.

jnothman · 2017-05-26T01:03:12Z

IMO this is the right way forward (but @GaelVaroquaux may disagree) although there are questions of how this relates to something like the proposed ColumnTransformer (#3886). I'm sure they should be merged. This does need some more thinking through; it doesn't immediately solve questions of handling unknown labels. But I think the basic idea is that the default handling for strings would be like DictVectorizer's.

…

On 26 May 2017 9:12 am, "Stephen Hoover" ***@***.***> wrote: I agree that numpy arrays are really the wrong data structure for mixed string/numerical data. I think that a transformer which operates column-wise on a DataFrame would be a useful addition. If this is going to be in scikit-learn itself, then my intuition is that column-wise operation would be more efficient (although a few more lines of code) than your PandasVectorizer suggestion. I would happily work on such a contribution, if that's something you'd accept. Would you be willing to have a DataFrame-specific transformer? I don't think it would require a pandas dependency (except for testing), but it would assume that its input was a DataFrame. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8793 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-N8udlge0wyK9v0nda82rUW3Xptks5r9grGgaJpZM4NH755> .

jnothman · 2017-05-26T04:05:59Z

ColumnTransformer is quite close to this, but we would need to allow using something like LabelEncoder. The trouble I foresee is that if we want to allow encoding non-textual data, you couldn't do it through the same framework simply because those transfrormers (e.g. OneHotEncoder) require rectangular arrays, not vectors.

amueller · 2017-06-10T08:43:37Z

I hadn't seen these last comments. I would like to also support arrays, maybe only the simple cases. I think @GaelVaroquaux would be ok with having a transformer that is mostly intended to work on dataframes.
I'm really not sure if keeping the OneHotEncoder is worth it.
We could add a new class to the proposed experimental module.
the main question then is which kind of ndarray inputs would we want to support. I would REALLY REALLY like to move forward on this....

amueller · 2017-06-10T09:16:44Z

Discussion with @GaelVaroquaux and @jorisvandenbossche concluded with: we want a new estimator, and it will work on numpy arrays, and we will one-hot encode all columns, and there's no way to subset the columns, and people should use ColumnTransformer for that, and we'll make sure that this use-case is very simple.
Basically ColumnTransformer should be the one-stop-shop for mixed data. If we do a selection of columns in OneHotEncoder, then we haven't solved the problem of scaling and imputation, and people would need to use the ColumnTransformer anyhow.

amueller · 2017-06-10T09:23:25Z

From #9012 the code I'd like to write:

data = pd.readcsv("stuff")
categorical = data.dtypes == "object"
preprocess = ColumnTransformer(
    ('categorical': make_pipeline(Imputer(strategy='mode'), CategoricalEncoder()), categorical)
    ('continuous': make_pipeline(Imputer(strategy='mean'), StandardScaler()), ~categorical))
X = data.fit_transform(preprocess)

stephen-hoover · 2017-06-22T14:51:19Z

Closing, since this isn't going to merge and has been superseded by #9151 .

jnothman · 2017-06-22T21:48:34Z

Thanks for some great work, Stephen. Sorry there's been so much flip-flopping. It's hard.

…

On 23 June 2017 at 00:51, Stephen Hoover ***@***.***> wrote: Closed #8793 <#8793>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8793 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz618xpkzzBfJyrz2rS5Nj9LrSCIqLks5sGn9rgaJpZM4NH755> .

jorisvandenbossche · 2017-06-27T20:41:11Z

Yes, sorry, @stephen-hoover. Feedback on the new (or old ..) API is always welcome!

vighneshbirodkar and others added 21 commits April 25, 2017 13:10

Refactored OneHotEncoder to work with strings

ea98484

ported functions to fixes.py

e03b5c7

unique arrays are now sorted

06e6d3a

revert selection logic

074f194

Added copy argument

083142e

Inbetween adding the seen option

f768f3b

remove seen argument and support range case with FutureWarning

1e34cae

Made label_encoders_ private

fed7959

Added new attributes and tests for OHE

c62d2ba

Fixed doctests

e929f23

Fixed rst doc tests

bc7a26b

Replaced type in array with ellipsis

feaf014

flake fixes

7b608e1

Add NORMALIZE_WHITESPACE for python3 tests

5f305d8

normalize whitespace for rst docs

1392292

normalizing whitespace again

50d2360

docstring changes and minor optimizations

8f2f1d3

Made tests pass by creating arrays with object dtype

1c8accf

Assign both values and n_values to self._values and remove redundant …

6edda8b

…checking

removed extra spaces for flake8 compat

1d2ca1a

REF Refactor OHE and avoid copies

93ae49e

Refactor the OneHotEncoder for easier reading. Avoid mandatory copies of input data in both the `fit` and `transform` steps. Add a test that the input data aren't modified after fitting or transforming.

Stephen Hoover added 2 commits April 26, 2017 14:16

WIP

fd11366

Remove error-strict, add auto-strict

b96a8d2

Also restore `n_values_` and `active_features_` attributes.

Fixes for test failures

7902352

stephen-hoover changed the title ~~[WIP] Support for strings in OneHotEncoder~~ [MRG] Support for strings in OneHotEncoder Apr 27, 2017

stephen-hoover commented Apr 27, 2017

View reviewed changes

jnothman reviewed Apr 30, 2017

View reviewed changes

Stephen Hoover added 4 commits May 3, 2017 19:42

Deprecate active_features_

2525019

Switch from auto-strict to error-strict

7a53fe8

Deprecate integer and list of integer inputs to values

05af448

Address CR

d9d77ae

Fix whitespace in doc test

840382e

Fix doctest for Python 2.7

ff4b30b

jorisvandenbossche mentioned this pull request Jun 8, 2017

[MRG] Add experimental.ColumnTransformer #9012

Merged

jorisvandenbossche mentioned this pull request Jun 18, 2017

[MRG + 1] ENH: new CategoricalEncoder class #9151

Merged

stephen-hoover closed this Jun 22, 2017

[MRG] Support for strings in OneHotEncoder #8793

[MRG] Support for strings in OneHotEncoder #8793

Conversation

stephen-hoover commented Apr 25, 2017 • edited Loading

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

stephen-hoover commented Apr 25, 2017

jnothman commented Apr 25, 2017 via email

vighneshbirodkar commented Apr 26, 2017

jnothman commented Apr 26, 2017

stephen-hoover commented Apr 26, 2017

stephen-hoover commented Apr 26, 2017

stephen-hoover commented Apr 27, 2017

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephen-hoover commented Apr 30, 2017

API thoughts:

How to get there:

jnothman commented May 3, 2017

stephen-hoover commented May 4, 2017

jnothman commented May 4, 2017 • edited Loading

jnothman commented May 4, 2017 • edited Loading

jnothman commented May 4, 2017

stephen-hoover commented May 4, 2017

jnothman commented May 25, 2017

stephen-hoover commented May 25, 2017

jnothman commented May 26, 2017 via email

jnothman commented May 26, 2017

amueller commented Jun 10, 2017

amueller commented Jun 10, 2017

amueller commented Jun 10, 2017 • edited Loading

stephen-hoover commented Jun 22, 2017

jnothman commented Jun 22, 2017 via email

jorisvandenbossche commented Jun 27, 2017

stephen-hoover commented Apr 25, 2017 •

edited

Loading

jnothman commented May 4, 2017 •

edited

Loading

jnothman commented May 4, 2017 •

edited

Loading

amueller commented Jun 10, 2017 •

edited

Loading