Skip to content

[MRG] Support for strings in OneHotEncoder #8793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

stephen-hoover
Copy link
Contributor

@stephen-hoover stephen-hoover commented Apr 25, 2017

Reference Issue

Continues PR from #7327 .

What does this implement/fix? Explain your changes.

This PR modifies the OneHotEncoder to use the LabelEncoder for encoding each column. This change allows the OneHotEncoder to expand arrays with type np.object, which may include string values as well as numerical values.

Any other comments?

This PR is continuing from #7327 . Apart from some refactoring, function changes from that PR are the removal of forced copies (and the copy parameter on the OneHotEncoder), restricting range checking to the numerical columns, erroring if numerical columns have values < 0, and modifying active_features_, feature_indices_, and n_values_ to reflect the way the code now operates.

CC @vighneshbirodkar . Thanks for putting this together!

TODO

  • Review tests
  • Review documentation
  • Add entry to What's New
  • Add attribute to hold categorical level at each column of the output

@stephen-hoover
Copy link
Contributor Author

@jnothman , here's a continuation of #7327 . My big question is whether or not a new transformer object would be acceptable. There's a lot of complexity in here which is wrapped up with the range-checking behavior of the existing OneHotEncoder.

How do you feel about moving this functionality to a new class, perhaps named CategoricalEncoder, and leaving the existing OneHotEncoder the way it is?

@jnothman
Copy link
Member

jnothman commented Apr 25, 2017 via email

@vighneshbirodkar
Copy link
Contributor

There already was a failed attempt at a new class at #6559

@jnothman
Copy link
Member

My time is very limited at the moment due to teaching. You've marked this WIP. What do you feel remains to be changed before this is in a state to be reviewed-for-merge?

Stephen Hoover added 2 commits April 26, 2017 14:16
Also restore `n_values_` and `active_features_` attributes.
@stephen-hoover
Copy link
Contributor Author

I did a bit more work on the code. I think the changes to OneHotEncoder are in good shape now, but I still want to look over the tests and see if there's any cases to add. This also needs an entry in the What's New, and I haven't looked over the documentation yet. I put up a checklist for these items. I'll be able to work on this more this week.

Changes with this latest update:

  • Removed 'error-strict' option for handle_unknown and added 'auto-strict' option for values. I think it's easier to write the code for the options set up this way, and arguably the original OHE behavior is a choice about how to fit the data. "Auto" is the original behavior, while "auto-strict" only uses categories from the input data.
  • Restored active_features_, since it's the easiest way to maintain the original class behavior with values='auto'.
  • Restored n_values_. It still seems potentially of use, but I don't feel strongly about it.
  • Removed the max value check. I'm handling this with the LabelEncoders now.
  • Removed the cast to np.object. This converted inputs of e.g. integers into object arrays, which I'd rather not do. The downside(?) is that lists of mixed string and numerical types get converted into string types, which should only matter with values='auto'.

@stephen-hoover
Copy link
Contributor Author

Sorry, looks like I have some test failures to deal with as well.

@stephen-hoover stephen-hoover changed the title [WIP] Support for strings in OneHotEncoder [MRG] Support for strings in OneHotEncoder Apr 27, 2017
@stephen-hoover
Copy link
Contributor Author

@jnothman , this is ready to be reviewed for merge now.

# 0 and column max. The transform will still only
# return dummy columns for integers present in training data.
if (not isinstance(X[0, i], six.string_types) and
int(X[0, i]) == X[0, i]):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the weakest part of this PR. I'm type-checking the first row of the column and seeing if it looks like an integer, then if it is, assuming that the entire column is integers. I don't think we can avoid type-checking, since the encoder needs to treat integers differently than everything else in "auto" mode. I don't want to do type checking on every entry in the column, though. That seems too slow.

The np.unique call will error if the column has mixed string and numerical types, but we could still run into problems if there's mixed floats and integers.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on API so far. I'd be interested to know what you think about what OHE should look like in a couple of versions' time.

handle_unknown='error', n_values=[2, 3, 4], sparse=True)
>>> enc.transform([[1, 0, 0]]).toarray()
array([[ 0., 1., 1., 0., 0., 1., 0., 0., 0.]])
>>> browsers = ['uses Internet Explorer', 'uses Chrome' , 'uses Safari', 'uses Firefox']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please try to limit lines to 80 chars. "uses " here seems a gratuitous waste of horizontal space

@@ -171,6 +171,16 @@ Enhancements
removed by setting it to `None`.
:issue:`7674` by:user:`Yichuan Liu <yl565>`.

- :class:`preprocessing.OneHotEncoder` now fits and transforms inputs of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would split this between enhancements (the new stuff) and api changes (deprecations).

in ``range(n_values[i])``
values : 'auto', 'auto-strict', int, List[int], or List[List[objects]]
- 'auto' (default) : Determine set of values from training data.
If values are integers, then allowed values will be between
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a distinction between "allowed" and "transformed" values for "auto" with ints. Maybe specify "errors will only be raised in transform for values out of the input range, when the input is integer". Really, though, this is a weird API, which is why we had designed handle_unknown="error-strict". Perhaps it's easier to change handle_unknown="error-strict" to become the default in a future version than values="auto-strict". WDYT?

n_values_ : array of shape (n_features,)
Maximum number of values per feature.
Number of categories per feature. Has value `0` for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear how this works in the auto integer case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to "Number of encoded categories" and moved active_features_ back to deprecation. Hopefully that will make it clearer.


one_hot_feature_index_ : array, shape [n_features_new]
``one_hot_feature_index_[i]`` specifies which feature of the input
is encoded by column `i` in the one-hot encoded array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need double-backticks


Attributes
----------
feature_index_range_ : array, shape [n_feature, 2]
``feature_index_range_[i]`` specifies the range of column indices
occupied by the input feature `i` in the one-hot encoded array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the range prior to or after masking by active_features_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After masking. The users never see output before masking, so this is giving column indices in the actual output data.

``feature_index_range_[i]`` specifies the range of column indices
occupied by the input feature `i` in the one-hot encoded array.

one_hot_feature_index_ : array, shape [n_features_new]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this as well as feature_index_range?

Don't we also need an attribute which indicates what values correspond to each feature of the output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feature_index_range_ maps from inputs to outputs. one_hot_feature_index_ maps from outputs to inputs. Maybe @vighneshbirodkar could comment more about the motivation for having both.

That's a good point about having an attribute to show the levels. I'll add a 1D object array whose values are the category encoded in each column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added categories_.

# Record which column of input data corresponds
# to each column of output data
n_expanded_cols = end + n_features - num_cat
self.one_hot_feature_index_ = np.empty(n_expanded_cols, dtype=np.int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be filled with a one-liner (or something like it) using np.repeat and np.diff


error_msg = re.escape("Column 0 contains new labels: [2]")
assert_raises_regex(ValueError, error_msg,
OneHotEncoder(n_values=2).fit_transform, X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surely this will also raise a deprecationwarning which should be caught/ignored

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests that deprecated parameters raise warnings.

@@ -1500,13 +1506,37 @@ def test_one_hot_encoder_sparse():
# test exception on wrong init param
assert_raises(TypeError, OneHotEncoder(n_values=np.int).fit, X)


def test_one_hot_encoder_error_on_negative():
# Negative numerical values in inputs should raise an exception
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this constraint unless we keep OHE as only dealing with ordinal inputs. Why do we need it? If we're raising an error when it happens, it's not like we're keeping things any more efficient than using a LabelEncoder in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erroring on negative integers is a part of the current OHE. I added this test to make sure that I preserved that behavior. The fit stage has an explicit check, but the transform stage relies on the LabelEncoder to error for unknown values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That behaviour derives from the assumption that inputs are 0-based ordinal representation of categories (like output from LabelEncoder). It no longer applies...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should say that the explicit check in fit is only when values="auto" and only on integer features. That mode is still assuming that the feature is a 0-based ordinal. I'm not sure what better way there is to not break the way the OHE used to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Do we have a test that checks negative integers are fine in the auto-strict case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a line to this test case to check that.

@stephen-hoover
Copy link
Contributor Author

API thoughts:

I think the cleanest interface would be to have values accept either "auto" or a sequence of sequences which specify the expected levels for each categorical. handle-unknown should be either "ignore" or "error". (Possibly there could be an extra option to encode an "unknown values" column, but adding that would be out of scope for this PR.)

That wouldn't let users completely reproduce the current behavior of ignoring unseen integers inside a 0-to-max-fit range but erroring on integers outside of that range. I'm curious to know how many people rely on that behavior, but I don't know how I'd find out.

If we're going to keep the current OHE "auto" behavior, I'd add a new option for handle_unknown, perhaps "error-out-of-range", and ask users to switch to that if they want the old-style behavior.

How to get there:

Writing this, I switched back and forth between thinking of the current OHE behavior on "auto" mode as being a different way of fitting and a different way of raising errors. Now I'm back to thinking that it's about raising errors. I'm inclined to get rid of "auto-strict" and use "error-strict" again. I'd do that without significant code changes, by checking the handle_unknown value at fit time. If we don't keep the current "auto" behavior, I'd also deprecate active_features_.

@jnothman
Copy link
Member

jnothman commented May 3, 2017

I'm inclined towards error-strict from a user orientation. The auto-strict definition makes the current auto behaviour confusing vis-a-vis active_features_ (even if active_features_ should be more of an implementation detail than it is because of the current error behaviour).

I think you should always assume that someone has taken advantage of the current behaviour, and aim to maintain/deprecate it.

(Supporting strings here makes this very close in behaviour to DictVectorizer (especially since DictVectorizer currently assumes numeric features or single-valued categorical features). I suspect it won't be faster than DictVectorizer. What do we gain by this alternative interface, other than a reduction in user upset.)

@stephen-hoover
Copy link
Contributor Author

I switched back to "error-strict" and got rid of "auto-strict". I was also more aggressive with deprecations. I moved active_features_ back to deprecated, which will hopefully remove some confusion about whether fit attributes are masked. There's a lot of deprecations now, but the OneHotEncoder would need much less code in another two versions.

Could you explain more about how someone would use a DictVectorizer for one-hot-encoding categoricals? It seems like there'd need to be a lot of extra preprocessing to make that work. With the OneHotEncoder in this PR, you can read in a CSV and send it straight into a Pipeline that starts with a OneHotEncoder. That's valuable to me.

@jnothman
Copy link
Member

jnothman commented May 4, 2017

>>> df = pd.DataFrame({'c1': ['a', 'b', 'c', 'a'], 'c2': [.6, .5, .7, 1]})  # or read_csv
>>> DictVectorizer().fit_transform(df.to_dict('records')).toarray()
array([[ 1. ,  0. ,  0. ,  0.6],
       [ 0. ,  1. ,  0. ,  0.5],
       [ 0. ,  0. ,  1. ,  0.7],
       [ 1. ,  0. ,  0. ,  1. ]])

@jnothman
Copy link
Member

jnothman commented May 4, 2017

class PandasVectorizer(DictVectorizer):
    def fit(self, X, y=None):
        return super(PandasVectorizer, self).fit(X.to_dict('records'), y)
    def fit_transform(self, X, y=None):
        return super(PandasVectorizer, self).fit_transform(X.to_dict('records'), y)
    def transform(self, X):
        return super(PandasVectorizer, self).transform(X.to_dict('records'))

are we wasting our time?

@jnothman
Copy link
Member

jnothman commented May 4, 2017

What are your thoughts on whether we need this functionality beyond Pandas/DictVectorizer? Should we start by putting an example of the above somewhere, with the possibility that we'll leave OHE alone?

@stephen-hoover
Copy link
Contributor Author

That's a nice implementation. I'd forgotten about to_dict; I don't often have reason to use it.

I think there's a number of simple ways to write a transformer which one-hot-encodes categorical features. Most of the terrible complexity of this PR is so that the OneHotEncoder can do both the old thing and the new thing at the same time. That PandasVectorizer may be the simplest implementation, assuming you're using pandas, but I'd need to poke at speed and memory use before I trusted it for anything big.

Personally, I'd like to see an official, off-the-shelf solution for categorical expansion. My own use case is that I'm writing code to automate model fitting and prediction. It involves sending Pipeline objects to AWS and retrieving them afterwards. Right now I'm handling categorical expansion outside of the Pipeline (I've been trying to avoid custom code inside it so that users don't need to install anything extra), but it would be nice to be able to include all of the steps together.

FWIW, I think people are often surprised that scikit-learn doesn't already handle one-hot-encoding of string categoricals. I was surprised, and spent a while digging through the documentation to confirm it for myself. Coworkers I've talked to have also been surprised.

Whether this functionality comes through a modified OneHotEncoder or a new class, I do think that handling string categorical features is a common enough use case that it should be supported through code in scikit-learn.

@jnothman
Copy link
Member

Part of the confusion here is that numpy really doesn't do a neat job of managing strings and mixed data types. I mean, it has struct arrays / recarrays, but we're not even handling them here.

Perhaps what we need these days is a DataFrameVectorizer. In particular, this would record input column structures (perhaps even when MultiIndex is involved) and dtypes to enable inverse transformation. It would one-hot encode strings. An initial implementation could build upon DictVectorizer although it may be better to operate on a DataFrame column-wise. WDYT? @amueller?

@stephen-hoover
Copy link
Contributor Author

I agree that numpy arrays are really the wrong data structure for mixed string/numerical data. I think that a transformer which operates column-wise on a DataFrame would be a useful addition. If this is going to be in scikit-learn itself, then my intuition is that column-wise operation would be more efficient (although a few more lines of code) than your PandasVectorizer suggestion.

I would happily work on such a contribution, if that's something you'd accept. Would you be willing to have a DataFrame-specific transformer? I don't think it would require a pandas dependency (except for testing), but it would assume that its input was a DataFrame.

@jnothman
Copy link
Member

jnothman commented May 26, 2017 via email

@jnothman
Copy link
Member

ColumnTransformer is quite close to this, but we would need to allow using something like LabelEncoder. The trouble I foresee is that if we want to allow encoding non-textual data, you couldn't do it through the same framework simply because those transfrormers (e.g. OneHotEncoder) require rectangular arrays, not vectors.

@amueller
Copy link
Member

I hadn't seen these last comments. I would like to also support arrays, maybe only the simple cases. I think @GaelVaroquaux would be ok with having a transformer that is mostly intended to work on dataframes.
I'm really not sure if keeping the OneHotEncoder is worth it.
We could add a new class to the proposed experimental module.
the main question then is which kind of ndarray inputs would we want to support. I would REALLY REALLY like to move forward on this....

@amueller
Copy link
Member

Discussion with @GaelVaroquaux and @jorisvandenbossche concluded with: we want a new estimator, and it will work on numpy arrays, and we will one-hot encode all columns, and there's no way to subset the columns, and people should use ColumnTransformer for that, and we'll make sure that this use-case is very simple.
Basically ColumnTransformer should be the one-stop-shop for mixed data. If we do a selection of columns in OneHotEncoder, then we haven't solved the problem of scaling and imputation, and people would need to use the ColumnTransformer anyhow.

@amueller
Copy link
Member

amueller commented Jun 10, 2017

From #9012 the code I'd like to write:

data = pd.readcsv("stuff")
categorical = data.dtypes == "object"
preprocess = ColumnTransformer(
    ('categorical': make_pipeline(Imputer(strategy='mode'), CategoricalEncoder()), categorical)
    ('continuous': make_pipeline(Imputer(strategy='mean'), StandardScaler()), ~categorical))
X = data.fit_transform(preprocess)

@stephen-hoover
Copy link
Contributor Author

Closing, since this isn't going to merge and has been superseded by #9151 .

@jnothman
Copy link
Member

jnothman commented Jun 22, 2017 via email

@jorisvandenbossche
Copy link
Member

Yes, sorry, @stephen-hoover. Feedback on the new (or old ..) API is always welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants