-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Support for strings in OneHotEncoder #8793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Support for strings in OneHotEncoder #8793
Conversation
Refactor the OneHotEncoder for easier reading. Avoid mandatory copies of input data in both the `fit` and `transform` steps. Add a test that the input data aren't modified after fitting or transforming.
@jnothman , here's a continuation of #7327 . My big question is whether or not a new transformer object would be acceptable. There's a lot of complexity in here which is wrapped up with the range-checking behavior of the existing How do you feel about moving this functionality to a new class, perhaps named |
I'm a bit torn, but: I don't think we should create a new class for the
sake of a simplified implementation. If we need to have two seperate code
paths in implementation, so be it. I would be persuaded more by having a
cleaner API due to separate classes.
…On 26 Apr 2017 5:37 am, "Stephen Hoover" ***@***.***> wrote:
@jnothman <https://github.com/jnothman> , here's a continuation of #7327
<#7327> . My big
question is whether or not a new transformer object would be acceptable.
There's a lot of complexity in here which is wrapped up with the
range-checking behavior of the existing OneHotEncoder.
How do you feel about moving this functionality to a new class, perhaps
named CategoricalEncoder, and leaving the existing OneHotEncoder the way
it is?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8793 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69t-xcuXSCX4lnsvp71fEYFjk2qRks5rzktugaJpZM4NH755>
.
|
There already was a failed attempt at a new class at #6559 |
My time is very limited at the moment due to teaching. You've marked this WIP. What do you feel remains to be changed before this is in a state to be reviewed-for-merge? |
Also restore `n_values_` and `active_features_` attributes.
I did a bit more work on the code. I think the changes to Changes with this latest update:
|
Sorry, looks like I have some test failures to deal with as well. |
@jnothman , this is ready to be reviewed for merge now. |
# 0 and column max. The transform will still only | ||
# return dummy columns for integers present in training data. | ||
if (not isinstance(X[0, i], six.string_types) and | ||
int(X[0, i]) == X[0, i]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the weakest part of this PR. I'm type-checking the first row of the column and seeing if it looks like an integer, then if it is, assuming that the entire column is integers. I don't think we can avoid type-checking, since the encoder needs to treat integers differently than everything else in "auto" mode. I don't want to do type checking on every entry in the column, though. That seems too slow.
The np.unique
call will error if the column has mixed string and numerical types, but we could still run into problems if there's mixed floats and integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on API so far. I'd be interested to know what you think about what OHE should look like in a couple of versions' time.
doc/modules/preprocessing.rst
Outdated
handle_unknown='error', n_values=[2, 3, 4], sparse=True) | ||
>>> enc.transform([[1, 0, 0]]).toarray() | ||
array([[ 0., 1., 1., 0., 0., 1., 0., 0., 0.]]) | ||
>>> browsers = ['uses Internet Explorer', 'uses Chrome' , 'uses Safari', 'uses Firefox'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please try to limit lines to 80 chars. "uses " here seems a gratuitous waste of horizontal space
@@ -171,6 +171,16 @@ Enhancements | |||
removed by setting it to `None`. | |||
:issue:`7674` by:user:`Yichuan Liu <yl565>`. | |||
|
|||
- :class:`preprocessing.OneHotEncoder` now fits and transforms inputs of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would split this between enhancements (the new stuff) and api changes (deprecations).
sklearn/preprocessing/data.py
Outdated
in ``range(n_values[i])`` | ||
values : 'auto', 'auto-strict', int, List[int], or List[List[objects]] | ||
- 'auto' (default) : Determine set of values from training data. | ||
If values are integers, then allowed values will be between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a distinction between "allowed" and "transformed" values for "auto" with ints. Maybe specify "errors will only be raised in transform for values out of the input range, when the input is integer". Really, though, this is a weird API, which is why we had designed handle_unknown="error-strict"
. Perhaps it's easier to change handle_unknown="error-strict"
to become the default in a future version than values="auto-strict"
. WDYT?
sklearn/preprocessing/data.py
Outdated
n_values_ : array of shape (n_features,) | ||
Maximum number of values per feature. | ||
Number of categories per feature. Has value `0` for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear how this works in the auto integer case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed to "Number of encoded categories" and moved active_features_
back to deprecation. Hopefully that will make it clearer.
sklearn/preprocessing/data.py
Outdated
|
||
one_hot_feature_index_ : array, shape [n_features_new] | ||
``one_hot_feature_index_[i]`` specifies which feature of the input | ||
is encoded by column `i` in the one-hot encoded array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need double-backticks
sklearn/preprocessing/data.py
Outdated
|
||
Attributes | ||
---------- | ||
feature_index_range_ : array, shape [n_feature, 2] | ||
``feature_index_range_[i]`` specifies the range of column indices | ||
occupied by the input feature `i` in the one-hot encoded array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the range prior to or after masking by active_features_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After masking. The users never see output before masking, so this is giving column indices in the actual output data.
sklearn/preprocessing/data.py
Outdated
``feature_index_range_[i]`` specifies the range of column indices | ||
occupied by the input feature `i` in the one-hot encoded array. | ||
|
||
one_hot_feature_index_ : array, shape [n_features_new] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this as well as feature_index_range
?
Don't we also need an attribute which indicates what values correspond to each feature of the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feature_index_range_
maps from inputs to outputs. one_hot_feature_index_
maps from outputs to inputs. Maybe @vighneshbirodkar could comment more about the motivation for having both.
That's a good point about having an attribute to show the levels. I'll add a 1D object array whose values are the category encoded in each column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added categories_
.
sklearn/preprocessing/data.py
Outdated
# Record which column of input data corresponds | ||
# to each column of output data | ||
n_expanded_cols = end + n_features - num_cat | ||
self.one_hot_feature_index_ = np.empty(n_expanded_cols, dtype=np.int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be filled with a one-liner (or something like it) using np.repeat
and np.diff
|
||
error_msg = re.escape("Column 0 contains new labels: [2]") | ||
assert_raises_regex(ValueError, error_msg, | ||
OneHotEncoder(n_values=2).fit_transform, X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
surely this will also raise a deprecationwarning which should be caught/ignored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tests that deprecated parameters raise warnings.
@@ -1500,13 +1506,37 @@ def test_one_hot_encoder_sparse(): | |||
# test exception on wrong init param | |||
assert_raises(TypeError, OneHotEncoder(n_values=np.int).fit, X) | |||
|
|||
|
|||
def test_one_hot_encoder_error_on_negative(): | |||
# Negative numerical values in inputs should raise an exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this constraint unless we keep OHE as only dealing with ordinal inputs. Why do we need it? If we're raising an error when it happens, it's not like we're keeping things any more efficient than using a LabelEncoder
in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Erroring on negative integers is a part of the current OHE. I added this test to make sure that I preserved that behavior. The fit
stage has an explicit check, but the transform
stage relies on the LabelEncoder
to error for unknown values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That behaviour derives from the assumption that inputs are 0-based ordinal representation of categories (like output from LabelEncoder). It no longer applies...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should say that the explicit check in fit
is only when values="auto"
and only on integer features. That mode is still assuming that the feature is a 0-based ordinal. I'm not sure what better way there is to not break the way the OHE used to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Do we have a test that checks negative integers are fine in the auto-strict
case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a line to this test case to check that.
API thoughts:I think the cleanest interface would be to have That wouldn't let users completely reproduce the current behavior of ignoring unseen integers inside a 0-to-max-fit range but erroring on integers outside of that range. I'm curious to know how many people rely on that behavior, but I don't know how I'd find out. If we're going to keep the current OHE "auto" behavior, I'd add a new option for How to get there:Writing this, I switched back and forth between thinking of the current OHE behavior on "auto" mode as being a different way of fitting and a different way of raising errors. Now I'm back to thinking that it's about raising errors. I'm inclined to get rid of "auto-strict" and use "error-strict" again. I'd do that without significant code changes, by checking the |
I'm inclined towards I think you should always assume that someone has taken advantage of the current behaviour, and aim to maintain/deprecate it. (Supporting strings here makes this very close in behaviour to |
I switched back to "error-strict" and got rid of "auto-strict". I was also more aggressive with deprecations. I moved Could you explain more about how someone would use a |
>>> df = pd.DataFrame({'c1': ['a', 'b', 'c', 'a'], 'c2': [.6, .5, .7, 1]}) # or read_csv
>>> DictVectorizer().fit_transform(df.to_dict('records')).toarray()
array([[ 1. , 0. , 0. , 0.6],
[ 0. , 1. , 0. , 0.5],
[ 0. , 0. , 1. , 0.7],
[ 1. , 0. , 0. , 1. ]]) |
class PandasVectorizer(DictVectorizer):
def fit(self, X, y=None):
return super(PandasVectorizer, self).fit(X.to_dict('records'), y)
def fit_transform(self, X, y=None):
return super(PandasVectorizer, self).fit_transform(X.to_dict('records'), y)
def transform(self, X):
return super(PandasVectorizer, self).transform(X.to_dict('records')) are we wasting our time? |
What are your thoughts on whether we need this functionality beyond Pandas/DictVectorizer? Should we start by putting an example of the above somewhere, with the possibility that we'll leave OHE alone? |
That's a nice implementation. I'd forgotten about I think there's a number of simple ways to write a transformer which one-hot-encodes categorical features. Most of the terrible complexity of this PR is so that the Personally, I'd like to see an official, off-the-shelf solution for categorical expansion. My own use case is that I'm writing code to automate model fitting and prediction. It involves sending FWIW, I think people are often surprised that scikit-learn doesn't already handle one-hot-encoding of string categoricals. I was surprised, and spent a while digging through the documentation to confirm it for myself. Coworkers I've talked to have also been surprised. Whether this functionality comes through a modified |
Part of the confusion here is that numpy really doesn't do a neat job of managing strings and mixed data types. I mean, it has struct arrays / recarrays, but we're not even handling them here. Perhaps what we need these days is a |
I agree that I would happily work on such a contribution, if that's something you'd accept. Would you be willing to have a |
IMO this is the right way forward (but @GaelVaroquaux may disagree)
although there are questions of how this relates to something like the
proposed ColumnTransformer (#3886). I'm sure they should be merged. This
does need some more thinking through; it doesn't immediately solve
questions of handling unknown labels. But I think the basic idea is that
the default handling for strings would be like DictVectorizer's.
…On 26 May 2017 9:12 am, "Stephen Hoover" ***@***.***> wrote:
I agree that numpy arrays are really the wrong data structure for mixed
string/numerical data. I think that a transformer which operates
column-wise on a DataFrame would be a useful addition. If this is going
to be in scikit-learn itself, then my intuition is that column-wise
operation would be more efficient (although a few more lines of code) than
your PandasVectorizer suggestion.
I would happily work on such a contribution, if that's something you'd
accept. Would you be willing to have a DataFrame-specific transformer? I
don't think it would require a pandas dependency (except for testing),
but it would assume that its input was a DataFrame.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8793 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-N8udlge0wyK9v0nda82rUW3Xptks5r9grGgaJpZM4NH755>
.
|
ColumnTransformer is quite close to this, but we would need to allow using something like |
I hadn't seen these last comments. I would like to also support arrays, maybe only the simple cases. I think @GaelVaroquaux would be ok with having a transformer that is mostly intended to work on dataframes. |
Discussion with @GaelVaroquaux and @jorisvandenbossche concluded with: we want a new estimator, and it will work on numpy arrays, and we will one-hot encode all columns, and there's no way to subset the columns, and people should use |
From #9012 the code I'd like to write: data = pd.readcsv("stuff")
categorical = data.dtypes == "object"
preprocess = ColumnTransformer(
('categorical': make_pipeline(Imputer(strategy='mode'), CategoricalEncoder()), categorical)
('continuous': make_pipeline(Imputer(strategy='mean'), StandardScaler()), ~categorical))
X = data.fit_transform(preprocess) |
Closing, since this isn't going to merge and has been superseded by #9151 . |
Thanks for some great work, Stephen. Sorry there's been so much
flip-flopping. It's hard.
…On 23 June 2017 at 00:51, Stephen Hoover ***@***.***> wrote:
Closed #8793 <#8793>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8793 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz618xpkzzBfJyrz2rS5Nj9LrSCIqLks5sGn9rgaJpZM4NH755>
.
|
Yes, sorry, @stephen-hoover. Feedback on the new (or old ..) API is always welcome! |
Reference Issue
Continues PR from #7327 .
What does this implement/fix? Explain your changes.
This PR modifies the
OneHotEncoder
to use theLabelEncoder
for encoding each column. This change allows theOneHotEncoder
to expand arrays with typenp.object
, which may include string values as well as numerical values.Any other comments?
This PR is continuing from #7327 . Apart from some refactoring, function changes from that PR are the removal of forced copies (and the
copy
parameter on theOneHotEncoder
), restricting range checking to the numerical columns, erroring if numerical columns have values < 0, and modifyingactive_features_
,feature_indices_
, andn_values_
to reflect the way the code now operates.CC @vighneshbirodkar . Thanks for putting this together!
TODO