Skip to content

[MRG] Non-categorical variables in OneHotEncoder. #2027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 5, 2013

Conversation

mblondel
Copy link
Member

@mblondel mblondel commented Jun 4, 2013

I added a categorical_features option to OneHotEncoder. It lets the user specify whether features / variables are categorical or not. Non-categorical features are not touched and are stacked as is at the end of the matrix. I was hesitating between using an array of indices or a boolean mask. I went for the latter as it seems easier to handle for the user but I would be glad to change if other people think otherwise.

I'm thinking of adding a similar feature to DictVectorizer but I would like us to agree on whether I should use an array of indices or a boolean mask first.

CC @amueller @larsmans

def fit_transform(self, X, y=None):
"""Fit OneHotEncoder to X, then transform X.
def _apply_transform(self, X, transform):
if isinstance(self.categorical_features, six.string_types) and \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a need for the isinstance check, it's implicit in the comparison to "all".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. I was worried that since ndarray overrides ==, I might run in some issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well,

>>> array("all") == "all"
array(True, dtype=bool)

but since

>>> bool(array("all") == "all")
True
>>> bool(array(["any"]) == "all")
False

I don't think that'll pose any problems. Comparisons between str and non-stringlike ndarray seem to always return just False.

@larsmans
Copy link
Member

larsmans commented Jun 4, 2013

I think a list of indices is actually easier from a user perspective, since a boolean mask is easily translated to indices with np.where but the other way around takes more work.


if n_cat == 0:
# No categorical variables.
return X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the input validation is potentially wasteful, especially the densification.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could use atleast2d_or_csc. This way I can do the categorical / non-categorical splitting without densification. Also, if there are only few categorical variables, this means that the densification done by _fit_transform and _transform will be affordable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, X = check_arrays(X, sparse_format='dense', dtype=np.int)[0] doesn't do an implicit dense conversion, it raises an exception. So my idea doesn't work unless I make an explicit toarray() before calling _fit_transform / _transform.

This also means that the validation is not wasteful.

@mblondel
Copy link
Member Author

mblondel commented Jun 4, 2013

Thinking about it, DictVectorizer will probably take a list of feature names so a list of feature indices will be closer to that. I will like the API to be as close as possible in OneHotEncoder and DictVectorizer.

If we go for a list of features indices / names, we need to choose if it's more convenient to list the categorical variables or the non-categorical ones. Which is best highly depends on the application...

For DictVectorizer, I could also use a dict feature-name => True / False.

@larsmans
Copy link
Member

larsmans commented Jun 4, 2013

DictVectorizer does one-hot coding when fed (str, str) pairs, which is not applicable here (right?).

I think it's more convenient to list the categorical ones, because otherwise you'd have a parameter called noncategorical_features, which is more typing :)

@ogrisel
Copy link
Member

ogrisel commented Jun 4, 2013

Travis has caught a doctest to update: https://travis-ci.org/scikit-learn/scikit-learn/builds/7766345#L2752

@mblondel
Copy link
Member Author

mblondel commented Jun 4, 2013

DictVectorizer does one-hot coding when fed (str, str) pairs, which is not applicable here (right?).

I thought that it does the one-hot coding even for integer features but I probably misread the code. Then, great, I indeed do not need to change anything.

@jnothman
Copy link
Member

jnothman commented Jun 4, 2013

It looks like DictVectorizer considers something categorical iff its value is a string, and I don't understand how categorical_features would help.

noncategorical_features, which is more typing

but ignore_mask or ignored_features or held_out_features takes less than either, and might apply more broadly to transformers...? I'm not sure about it.

However, I do think this can be pulled out as a generic util: transform_selected(transform, X, selected) := np.hstack(transform(X[:, selected]), X[:, ~selected]) with sparse support, etc...

@mblondel
Copy link
Member Author

mblondel commented Jun 4, 2013

I thought of ignore_features but it is ambiguous in that one might think that the features are removed. I just leave them as is.

@larsmans
Copy link
Member

larsmans commented Jun 4, 2013

categorical_features is the only obvious choice that I can think of.

@mblondel
Copy link
Member Author

mblondel commented Jun 4, 2013

Any other opinion on indices vs. mask? Another thing I like about masks is that you can use np.ones or np.zeros to generate the initial mask then fill in the few categorical or non-categorial variables for your application of interest. This way we are not favoring applications with few categorical variables.

@glouppe
Copy link
Contributor

glouppe commented Jun 4, 2013

Any other opinion on indices vs. mask?

I don't want to complicate things, but wouldn't it be possible to support both? I was thinking about converting (implicitely) masks to indices (or vice-versa depending on what you want to feed to the implementation). If the array is as long as the number of features, then it is a boolean mask and can be converted to indices, otherwise it is assumed to be an array of indices. Couldn't this work?

@mblondel
Copy link
Member Author

mblondel commented Jun 4, 2013

Indeed, it shouldn't be too hard to support both. Thanks!

@larsmans
Copy link
Member

larsmans commented Jun 4, 2013

That leads to ambiguity. What if n_features=2 and categorical_features=array([0, 1])?

I say we leave it up to the user to call np.where and maybe put that in an example.

@glouppe
Copy link
Contributor

glouppe commented Jun 4, 2013

@larsmans Indeed, you are right. By the way, what are they doing in Numpy in such a case? (since you can index an array both with a mask or a list of indices)

@glouppe
Copy link
Contributor

glouppe commented Jun 4, 2013

Numpy seems to rely on the type of the array in that case.

In [2]: a = np.arange(2)

In [5]: a[np.array([0,1], dtype=np.bool)]
Out[5]: array([1])

In [6]: a[np.array([0,1])]
Out[6]: array([0, 1])

Anyway, up to you. I don't want to complicate things for nothing :)

@mblondel
Copy link
Member Author

mblondel commented Jun 4, 2013

Indeed, if we want to support both, the user has to explicity set dtype=bool in the case of masks. That sounds fair enough to me.

@ogrisel
Copy link
Member

ogrisel commented Jun 4, 2013

categorical_features is the only obvious choice that I can think of.

Instead of the ignored_features or noncategorical_features one could use passthrough_features to be both explicit and use a positive formulation. But I am also ok with the current categorical_features.

@jnothman
Copy link
Member

jnothman commented Jun 4, 2013

passthrough_features would make it clear if, for instance, the same option were available on a feature selector or PCA.

@mblondel
Copy link
Member Author

mblondel commented Jun 4, 2013

I added combined support of masks and index arrays. I kept categorical_features which I find to the point and directly related to what OneHotEncoder is doing. If there's no further comment, I will update the user guide tomorrow.

@mblondel
Copy link
Member Author

mblondel commented Jun 4, 2013

Indirectly related to this PR: does any one know how to handle variables which only make sense if another variable takes some particular value ? For example, a variable which only makes sense if the patient is a female (e.g., number of years since menopause)

@larsmans
Copy link
Member

larsmans commented Jun 4, 2013

I guess I'd handle that as a quadratic feature: female × years_past_menopause. Then delete the original years_past_menopause feature. Or, using DictVectorizer,

female = patient.sex == FEMALE
features['female'] = female
if female:
    features['years_past_menopause'] = patient.menopause  # default of 0 when missing

@jnothman
Copy link
Member

jnothman commented Jun 4, 2013

I'd still suggest this belongs as a utility function, not something special to OneHotEncoder.

@mblondel
Copy link
Member Author

mblondel commented Jun 4, 2013

enc = OneHotEncoder(categorical_features=cat)
X_train = enc.fit_tansform(X_train)
X_test = enc.transform(X_test)

vs.

enc = OneHotEncoder()
X_train = transform_selected(X_train, enc.fit_transform, categorical_features=cat)
X_test = transform_selected(X_test, enc.transform, categorical_features=cat)

vs.

enc = OneHotEncoder()

X_cat, X_non_cat = split_selected(X_train)
X_cat = enc.fit_transform(X_cat)
X_train = sp.hstack((X_cat, X_non_cat))

X_cat, X_non_cat = split_selected(X_test)
X_cat = enc.transform(X_cat)
X_test = sp.hstack((X_cat, X_non_cat))

Any other opinions? Abstractions are nice but it's also important to have easy-to-use APIs...

@mblondel
Copy link
Member Author

mblondel commented Jun 4, 2013

If scikit-learn was based on abstractions like this, we wouldn't need OVR support directly in SGDClassifier and would rely on OneVsRestClassifier instead.

@jnothman
Copy link
Member

jnothman commented Jun 5, 2013

I should clarify, because you're getting me completely wrong here and making a farce: I don't think you should require the user to use the abstraction. Of course the user can build such an abstraction out of appropriate transformers, Pipelines and FeatureUnions, but that's absolutely mad.

Still, you can create a helper function and use it only in OneHotEncoder for now. Doing so:

  • gives a name to what you're doing rather than nesting the OneHotEncoder code with some obscure _apply_transform to handle an edge-case parameter ("explicit is better than implicit");
  • allows you to test its functionality as a unit;
  • enables other transformer implementations to adopt the same functionality without having to reinvent the wheel (to handle the boolean vs integer index, the sparse vs dense cases, etc).

I am sorry for not making this clearer in the first place; but surely your interpretation of my suggestion is so patently ridiculous (not to mention even messier if providing a class API with parameters, etc.) that it must be a misinterpretation.

@mblondel
Copy link
Member Author

mblondel commented Jun 5, 2013

However, I do think this can be pulled out as a generic util: transform_selected(transform, X, selected) := np.hstack(transform(X[:, selected]), X[:, ~selected]) with sparse support, etc...

I was not trying to make a farce of you. From the above comment, I thought you were suggesting to let the user call the utility, not to use it in OneHotEncoder directly. If you get angry so easily you should try to be clearer in the first place.

I will try to implement your suggestion tonight. If I create such an abstraction, while I am it, I can also create some kind of FeatureSplit transformer.

@jnothman
Copy link
Member

jnothman commented Jun 5, 2013

I didn't get angry until I felt my suggestion was being ridiculed.

Unless you can come up with some convincingly simple API, I don't think FeatureSplit is worthwhile (nor is its name especially intuitive). If the user wants such abstracted functionality, it should come by the way of some SelectByIndex(mask_or_indices, negate=False) that acts like other feature selectors, and can indeed be combined with FeatureUnion and Pipeline to create a monster that does exactly what it's told to (and can be controlled by set_params in a fairly logical manner).

@jnothman
Copy link
Member

jnothman commented Jun 5, 2013

Alternatively, you could just add a parameter to FeatureUnion that allows you to mask the input for each transformer (but then one requires an implementation of the IdentityTransformer to get the same expressiveness). I think that's actually what you mean by FeatureSplit.

@jnothman
Copy link
Member

jnothman commented Jun 5, 2013

(And that suggestion of adding a parameter to FeatureUnion is very similar to one I made elsewhere, except that there I did not assume X was a 2d array, but possibly a dict or an object.)

@GaelVaroquaux
Copy link
Member

I didn't get angry until I felt my suggestion was being ridiculed.

In case of possible misunderstanding, my view is to always believe that
the person I am discussing with is being nice, and I have not understood
him. I find that at least 50% if the case it is true, so I am definitely
doing as good as chance, but I get double benefit that I feel better, and
the person also.

I think that this suggestion could have been applied both ways here :)

@ogrisel
Copy link
Member

ogrisel commented Jun 5, 2013

@jnothman I agree the FeatureUnion could be improved to be made more generic to better deal with heterogeneous data (e.g. a sequence of dicts with free text fields, dates, numeric values and string encoded category names for instance). Ideally we would want a simple way to pass some sort of "feature router". Maybe we could discuss strategies to implement this during the sprint in Paris. If so we first need to come up with a set of common use cases (I have a couple in mind).

In the mean time +1 for splitting the OneHotEncoder categorical feature un-masking utility code as a utility function (private for now so as not to have to handle backward compat in case we change our minds on how to best deal with this situation).

@jnothman
Copy link
Member

jnothman commented Jun 5, 2013

I agree the FeatureUnion could be improved to be made more generic to better deal with heterogeneous data

Might as well make an issue so this is remembered: #2034. (And it's looking unlikely that I'll be at the sprint.)

@mblondel
Copy link
Member Author

mblondel commented Jun 5, 2013

In the mean time +1 for splitting the OneHotEncoder categorical feature un-masking utility code as a utility function (private for now so as not to have to handle backward compat in case we change our minds on how to best deal with this situation).

Done.

-------
X : array or sparse matrix, shape=(n_samples, n_features_new)
"""
if len(selected) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exploits the fact that we're using a string and not, say, None to indicate the "all" case. So I find it a bit weird here. It's also an uncommon use-case, and so should be tested after the "all" case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using selected=None as default value to mean "select all" felt strange to me. Hence the default value of "all". None can be used as synonym for [] though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here was using len("all") != 0. Better to put the "all" condition first...

@mblondel
Copy link
Member Author

mblondel commented Jul 4, 2013

Can I haz merge?

@larsmans
Copy link
Member

larsmans commented Jul 4, 2013

Shall we rename categorical_features to just encode? It's shorter, it's imperative, and maybe the user actually wants to encode some categorical variables, but not others.

Otherwise, +1.

@ogrisel
Copy link
Member

ogrisel commented Jul 4, 2013

Shall we rename categorical_features to just encode? It's shorter, it's imperative, and maybe the user actually wants to encode some categorical variables, but not others.

+1

"""
if selected == "all":
return transform(X)
elif len(selected) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we copy here? I am worried that there will be a strange behavior with copies happening sometimes but not always.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason to copy in a no-op.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason to copy in a no-op.

For consistency: you don't want a copy to happen or not depending on some
small changes in parameters. This can lead to bugs really hard to debug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean:

X1hot = onehot.fit_transform(X)
X *= 2  # absurd example

then X1hot changes? I guess you're right, +1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean:

X1hot = onehot.fit_transform(X)
X *= 2 # absurd example

then X1hot changes?

That's exactly what I have in mind. I've been working with library that
have inconsistent copy behavior, and I usual end up forcing the copy
myself, just to avoid surprises.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d25f12b.

@GaelVaroquaux
Copy link
Member

Very cool. I made a few minor comment (feel free not to address them, even though I feel that some should :).

Besides that, 👍 for merge.

@mblondel
Copy link
Member Author

mblondel commented Jul 5, 2013

@larsmans I personally prefer the more explicit categorical_features but I am glad to change if more people prefer encode.

@GaelVaroquaux
Copy link
Member

@larsmans I personally prefer the more explicit categorical_features but I am
glad to change if more people prefer encode.

I think that I prefer categorical_features, as it is more explicit.

@larsmans
Copy link
Member

larsmans commented Jul 5, 2013

But the statistical term might be confusing to new users. Also, encode is equally explicit :)

@larsmans
Copy link
Member

larsmans commented Jul 5, 2013

We have a stalemate. Anyone for either categorical_features or encode? @glouppe? @jnothman? When that's resolved, I'm up for merging and fixing the little things in master.

@glouppe
Copy link
Contributor

glouppe commented Jul 5, 2013

Anyone for either categorical_features or encode?

I prefer categorical_features.

@larsmans
Copy link
Member

larsmans commented Jul 5, 2013

Ok, that's 3-2 for you guys. Whoever still objects gets to submit an issue (I won't ;)

larsmans added a commit that referenced this pull request Jul 5, 2013
ENH Non-categorical variables in OneHotEncoder
@larsmans larsmans merged commit 4fe51ba into scikit-learn:master Jul 5, 2013
@mblondel
Copy link
Member Author

mblondel commented Jul 5, 2013

Thanks Lars!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants