[MRG] Non-categorical variables in OneHotEncoder. #2027

mblondel · 2013-06-04T11:40:49Z

I added a categorical_features option to OneHotEncoder. It lets the user specify whether features / variables are categorical or not. Non-categorical features are not touched and are stacked as is at the end of the matrix. I was hesitating between using an array of indices or a boolean mask. I went for the latter as it seems easier to handle for the user but I would be glad to change if other people think otherwise.

I'm thinking of adding a similar feature to DictVectorizer but I would like us to agree on whether I should use an array of indices or a boolean mask first.

CC @amueller @larsmans

larsmans · 2013-06-04T11:49:17Z

sklearn/preprocessing.py

-    def fit_transform(self, X, y=None):
-        """Fit OneHotEncoder to X, then transform X.
+    def _apply_transform(self, X, transform):
+        if isinstance(self.categorical_features, six.string_types) and \


I don't think there's a need for the isinstance check, it's implicit in the comparison to "all".

Ok, thanks. I was worried that since ndarray overrides ==, I might run in some issues.

Well,

>>> array("all") == "all" array(True, dtype=bool)

but since

>>> bool(array("all") == "all") True >>> bool(array(["any"]) == "all") False

I don't think that'll pose any problems. Comparisons between str and non-stringlike ndarray seem to always return just False.

larsmans · 2013-06-04T11:50:56Z

I think a list of indices is actually easier from a user perspective, since a boolean mask is easily translated to indices with np.where but the other way around takes more work.

larsmans · 2013-06-04T11:54:13Z

sklearn/preprocessing.py

+
+            if n_cat == 0:
+                # No categorical variables.
+                return X


In this case, the input validation is potentially wasteful, especially the densification.

I could use atleast2d_or_csc. This way I can do the categorical / non-categorical splitting without densification. Also, if there are only few categorical variables, this means that the densification done by _fit_transform and _transform will be affordable.

Hum, X = check_arrays(X, sparse_format='dense', dtype=np.int)[0] doesn't do an implicit dense conversion, it raises an exception. So my idea doesn't work unless I make an explicit toarray() before calling _fit_transform / _transform.

This also means that the validation is not wasteful.

mblondel · 2013-06-04T11:59:03Z

Thinking about it, DictVectorizer will probably take a list of feature names so a list of feature indices will be closer to that. I will like the API to be as close as possible in OneHotEncoder and DictVectorizer.

If we go for a list of features indices / names, we need to choose if it's more convenient to list the categorical variables or the non-categorical ones. Which is best highly depends on the application...

For DictVectorizer, I could also use a dict feature-name => True / False.

larsmans · 2013-06-04T12:08:46Z

DictVectorizer does one-hot coding when fed (str, str) pairs, which is not applicable here (right?).

I think it's more convenient to list the categorical ones, because otherwise you'd have a parameter called noncategorical_features, which is more typing :)

ogrisel · 2013-06-04T12:14:19Z

Travis has caught a doctest to update: https://travis-ci.org/scikit-learn/scikit-learn/builds/7766345#L2752

mblondel · 2013-06-04T12:19:35Z

DictVectorizer does one-hot coding when fed (str, str) pairs, which is not applicable here (right?).

I thought that it does the one-hot coding even for integer features but I probably misread the code. Then, great, I indeed do not need to change anything.

jnothman · 2013-06-04T12:23:49Z

It looks like DictVectorizer considers something categorical iff its value is a string, and I don't understand how categorical_features would help.

noncategorical_features, which is more typing

but ignore_mask or ignored_features or held_out_features takes less than either, and might apply more broadly to transformers...? I'm not sure about it.

However, I do think this can be pulled out as a generic util: transform_selected(transform, X, selected) := np.hstack(transform(X[:, selected]), X[:, ~selected]) with sparse support, etc...

mblondel · 2013-06-04T12:26:14Z

I thought of ignore_features but it is ambiguous in that one might think that the features are removed. I just leave them as is.

larsmans · 2013-06-04T12:26:56Z

categorical_features is the only obvious choice that I can think of.

mblondel · 2013-06-04T12:40:42Z

Any other opinion on indices vs. mask? Another thing I like about masks is that you can use np.ones or np.zeros to generate the initial mask then fill in the few categorical or non-categorial variables for your application of interest. This way we are not favoring applications with few categorical variables.

glouppe · 2013-06-04T12:46:01Z

Any other opinion on indices vs. mask?

I don't want to complicate things, but wouldn't it be possible to support both? I was thinking about converting (implicitely) masks to indices (or vice-versa depending on what you want to feed to the implementation). If the array is as long as the number of features, then it is a boolean mask and can be converted to indices, otherwise it is assumed to be an array of indices. Couldn't this work?

mblondel · 2013-06-04T12:53:15Z

Indeed, it shouldn't be too hard to support both. Thanks!

larsmans · 2013-06-04T12:58:21Z

That leads to ambiguity. What if n_features=2 and categorical_features=array([0, 1])?

I say we leave it up to the user to call np.where and maybe put that in an example.

glouppe · 2013-06-04T13:02:44Z

@larsmans Indeed, you are right. By the way, what are they doing in Numpy in such a case? (since you can index an array both with a mask or a list of indices)

glouppe · 2013-06-04T13:05:22Z

Numpy seems to rely on the type of the array in that case.

In [2]: a = np.arange(2)

In [5]: a[np.array([0,1], dtype=np.bool)]
Out[5]: array([1])

In [6]: a[np.array([0,1])]
Out[6]: array([0, 1])

Anyway, up to you. I don't want to complicate things for nothing :)

mblondel · 2013-06-04T13:08:28Z

Indeed, if we want to support both, the user has to explicity set dtype=bool in the case of masks. That sounds fair enough to me.

ogrisel · 2013-06-04T13:08:36Z

categorical_features is the only obvious choice that I can think of.

Instead of the ignored_features or noncategorical_features one could use passthrough_features to be both explicit and use a positive formulation. But I am also ok with the current categorical_features.

jnothman · 2013-06-04T13:20:52Z

passthrough_features would make it clear if, for instance, the same option were available on a feature selector or PCA.

mblondel · 2013-06-04T13:58:19Z

I added combined support of masks and index arrays. I kept categorical_features which I find to the point and directly related to what OneHotEncoder is doing. If there's no further comment, I will update the user guide tomorrow.

mblondel · 2013-06-04T14:07:46Z

Indirectly related to this PR: does any one know how to handle variables which only make sense if another variable takes some particular value ? For example, a variable which only makes sense if the patient is a female (e.g., number of years since menopause)

larsmans · 2013-06-04T14:15:47Z

I guess I'd handle that as a quadratic feature: female × years_past_menopause. Then delete the original years_past_menopause feature. Or, using DictVectorizer,

female = patient.sex == FEMALE
features['female'] = female
if female:
    features['years_past_menopause'] = patient.menopause  # default of 0 when missing

jnothman · 2013-06-04T21:12:24Z

I'd still suggest this belongs as a utility function, not something special to OneHotEncoder.

mblondel · 2013-06-04T23:47:35Z

enc = OneHotEncoder(categorical_features=cat)
X_train = enc.fit_tansform(X_train)
X_test = enc.transform(X_test)

vs.

enc = OneHotEncoder()
X_train = transform_selected(X_train, enc.fit_transform, categorical_features=cat)
X_test = transform_selected(X_test, enc.transform, categorical_features=cat)

vs.

enc = OneHotEncoder()

X_cat, X_non_cat = split_selected(X_train)
X_cat = enc.fit_transform(X_cat)
X_train = sp.hstack((X_cat, X_non_cat))

X_cat, X_non_cat = split_selected(X_test)
X_cat = enc.transform(X_cat)
X_test = sp.hstack((X_cat, X_non_cat))

Any other opinions? Abstractions are nice but it's also important to have easy-to-use APIs...

mblondel · 2013-06-04T23:51:44Z

If scikit-learn was based on abstractions like this, we wouldn't need OVR support directly in SGDClassifier and would rely on OneVsRestClassifier instead.

jnothman · 2013-06-05T02:35:49Z

I should clarify, because you're getting me completely wrong here and making a farce: I don't think you should require the user to use the abstraction. Of course the user can build such an abstraction out of appropriate transformers, Pipelines and FeatureUnions, but that's absolutely mad.

Still, you can create a helper function and use it only in OneHotEncoder for now. Doing so:

gives a name to what you're doing rather than nesting the OneHotEncoder code with some obscure _apply_transform to handle an edge-case parameter ("explicit is better than implicit");
allows you to test its functionality as a unit;
enables other transformer implementations to adopt the same functionality without having to reinvent the wheel (to handle the boolean vs integer index, the sparse vs dense cases, etc).

I am sorry for not making this clearer in the first place; but surely your interpretation of my suggestion is so patently ridiculous (not to mention even messier if providing a class API with parameters, etc.) that it must be a misinterpretation.

mblondel · 2013-06-05T03:07:27Z

However, I do think this can be pulled out as a generic util: transform_selected(transform, X, selected) := np.hstack(transform(X[:, selected]), X[:, ~selected]) with sparse support, etc...

I was not trying to make a farce of you. From the above comment, I thought you were suggesting to let the user call the utility, not to use it in OneHotEncoder directly. If you get angry so easily you should try to be clearer in the first place.

I will try to implement your suggestion tonight. If I create such an abstraction, while I am it, I can also create some kind of FeatureSplit transformer.

jnothman · 2013-06-05T03:13:44Z

I didn't get angry until I felt my suggestion was being ridiculed.

Unless you can come up with some convincingly simple API, I don't think FeatureSplit is worthwhile (nor is its name especially intuitive). If the user wants such abstracted functionality, it should come by the way of some SelectByIndex(mask_or_indices, negate=False) that acts like other feature selectors, and can indeed be combined with FeatureUnion and Pipeline to create a monster that does exactly what it's told to (and can be controlled by set_params in a fairly logical manner).

jnothman · 2013-06-05T03:18:25Z

Alternatively, you could just add a parameter to FeatureUnion that allows you to mask the input for each transformer (but then one requires an implementation of the IdentityTransformer to get the same expressiveness). I think that's actually what you mean by FeatureSplit.

jnothman · 2013-06-05T03:23:34Z

(And that suggestion of adding a parameter to FeatureUnion is very similar to one I made elsewhere, except that there I did not assume X was a 2d array, but possibly a dict or an object.)

GaelVaroquaux · 2013-06-05T05:25:38Z

I didn't get angry until I felt my suggestion was being ridiculed.

In case of possible misunderstanding, my view is to always believe that
the person I am discussing with is being nice, and I have not understood
him. I find that at least 50% if the case it is true, so I am definitely
doing as good as chance, but I get double benefit that I feel better, and
the person also.

I think that this suggestion could have been applied both ways here :)

ogrisel · 2013-06-05T07:54:24Z

@jnothman I agree the FeatureUnion could be improved to be made more generic to better deal with heterogeneous data (e.g. a sequence of dicts with free text fields, dates, numeric values and string encoded category names for instance). Ideally we would want a simple way to pass some sort of "feature router". Maybe we could discuss strategies to implement this during the sprint in Paris. If so we first need to come up with a set of common use cases (I have a couple in mind).

In the mean time +1 for splitting the OneHotEncoder categorical feature un-masking utility code as a utility function (private for now so as not to have to handle backward compat in case we change our minds on how to best deal with this situation).

jnothman · 2013-06-05T08:14:11Z

I agree the FeatureUnion could be improved to be made more generic to better deal with heterogeneous data

Might as well make an issue so this is remembered: #2034. (And it's looking unlikely that I'll be at the sprint.)

rather than a method.

mblondel · 2013-06-05T11:01:35Z

In the mean time +1 for splitting the OneHotEncoder categorical feature un-masking utility code as a utility function (private for now so as not to have to handle backward compat in case we change our minds on how to best deal with this situation).

Done.

jnothman · 2013-06-10T04:06:20Z

sklearn/preprocessing.py

+    -------
+    X : array or sparse matrix, shape=(n_samples, n_features_new)
+    """
+    if len(selected) == 0:


This exploits the fact that we're using a string and not, say, None to indicate the "all" case. So I find it a bit weird here. It's also an uncommon use-case, and so should be tested after the "all" case.

Using selected=None as default value to mean "select all" felt strange to me. Hence the default value of "all". None can be used as synonym for [] though.

My concern here was using len("all") != 0. Better to put the "all" condition first...

…lect_categorical

mblondel · 2013-07-04T12:55:30Z

Can I haz merge?

larsmans · 2013-07-04T12:59:14Z

Shall we rename categorical_features to just encode? It's shorter, it's imperative, and maybe the user actually wants to encode some categorical variables, but not others.

Otherwise, +1.

ogrisel · 2013-07-04T17:39:27Z

Shall we rename categorical_features to just encode? It's shorter, it's imperative, and maybe the user actually wants to encode some categorical variables, but not others.

+1

GaelVaroquaux · 2013-07-04T17:54:14Z

sklearn/preprocessing.py

+    """
+    if selected == "all":
+        return transform(X)
+    elif len(selected) == 0:


Should we copy here? I am worried that there will be a strange behavior with copies happening sometimes but not always.

I don't see a reason to copy in a no-op.

I don't see a reason to copy in a no-op.

For consistency: you don't want a copy to happen or not depending on some
small changes in parameters. This can lead to bugs really hard to debug.

I guess you mean:

X1hot = onehot.fit_transform(X) X *= 2 # absurd example

then X1hot changes? I guess you're right, +1.

I guess you mean:

X1hot = onehot.fit_transform(X)
X *= 2 # absurd example

then X1hot changes?

That's exactly what I have in mind. I've been working with library that
have inconsistent copy behavior, and I usual end up forcing the copy
myself, just to avoid surprises.

Fixed in d25f12b.

GaelVaroquaux · 2013-07-04T18:10:14Z

Very cool. I made a few minor comment (feel free not to address them, even though I feel that some should :).

Besides that, 👍 for merge.

mblondel · 2013-07-05T01:44:22Z

@larsmans I personally prefer the more explicit categorical_features but I am glad to change if more people prefer encode.

GaelVaroquaux · 2013-07-05T01:47:37Z

@larsmans I personally prefer the more explicit categorical_features but I am
glad to change if more people prefer encode.

I think that I prefer categorical_features, as it is more explicit.

larsmans · 2013-07-05T15:15:59Z

But the statistical term might be confusing to new users. Also, encode is equally explicit :)

larsmans · 2013-07-05T17:12:35Z

We have a stalemate. Anyone for either categorical_features or encode? @glouppe? @jnothman? When that's resolved, I'm up for merging and fixing the little things in master.

glouppe · 2013-07-05T17:53:24Z

Anyone for either categorical_features or encode?

I prefer categorical_features.

larsmans · 2013-07-05T18:19:22Z

Ok, that's 3-2 for you guys. Whoever still objects gets to submit an issue (I won't ;)

ENH Non-categorical variables in OneHotEncoder

mblondel · 2013-07-05T23:11:40Z

Thanks Lars!

Add categorical_features option to OneHotEncoder.

71db3d4

larsmans reviewed Jun 4, 2013
View reviewed changes

Support both masks and arrays of indices.

e8d3460

Typo.

6e2aee2

Rename _apply_transform to _transform_selected and make it a function

017afab

rather than a method.

jnothman reviewed Jun 10, 2013
View reviewed changes

mblondel added 2 commits June 27, 2013 23:57

Merge branch 'master' of github.com:scikit-learn/scikit-learn into se…

85e4a5f

…lect_categorical

Address @jnothman's comments.

e11873f

GaelVaroquaux reviewed Jul 4, 2013
View reviewed changes

larsmans added a commit that referenced this pull request Jul 5, 2013

Merge pull request #2027 from mblondel/select_categorical

4fe51ba

ENH Non-categorical variables in OneHotEncoder

larsmans merged commit 4fe51ba into scikit-learn:master Jul 5, 2013

[MRG] Non-categorical variables in OneHotEncoder. #2027

[MRG] Non-categorical variables in OneHotEncoder. #2027

Conversation

mblondel commented Jun 4, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larsmans commented Jun 4, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mblondel commented Jun 4, 2013

larsmans commented Jun 4, 2013

ogrisel commented Jun 4, 2013

mblondel commented Jun 4, 2013

jnothman commented Jun 4, 2013

mblondel commented Jun 4, 2013

larsmans commented Jun 4, 2013

mblondel commented Jun 4, 2013

glouppe commented Jun 4, 2013

mblondel commented Jun 4, 2013

larsmans commented Jun 4, 2013

glouppe commented Jun 4, 2013

glouppe commented Jun 4, 2013

mblondel commented Jun 4, 2013

ogrisel commented Jun 4, 2013

jnothman commented Jun 4, 2013

mblondel commented Jun 4, 2013

mblondel commented Jun 4, 2013

larsmans commented Jun 4, 2013

jnothman commented Jun 4, 2013

mblondel commented Jun 4, 2013

mblondel commented Jun 4, 2013

jnothman commented Jun 5, 2013

mblondel commented Jun 5, 2013

jnothman commented Jun 5, 2013

jnothman commented Jun 5, 2013

jnothman commented Jun 5, 2013

GaelVaroquaux commented Jun 5, 2013

ogrisel commented Jun 5, 2013

jnothman commented Jun 5, 2013

mblondel commented Jun 5, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mblondel commented Jul 4, 2013

larsmans commented Jul 4, 2013

ogrisel commented Jul 4, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux commented Jul 4, 2013

mblondel commented Jul 5, 2013

GaelVaroquaux commented Jul 5, 2013

larsmans commented Jul 5, 2013

larsmans commented Jul 5, 2013

glouppe commented Jul 5, 2013

larsmans commented Jul 5, 2013

mblondel commented Jul 5, 2013