Added Pipeline friendly LabelBinarizer #7375

hesenp · 2016-09-09T04:54:03Z

This change would enable LabelBinarizer to work with categorical features in pipelines. This is a long sought-after feature per issue 3956, issue 4920, and issue 4883

amueller · 2016-09-09T14:57:47Z

This is not the right fix. LabelBinarizer is supposed to work on labels, not the data.
The right approach is to allow strings in OneHotEncoder, which @vighneshbirodkar is working on (see #7327). This turns out to be a bit tricky, though.

hesenp · 2016-09-09T15:58:58Z

Depending on how you translate this. Label to me means categorical variables. You don't wanna reinvent every method just for response variable right?

amueller · 2016-09-09T16:47:14Z

but the interface is quite different. And the way you implemented it right now it only works for a single column, right?
And I think label is quite distinct from feature.

hesenp · 2016-09-09T17:39:54Z

Which part of the interface is different, please elaborate?

Whatever functionality, implementation etc, I'm fine as long as the fix that you rooted for will go into future editions soon. The three issues that I mentioned at the beginning of this request dates all the way back to 2014 (issue 3956) and resurfaced twice in 2015 (issue 4920, and issue 4883).

I've marked my calendar and will check back on this thread within 30 days. I hope your colleague have fixed the problem by then.

amueller · 2016-09-09T18:22:20Z

This is an issue that is important to me. Feel free to check back in 30 days. Whether anything has happened by then will depend on whether @vighneshbirodkar @jnothman and I had the time to work on it.
Feel free to contribute reviews and suggestions to the thread I linked to.
I am also unsatisfied with the state of this, but that doesn't solve the issue.
We need to fix the OneHotEncoder which currently has quite odd behavior.

So really we don't want to duplicate functionality. LabelEncoder is for labels. OneHotEncoder is for features. You should use OneHotEncoder for features. It works on multiple features at once, because data has multiple features. LabelEncoder works on a single array of labels.
How would you use the class you implemented to apply it to data?

hesenp · 2016-09-09T18:51:08Z

Okay Sir, you get LabelBinarizer as it is.

From what I see in this pr you mentioned, we are trying accomplish too many things in one go. And the result might just be brain pain.

How about just leave OneHotEncoder as it is, add another util function /class to wrap its functionality. So the wrapper class would be able to handle categorical features that comes in possibly multi-columns? I wish I have the bandwidth to stab at this ...

amueller · 2016-09-09T19:10:20Z

@hesenp we actually thought about adding a new class and deprecating OneHotEncoder, but @jnothman didn't like that ;) And I agree it's a bit odd to replace the class that has been around for so long.

If you can up with an easier way to make OneHotEncoder more stable and do the things we want, without doing too many things at once, I'm happy to take your suggestions.

hesenp · 2016-10-17T17:40:39Z

Hey @amueller , i just came back from a long trip and there seems to be no change on this and the related PR that we talked about. I gave these problems some thought and summarized some pain points below. And some possible solutions farther below. Please let me know what you think?

Pain points:

Scikit Learn for the current stage mostly only enable features as homogeneous vectors. To add in features of different types from a Pandas DataFrame, for example, would involve extracting features by column, do the preproessing, and merging them in pipeline afterwards. This can be cumbersome when we have many feature from different sources.
Scikit Learn don't have many choice of response variable (Y) transformation yet. And the transformations don't fit into our fit/transform/predict pattern yet.

To resolve these pain points, I doodled in my toy projects the following component:

ItemSelector (adopted from a sklearn tutorial) to select features (please advise me if this is in our current code base already)
add transform_y to the pipeline. So we have fit/transform/transform_y/predict. This would enable us to leverage on modules to work on response variables.

Please let me know what you think?

amueller · 2016-10-17T22:10:25Z

@hesenp I think the ItemSelector is probably a good solution. The transform_y is not as clear-cut to me. I haven't found many applications for that apart from resampling the data. It doesn't really work for processing continuous y, for example, because you would need to transform them back for scoring.
What applications for transform_y do you have in mind?

imbalanced-learn hacked the pipeline to allow subsampling:
https://github.com/scikit-learn-contrib/imbalanced-learn

For that application, you want to do that during training but not during prediction. I have talked to people that have use cases where you want to resample both during training and prediction. Your transform_y would not allow resampling only during training, right?

jnothman · 2016-10-18T10:25:04Z

Regarding ItemSelector, see #3886, #2034

jorisvandenbossche · 2017-08-09T13:26:44Z

@amueller @jnothman I think this can be closed given the work on the CategoricalEncoder in #9151 (or I can add this issue number in the list of issues to close by that PR)

amueller · 2017-08-09T22:05:17Z

@jorisvandenbossche yeah we should add it to the issues closed, I think.

jorisvandenbossche · 2017-08-10T07:43:21Z

Added there.

hesenp added 3 commits September 8, 2016 21:49

added label binarizer pipeline friendly

7e8264e

corrected spacing style

ff4f5f1

added fit_transform

f0172ea

jorisvandenbossche mentioned this pull request Aug 10, 2017

[MRG + 1] ENH: new CategoricalEncoder class #9151

Merged

dustin-rcg mentioned this pull request Sep 15, 2017

LabelBinarizer doesn't work in Pipeline ageron/handson-ml#55

Closed

jnothman closed this in #9151 Nov 21, 2017

jnothman mentioned this pull request Jan 29, 2018

Make LabelBinarizer pipeline friendly #10547

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Pipeline friendly LabelBinarizer #7375

Added Pipeline friendly LabelBinarizer #7375

hesenp commented Sep 9, 2016

amueller commented Sep 9, 2016

hesenp commented Sep 9, 2016

amueller commented Sep 9, 2016

hesenp commented Sep 9, 2016 •

edited

Loading

amueller commented Sep 9, 2016

hesenp commented Sep 9, 2016

amueller commented Sep 9, 2016

hesenp commented Oct 17, 2016

amueller commented Oct 17, 2016

jnothman commented Oct 18, 2016

jorisvandenbossche commented Aug 9, 2017

amueller commented Aug 9, 2017

jorisvandenbossche commented Aug 10, 2017

Added Pipeline friendly LabelBinarizer #7375

Added Pipeline friendly LabelBinarizer #7375

Conversation

hesenp commented Sep 9, 2016

amueller commented Sep 9, 2016

hesenp commented Sep 9, 2016

amueller commented Sep 9, 2016

hesenp commented Sep 9, 2016 • edited Loading

amueller commented Sep 9, 2016

hesenp commented Sep 9, 2016

amueller commented Sep 9, 2016

hesenp commented Oct 17, 2016

amueller commented Oct 17, 2016

jnothman commented Oct 18, 2016

jorisvandenbossche commented Aug 9, 2017

amueller commented Aug 9, 2017

jorisvandenbossche commented Aug 10, 2017

hesenp commented Sep 9, 2016 •

edited

Loading