Skip to content

Added Pipeline friendly LabelBinarizer #7375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

hesenp
Copy link

@hesenp hesenp commented Sep 9, 2016

This change would enable LabelBinarizer to work with categorical features in pipelines. This is a long sought-after feature per issue 3956, issue 4920, and issue 4883

@amueller
Copy link
Member

amueller commented Sep 9, 2016

This is not the right fix. LabelBinarizer is supposed to work on labels, not the data.
The right approach is to allow strings in OneHotEncoder, which @vighneshbirodkar is working on (see #7327). This turns out to be a bit tricky, though.

@hesenp
Copy link
Author

hesenp commented Sep 9, 2016

Depending on how you translate this. Label to me means categorical variables. You don't wanna reinvent every method just for response variable right?

@amueller
Copy link
Member

amueller commented Sep 9, 2016

but the interface is quite different. And the way you implemented it right now it only works for a single column, right?
And I think label is quite distinct from feature.

@hesenp
Copy link
Author

hesenp commented Sep 9, 2016

Which part of the interface is different, please elaborate?

Whatever functionality, implementation etc, I'm fine as long as the fix that you rooted for will go into future editions soon. The three issues that I mentioned at the beginning of this request dates all the way back to 2014 (issue 3956) and resurfaced twice in 2015 (issue 4920, and issue 4883).

I've marked my calendar and will check back on this thread within 30 days. I hope your colleague have fixed the problem by then.

@amueller
Copy link
Member

amueller commented Sep 9, 2016

This is an issue that is important to me. Feel free to check back in 30 days. Whether anything has happened by then will depend on whether @vighneshbirodkar @jnothman and I had the time to work on it.
Feel free to contribute reviews and suggestions to the thread I linked to.
I am also unsatisfied with the state of this, but that doesn't solve the issue.
We need to fix the OneHotEncoder which currently has quite odd behavior.

So really we don't want to duplicate functionality. LabelEncoder is for labels. OneHotEncoder is for features. You should use OneHotEncoder for features. It works on multiple features at once, because data has multiple features. LabelEncoder works on a single array of labels.
How would you use the class you implemented to apply it to data?

@hesenp
Copy link
Author

hesenp commented Sep 9, 2016

Okay Sir, you get LabelBinarizer as it is.

From what I see in this pr you mentioned, we are trying accomplish too many things in one go. And the result might just be brain pain.

How about just leave OneHotEncoder as it is, add another util function /class to wrap its functionality. So the wrapper class would be able to handle categorical features that comes in possibly multi-columns? I wish I have the bandwidth to stab at this ...

@amueller
Copy link
Member

amueller commented Sep 9, 2016

@hesenp we actually thought about adding a new class and deprecating OneHotEncoder, but @jnothman didn't like that ;) And I agree it's a bit odd to replace the class that has been around for so long.

If you can up with an easier way to make OneHotEncoder more stable and do the things we want, without doing too many things at once, I'm happy to take your suggestions.

@hesenp
Copy link
Author

hesenp commented Oct 17, 2016

Hey @amueller , i just came back from a long trip and there seems to be no change on this and the related PR that we talked about. I gave these problems some thought and summarized some pain points below. And some possible solutions farther below. Please let me know what you think?

Pain points:

  • Scikit Learn for the current stage mostly only enable features as homogeneous vectors. To add in features of different types from a Pandas DataFrame, for example, would involve extracting features by column, do the preproessing, and merging them in pipeline afterwards. This can be cumbersome when we have many feature from different sources.
  • Scikit Learn don't have many choice of response variable (Y) transformation yet. And the transformations don't fit into our fit/transform/predict pattern yet.

To resolve these pain points, I doodled in my toy projects the following component:

  • ItemSelector (adopted from a sklearn tutorial) to select features (please advise me if this is in our current code base already)
  • add transform_y to the pipeline. So we have fit/transform/transform_y/predict. This would enable us to leverage on modules to work on response variables.

Please let me know what you think?

@amueller
Copy link
Member

@hesenp I think the ItemSelector is probably a good solution. The transform_y is not as clear-cut to me. I haven't found many applications for that apart from resampling the data. It doesn't really work for processing continuous y, for example, because you would need to transform them back for scoring.
What applications for transform_y do you have in mind?

imbalanced-learn hacked the pipeline to allow subsampling:
https://github.com/scikit-learn-contrib/imbalanced-learn

For that application, you want to do that during training but not during prediction. I have talked to people that have use cases where you want to resample both during training and prediction. Your transform_y would not allow resampling only during training, right?

@jnothman
Copy link
Member

Regarding ItemSelector, see #3886, #2034

@jorisvandenbossche
Copy link
Member

@amueller @jnothman I think this can be closed given the work on the CategoricalEncoder in #9151 (or I can add this issue number in the list of issues to close by that PR)

@amueller
Copy link
Member

amueller commented Aug 9, 2017

@jorisvandenbossche yeah we should add it to the issues closed, I think.

@jorisvandenbossche
Copy link
Member

Added there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants