-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Added Pipeline friendly LabelBinarizer #7375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is not the right fix. LabelBinarizer is supposed to work on labels, not the data. |
Depending on how you translate this. Label to me means categorical variables. You don't wanna reinvent every method just for response variable right? |
but the interface is quite different. And the way you implemented it right now it only works for a single column, right? |
Which part of the interface is different, please elaborate? Whatever functionality, implementation etc, I'm fine as long as the fix that you rooted for will go into future editions soon. The three issues that I mentioned at the beginning of this request dates all the way back to 2014 (issue 3956) and resurfaced twice in 2015 (issue 4920, and issue 4883). I've marked my calendar and will check back on this thread within 30 days. I hope your colleague have fixed the problem by then. |
This is an issue that is important to me. Feel free to check back in 30 days. Whether anything has happened by then will depend on whether @vighneshbirodkar @jnothman and I had the time to work on it. So really we don't want to duplicate functionality. LabelEncoder is for labels. OneHotEncoder is for features. You should use OneHotEncoder for features. It works on multiple features at once, because data has multiple features. LabelEncoder works on a single array of labels. |
Okay Sir, you get LabelBinarizer as it is. From what I see in this pr you mentioned, we are trying accomplish too many things in one go. And the result might just be brain pain. How about just leave OneHotEncoder as it is, add another util function /class to wrap its functionality. So the wrapper class would be able to handle categorical features that comes in possibly multi-columns? I wish I have the bandwidth to stab at this ... |
@hesenp we actually thought about adding a new class and deprecating If you can up with an easier way to make |
Hey @amueller , i just came back from a long trip and there seems to be no change on this and the related PR that we talked about. I gave these problems some thought and summarized some pain points below. And some possible solutions farther below. Please let me know what you think? Pain points:
To resolve these pain points, I doodled in my toy projects the following component:
Please let me know what you think? |
@hesenp I think the ItemSelector is probably a good solution. The imbalanced-learn hacked the pipeline to allow subsampling: For that application, you want to do that during training but not during prediction. I have talked to people that have use cases where you want to resample both during training and prediction. Your |
@jorisvandenbossche yeah we should add it to the issues closed, I think. |
Added there. |
This change would enable LabelBinarizer to work with categorical features in pipelines. This is a long sought-after feature per issue 3956, issue 4920, and issue 4883