-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Fix LabelBinarizer and LabelEncoder fit and transform signatures to work with Pipeline #3113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hmm... Now that I see this, I think I was a bit muddled when looking at your question on #3112. These processors deal with targets, not data. Pipelines are about transforming and fitting data. Yes, one could conceive of using a pipeline to transform labels, but really pipelines are not that sophisticated, but are necessitated by the model selection framework, etc. So we have to keep the first parameter as |
I was just puzzling over why Still, the At a conceptual level, why is it that there is a distinction between transforming data and transforming targets? It seems like that you could do a lot of the same transformations on targets that you would do for data in the process of creating and fitting a learning model. |
From a programmatic perspective, the duplication isn't great; from the On 27 April 2014 20:42, hxu notifications@github.com wrote:
|
They shouldn't, but for historical reasons they do. @GaelVaroquaux once suggested that they should be a different type with a |
It's a design mistake. I completely hate myself for doing it. To me it If we can find a way forward from here to have objects modifying y, I |
I don't think this PR is going to solve the issue. It allows the creation of a Pipeline for labels, but not a pipeline that can modify both labels and data (say |
Absolutely not. Actually, I think that I am -1 on having some transforms |
Closing, then, but let's keep #3112 open as a reminder. Maybe we should just decouple the |
I think I understand what you guys mean in terms of the ideal Pipeline structure, but I would like to confirm. Currently, if I were to do I can see the logic of having a different type with |
I'm not sure. If it has both, then how do you control what it does inside a pipeline? E.g. you can bin |
I would imagine you tell it during the setup of the
Then the questions is, if there is an estimator is at the end of the Pipeline, where do you put it? Or would it be another argument? For me at least, I've found that I frequently need to transform my |
That would make pipelines much heavier and harder to use. What is your typical transformation to |
Yes, it is true that it would make pipelines more complex. Actually manually setting a separately pipeline for The case that I have at the moment is binarizing a categorical The binning example that you mention could be a better example -- if I have a real valued |
Most estimators are limited to single-column |
Sorry, I meant estimator as including classifiers. Yes, it is a multi-label classification task. |
Then you should try the meta-estimators in Your usecase can be summarized in pseudocode as
where Conceptually, that's not a pipeline. A pipeline applies independent transformations one by one. In this case, (Sorry for the long story, but this is also for my own reference and for the project.) |
Ok thanks, I'll have a look at the meta-estimators. |
I don't think that a discussion on what the right API and code without Anyhow, I think that we need to sort out the structure of 'y' before (I https://github.com/scikit-learn/scikit-learn/issues?labels=API&page=1&state=open Anyhow, as far as I am concerned, the current priority would be releasing |
For regression, it's pretty common to normalize targets. y_tr = y_scaler.fit_transform(y_tr)
reg.fit(X_tr, y_tr)
y_pred = reg.predict(X_te)
y_pred = y_scaler.inverse_transform(y_pred) I think y transformers are useful but we need a way to differentiate them from x transformers. |
A possible way would be to have |
I need this to apply mutli-binarization to a list of labels for each column, using FeatureUnion would be the most common sense way to add these binarized labels to the data, but I can't do that because of the incompatible signature. I don't understand the conversation between targets and data (labels and features?) but I need functionality like this all the time for my features. Data often comes in a list of categories that you would like to one-hot encode, and sometimes those features are all in the same column as well. For now, I replace MultiLabelBinarizer in my pipeline with this:
|
your use case is not clear to me. example? would a metaestimator like from
sklearn.multiclass be appropriate?
…On 13 May 2017 4:59 am, "Kyle" ***@***.***> wrote:
I need this to apply mutli-binarization to a list of labels for each
column, using FeatureUnion would be the most common sense way to add these
binarized labels to the data, but I can't do that because of the
incompatible signature. I don't understand the conversation between targets
and data (labels and features?) but I need functionality like this all the
time for my features. Data often comes in a list of categories that you
would like to one-hot encode, and sometimes those features are all in the
same column as well.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3113 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wOqiXtZMDhQx72NkUOczSvtrgP7ks5r5KwegaJpZM4B10iM>
.
|
My use case is this: My data is food inspections, the outcome being pass/fail. One column has the codes which were violated. They are all stored in one column of varying format. I parse out these numbers in one pipeline of my feature union. This results in an array of varying sizes for each row in that column. I then convert this into a binarized matrix: 1 for each violation code that was failed, and 0 for violation codes that were passed. Essentially, any time a single column contains a list of values, this situation would come up. I would guess that's relatively common. No, I don't believe that would solve the same problem. This is about processing the data rather than fitting it for predictions. |
Isn't this what y =
MultiLabelBinarizer().fit_transform(data.violation_codes) would do for you?
…On 14 May 2017 at 13:05, Kyle ***@***.***> wrote:
My use case is this:
My data is food inspections, the outcome being pass/fail. One column has
the codes which were violated. They are all stored in one column of varying
format. I parse out these numbers in one pipeline of my feature union. This
results in an array of varying sizes for each row in that column. I then
convert this into a binarized matrix: 1 for each violation code that was
failed, and 0 for violation codes that were not passed.
Essentially, any time a single column contains a list of values, this
situation would come up. I would guess that's relatively common.
No, I don't believe that would solve the same problem. This is about
processing the data rather than fitting it for predictions.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3113 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wJdNP3xOpAmkiG3pGuIc1Hp2qZ3ks5r5m-LgaJpZM4B10iM>
.
|
You would hope, but for some reason it does not work as part of a pipeline. Specifically it says the number of arguments do not match, so I had to write the above code to use in my pipeline. |
no it does not work in a pipeline. why do you need it in a pipeline?
…On 14 May 2017 11:48 pm, "Kyle" ***@***.***> wrote:
You would hope, but for some reason it does not work as part of a
pipeline. Specifically it says the number of arguments do not match, so I
had to write the above code to use in my pipeline.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3113 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yMtLmRd8qKgZU7JZEBdF-GZGfykks5r5wZFgaJpZM4B10iM>
.
|
are we talking about feature space, or classification targets?
…On 15 May 2017 7:33 am, "Joel Nothman" ***@***.***> wrote:
no it does not work in a pipeline. why do you need it in a pipeline?
On 14 May 2017 11:48 pm, "Kyle" ***@***.***> wrote:
> You would hope, but for some reason it does not work as part of a
> pipeline. Specifically it says the number of arguments do not match, so I
> had to write the above code to use in my pipeline.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#3113 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz6yMtLmRd8qKgZU7JZEBdF-GZGfykks5r5wZFgaJpZM4B10iM>
> .
>
|
Feature space. The pass/fail in this case is not the same as the violations. The violations are a feature. Am I misunderstanding what a pipeline is used for? |
no, that's right, but that's not what a multilabelbinarizer is used for.
it's what a OneHotEncoder is used for.
…On 15 May 2017 7:35 am, "Kyle" ***@***.***> wrote:
Feature space. Am I misunderstanding what a pipeline is used for?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3113 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yldMJt8npyCfC9-MMKKVCqW3FCEks5r53O6gaJpZM4B10iM>
.
|
or a dictvectorizer. Or a countvectorizer. Or a custom vectorizer
…On 15 May 2017 7:46 am, "Joel Nothman" ***@***.***> wrote:
no, that's right, but that's not what a multilabelbinarizer is used for.
it's what a OneHotEncoder is used for.
On 15 May 2017 7:35 am, "Kyle" ***@***.***> wrote:
> Feature space. Am I misunderstanding what a pipeline is used for?
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#3113 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz6yldMJt8npyCfC9-MMKKVCqW3FCEks5r53O6gaJpZM4B10iM>
> .
>
|
OneHotEncoder doesn't work on an array of arrays though right? After parsing via a pipeline, the result (before the feature union) would look like this:
Where each row is a training instance, and each value is the codes in that instance. I think a one-hot encoder would expect each instance to have only one value, which ends up making the multi-label binarizer look like the correct choice. |
you're right that ohe isn't the right choice. The point is that we can't
build in every possible feature representation you may happen to have. we
mostly deal with data that is rectangular in the first place,
feature_extraction being the primary exception. It would be nice if
dictvectorizer handled this kind of case more smoothly, but at the end of
the day tour need to be able to write your own Transformers for custom
transformations
On 15 May 2017 7:50 am, "Kyle" <notifications@github.com> wrote:
OneHotEncoder doesn't work on an array of arrays though right? After
parsing via a pipeline, the result (before the feature union) would look
like this:
[
[26,32,56],
[67,32],
[],
...
]
Where each row is a training instance, and each value is the codes in that
instance. I think a one-hot encoder would expect each instance to have only
one value, which ends up making the multi-label binarizer look like the
correct choice.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3113 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yUI2kkGZ4lgmFHeub2SHSVfMo_uks5r53cggaJpZM4B10iM>
.
|
Pull request to include multi-label transformer ¯_( ͡° ͜ʖ ͡°)_/¯? I don't think I've ever contributed to a useful project before lol... |
There are existing proposals to support multi valued data with
Dictvectorizer, though I can't recall any that I've been satisfied with yet
On 15 May 2017 8:02 am, "Kyle" <notifications@github.com> wrote:
Pull request to include multi-label transformer ¯_( ͡° ͜ʖ ͡°)_/¯? I don't
think I've ever contributed to a useful project before lol...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3113 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-DG_Aqu2jnF8PuOoGwwTMtc8Eltks5r53n2gaJpZM4B10iM>
.
|
I think they're two different problems, but I could be misunderstanding. If a dict vectorizer handled multi-valued data, would it be handling maps where the values are arrays? Or simply being extended to handle arrays as a type of map (index keys)? |
Each record would be something like {'violations': ['111', '555', '676'],
'other_feature': 5.2}
…On 15 May 2017 at 09:22, Kyle ***@***.***> wrote:
I think they're two different problems, but I could be misunderstanding.
If a dict vectorizer handled multi-valued data, would it be handling maps
where the values are arrays? Or simply being extended to handle arrays as a
type of map (index keys)?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3113 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz68Ge4ntwotY2-PgB4D5sAsTHmxWZks5r54yugaJpZM4B10iM>
.
|
I can't help but wonder if there's some way to generalize this problem to handle many data structures. However, it would be trivial for me to update my pipeline to have a map with a single key, so this kind of change would solve my problem and integrate with the current framework. Is there an issue I can associate with a pull request if I try to make this change? |
fit
,transform
, andfit_transform
previously only accepted a single argument, which breaks when used within a Pipeline, since Pipeline always callsfit_transform
with a secondy
argument, even if it isNone
.Fixes #3112