Fix LabelBinarizer and LabelEncoder fit and transform signatures to work with Pipeline #3113

hxu · 2014-04-27T08:05:06Z

fit, transform, and fit_transform previously only accepted a single argument, which breaks when used within a Pipeline, since Pipeline always calls fit_transform with a second y argument, even if it is None.

Fixes #3112

jnothman · 2014-04-27T10:30:10Z

Hmm... Now that I see this, I think I was a bit muddled when looking at your question on #3112. These processors deal with targets, not data. Pipelines are about transforming and fitting data.

Yes, one could conceive of using a pipeline to transform labels, but really pipelines are not that sophisticated, but are necessitated by the model selection framework, etc.

So we have to keep the first parameter as y. Adding another, arbitrarily-named parameter is a possible, but strange solution. Recommending that Pipelines not be used for label processors is another, but if so, why should label processors look like Transformers at all?

hxu · 2014-04-27T10:43:02Z

I was just puzzling over why OneHotEncoder and LabelBinarizer seem to do very similar things, yet are two separate classes. Would it be accurate to say that OneHotEncoder is intended for use with data, while LabelBinarizer and LabelEncoder are intended for use with targets?

Still, the inverse_transform feature of the LabelBinarizer is useful in some cases, so I actually prefer to use this over OneHotEncoder.

At a conceptual level, why is it that there is a distinction between transforming data and transforming targets? It seems like that you could do a lot of the same transformations on targets that you would do for data in the process of creating and fitting a learning model.

jnothman · 2014-04-27T12:20:40Z

From a programmatic perspective, the duplication isn't great; from the
perspective of usability, it might be helpful. If you think
OneHotEncoder.inverse_transform is useful, then maybe that's what we need
a PR for...

On 27 April 2014 20:42, hxu notifications@github.com wrote:

I was just puzzling over why OneHotEncoder and LabelBinarizer seem to do
very similar things, yet are two separate classes. Would it be accurate to
say that OneHotEncoder is intended for use with data, while LabelBinarizerand
LabelEncoder are intended for use with targets?

Still, the inverse_transform feature of the LabelBinarizer is useful in
some cases, so I actually prefer to use this over OneHotEncoder.

At a conceptual level, why is it that there is a distinction between
transforming data and transforming targets? It seems like that you could do
a lot of the same transformations on targets that you would do for data in
the process of creating and fitting a learning model.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3113#issuecomment-41493353
.

larsmans · 2014-04-27T15:20:18Z

why should label processors look like Transformers at all?

They shouldn't, but for historical reasons they do. @GaelVaroquaux once suggested that they should be a different type with a transform_y method, but currently we're just stuck with this old API decision.

GaelVaroquaux · 2014-04-27T15:24:01Z

why should label processors look like Transformers at all?
They shouldn't, but for historical reasons they do.

It's a design mistake. I completely hate myself for doing it. To me it
also shows why design should be done slowly, and based as much as
possible on usecases.

If we can find a way forward from here to have objects modifying y, I
think that it would be great. If we didn't have a userbase to support, I
would immediatly change transform to return y.

larsmans · 2014-04-27T15:29:02Z

I don't think this PR is going to solve the issue. It allows the creation of a Pipeline for labels, but not a pipeline that can modify both labels and data (say y → log(y), feature selection, regression model), which would be the killer feature.

GaelVaroquaux · 2014-04-27T15:30:55Z

I don't think this PR is going to solve the issue.

Absolutely not. Actually, I think that I am -1 on having some transforms
that work on y and some that work on X. I think that it will just make
confusing and error prone.

larsmans · 2014-04-27T15:39:38Z

Closing, then, but let's keep #3112 open as a reminder. Maybe we should just decouple the Label* from the transformer API.

hxu · 2014-04-27T15:45:27Z

I think I understand what you guys mean in terms of the ideal Pipeline structure, but I would like to confirm. Currently, if I were to do pipeline.fit(X, y), it would execute a bunch of transforms on X and then fit an estimator. It assumes that y is already in a state that the estimator can use. Is it true that the ideal would be where pipeline.fit(X, y) would execute a set of transforms on X, AND execute a separate set of transforms on y, then fit the estimator with the final result of both?

I can see the logic of having a different type with transform_y, but could you have a transform that can be applied to both X and y? Would it have both transform and transform_y defined on the class?

larsmans · 2014-04-27T15:50:30Z

I'm not sure. If it has both, then how do you control what it does inside a pipeline? E.g. you can bin y to use a classifier for regression, but that doesn't mean you also want to bin X, so you need to tell it what to transform, X or y.

hxu · 2014-04-27T15:56:26Z

I would imagine you tell it during the setup of the Pipeline. For example:

Pipeline(
    x_transforms=[
        // transforms for X here
    ],
    y_transforms = [
        // transforms for y here
    ],
)

Then the questions is, if there is an estimator is at the end of the Pipeline, where do you put it? Or would it be another argument?

For me at least, I've found that I frequently need to transform my ys before I can use it in an estimator, so I'll end up creating a separate "pipeline" that I have to execute separately before passing it into the main pipeline. A feature like this would be pretty useful.

larsmans · 2014-04-27T15:58:36Z

That would make pipelines much heavier and harder to use.

What is your typical transformation to y?

hxu · 2014-04-27T16:06:40Z

Yes, it is true that it would make pipelines more complex. Actually manually setting a separately pipeline for y is not that bad, but the fact that I cannot use the LabelBinarizer means I have to write a bit more code.

The case that I have at the moment is binarizing a categorical y. But it occurs to me that not all estimators can even take the result of the binarization, since some estimators only work on ys with a single column.

The binning example that you mention could be a better example -- if I have a real valued y, but if I don't want to predict the exact value, I just want to predict which range it is in. Or if I wanted to predict whether it is higher or lower than threshold.

larsmans · 2014-04-27T16:11:42Z

Most estimators are limited to single-column y, but all classifiers can deal with categorical y--that's pretty much the definition of classification. I don't really get what you're trying to achieve. Is it multi-label stuff?

hxu · 2014-04-27T16:12:59Z

Sorry, I meant estimator as including classifiers. Yes, it is a multi-label classification task.

larsmans · 2014-04-27T16:39:13Z

Then you should try the meta-estimators in sklearn.multiclass. If they don't work, then writing a new meta-estimator is the way to, I think, because:

Your usecase can be summarized in pseudocode as

y' = f(y)
    fit(X, y')
    ŷ' = predict(Xtest)
ŷ = f⁻¹(ŷ')

where f is binarize (but it could just as well be log for some regression tasks).

Conceptually, that's not a pipeline. A pipeline applies independent transformations one by one. In this case, f needs to be fit to make f⁻¹ (inverse_transform) work. So Pipeline is not the right place for this functionality; structurally, it wraps around the entire fit process, which is exactly what meta-estimators do.

(Sorry for the long story, but this is also for my own reference and for the project.)

hxu · 2014-04-27T16:54:24Z

Ok thanks, I'll have a look at the meta-estimators.

GaelVaroquaux · 2014-04-27T19:45:48Z

I can see the logic of having a different type with transform_y, but could you
have a transform that can be applied to both X and y? Would it have both
transform and transform_y defined on the class?

I don't think that a discussion on what the right API and code without
examples and usescase is going to enable to see the emerging patterns.

Anyhow, I think that we need to sort out the structure of 'y' before (I
personnally now think that they should optionally be dictionnary of
arrays or pandas dataframes), and this plugs in discussion lead in #2904,
#2374, and other API-related issues:

https://github.com/scikit-learn/scikit-learn/issues?labels=API&page=1&state=open

Anyhow, as far as I am concerned, the current priority would be releasing
rather than changing APIs.

mblondel · 2015-01-24T07:53:46Z

For regression, it's pretty common to normalize targets.

y_tr = y_scaler.fit_transform(y_tr)
reg.fit(X_tr, y_tr)
y_pred = reg.predict(X_te)
y_pred = y_scaler.inverse_transform(y_pred)

I think y transformers are useful but we need a way to differentiate them from x transformers.

jnothman · 2015-01-24T10:34:32Z

I suspect you intended your comment on #4143, @mblondel

amueller · 2015-01-24T17:52:35Z

A possible way would be to have transform_y

KthProg · 2017-05-12T18:59:38Z

I need this to apply mutli-binarization to a list of labels for each column, using FeatureUnion would be the most common sense way to add these binarized labels to the data, but I can't do that because of the incompatible signature. I don't understand the conversation between targets and data (labels and features?) but I need functionality like this all the time for my features. Data often comes in a list of categories that you would like to one-hot encode, and sometimes those features are all in the same column as well.

For now, I replace MultiLabelBinarizer in my pipeline with this:

class MultiLabelTransformer(BaseEstimator, TransformerMixin):
  def __init__(self):
    pass
  def fit(self, X, y=None):
    return self
  def transform(self, X):
    mlb = MultiLabelBinarizer()
    return mlb.fit_transform(X)

jnothman · 2017-05-13T23:58:28Z

your use case is not clear to me. example? would a metaestimator like from sklearn.multiclass be appropriate?

…

On 13 May 2017 4:59 am, "Kyle" ***@***.***> wrote: I need this to apply mutli-binarization to a list of labels for each column, using FeatureUnion would be the most common sense way to add these binarized labels to the data, but I can't do that because of the incompatible signature. I don't understand the conversation between targets and data (labels and features?) but I need functionality like this all the time for my features. Data often comes in a list of categories that you would like to one-hot encode, and sometimes those features are all in the same column as well. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6wOqiXtZMDhQx72NkUOczSvtrgP7ks5r5KwegaJpZM4B10iM> .

KthProg · 2017-05-14T03:05:46Z

My use case is this:

My data is food inspections, the outcome being pass/fail. One column has the codes which were violated. They are all stored in one column of varying format. I parse out these numbers in one pipeline of my feature union. This results in an array of varying sizes for each row in that column. I then convert this into a binarized matrix: 1 for each violation code that was failed, and 0 for violation codes that were passed.

Essentially, any time a single column contains a list of values, this situation would come up. I would guess that's relatively common.

No, I don't believe that would solve the same problem. This is about processing the data rather than fitting it for predictions.

jnothman · 2017-05-14T12:32:12Z

Isn't this what y = MultiLabelBinarizer().fit_transform(data.violation_codes) would do for you?

…

On 14 May 2017 at 13:05, Kyle ***@***.***> wrote: My use case is this: My data is food inspections, the outcome being pass/fail. One column has the codes which were violated. They are all stored in one column of varying format. I parse out these numbers in one pipeline of my feature union. This results in an array of varying sizes for each row in that column. I then convert this into a binarized matrix: 1 for each violation code that was failed, and 0 for violation codes that were not passed. Essentially, any time a single column contains a list of values, this situation would come up. I would guess that's relatively common. No, I don't believe that would solve the same problem. This is about processing the data rather than fitting it for predictions. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6wJdNP3xOpAmkiG3pGuIc1Hp2qZ3ks5r5m-LgaJpZM4B10iM> .

KthProg · 2017-05-14T13:48:51Z

You would hope, but for some reason it does not work as part of a pipeline. Specifically it says the number of arguments do not match, so I had to write the above code to use in my pipeline.

jnothman · 2017-05-14T21:33:26Z

no it does not work in a pipeline. why do you need it in a pipeline?

…

On 14 May 2017 11:48 pm, "Kyle" ***@***.***> wrote: You would hope, but for some reason it does not work as part of a pipeline. Specifically it says the number of arguments do not match, so I had to write the above code to use in my pipeline. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6yMtLmRd8qKgZU7JZEBdF-GZGfykks5r5wZFgaJpZM4B10iM> .

jnothman · 2017-05-14T21:34:06Z

are we talking about feature space, or classification targets?

…

On 15 May 2017 7:33 am, "Joel Nothman" ***@***.***> wrote: no it does not work in a pipeline. why do you need it in a pipeline? On 14 May 2017 11:48 pm, "Kyle" ***@***.***> wrote: > You would hope, but for some reason it does not work as part of a > pipeline. Specifically it says the number of arguments do not match, so I > had to write the above code to use in my pipeline. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#3113 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAEz6yMtLmRd8qKgZU7JZEBdF-GZGfykks5r5wZFgaJpZM4B10iM> > . >

KthProg · 2017-05-14T21:35:53Z

Feature space. The pass/fail in this case is not the same as the violations. The violations are a feature. Am I misunderstanding what a pipeline is used for?

jnothman · 2017-05-14T21:46:42Z

no, that's right, but that's not what a multilabelbinarizer is used for. it's what a OneHotEncoder is used for.

…

On 15 May 2017 7:35 am, "Kyle" ***@***.***> wrote: Feature space. Am I misunderstanding what a pipeline is used for? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6yldMJt8npyCfC9-MMKKVCqW3FCEks5r53O6gaJpZM4B10iM> .

jnothman · 2017-05-14T21:47:38Z

or a dictvectorizer. Or a countvectorizer. Or a custom vectorizer

…

On 15 May 2017 7:46 am, "Joel Nothman" ***@***.***> wrote: no, that's right, but that's not what a multilabelbinarizer is used for. it's what a OneHotEncoder is used for. On 15 May 2017 7:35 am, "Kyle" ***@***.***> wrote: > Feature space. Am I misunderstanding what a pipeline is used for? > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#3113 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAEz6yldMJt8npyCfC9-MMKKVCqW3FCEks5r53O6gaJpZM4B10iM> > . >

KthProg · 2017-05-14T21:50:23Z

OneHotEncoder doesn't work on an array of arrays though right? After parsing via a pipeline, the result (before the feature union) would look like this:

[
  [26,32,56],
  [67,32],
  [],
  ...
]

Where each row is a training instance, and each value is the codes in that instance. I think a one-hot encoder would expect each instance to have only one value, which ends up making the multi-label binarizer look like the correct choice.

jnothman · 2017-05-14T21:58:54Z

you're right that ohe isn't the right choice. The point is that we can't build in every possible feature representation you may happen to have. we mostly deal with data that is rectangular in the first place, feature_extraction being the primary exception. It would be nice if dictvectorizer handled this kind of case more smoothly, but at the end of the day tour need to be able to write your own Transformers for custom transformations On 15 May 2017 7:50 am, "Kyle" <notifications@github.com> wrote: OneHotEncoder doesn't work on an array of arrays though right? After parsing via a pipeline, the result (before the feature union) would look like this: [ [26,32,56], [67,32], [], ... ] Where each row is a training instance, and each value is the codes in that instance. I think a one-hot encoder would expect each instance to have only one value, which ends up making the multi-label binarizer look like the correct choice. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6yUI2kkGZ4lgmFHeub2SHSVfMo_uks5r53cggaJpZM4B10iM> .

KthProg · 2017-05-14T22:02:28Z

Pull request to include multi-label transformer ¯_( ͡° ͜ʖ ͡°)_/¯? I don't think I've ever contributed to a useful project before lol...

jnothman · 2017-05-14T23:12:25Z

There are existing proposals to support multi valued data with Dictvectorizer, though I can't recall any that I've been satisfied with yet On 15 May 2017 8:02 am, "Kyle" <notifications@github.com> wrote: Pull request to include multi-label transformer ¯_( ͡° ͜ʖ ͡°)_/¯? I don't think I've ever contributed to a useful project before lol... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-DG_Aqu2jnF8PuOoGwwTMtc8Eltks5r53n2gaJpZM4B10iM> .

KthProg · 2017-05-14T23:22:20Z

I think they're two different problems, but I could be misunderstanding. If a dict vectorizer handled multi-valued data, would it be handling maps where the values are arrays? Or simply being extended to handle arrays as a type of map (index keys)?

jnothman · 2017-05-15T00:37:14Z

Each record would be something like {'violations': ['111', '555', '676'], 'other_feature': 5.2}

…

On 15 May 2017 at 09:22, Kyle ***@***.***> wrote: I think they're two different problems, but I could be misunderstanding. If a dict vectorizer handled multi-valued data, would it be handling maps where the values are arrays? Or simply being extended to handle arrays as a type of map (index keys)? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz68Ge4ntwotY2-PgB4D5sAsTHmxWZks5r54yugaJpZM4B10iM> .

KthProg · 2017-05-15T00:48:45Z

I can't help but wonder if there's some way to generalize this problem to handle many data structures. However, it would be trivial for me to update my pipeline to have a map with a single key, so this kind of change would solve my problem and integrate with the current framework.

Is there an issue I can associate with a pull request if I try to make this change?

hxu added 2 commits April 27, 2014 00:57

failing tests

928abaf

Make LabelBinarizer and LabelEncoder work with Pipeline

61406d9

larsmans closed this Apr 27, 2014

This was referenced Aug 31, 2014

LabelBinarizer and LabelEncoder fit and transform signatures not compatible with Pipeline #3112

Closed

[MRG] Label Encoder Unseen Labels #3599

Closed

cancan101 mentioned this pull request Jan 22, 2015

Allow for Transformers on y #4143

Closed

Uh oh!

Fix LabelBinarizer and LabelEncoder fit and transform signatures to work with Pipeline #3113

Fix LabelBinarizer and LabelEncoder fit and transform signatures to work with Pipeline #3113

Uh oh!

Conversation

hxu commented Apr 27, 2014

Uh oh!

jnothman commented Apr 27, 2014

Uh oh!

hxu commented Apr 27, 2014

Uh oh!

jnothman commented Apr 27, 2014

Uh oh!

larsmans commented Apr 27, 2014

Uh oh!

GaelVaroquaux commented Apr 27, 2014

Uh oh!

larsmans commented Apr 27, 2014

Uh oh!

GaelVaroquaux commented Apr 27, 2014

Uh oh!

larsmans commented Apr 27, 2014

Uh oh!

hxu commented Apr 27, 2014

Uh oh!

larsmans commented Apr 27, 2014

Uh oh!

hxu commented Apr 27, 2014

Uh oh!

larsmans commented Apr 27, 2014

Uh oh!

hxu commented Apr 27, 2014

Uh oh!

larsmans commented Apr 27, 2014

Uh oh!

hxu commented Apr 27, 2014

Uh oh!

larsmans commented Apr 27, 2014

Uh oh!

hxu commented Apr 27, 2014

Uh oh!

GaelVaroquaux commented Apr 27, 2014

Uh oh!

mblondel commented Jan 24, 2015

Uh oh!

jnothman commented Jan 24, 2015

Uh oh!

amueller commented Jan 24, 2015

Uh oh!

KthProg commented May 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented May 13, 2017 via email

Uh oh!

KthProg commented May 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented May 14, 2017 via email

Uh oh!

KthProg commented May 14, 2017

Uh oh!

jnothman commented May 14, 2017 via email

Uh oh!

jnothman commented May 14, 2017 via email

Uh oh!

KthProg commented May 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented May 14, 2017 via email

Uh oh!

jnothman commented May 14, 2017 via email

Uh oh!

KthProg commented May 14, 2017

Uh oh!

jnothman commented May 14, 2017 via email

Uh oh!

KthProg commented May 14, 2017

Uh oh!

jnothman commented May 14, 2017 via email

Uh oh!

KthProg commented May 12, 2017 •

edited

Loading

KthProg commented May 14, 2017 •

edited

Loading

KthProg commented May 14, 2017 •

edited

Loading