Skip to content

Fix LabelBinarizer and LabelEncoder fit and transform signatures to work with Pipeline #3113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

hxu
Copy link

@hxu hxu commented Apr 27, 2014

fit, transform, and fit_transform previously only accepted a single argument, which breaks when used within a Pipeline, since Pipeline always calls fit_transform with a second y argument, even if it is None.

Fixes #3112

@jnothman
Copy link
Member

Hmm... Now that I see this, I think I was a bit muddled when looking at your question on #3112. These processors deal with targets, not data. Pipelines are about transforming and fitting data.

Yes, one could conceive of using a pipeline to transform labels, but really pipelines are not that sophisticated, but are necessitated by the model selection framework, etc.

So we have to keep the first parameter as y. Adding another, arbitrarily-named parameter is a possible, but strange solution. Recommending that Pipelines not be used for label processors is another, but if so, why should label processors look like Transformers at all?

@hxu
Copy link
Author

hxu commented Apr 27, 2014

I was just puzzling over why OneHotEncoder and LabelBinarizer seem to do very similar things, yet are two separate classes. Would it be accurate to say that OneHotEncoder is intended for use with data, while LabelBinarizer and LabelEncoder are intended for use with targets?

Still, the inverse_transform feature of the LabelBinarizer is useful in some cases, so I actually prefer to use this over OneHotEncoder.

At a conceptual level, why is it that there is a distinction between transforming data and transforming targets? It seems like that you could do a lot of the same transformations on targets that you would do for data in the process of creating and fitting a learning model.

@jnothman
Copy link
Member

From a programmatic perspective, the duplication isn't great; from the
perspective of usability, it might be helpful. If you think
OneHotEncoder.inverse_transform is useful, then maybe that's what we need
a PR for...

On 27 April 2014 20:42, hxu notifications@github.com wrote:

I was just puzzling over why OneHotEncoder and LabelBinarizer seem to do
very similar things, yet are two separate classes. Would it be accurate to
say that OneHotEncoder is intended for use with data, while LabelBinarizerand
LabelEncoder are intended for use with targets?

Still, the inverse_transform feature of the LabelBinarizer is useful in
some cases, so I actually prefer to use this over OneHotEncoder.

At a conceptual level, why is it that there is a distinction between
transforming data and transforming targets? It seems like that you could do
a lot of the same transformations on targets that you would do for data in
the process of creating and fitting a learning model.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3113#issuecomment-41493353
.

@larsmans
Copy link
Member

why should label processors look like Transformers at all?

They shouldn't, but for historical reasons they do. @GaelVaroquaux once suggested that they should be a different type with a transform_y method, but currently we're just stuck with this old API decision.

@GaelVaroquaux
Copy link
Member

why should label processors look like Transformers at all?

They shouldn't, but for historical reasons they do.

It's a design mistake. I completely hate myself for doing it. To me it
also shows why design should be done slowly, and based as much as
possible on usecases.

If we can find a way forward from here to have objects modifying y, I
think that it would be great. If we didn't have a userbase to support, I
would immediatly change transform to return y.

@larsmans
Copy link
Member

I don't think this PR is going to solve the issue. It allows the creation of a Pipeline for labels, but not a pipeline that can modify both labels and data (say y → log(y), feature selection, regression model), which would be the killer feature.

@GaelVaroquaux
Copy link
Member

I don't think this PR is going to solve the issue.

Absolutely not. Actually, I think that I am -1 on having some transforms
that work on y and some that work on X. I think that it will just make
confusing and error prone.

@larsmans
Copy link
Member

Closing, then, but let's keep #3112 open as a reminder. Maybe we should just decouple the Label* from the transformer API.

@larsmans larsmans closed this Apr 27, 2014
@hxu
Copy link
Author

hxu commented Apr 27, 2014

I think I understand what you guys mean in terms of the ideal Pipeline structure, but I would like to confirm. Currently, if I were to do pipeline.fit(X, y), it would execute a bunch of transforms on X and then fit an estimator. It assumes that y is already in a state that the estimator can use. Is it true that the ideal would be where pipeline.fit(X, y) would execute a set of transforms on X, AND execute a separate set of transforms on y, then fit the estimator with the final result of both?

I can see the logic of having a different type with transform_y, but could you have a transform that can be applied to both X and y? Would it have both transform and transform_y defined on the class?

@larsmans
Copy link
Member

I'm not sure. If it has both, then how do you control what it does inside a pipeline? E.g. you can bin y to use a classifier for regression, but that doesn't mean you also want to bin X, so you need to tell it what to transform, X or y.

@hxu
Copy link
Author

hxu commented Apr 27, 2014

I would imagine you tell it during the setup of the Pipeline. For example:

Pipeline(
    x_transforms=[
        // transforms for X here
    ],
    y_transforms = [
        // transforms for y here
    ],
)

Then the questions is, if there is an estimator is at the end of the Pipeline, where do you put it? Or would it be another argument?

For me at least, I've found that I frequently need to transform my ys before I can use it in an estimator, so I'll end up creating a separate "pipeline" that I have to execute separately before passing it into the main pipeline. A feature like this would be pretty useful.

@larsmans
Copy link
Member

That would make pipelines much heavier and harder to use.

What is your typical transformation to y?

@hxu
Copy link
Author

hxu commented Apr 27, 2014

Yes, it is true that it would make pipelines more complex. Actually manually setting a separately pipeline for y is not that bad, but the fact that I cannot use the LabelBinarizer means I have to write a bit more code.

The case that I have at the moment is binarizing a categorical y. But it occurs to me that not all estimators can even take the result of the binarization, since some estimators only work on ys with a single column.

The binning example that you mention could be a better example -- if I have a real valued y, but if I don't want to predict the exact value, I just want to predict which range it is in. Or if I wanted to predict whether it is higher or lower than threshold.

@larsmans
Copy link
Member

Most estimators are limited to single-column y, but all classifiers can deal with categorical y--that's pretty much the definition of classification. I don't really get what you're trying to achieve. Is it multi-label stuff?

@hxu
Copy link
Author

hxu commented Apr 27, 2014

Sorry, I meant estimator as including classifiers. Yes, it is a multi-label classification task.

@larsmans
Copy link
Member

Then you should try the meta-estimators in sklearn.multiclass. If they don't work, then writing a new meta-estimator is the way to, I think, because:

Your usecase can be summarized in pseudocode as

y' = f(y)
    fit(X, y')
    ŷ' = predict(Xtest)
ŷ = f⁻¹(ŷ')

where f is binarize (but it could just as well be log for some regression tasks).

Conceptually, that's not a pipeline. A pipeline applies independent transformations one by one. In this case, f needs to be fit to make f⁻¹ (inverse_transform) work. So Pipeline is not the right place for this functionality; structurally, it wraps around the entire fit process, which is exactly what meta-estimators do.

(Sorry for the long story, but this is also for my own reference and for the project.)

@hxu
Copy link
Author

hxu commented Apr 27, 2014

Ok thanks, I'll have a look at the meta-estimators.

@GaelVaroquaux
Copy link
Member

I can see the logic of having a different type with transform_y, but could you
have a transform that can be applied to both X and y? Would it have both
transform and transform_y defined on the class?

I don't think that a discussion on what the right API and code without
examples and usescase is going to enable to see the emerging patterns.

Anyhow, I think that we need to sort out the structure of 'y' before (I
personnally now think that they should optionally be dictionnary of
arrays or pandas dataframes), and this plugs in discussion lead in #2904,
#2374, and other API-related issues:

https://github.com/scikit-learn/scikit-learn/issues?labels=API&page=1&state=open

Anyhow, as far as I am concerned, the current priority would be releasing
rather than changing APIs.

@mblondel
Copy link
Member

For regression, it's pretty common to normalize targets.

y_tr = y_scaler.fit_transform(y_tr)
reg.fit(X_tr, y_tr)
y_pred = reg.predict(X_te)
y_pred = y_scaler.inverse_transform(y_pred)

I think y transformers are useful but we need a way to differentiate them from x transformers.

@jnothman
Copy link
Member

I suspect you intended your comment on #4143, @mblondel

@amueller
Copy link
Member

A possible way would be to have transform_y

@KthProg
Copy link

KthProg commented May 12, 2017

I need this to apply mutli-binarization to a list of labels for each column, using FeatureUnion would be the most common sense way to add these binarized labels to the data, but I can't do that because of the incompatible signature. I don't understand the conversation between targets and data (labels and features?) but I need functionality like this all the time for my features. Data often comes in a list of categories that you would like to one-hot encode, and sometimes those features are all in the same column as well.

For now, I replace MultiLabelBinarizer in my pipeline with this:

class MultiLabelTransformer(BaseEstimator, TransformerMixin):
  def __init__(self):
    pass
  def fit(self, X, y=None):
    return self
  def transform(self, X):
    mlb = MultiLabelBinarizer()
    return mlb.fit_transform(X)

@jnothman
Copy link
Member

jnothman commented May 13, 2017 via email

@KthProg
Copy link

KthProg commented May 14, 2017

My use case is this:

My data is food inspections, the outcome being pass/fail. One column has the codes which were violated. They are all stored in one column of varying format. I parse out these numbers in one pipeline of my feature union. This results in an array of varying sizes for each row in that column. I then convert this into a binarized matrix: 1 for each violation code that was failed, and 0 for violation codes that were passed.

Essentially, any time a single column contains a list of values, this situation would come up. I would guess that's relatively common.

No, I don't believe that would solve the same problem. This is about processing the data rather than fitting it for predictions.

@jnothman
Copy link
Member

jnothman commented May 14, 2017 via email

@KthProg
Copy link

KthProg commented May 14, 2017

You would hope, but for some reason it does not work as part of a pipeline. Specifically it says the number of arguments do not match, so I had to write the above code to use in my pipeline.

@jnothman
Copy link
Member

jnothman commented May 14, 2017 via email

@jnothman
Copy link
Member

jnothman commented May 14, 2017 via email

@KthProg
Copy link

KthProg commented May 14, 2017

Feature space. The pass/fail in this case is not the same as the violations. The violations are a feature. Am I misunderstanding what a pipeline is used for?

@jnothman
Copy link
Member

jnothman commented May 14, 2017 via email

@jnothman
Copy link
Member

jnothman commented May 14, 2017 via email

@KthProg
Copy link

KthProg commented May 14, 2017

OneHotEncoder doesn't work on an array of arrays though right? After parsing via a pipeline, the result (before the feature union) would look like this:

[
  [26,32,56],
  [67,32],
  [],
  ...
]

Where each row is a training instance, and each value is the codes in that instance. I think a one-hot encoder would expect each instance to have only one value, which ends up making the multi-label binarizer look like the correct choice.

@jnothman
Copy link
Member

jnothman commented May 14, 2017 via email

@KthProg
Copy link

KthProg commented May 14, 2017

Pull request to include multi-label transformer ¯_( ͡° ͜ʖ ͡°)_/¯? I don't think I've ever contributed to a useful project before lol...

@jnothman
Copy link
Member

jnothman commented May 14, 2017 via email

@KthProg
Copy link

KthProg commented May 14, 2017

I think they're two different problems, but I could be misunderstanding. If a dict vectorizer handled multi-valued data, would it be handling maps where the values are arrays? Or simply being extended to handle arrays as a type of map (index keys)?

@jnothman
Copy link
Member

jnothman commented May 15, 2017 via email

@KthProg
Copy link

KthProg commented May 15, 2017

I can't help but wonder if there's some way to generalize this problem to handle many data structures. However, it would be trivial for me to update my pipeline to have a map with a single key, so this kind of change would solve my problem and integrate with the current framework.

Is there an issue I can associate with a pull request if I try to make this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LabelBinarizer and LabelEncoder fit and transform signatures not compatible with Pipeline
7 participants