Skip to content

transform should not accept y #8174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnothman opened this issue Jan 8, 2017 · 21 comments
Closed

transform should not accept y #8174

jnothman opened this issue Jan 8, 2017 · 21 comments

Comments

@jnothman
Copy link
Member

jnothman commented Jan 8, 2017

I suspect that whenever y=None is included in the transform method signature (after mandatory X), this is done in error. The y parameter should be deprecated and eventually removed. It only confuses the user about the isolation of y from transformation.

@jnothman
Copy link
Member Author

jnothman commented Jan 8, 2017

$ git grep def.transform.*y=None
sklearn/cluster/birch.py:    def transform(self, X, y=None):
sklearn/cluster/k_means_.py:    def transform(self, X, y=None):
sklearn/decomposition/base.py:    def transform(self, X, y=None):
sklearn/decomposition/dict_learning.py:    def transform(self, X, y=None):
sklearn/decomposition/fastica_.py:    def transform(self, X, y=None, copy=True):
sklearn/decomposition/pca.py:    def transform(self, X, y=None):
sklearn/feature_extraction/dict_vectorizer.py:    def transform(self, X, y=None):
sklearn/feature_extraction/hashing.py:    def transform(self, raw_X, y=None):
sklearn/feature_extraction/text.py:    def transform(self, X, y=None):
sklearn/kernel_approximation.py:    def transform(self, X, y=None):
sklearn/kernel_approximation.py:    def transform(self, X, y=None):
sklearn/kernel_approximation.py:    def transform(self, X, y=None):
sklearn/neighbors/approximate.py:    def transform(self, X, y=None):
sklearn/preprocessing/_function_transformer.py:    def transform(self, X, y=None):
sklearn/preprocessing/data.py:    def transform(self, X, y=None, copy=None):
sklearn/preprocessing/data.py:    def transform(self, X, y=None):
sklearn/preprocessing/data.py:    def transform(self, X, y=None):
sklearn/preprocessing/data.py:    def transform(self, X, y=None):
sklearn/preprocessing/data.py:    def transform(self, X, y=None, copy=None):
sklearn/preprocessing/data.py:    def transform(self, X, y=None, copy=None):
sklearn/preprocessing/data.py:    def transform(self, K, y=None, copy=True):
sklearn/preprocessing/data.py:    def transform(self, X, y=None):
sklearn/random_projection.py:    def transform(self, X, y=None):
sklearn/tests/test_base.py:        def transform(self, X, y=None):
sklearn/tests/test_pipeline.py:    def transform(self, X, y=None):
sklearn/tests/test_pipeline.py:    def transform(self, X, y=None):

@Akshay0724
Copy link
Contributor

Akshay0724 commented Jan 8, 2017

Hello @jnothman, I'd like to work on this issue.

Are you talking about modification like this--

def transform(self,X,y=None):
if y is not None:
warning.warn("Parameter y is Deprecated for this method as it is not required and will be removed in future version")

@jnothman
Copy link
Member Author

jnothman commented Jan 8, 2017 via email

@jnothman
Copy link
Member Author

jnothman commented Jan 8, 2017 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 8, 2017 via email

@jnothman
Copy link
Member Author

jnothman commented Jan 8, 2017

@GaelVaroquaux please clarify examples. IMO:

  • I can't think of any case y is used in transform atm
  • It is certainly not forwarded by Pipeline or FeatureUnion
  • We have inconsistency as to whether y=None is included in transform's signature
  • But I think including it in transform's signature when it is generally unused confuses users and people writing their own transformers.
  • The documentation has transform with a single positional arg as the prototype.

@jnothman
Copy link
Member Author

jnothman commented Jan 8, 2017 via email

@Akshay0724
Copy link
Contributor

Hello @jnothman, I think it is authentic before starting working on a PR is to ask weather some one else is working on or not. I had started working on this issue and was going to make a PR, in the mean time @tzano directly make a PR.

@jnothman
Copy link
Member Author

jnothman commented Jan 9, 2017 via email

@Akshay0724
Copy link
Contributor

Is work is still do be done in this issue.

@jnothman
Copy link
Member Author

jnothman commented Feb 19, 2017 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 19, 2017 via email

@jnothman
Copy link
Member Author

jnothman commented Feb 19, 2017 via email

@amueller
Copy link
Member

amueller commented Mar 3, 2017

I argued for this before, but @GaelVaroquaux wanted to wait on a ruling on what we're gonna do in terms of a new interface before deprecating anything. I'm still totally for it.

@jnothman
Copy link
Member Author

jnothman commented Mar 4, 2017 via email

@jnothman
Copy link
Member Author

Fixed

@matlotpib
Copy link

matlotpib commented May 9, 2019

@jnothman
@GaelVaroquaux

"I can't think of any case y is used in transform atm"

"So, I agree that transform cannot accept y."

I have a use case for a transform accepting 'y'.

During the training phase:

Fit(X,y):

  • for a given feature X['feature'], fit model that predicts that feature based on 'y' (and maybe other features)

Transform(X,y):

  • if an instance of the given feature is determined to be an outlier, replace it with the value predicted by the model created in 'fit' (which requires 'y').

...Then, during the predict phase:

Transform(X,y==None):

  • there is no 'y', so do nothing

To summarise: during training, you use a model based on 'y' to predict 'x', and use it to replace outliers in X.
This may allow you to train a more reasonable classifier.

Then, during predict, you do nothing in the transform, since 'y' is what you are predicting.

For this, you want the option of being able to pass in 'y' during the training phase, but not passing it in during the predict phase.

Edit: but maybe there is a better way of doing this? I just wanted everything to be in the pipeline, rather than doing this step separately.

@jnothman
Copy link
Member Author

@matlotpib isn't this what fit_transform allows for? Or am I missing something?

@DavidMertz
Copy link

I am using FunctionTransformer for the first time, and I absolutely need y for my purposes. What I wish to do is remove "non-modal" (i.e. atypical) target values. I have a training set where I can assume that these uncommon target values are bad data. Only one of the mode values should be permitted for a particular combination of features (by default the identity of a person, but the interface allows selecting other fields to define "equivalent" persons).

def _modal_filter(X, y, fields='_member_key'):
    df = X.copy()
    df['TARGET'] = y
    df = modal_filter(df, fields)
    df.drop([c for c in df.columns if c.startswith(('TARGET', '_'))], 
            axis=1, inplace=True)
    return df

def ModalFilter(fields):
    fn = partial(_modal_filter, fields=fields)
    return FunctionTransformer(fn, validate=False, pass_y=True)

If I cannot pass in y, there does not seem to be any way to accommodate this in a pipeline. You can see that the underlying function modal_filter() works fine. but accepts a DataFrame that has a TARGET column in it. This is a good function, but cannot be pipelined directly (as far as I can see)

@jnothman
Copy link
Member Author

jnothman commented Jun 20, 2019 via email

@mhorbach-tibco
Copy link

mhorbach-tibco commented Jun 9, 2022

y is needed in this case: you want to make a FunctionTransformer that either calls CountEncoder, or TargetEncoder, the choice is passed in with method='target_encoder' (default). Both encoders are from catagory_encoders. TargetEncoder encodes X but not y. It just uses y to do the encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants