Skip to content

ENH Makes ColumnTransformer more flexible by only checking for non-dropped columns #19263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jan 24, 2021

Reference Issues/PRs

Fixes #14251

What does this implement/fix? Explain your changes.

This PR enables transform to only require non-dropped columns to exist in the input, X, regardless of the order. Also, dropped columns are not required in transform.

@thomasjpfan thomasjpfan marked this pull request as draft January 24, 2021 23:52
@thomasjpfan thomasjpfan marked this pull request as ready for review January 25, 2021 00:33
@thomasjpfan thomasjpfan marked this pull request as draft January 25, 2021 01:32
@thomasjpfan thomasjpfan marked this pull request as ready for review January 25, 2021 01:51
@adrinjalali
Copy link
Member

So think about this pipeline:

clf = make_pipeline(ColumnTransformer(...., remaining=SequenceAnalyzer()), SGDClassifier())

Assume all those remaining columns form a time series or a sort of a sequence which is understood by SequenceAnalyzer, and SequenceAnalyzer can handle different lengths of those series (in the form of columns). Should ColumnTransformer then accept different lengths? This could happen if the upstream script returns a DataFrame which has the columns supporting maximum length of the series, which could differ between fit and transform here.

If we want ColumnTransformer to be very flexible, a user could argue this should also be supported I guess. (to be clear, I would rather not handle any of these in ColumnTransformer :D )

But if I'm the only one who doesn't like this to be in ColumnTransformer, I'm happy for it to be merged :)

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the test cases so far. Are there more edge cases we might want to test?

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round.

"data given during fit."
)
Xs = self._fit_transform(X, None, _transform_one, fitted=True)
# ndarray was used for training
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this stage, the if-else is not enough not know that we trained/fitted on an ndarray. I think the error for the invalid combination "fit df and transform ndarray" will be thrown in self._fit_transform further below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment to:

ndarray was used for fitting or transforming, thus we only check that n_features is consistent

In one of my later commits, this PR started to allow for fitting on dataframe and transforming on numpy arrays and vice vesa. This was to ensure backward compatibility. The new feature enabled by this PR is only active if fitting and transform are being done on dataframes.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
It remains to adapt the docstring and maybe the user guide. When exactly is the order important, when is it not?
(And maybe also correct the docstring sentence "A callable is passed the input data X and can return any of the above.":eyes:)

Comment on lines +531 to +532
and the dataframe only has string column names, then transforming a dataframe
will use the column names to select the columns::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and the dataframe only has string column names, then transforming a dataframe
will use the column names to select the columns::
and the dataframe has only string column names, then transforming a dataframe
will use the column names to select the columns, no matter in which order they
are::

The dataframe in transform also needs string column names, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't check for the dataframe in transform to only string columns. The only requirement was for the dataframe in transform to have all the (str) column names that were required by fit.

I lean slightly toward the current behavior, but I am open to requiring all strings column in transform as well.

@rth
Copy link
Member

rth commented Feb 1, 2021

Thanks for proposing this PR!

@lorentzenchr lorentzenchr added this to the 1.0 milestone Feb 22, 2021
@lorentzenchr
Copy link
Member

@thomasjpfan Could you resolve merge conflicts? Then we could label it as waiting for reviewers😏

@rth rth self-requested a review March 27, 2021 10:09
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just tiny sugggestions for further improvement. Feel free to ignore if you believe they are not needed/helpful.

@lorentzenchr lorentzenchr merged commit 9c3b402 into scikit-learn:main Apr 29, 2021
@lorentzenchr
Copy link
Member

@thomasjpfan Thanks for taking care of this issue and making it happen for the upcoming 1.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC ColumnTransformer input validation and requirements
6 participants