-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Makes ColumnTransformer more flexible by only checking for non-dropped columns #19263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Makes ColumnTransformer more flexible by only checking for non-dropped columns #19263
Conversation
So think about this pipeline: clf = make_pipeline(ColumnTransformer(...., remaining=SequenceAnalyzer()), SGDClassifier()) Assume all those remaining columns form a time series or a sort of a sequence which is understood by If we want But if I'm the only one who doesn't like this to be in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed the test cases so far. Are there more edge cases we might want to test?
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another round.
"data given during fit." | ||
) | ||
Xs = self._fit_transform(X, None, _transform_one, fitted=True) | ||
# ndarray was used for training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this stage, the if-else is not enough not know that we trained/fitted on an ndarray. I think the error for the invalid combination "fit df and transform ndarray" will be thrown in self._fit_transform
further below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the comment to:
ndarray was used for fitting or transforming, thus we only check that n_features is consistent
In one of my later commits, this PR started to allow for fitting on dataframe and transforming on numpy arrays and vice vesa. This was to ensure backward compatibility. The new feature enabled by this PR is only active if fitting and transform are being done on dataframes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
It remains to adapt the docstring and maybe the user guide. When exactly is the order important, when is it not?
(And maybe also correct the docstring sentence "A callable is passed the input data X
and can return any of the above.":eyes:)
and the dataframe only has string column names, then transforming a dataframe | ||
will use the column names to select the columns:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the dataframe only has string column names, then transforming a dataframe | |
will use the column names to select the columns:: | |
and the dataframe has only string column names, then transforming a dataframe | |
will use the column names to select the columns, no matter in which order they | |
are:: |
The dataframe in transform also needs string column names, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR doesn't check for the dataframe in transform
to only string columns. The only requirement was for the dataframe in transform
to have all the (str) column names that were required by fit
.
I lean slightly toward the current behavior, but I am open to requiring all strings column in transform
as well.
Thanks for proposing this PR! |
@thomasjpfan Could you resolve merge conflicts? Then we could label it as waiting for reviewers😏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just tiny sugggestions for further improvement. Feel free to ignore if you believe they are not needed/helpful.
@thomasjpfan Thanks for taking care of this issue and making it happen for the upcoming 1.0 release. |
Reference Issues/PRs
Fixes #14251
What does this implement/fix? Explain your changes.
This PR enables
transform
to only require non-dropped columns to exist in the input,X
, regardless of the order. Also, dropped columns are not required intransform
.