Skip to content

FIX Ensure that index is correct when global transform_output is pandas #26454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 14, 2023

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Fixes #26443

What does this implement/fix? Explain your changes.

This PR updates the common test to ensure that the output dataframe from transform is consistent with the input. I updated Isomap, IterativeImputer and PowerTransformer so they pass the updated common test.

@@ -627,7 +627,7 @@ def _initial_imputation(self, X, in_fit=False):
strategy=self.initial_strategy,
fill_value=self.fill_value,
keep_empty_features=self.keep_empty_features,
)
).set_output(transform="default")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this, cause it seems it could be quite buggy, as in, in our codebase I don't know where else we have such instances where we're not explicitly setting the output type.

I would suggest having a default which is to raise and be like: you haven't explicitly set this value for this estimator, please do so, and have that mode for scikit-learn at least.

This way we'd know we should always set it explicitly to: default, pandas, or the global config.

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get behind a new "raise_if_not_configured" configuration flag. Implementation-wise we'll need a standard place to check that configuration at fit. That can be in _validate_data or _fit_context. _fit_context is a more natural place but it is not used everywhere yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_fit_context is a more natural place but it is not used everywhere yet.

It will be the case when #26473 is merged

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this solution, but rather have in the release than not.

@thomasjpfan
Copy link
Member Author

Even with the suggestion in #26454 (comment), we'll end up with the same solution as this PR. The suggestion would improve our coverage for catching this type of bug.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeremiedbb jeremiedbb merged commit 96878ba into scikit-learn:main Jun 14, 2023
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PowerTransformer returns inconsistent index when transform output is set globally
3 participants