Skip to content

MinMaxScaler output datatype #25845

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ducatte
Copy link

@ducatte ducatte commented Mar 14, 2023

Reference Issues/PRs

Fixes #18443

What does this implement/fix? Explain your changes.

Modifies MinMaxScaler to give users the option to choose the output_dtype

Any other comments?

We may want to add this feature to other scalers
Still need to modify the changelog. Version 1.3, Enhancement?

ducatte added 3 commits March 13, 2023 23:08
modifications:
__init__
partial_fit
transform
minmax_scale

_parameter_constraints
docstring

fix tests
@ducatte ducatte changed the title MinMaxScaler output datatype #18443 MinMaxScaler output datatype Mar 14, 2023
Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking up this issue and working on it. I've left a few comments

@ducatte ducatte requested a review from betatim March 15, 2023 02:08
@glemaitre glemaitre self-requested a review March 16, 2023 14:13
@ducatte
Copy link
Author

ducatte commented Mar 20, 2023

@glemaitre @betatim can you help me with the review? I just corrected the merge conflict in the changelog.

@glemaitre
Copy link
Member

glemaitre commented Mar 20, 2023

We discuss this PR IRL with @jeremiedbb and @ogrisel to know if we really wanted this feature.

In the end, the feature will be useful if we succeed to avoid some memory copy. Otherwise, we could instead have a FunctionTransformer whose job is to call astype before or after the scaler in a Pipeline.

So here, the fix is not the right one since we actually do a copy when calling _validate_data. Putting fit aside, at transform, we should keep X dtype untouched and instead pre-allocate X_trans to be returned in the output dtype. Then, all operations could be done using the out from numpy. In this case, we never modify X and thus avoid an additional memory copy.

Where it becomes more cumbersome is handling the sparse case since the API offered by scipy.sparse is not the same. It means that we will have diverging code between dense and sparse which is not the case for the moment.

This PR could be useful to actually evaluate how much cumbersome and additional maintenance effort is required if we want to go this path.

@glemaitre glemaitre removed their request for review September 8, 2023 18:32
@lucyleeow lucyleeow added Needs Decision Requires decision and removed Needs Decision Requires decision labels Nov 1, 2023
@lucyleeow
Copy link
Member

lucyleeow commented Nov 1, 2023

Closing this PR as we are not sure we want this feature (and would need different implementation) and @ducatte indicated they were not interested in continuing on this but please re-open if desired.

I've copied @glemaitre 's useful repy to the issue #18443

@lucyleeow lucyleeow closed this Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MinMaxScaler output datatype
4 participants