MinMaxScaler output datatype #25845

ducatte · 2023-03-14T01:51:30Z

Reference Issues/PRs

Fixes #18443

What does this implement/fix? Explain your changes.

Modifies MinMaxScaler to give users the option to choose the output_dtype

Any other comments?

We may want to add this feature to other scalers
Still need to modify the changelog. Version 1.3, Enhancement?

modifications: __init__ partial_fit transform minmax_scale _parameter_constraints docstring fix tests

betatim

Thanks for picking up this issue and working on it. I've left a few comments

sklearn/preprocessing/_data.py

ducatte · 2023-03-20T07:49:50Z

@glemaitre @betatim can you help me with the review? I just corrected the merge conflict in the changelog.

glemaitre · 2023-03-20T09:03:18Z

We discuss this PR IRL with @jeremiedbb and @ogrisel to know if we really wanted this feature.

In the end, the feature will be useful if we succeed to avoid some memory copy. Otherwise, we could instead have a FunctionTransformer whose job is to call astype before or after the scaler in a Pipeline.

So here, the fix is not the right one since we actually do a copy when calling _validate_data. Putting fit aside, at transform, we should keep X dtype untouched and instead pre-allocate X_trans to be returned in the output dtype. Then, all operations could be done using the out from numpy. In this case, we never modify X and thus avoid an additional memory copy.

Where it becomes more cumbersome is handling the sparse case since the API offered by scipy.sparse is not the same. It means that we will have diverging code between dense and sparse which is not the case for the moment.

This PR could be useful to actually evaluate how much cumbersome and additional maintenance effort is required if we want to go this path.

lucyleeow · 2023-11-01T03:27:55Z

Closing this PR as we are not sure we want this feature (and would need different implementation) and @ducatte indicated they were not interested in continuing on this but please re-open if desired.

I've copied @glemaitre 's useful repy to the issue #18443

ducatte added 3 commits March 13, 2023 23:08

add passing test with current behaviour

507d73a

modified test with desired behavior

ddc1251

added output_dtype to MinMaxScaler

5808404

modifications: __init__ partial_fit transform minmax_scale _parameter_constraints docstring fix tests

github-actions bot added the module:preprocessing label Mar 14, 2023

ducatte changed the title ~~MinMaxScaler output datatype #18443~~ MinMaxScaler output datatype Mar 14, 2023

Merge branch 'main' into FixIssue18443-MinMaxScaler-output-datatype

560b6bb

betatim reviewed Mar 14, 2023

View reviewed changes

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

ducatte added 3 commits March 14, 2023 20:58

modifications and changelog

99a0d83

minmax_scale, put output_dtype at the end

84b6ca2

Merge branch 'main' into FixIssue18443-MinMaxScaler-output-datatype

fe4ba2e

ducatte requested a review from betatim March 15, 2023 02:08

ducatte added 2 commits March 15, 2023 10:23

Merge branch 'main' into FixIssue18443-MinMaxScaler-output-datatype

757be24

Merge branch 'main' into FixIssue18443-MinMaxScaler-output-datatype

de49125

glemaitre self-requested a review March 16, 2023 14:13

Merge branch 'main' into FixIssue18443-MinMaxScaler-output-datatype

7d6c9c1

ogrisel mentioned this pull request Apr 6, 2023

Use np.uint8 as default dtype for OneHotEncoder instead of np.float64 #26063

Open

ducatte mentioned this pull request May 7, 2023

MinMaxScaler output datatype #18443

Open

glemaitre removed their request for review September 8, 2023 18:32

lucyleeow added Needs Decision Requires decision and removed Needs Decision Requires decision labels Nov 1, 2023

lucyleeow closed this Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MinMaxScaler output datatype #25845

MinMaxScaler output datatype #25845

Uh oh!

ducatte commented Mar 14, 2023 •

edited

Loading

Uh oh!

betatim left a comment

Uh oh!

Uh oh!

Uh oh!

ducatte commented Mar 20, 2023

Uh oh!

glemaitre commented Mar 20, 2023 •

edited

Loading

Uh oh!

lucyleeow commented Nov 1, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

MinMaxScaler output datatype #25845

MinMaxScaler output datatype #25845

Uh oh!

Conversation

ducatte commented Mar 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

betatim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ducatte commented Mar 20, 2023

Uh oh!

glemaitre commented Mar 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucyleeow commented Nov 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ducatte commented Mar 14, 2023 •

edited

Loading

glemaitre commented Mar 20, 2023 •

edited

Loading

lucyleeow commented Nov 1, 2023 •

edited

Loading