Make standard scaler compatible to Array API #27113

AlexanderFabisch · 2023-08-19T16:42:43Z

Here's my contribution from the EuroSciPy 2023 sprint. It's still work in progress and I won't have the time to continue the work before October. So if anyone else wants to take it from here, feel free to do so.

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Make standard scaler compatible to Array API.

Any other comments?

Unfortunately, the current implementation breaks some unit tests of the standard scaler that are related to dtypes. That's because I wanted to make it work for torch.float16, but maybe that is not necessary and we should just support float32 and float64.

I'll also add some comments to the diff. See below.

sklearn/preprocessing/_data.py

github-actions · 2023-08-19T16:44:14Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 4b5e4bd. Link to the linter CI: here}

sklearn/preprocessing/_data.py

sklearn/preprocessing/tests/test_data.py

sklearn/utils/_array_api.py

AlexanderFabisch · 2023-08-19T16:58:35Z

sklearn/utils/extmath.py

-        result = op(x, *args, **kwargs, dtype=np.float64)
+    from ..utils._array_api import isdtype, get_namespace
+    xp, _ = get_namespace(x)
+    if isdtype(x.dtype, "real floating", xp=xp) and x.dtype in (xp.float32, xp.float64):  # what about int, etc.?


Here is actually an error, I guess the second condition should be x.dtype in (xp.float16, xp.float32).

Since x.dtype.itemsize < 8 is not an Array API compatible way to assess the precision level of a dtype, we could instead do xp.finfo(x.dtype).bits < 64.

this should be solved

sklearn/utils/extmath.py

AlexanderFabisch · 2023-08-19T17:02:56Z

sklearn/preprocessing/tests/test_data.py

+def test_standard_scaler_array_api_compliance(array_namespace, device, dtype):
+    xp, device, dtype = _array_api_for_tests(array_namespace, device, dtype)
+
+    from sklearn.datasets import make_classification


This should either be at the top of the file or we can just use something like random_state.randn(n_samples, n_features).

EdAbati · 2023-08-20T09:06:14Z

Hi @AlexanderFabisch, I'm happy to continue this if it cannot wait until October. Waiting to see what the maintainers think. :)

Here are a few things I learned while working on my PR that might be helpful if you decide to keep working on it:

update your branch with main to get some useful functions like _array_api.supported_float_dtypes
testing the Array API compliance could be done by using a function that looks like this
in other places, a scalar array is created using xp.asarray(0.0, device=device(...))

AlexanderFabisch · 2023-08-20T17:10:33Z

I'm happy to continue this if it cannot wait until October.

Sure, I could also give you write access to my fork if needed. That way we could collaborate better.

sklearn/preprocessing/_data.py

EdAbati · 2023-09-16T17:58:34Z

Hi @AlexanderFabisch , thank you for sharing the fork :)
I continued a bit, and tried to resolve some comments based on what I saw in the other PRs.

There are still a couple of TODOs:

the "hack" regarding sqrt needs to be updated
some tests are still failing because certain operations with arrays of different types are not allowed (e.g. multiplication between int64 and float64). Make standard scaler compatible to Array API #27113 (comment)

Another thing to bear in mind is that device='mps' does not support float64. #27232 introduces something we could use

AlexanderFabisch · 2023-09-17T07:41:39Z

That looks a lot better @EdAbati . Thanks for continuing this PR.

It used to be that _check_sample_weight would coerce to a maximum precision float dtype for non-floating-point data. It was changed to the default float type, though.

Also, fixup formatting.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

AlexanderFabisch · 2025-02-14T10:55:35Z

I started to rebase on the current main and will compile a list of todos. Unfortunately, this PR breaks a unit test that has been introduced recently. I hope to have a clear picture of what is left to do in the beginning of next week.

AlexanderFabisch · 2025-02-15T16:59:33Z

I rebased and cleaned up the PR a bit. I believe there is only one open discussion at the moment about using try/except vs. inspection of a function's signature in extmath._safe_accumulator_op. Everything else looks good. @charlesjhill What do you think?

github-actions bot added module:preprocessing module:utils labels Aug 19, 2023