ENH: Add Dask Array API support #28588

lithomas1 · 2024-03-07T00:08:59Z

Reference Issues/PRs

#26724

What does this implement/fix? Explain your changes.

Any other comments?

This depends on unmerged/unreleased changes in array-api-compat

…dask-array-api

github-actions · 2024-03-07T00:10:31Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: bd5c6e4. Link to the linter CI: here}

sklearn/discriminant_analysis.py

lithomas1 · 2024-03-07T00:14:51Z

sklearn/discriminant_analysis.py

@@ -113,6 +113,14 @@ def _class_means(X, y):
    """
    xp, is_array_api_compliant = get_namespace(X)
    classes, y = xp.unique_inverse(y)
+    # Force lazy array api backends to call compute
+    if hasattr(classes, "persist"):


I'm planning on extracting the calls here to a compute function to realize lazy arrays.

Not sure if it will live in scikit-learn, though. I think I'd want it to live in array-api-compat, but I haven't discussed this with the folks there yet.

Based on usage here, the new compute function could have an option to e.g. compute shape only if that's what's needed, or compute the full array.

I think this kind of code needs to go somewhere else. Having the "lazyness" leak through is a bit silly. I'm not sure where it should be though.

I don't know how to resolve this issue. In the past I've suggested that accessing something like .shape should just trigger the computation for you or that everything needs to grow a .compute, even eager libs. Both would allow array consumers like scikit-learn to write code that does not care whether the implementation is lazy or not. However there doesn't seem to be much support for this within the Array API community. The alternative is having to place code that checks "is this lazy? if yes trigger compute" in consumers like scikit-learn which I think is not great.

So this is maybe not a task for you alone to solve @lithomas1 but we need to find some kind of solution

Thinking about this some more, would scikit-learn be happy with a shape helper (as opposed to my earlier suggestion of a compute helper) from array-api-compat?

This would handle the laziness of the various array libraries out there and call compute (or whatever the equivalent is) on the array if the shape needs to be materialized, before returning the shape of the array.

(In the event that array-api-compat is not installed, we can just define this to return, e.g. x.shape like scikit-learn already does)

The advantage of this approach would be that scikit-learn doesn't have to think about laziness anymore - that work would be outsourced to array-api-compat.

The only downsides of this approach of this approach are that

We indiscriminately materialize (for lazy arrays) even if it's strictly not necessary (e.g. we just access the shape to pass it to an array constructor like np.zeros). I don't think we'll lose too much (if any) performance here, though.

scikit-learn devs need to remember that accessing .shape is banned, and existing usages have to be migrated.

This can be mitigated with a pre-commit hook, to automatically detect .shape accesses.
I think with something like ruff, one can write a custom rule to automatically rewrite .shape accesses to shape(x)

For existing usages, this is something that can be done as part of the Array API migration process, so it shouldn't cause too much churn on its own

How does this sound to you?

@betatim any thoughts on the above?

Replying here and linking @ogrisel's comment #28588 (comment)

I think having a helper like shape() is the only(?) way forward for now. I'd not add it to array-api-compat but instead add it to scikit-learn in utils/_array_api.py - we already have a few helpers there for things that feel "scikit-learn specific".

More adventurous: I wonder if we can even wrap the dask namespace (and via that its arrays) to make it so what .shape access triggers the computation. That way people who edit scikit-learn's code base don't need to know anything about this issue.

Note that a compute helper could also help deal with the boolean-masked assignment problem in r2_score described in more details in this comment: #28588 (comment)

This also a lazy evaluation problem but not related to shape values.

I think having a helper like shape() is the only(?) way forward for now. I'd not add it to array-api-compat but instead add it to scikit-learn in utils/_array_api.py - we already have a few helpers there for things that feel "scikit-learn specific".

We could also do both: have a public helper in array-api-compat and a private scikit-learn specific helper in scikit-learn, that does nothing for libraries that are not accessed via array-api-compat (e.g. array-api-strict) as long the spec does not provide a standard way to deal with this.

More adventurous: I wonder if we can even wrap the dask namespace (and via that its arrays) to make it so what .shape access triggers the computation. That way people who edit scikit-learn's code base don't need to know anything about this issue.

Not sure how feasible this is and whether its desirable or not to trigger computation implicitly when using lazy libraries.

lithomas1 · 2024-03-07T00:17:08Z

sklearn/discriminant_analysis.py

+        # compute right now
+        # Probably a dask bug
+        # (the error is also kinda flaky)
+        y = y.compute()


To shed more light on what this is, I think this bug happens when scikit-learn calls unique_inverse.

I think something is going wrong in dask somewhere where the results of intermediate operations are getting corrupted.

When the error occurs in computation is somewhat flaky, but it happens more often than not without the compute here.

Did you investigate the cause of the corruption? It might be worth reporting a minimal reproducer upstream.

sklearn/utils/_array_api.py

lithomas1 · 2024-03-07T00:20:32Z

While LDA was a bit hard to port over to dask, PCA worked perfectly out of the box!
(which I think is a pretty big win for the array API)

The other preprocessing/metrics/tools also worked out of the box, I think (at least judging by the tests).

lithomas1 · 2024-03-07T00:24:57Z

R.e. performance

Testing out of core

for dask generated by dask-ml with parameters

n_samples=100_000
n_classes=2
n_informative=5

on a Gitpod machine with 2 cores and 8 GB RAM, I get

14m51s for 100,000,000 by 20 LDA (14.90 GB) # 100,000,000 samples
47 seconds for 10,000,000 by 20 LDA (1.49 GB) # 10,000,000 samples
6m 3.5s for 50,000,000 by 20 LDA (7.45 GB). # 50,000,000 samples

which I think is a pretty decent scaling

Distributed computation

For

n_samples=20,000,000
n_classes=2
n_informative=5

and chunksize=100,000

I am measuring
45s runtime for 4 workers (2 CPU, 8GB RAM), and
4 min 29s runtime for a single worker

note: there was a bit of spilling on this one.

adrinjalali · 2024-03-07T07:03:35Z

cc @betatim @ogrisel

sklearn/discriminant_analysis.py

lithomas1 · 2024-03-12T01:20:14Z

FYI, array-api-compat fixes are ongoing here data-apis/array-api-compat#110

…dask-array-api

lithomas1 · 2024-03-24T21:59:02Z

Since array-api-compat 1.5.1 came out and CI is green here,

I'm going to be marking this PR as ready for review.

The only other change I'm planning right now is splitting out the LDA changes, since that requires a patch to dask itself.

The correct way to handle laziness is also something that might be good to think about.
(It might be good to loop in more scikit-learn devs about this).

ogrisel · 2024-04-23T15:33:11Z

I'm not sure how important score is though. There seem to be some usages of it, but curiously there seem to be no examples in scikit-learn itself on using score.

The score method is implicitly used by tools such as cross_val_score or GridSearchCV. However, it is true that it is very rarely used for PCA alone in practice. It's mostly used for supervised learning pipelines.

I agree that it's not ideal to have dask not be able to use score, but this is something that I think is reasonable to have users work around for now, e.g. by using _estimator_with_converted_arrays, to convert the estimator from dask to numpy arrays.
(similar to how we transfer arrays from GPU to CPU on cupy)

It might also be a case the score method itself of the estimator can convert arrays to numpy if the namespace does not provide the necessary xp.linalg.slogdet method. For truncated PCA, we can expect the call of the fit method and maybe get_precision to be slow and would deserve running on accelerated namespace while the final xp.linalg.slogdet call should not be a performance critical operation (and the result is a scalar).

…dask-array-api

doc/whats_new/v1.5.rst

Co-authored-by: Samir Nasibli <samir.nasibli@intel.com>

ogrisel · 2024-05-23T12:28:32Z

sklearn/linear_model/_ridge.py

@@ -288,11 +288,14 @@ def _solve_cholesky_kernel(K, y, alpha, sample_weight=None, copy=False):
 def _solve_svd(X, y, alpha, xp=None):
    xp, _ = get_namespace(X, xp=xp)
    U, s, Vt = xp.linalg.svd(X, full_matrices=False)
-    idx = s > 1e-15  # same default value as scipy.linalg.pinv
-    s_nnz = s[idx][:, None]
+    idx = s > 1e-15[:, None]  # same default value as scipy.linalg.pinv


> does not have precedence of over [], right? Trying this pattern locally yields:

TypeError: 'float' object is not subscriptable

I think you meant the following instead:

Suggested change

idx = s > 1e-15[:, None] # same default value as scipy.linalg.pinv

# scipy.linalg.pinv also thresholds at 1e-15 by default.

idx = (s > 1e-15)[:, None]

But then the following s = s[:, None] seems redundant and similarly for idx[:, None] in the call to where.

Also we should probably rename idx to something more correct and explicit such as strictly_positive_mask.

Thanks! I think I had this working locally and passing tests, but I must have messed something up on the merge.

I'm planning on circling back to this towards the end of summer, so feel free to take over this if you're interested in the meantime.

Not holding my breath, but also hoping that dask.array support improves in the meantime as well.

(The recent Array API updates suggest that sort is the most pressing thing that's missing in dask. Linalg wise, the eig family of methods is probably the next biggest missing feature in dask, we just haven't seen it come up yet since not a lot of estimators have been ported yet).

ogrisel · 2024-05-24T14:00:01Z

Let's convert to draft for the time being then.

ogrisel

I merged main to retrigger a round of CI with the current state of scikit-learn array API so that we have a better understanding on how much value dependent shapes / assignments are a problem and maybe compare the test results to the failures observed when attempting to run the same tests with jax: #29647.

I am changing my review state to request changes to avoid implying that this draft PR is ready to merge as is. We first need to agree if we accept partial dask support of if want to iron out all the problems induced by the lazy evaluation semantics before considering a merge of this PR.

lithomas1 · 2024-08-27T20:20:00Z

Thanks for updating.

I'm back to working on this again now that my summer is over.

Last time I remember, I think had ridge regression working locally (after wrangling some issues with data-dependent output shapes which I think I worked around with where).

I'm going to try to fix some more issues in dask upstream like the lack of sorting, which seems to be a majority of failures, an also linalg.

r.e. dynamic shapes:
Have you tried running sklearn functions through jax with JIT on?
I believe dynamic shapes isn't an issue in jax without the JIT.

…dask-array-api

ogrisel · 2024-09-11T12:23:16Z

Have you tried running sklearn functions through jax with JIT on?

No I have not tried.

ogrisel · 2025-03-25T15:09:30Z

For the record, #30340 was recently merged to main so this might unblock this PR because we can now leverage dask compat features implemented in array-api-extra.

lithomas1 · 2025-03-25T16:59:53Z

Thanks, I'll rebase this this week and let you know how it goes.

…dask-array-api

lithomas1 · 2025-03-30T22:24:31Z

Made a bit of progress.
Some notes I made along the way are:

TODOs

Upstream fill_diagonal to array-api-extra (5 uses in sklearn)
Upstream isin/in1d to array-api-extra (used in encoders)

Changes

had to disable shape check in _average for lazy backends

Blockers

setdiff1d in array-api-extra only works on eager arrays
- Probably the next major thing to work on (this blocks LabelEncoder from working)
  - remaining 13 metrics tests also depend on this

Progress

Fixed PCA (was broken due to fill_diagonal)
Fixed the input validators (e.g. is_multilabel)
- nunique in array-api-extra is very helpful so far
  - Able to do int(nunique(... to get n_classes
  - This forces compute (but shouldn't be so bad if we're only doing this on y)
- Going to try to apply this to LDA later on

Thoughts

nunique works, but it'd be better if it was part of a unique function though
- Right now, we have to do double the unique computation, once on finding the unique values and once for nunique
If we want to support lazy backends, there needs to be some thought as to how error checking should work
- Right now, I disable them since for dask we don't know the shape until we compute. The alternative would be to defer them to raise during compute time.

…n into wip-dask-array-api

…dask-array-api

lithomas1 · 2025-04-17T13:37:23Z

I haven't touched this PR in a while (besides rebasing).
I've opened an issue at array-api-extra, data-apis/array-api-extra#268, for the missing functions and I think they're in favor of adding them.

Hopefully I can make the PRs into array-api-extra and/or finish debugging LDA next week.

lucascolley · 2025-04-19T17:03:12Z

setdiff1d in array-api-extra only works on eager arrays

there was a PR in progress for this at data-apis/array-api-extra#124, feel free to comment over there if that is blocking you

lithomas1 and others added 10 commits February 18, 2024 19:43

add all files

dd239fc

don't use np.asarray to force computation

c1a7522

some mods

696ed09

Merge branch 'main' of github.com:scikit-learn/scikit-learn into wip-…

d0ea36f

…dask-array-api

update

ab75783

change back sign coercion stuff

2d6d2ca

update

584c13a

avoid hang, but why?

164a066

fixes and remove linear model support

634a228

remove test notebook

11d1fcd

github-actions bot added the module:utils label Mar 7, 2024

lithomas1 commented Mar 7, 2024

View reviewed changes

lithomas1 changed the title ~~Add Dask Array API support to LDA/PCA~~ ENH: Add Dask Array API support to LDA/PCA Mar 7, 2024

betatim reviewed Mar 7, 2024

View reviewed changes

sklearn/discriminant_analysis.py Outdated Show resolved Hide resolved

skip some tests for PCA

cc6cc4b

lithomas1 added 7 commits March 12, 2024 22:24

update

514abb3

Merge branch 'main' of github.com:scikit-learn/scikit-learn into wip-…

34d1b65

…dask-array-api

add min version check for array-api-compat

9637b8e

patches

683e95a

bump array-api-compat

9fd95db

fix r2_score

263b0e6

fix tests

c9e789c

lithomas1 marked this pull request as ready for review March 24, 2024 21:41

lithomas1 changed the title ~~ENH: Add Dask Array API support to LDA/PCA~~ ENH: Add Dask Array API support Mar 26, 2024

lithomas1 added 3 commits April 26, 2024 12:50

Merge branch 'main' of github.com:scikit-learn/scikit-learn into wip-…

543ade4

…dask-array-api

fix ridgeregression

f757c55

Merge branch 'main' of github.com:scikit-learn/scikit-learn into wip-…

854544d

…dask-array-api

samir-nasibli reviewed May 17, 2024

View reviewed changes

doc/whats_new/v1.5.rst Outdated Show resolved Hide resolved

Remove conflict resolution marker

dbb8673

Co-authored-by: Samir Nasibli <samir.nasibli@intel.com>

ogrisel reviewed May 23, 2024

View reviewed changes

ogrisel marked this pull request as draft May 24, 2024 13:59

ogrisel mentioned this pull request Aug 8, 2024

dask.array module should expose dtype objects and the isdtype function dask/dask#10387

Open

Merge main and regenerate lock files

58b5ad1

ogrisel requested changes Aug 21, 2024

View reviewed changes

Merge branch 'main' of github.com:scikit-learn/scikit-learn into wip-…

b6af823

…dask-array-api

lithomas1 force-pushed the wip-dask-array-api branch from 42d97e2 to b6af823 Compare August 28, 2024 21:10

This was referenced Sep 11, 2024

Enable array API testing for jax.experimental.array_api #29647

Draft

Array API backends support for MLX #29673

Open

ogrisel mentioned this pull request Mar 20, 2025

MNT co-vendor array-api-{compat, extra} #30340

Merged

1 task

lithomas1 added 3 commits March 25, 2025 20:41

Merge branch 'main' of github.com:scikit-learn/scikit-learn into wip-…

041c80f

…dask-array-api

more fixes

a31765f

Merge branch 'main' into wip-dask-array-api

388dc94

lithomas1 added 3 commits March 30, 2025 18:26

WIP for LDA working

2bca647

Merge branch 'wip-dask-array-api' of github.com:lithomas1/scikit-lear…

d736fd4

…n into wip-dask-array-api

Merge branch 'main' of github.com:scikit-learn/scikit-learn into wip-…

bd5c6e4

…dask-array-api

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add Dask Array API support #28588

ENH: Add Dask Array API support #28588

lithomas1 commented Mar 7, 2024 •

edited

Loading

github-actions bot commented Mar 7, 2024 •

edited

Loading

lithomas1 Mar 7, 2024

betatim Mar 7, 2024

lithomas1 Mar 24, 2024 •

edited

Loading

lithomas1 Mar 28, 2024

betatim Apr 2, 2024

ogrisel Apr 10, 2024 •

edited

Loading

ogrisel Apr 10, 2024

ogrisel Apr 10, 2024

lithomas1 Mar 7, 2024

ogrisel Apr 10, 2024

lithomas1 commented Mar 7, 2024

lithomas1 commented Mar 7, 2024 •

edited

Loading

adrinjalali commented Mar 7, 2024

lithomas1 commented Mar 12, 2024

lithomas1 commented Mar 24, 2024

ogrisel commented Apr 23, 2024

ogrisel May 23, 2024

ogrisel May 23, 2024

lithomas1 May 23, 2024

ogrisel commented May 24, 2024

ogrisel left a comment •

edited

Loading

lithomas1 commented Aug 27, 2024

ogrisel commented Sep 11, 2024

ogrisel commented Mar 25, 2025

lithomas1 commented Mar 25, 2025

lithomas1 commented Mar 30, 2025 •

edited

Loading

lithomas1 commented Apr 17, 2025

lucascolley commented Apr 19, 2025

	idx = s > 1e-15[:, None] # same default value as scipy.linalg.pinv
	# scipy.linalg.pinv also thresholds at 1e-15 by default.
	idx = (s > 1e-15)[:, None]

ENH: Add Dask Array API support #28588

Are you sure you want to change the base?

ENH: Add Dask Array API support #28588

Conversation

lithomas1 commented Mar 7, 2024 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Mar 7, 2024 • edited Loading

✔️ Linting Passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithomas1 Mar 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithomas1 commented Mar 7, 2024

lithomas1 commented Mar 7, 2024 • edited Loading

adrinjalali commented Mar 7, 2024

lithomas1 commented Mar 12, 2024

lithomas1 commented Mar 24, 2024

ogrisel commented Apr 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel commented May 24, 2024

ogrisel left a comment • edited Loading

Choose a reason for hiding this comment

lithomas1 commented Aug 27, 2024

ogrisel commented Sep 11, 2024

ogrisel commented Mar 25, 2025

lithomas1 commented Mar 25, 2025

lithomas1 commented Mar 30, 2025 • edited Loading

lithomas1 commented Apr 17, 2025

lucascolley commented Apr 19, 2025

lithomas1 commented Mar 7, 2024 •

edited

Loading

github-actions bot commented Mar 7, 2024 •

edited

Loading

lithomas1 Mar 24, 2024 •

edited

Loading

ogrisel Apr 10, 2024 •

edited

Loading

lithomas1 commented Mar 7, 2024 •

edited

Loading

ogrisel left a comment •

edited

Loading

lithomas1 commented Mar 30, 2025 •

edited

Loading