WIP Enabling different array types (CuPy) in PCA with NEP 37 #16574

thomasjpfan · 2020-02-27T22:38:23Z

Reference Issues/PRs

After discussing with @seberg, it was concluded that a prototype of using CuPy with sklearn can help with seeing how libraries would use NEP 37.

NEP 37
Avoid np.asarray call in check_array for duck-typed arrays #11447
Pass through non-ndarray object that support nep13 and nep18: vaex+sklearn out of core #14963
Using enable_duck_array as a config flag for now

What does this implement/fix? Explain your changes.

For reference, this is how this will look like for the user:

https://gist.github.com/thomasjpfan/da54f874f3d434eb4ae360b5e6fa4c12

With numpy

import numpy as np
from sklearn.decomposition import PCA

rng = np.random.RandomState(42)
X = rng.rand(15, 5)

pca = PCA(n_components=3, random_state=0)
X_trans = pca.fit_transform(X)
type(X_trans)
# numpy.ndarray

With cupy

X_cp = cupy.asarray(X)
type(X_cp)
# cupy.core.core.ndarray

X_cp_trans = pca.fit_transform(X_cp)
# ValueError

with sklearn.config_context(enable_duck_array=True):
    X_cp_trans = pca.fit_transform(X_cp)
type(X_cp_trans)
# cupy.core.core.ndarray

Any other comments?

This is a quick prototype on how we can use NEP 37 in PCA. (There is still work for other solvers)
Note the get_array_module is a hack for now. It would be called with numpy if NEP 37 gets accepted.

CC: @amueller @seberg

…n_nep37

amueller · 2020-02-28T18:28:07Z

In some discussions this week (maybe with @rgommers ?) the concern was raised that it will be awkward if each library using this interface will add a config flag, and that it might be nicer if the opt-in was on the level of numpy. But I'm not entirely sure if this is possible.

rgommers · 2020-02-29T10:22:02Z

Yes we did discuss this sklearn context manager idea, also with one or both of @jakirkham or @jni I believe. This PR is an interesting experiment. It would be useful to step back and figure out how this will look to the end user once more libraries start doing this. Options:

every library has its own per-package switch for this (as in this PR)
NumPy provides a switch
no switch, just accept a backwards incompatibility in a next major (or minor) release somewhere

Perhaps there are more options?

Dask and CuPy have already done (3) for __array_function__ with not too many problems. However, other libraries are understandably attached to their backwards compatibility. At first glance (2) seems more appealing than (1) long-term. It may make introduction a little harder - global state then has a wider reach - but having to set many of those flags for a larger programs will be quite clumsy a few years down the line.

rgommers · 2020-02-29T10:23:36Z

Another thing to discuss is having a context manager in the first place. E.g. @shoyer was very much against that when it was proposed with unumpy. If you do want to use one, think about how well things compose, and if it's robust for multi-threading/processing.

seberg · 2020-06-22T14:27:26Z

Since I see some activity, a while ago, we decided that we should explore both things, if easier in an (as private as possible) way in the master branch, with the understanding that it will by default be removed for the next release, unless we agree otherwise.
There is still the reasonable option of trying to extend __array_function__ enough to make it work well here (that means some kind of like= argument probably, which seems not a big issue to add in general, and possibly allowing easier fallback for __array_function__ implementers, or similar).

EDIT: Sorry, the main point was this: I am not aware of anyone actively trying to do this/experimenting with it, so picking it up, is welcome.

ogrisel · 2020-06-22T15:23:26Z

@seberg I wanted to give this PR a closer look and I found it needed a couple of fixes + a merge from master to take into account recent changes to get the CI up and running.

I tested it locally with jax and found an issue with read-only intermediate data when copy=True so I decided first to add a new test before pushing any further "improvement" to this PR.

There is still the reasonable option of trying to extend array_function enough to make it work well here (that means some kind of like= argument probably, which seems not a big issue to add in general, and possibly allowing easier fallback for array_function implementers, or similar).

Can you give a bit more details?

Another problem will be to deal with np.random.RandomState which is used internally by sklearn.decomposition.PCA(svd_solver='randomized'). This specifc cas has not been addressed in the current prototype.

AFAIK jax.numpy does not try to implement the numpy.random API at the moment, I am not sure how we should deal with partial implementations of the numpy API.

seberg · 2020-06-22T16:09:15Z

@ogrisel have to go in details later, but basically. I think we have not reached the point where we are sure that np.get_array_module() is quite necessary compared to making __array_function__ more powerful, but fixing some of its issues.
How exactly that looks like is unclear, and I think we actually need this to be explored at this time. I.e. try around with both and see how far we get. Then we can make a call.

np.random is a bigger issue, yes. Using __array_function__ it is much harder to wrap, especially the generators. We would need to dispatch generators based on a like argument or so as well (giving the option to the array-like to swap them out basically).

You could check here a bit: https://github.com/numpy/archive/blob/master/other_meetings/2020-04-21-array-protocols_discussion_and_notes.md although mostly what is under "Agenda" is what we discussed, after that is mainly thoughts by either me and Hameer compiled before the meeting...

Basically, we had decided that we would be willing to very actively explore both options (if helpful even in the master branch). At the time, 1.19 branching was still off, we have the release now, so now would be the time to attack such thoughts!

thomasjpfan · 2020-06-22T18:10:27Z

I tested it locally with jax and found an issue with read-only intermediate data when copy=True so I decided first to add a new test before pushing any further "improvement" to this PR.

For jax support we would have to do away with any inplace operations, such as:

X -= self.mean_

Also we would not be able to use the out= keyword in numpy operations to do things in place.

Edit: I disabled some instances to be nicer to the CI while there is a sprint going on.

Edit2: I do not think array_function is supported in jax yet: jax-ml/jax#1565 . I do not know of a non-gpu array library that implements a good majority of the numpy api.

ogrisel · 2020-06-23T07:27:27Z

The problem for the test failing with jax is that a.copy() returns a (read-only) numpy array instead of a newly allocated jax array. The fact that it's a numpy array seems to be on purpose, which is quite unfortunate but so be it. I do not why it's readonly as the original jax array is mutable and X -= self.mean_ would have worked on it (and actually the same test passes with copy=False).

ogrisel · 2020-06-23T07:29:46Z

To answer @seberg's comment on NEP 37 vs __array_function__ with NEP 35 / like=:

@thomasjpfan have you already started such experiments on your side? If not I think it would be interesting to open a sister PR to try to experiment with that approach on the PCA class as well.

ogrisel · 2020-06-23T07:31:08Z

The array_function / like= approach is being discussed in #16196 on the PolynomialFeatures estimator.

thomasjpfan · 2020-06-23T14:33:17Z

have you already started such experiments on your side? If not I think it would be interesting to open a sister PR to try to experiment with that approach on the PCA class as well.

I remember experimenting with __array_function__ and it worked as long as we do not use scipy.linalg and allow the user to pass a cupy.random.RandomState.

Feel free to open another PR to explore the __array_function__ approach with PCA.

seberg · 2020-06-23T14:37:59Z

Its great that you are exploring this. I will note again that previously we would be OK to experiment a bit within NumPy master (e.g. add like= arguments or similar, if necessary to see how far we can go – Although I suppose a fork on the numpy repo where we include both is maybe just as well).

There are a few issues with __array_function__ (hopefully I am not forgetting something important):

All or nothing override, i.e. no clean fall-back to a non-dispatched version.
Backcompat: An array-like implementing it "opts-in" their users.
- There may be no solution to this, transitions are hard, but it is unclear right now that this actually is a huge show stopper in the long run.
- get_array_module is explicit, so it is a bit better, but in the end if sklearn uses it by default, a similar transition issue occurs.
We need like= argument or some other mechanism for issues
It would be nice if we can have a story for libraries. I.e. how could a library end up allowing something like __array_function__ itself.

thomasjpfan · 2020-06-23T14:43:51Z

I'll put together another PR to see how we can use __array__function__ for a simpler estimator, such as one of the scalers.

shoyer · 2020-06-23T16:10:35Z

The problem for the test failing with jax is that a.copy() returns a (read-only) numpy array instead of a newly allocated jax array.

I think there is a good chance this will be changed in JAX:
jax-ml/jax#3473

mblondel · 2020-06-23T17:12:08Z

SciPy has the same issue; most modules have C, C++, Fortran or Cython code that'll be very hard (or impossible) to rewrite in Python/NumPy. I think we want to make that a module-by-module rather than function-by-function decision, otherwise it becomes completely impossible to reason about.

If a function looks like this

def func():
  some_pre_processing() # Pure NumPy
  algorithm_core()  # Cython / Fortran / C / C++
  some_post_processing() # Pure NumPy

there could potentially be some value, if the cost of the pre and post processings ouweight the cost of CPU / GPU switch. Not sure if it's often the case in scikit-learn though.

thomasjpfan · 2025-02-12T17:49:45Z

With the work around Array API, I think this PR is not needed anymore.

jakirkham · 2025-02-12T21:08:52Z

Thank you Thomas for your work here! 🙏

It is nice to see how array support has evolved

thomasjpfan added 27 commits February 21, 2020 16:09

MNT Adds labeler

3882723

BUG Fix

d79390b

Double quotes are better

d71ecae

BUG Fix

919a519

BUG Fix

a89754c

MNT Adds build ci tag

0f56d96

MNT Use fork for new feature

e4ee673

Merge branch 'only_change_setup'

409be8d

MNT Uses tagged version

3e729e4

Merge remote-tracking branch 'upstream/master'

faef88c

WIP Testing nep37 [skip ci]

60c3834

WIP Testing nep37 [skip ci]

e7bfda8

WIP Testing nep37 [skip ci]

8cdfd2d

WIP Testing nep37 [skip ci]

a8dc598

WIP Testing nep37 [skip ci]

adb07db

WIP Testing nep37 [skip ci]

df8ca82

WIP Testing nep37 [skip ci]

ea1c0fb

WIP Testing nep37 [skip ci]

4fe65fe

Merge remote-tracking branch 'upstream/master' into pca_array_functio…

5ca03e0

…n_nep37

WIP Testing nep37 [skip ci]

fe6293d

WIP Testing nep37 [skip ci]

8c3001f

WIP Testing nep37 [skip ci]

33bbd52

WIP Testing nep37 [skip ci]

19b9d2a

WIP Testing nep37 [skip ci]

a988b8a

WIP Testing nep37 [skip ci]

a33bde0

WIP Testing nep37 [skip ci]

d50fbbf

BUG Fix

817c4f7

thomasjpfan and others added 4 commits April 21, 2020 11:39

ENH adds support for minmaxScaler

7578d90

Fix extra n_features + better error message

fc5d9e8

Merge master + pass npx to linalg.svd

a24515c

Add missing docstring to make the tests pass

f33e325

ogrisel added 2 commits June 22, 2020 16:56

Add a test for jax compat

a96afc5

Let's focus on svd_solver='full' for now

adf4692

CI Be nicer to the ci

c1c2f20

viclafargue mentioned this pull request Jun 23, 2020

[WIP] PCA NEP-37 adding random pathway and CuPy test #17676

Draft

ogrisel mentioned this pull request Jun 25, 2020

Add GPU support for NumPy operations with CuPy #9146

Closed

rgommers mentioned this pull request Jul 20, 2020

Related topic: NumPy array protocols data-apis/array-api#1

Closed

saulshanabrook mentioned this pull request Aug 10, 2020

How to expose API to downstream libraries? data-apis/array-api#16

Closed

stsievert mentioned this pull request Aug 18, 2020

CuPy arrays adriangb/scikeras#64

Open

jeremiedbb mentioned this pull request Nov 9, 2020

Compatibility with scikit-learn API rapidsai/cuml#3125

Closed

Base automatically changed from master to main January 22, 2021 10:52

tylerjereddy mentioned this pull request Apr 12, 2023

ENH: port scipy.signal._arraytools to be Array API compatible scipy/scipy#15395

Closed

thomasjpfan closed this Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP Enabling different array types (CuPy) in PCA with NEP 37 #16574

WIP Enabling different array types (CuPy) in PCA with NEP 37 #16574

thomasjpfan commented Feb 27, 2020 •

edited

Loading

amueller commented Feb 28, 2020

rgommers commented Feb 29, 2020

rgommers commented Feb 29, 2020

seberg commented Jun 22, 2020 •

edited

Loading

ogrisel commented Jun 22, 2020

seberg commented Jun 22, 2020

thomasjpfan commented Jun 22, 2020 •

edited

Loading

ogrisel commented Jun 23, 2020

ogrisel commented Jun 23, 2020 •

edited

Loading

ogrisel commented Jun 23, 2020

thomasjpfan commented Jun 23, 2020

seberg commented Jun 23, 2020

thomasjpfan commented Jun 23, 2020

shoyer commented Jun 23, 2020

mblondel commented Jun 23, 2020

thomasjpfan commented Feb 12, 2025

jakirkham commented Feb 12, 2025

WIP Enabling different array types (CuPy) in PCA with NEP 37 #16574

WIP Enabling different array types (CuPy) in PCA with NEP 37 #16574

Conversation

thomasjpfan commented Feb 27, 2020 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

With numpy

With cupy

Any other comments?

amueller commented Feb 28, 2020

rgommers commented Feb 29, 2020

rgommers commented Feb 29, 2020

seberg commented Jun 22, 2020 • edited Loading

ogrisel commented Jun 22, 2020

seberg commented Jun 22, 2020

thomasjpfan commented Jun 22, 2020 • edited Loading

ogrisel commented Jun 23, 2020

ogrisel commented Jun 23, 2020 • edited Loading

ogrisel commented Jun 23, 2020

thomasjpfan commented Jun 23, 2020

seberg commented Jun 23, 2020

thomasjpfan commented Jun 23, 2020

shoyer commented Jun 23, 2020

mblondel commented Jun 23, 2020

thomasjpfan commented Feb 12, 2025

jakirkham commented Feb 12, 2025

thomasjpfan commented Feb 27, 2020 •

edited

Loading

seberg commented Jun 22, 2020 •

edited

Loading

thomasjpfan commented Jun 22, 2020 •

edited

Loading

ogrisel commented Jun 23, 2020 •

edited

Loading