Skip to content

WIP Enabling different array types (CuPy) in PCA with NEP 37 #16574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Feb 27, 2020

Reference Issues/PRs

After discussing with @seberg, it was concluded that a prototype of using CuPy with sklearn can help with seeing how libraries would use NEP 37.

What does this implement/fix? Explain your changes.

For reference, this is how this will look like for the user:

https://gist.github.com/thomasjpfan/da54f874f3d434eb4ae360b5e6fa4c12

With numpy

import numpy as np
from sklearn.decomposition import PCA

rng = np.random.RandomState(42)
X = rng.rand(15, 5)

pca = PCA(n_components=3, random_state=0)
X_trans = pca.fit_transform(X)
type(X_trans)
# numpy.ndarray

With cupy

X_cp = cupy.asarray(X)
type(X_cp)
# cupy.core.core.ndarray

X_cp_trans = pca.fit_transform(X_cp)
# ValueError

with sklearn.config_context(enable_duck_array=True):
    X_cp_trans = pca.fit_transform(X_cp)
type(X_cp_trans)
# cupy.core.core.ndarray

Any other comments?

This is a quick prototype on how we can use NEP 37 in PCA. (There is still work for other solvers)
Note the get_array_module is a hack for now. It would be called with numpy if NEP 37 gets accepted.

CC: @amueller @seberg

@amueller
Copy link
Member

In some discussions this week (maybe with @rgommers ?) the concern was raised that it will be awkward if each library using this interface will add a config flag, and that it might be nicer if the opt-in was on the level of numpy. But I'm not entirely sure if this is possible.

@rgommers
Copy link
Contributor

Yes we did discuss this sklearn context manager idea, also with one or both of @jakirkham or @jni I believe. This PR is an interesting experiment. It would be useful to step back and figure out how this will look to the end user once more libraries start doing this. Options:

  1. every library has its own per-package switch for this (as in this PR)
  2. NumPy provides a switch
  3. no switch, just accept a backwards incompatibility in a next major (or minor) release somewhere

Perhaps there are more options?

Dask and CuPy have already done (3) for __array_function__ with not too many problems. However, other libraries are understandably attached to their backwards compatibility. At first glance (2) seems more appealing than (1) long-term. It may make introduction a little harder - global state then has a wider reach - but having to set many of those flags for a larger programs will be quite clumsy a few years down the line.

@rgommers
Copy link
Contributor

Another thing to discuss is having a context manager in the first place. E.g. @shoyer was very much against that when it was proposed with unumpy. If you do want to use one, think about how well things compose, and if it's robust for multi-threading/processing.

@seberg
Copy link
Contributor

seberg commented Jun 22, 2020

Since I see some activity, a while ago, we decided that we should explore both things, if easier in an (as private as possible) way in the master branch, with the understanding that it will by default be removed for the next release, unless we agree otherwise.
There is still the reasonable option of trying to extend __array_function__ enough to make it work well here (that means some kind of like= argument probably, which seems not a big issue to add in general, and possibly allowing easier fallback for __array_function__ implementers, or similar).

EDIT: Sorry, the main point was this: I am not aware of anyone actively trying to do this/experimenting with it, so picking it up, is welcome.

@ogrisel
Copy link
Member

ogrisel commented Jun 22, 2020

@seberg I wanted to give this PR a closer look and I found it needed a couple of fixes + a merge from master to take into account recent changes to get the CI up and running.

I tested it locally with jax and found an issue with read-only intermediate data when copy=True so I decided first to add a new test before pushing any further "improvement" to this PR.

There is still the reasonable option of trying to extend array_function enough to make it work well here (that means some kind of like= argument probably, which seems not a big issue to add in general, and possibly allowing easier fallback for array_function implementers, or similar).

Can you give a bit more details?

Another problem will be to deal with np.random.RandomState which is used internally by sklearn.decomposition.PCA(svd_solver='randomized'). This specifc cas has not been addressed in the current prototype.

AFAIK jax.numpy does not try to implement the numpy.random API at the moment, I am not sure how we should deal with partial implementations of the numpy API.

@seberg
Copy link
Contributor

seberg commented Jun 22, 2020

@ogrisel have to go in details later, but basically. I think we have not reached the point where we are sure that np.get_array_module() is quite necessary compared to making __array_function__ more powerful, but fixing some of its issues.
How exactly that looks like is unclear, and I think we actually need this to be explored at this time. I.e. try around with both and see how far we get. Then we can make a call.

np.random is a bigger issue, yes. Using __array_function__ it is much harder to wrap, especially the generators. We would need to dispatch generators based on a like argument or so as well (giving the option to the array-like to swap them out basically).

You could check here a bit: https://github.com/numpy/archive/blob/master/other_meetings/2020-04-21-array-protocols_discussion_and_notes.md although mostly what is under "Agenda" is what we discussed, after that is mainly thoughts by either me and Hameer compiled before the meeting...

Basically, we had decided that we would be willing to very actively explore both options (if helpful even in the master branch). At the time, 1.19 branching was still off, we have the release now, so now would be the time to attack such thoughts!

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Jun 22, 2020

I tested it locally with jax and found an issue with read-only intermediate data when copy=True so I decided first to add a new test before pushing any further "improvement" to this PR.

For jax support we would have to do away with any inplace operations, such as:

X -= self.mean_

Also we would not be able to use the out= keyword in numpy operations to do things in place.

Edit: I disabled some instances to be nicer to the CI while there is a sprint going on.

Edit2: I do not think array_function is supported in jax yet: jax-ml/jax#1565 . I do not know of a non-gpu array library that implements a good majority of the numpy api.

@ogrisel
Copy link
Member

ogrisel commented Jun 23, 2020

The problem for the test failing with jax is that a.copy() returns a (read-only) numpy array instead of a newly allocated jax array. The fact that it's a numpy array seems to be on purpose, which is quite unfortunate but so be it. I do not why it's readonly as the original jax array is mutable and X -= self.mean_ would have worked on it (and actually the same test passes with copy=False).

@ogrisel
Copy link
Member

ogrisel commented Jun 23, 2020

To answer @seberg's comment on NEP 37 vs __array_function__ with NEP 35 / like=:

@thomasjpfan have you already started such experiments on your side? If not I think it would be interesting to open a sister PR to try to experiment with that approach on the PCA class as well.

@ogrisel
Copy link
Member

ogrisel commented Jun 23, 2020

The array_function / like= approach is being discussed in #16196 on the PolynomialFeatures estimator.

@thomasjpfan
Copy link
Member Author

have you already started such experiments on your side? If not I think it would be interesting to open a sister PR to try to experiment with that approach on the PCA class as well.

I remember experimenting with __array_function__ and it worked as long as we do not use scipy.linalg and allow the user to pass a cupy.random.RandomState.

Feel free to open another PR to explore the __array_function__ approach with PCA.

@seberg
Copy link
Contributor

seberg commented Jun 23, 2020

Its great that you are exploring this. I will note again that previously we would be OK to experiment a bit within NumPy master (e.g. add like= arguments or similar, if necessary to see how far we can go – Although I suppose a fork on the numpy repo where we include both is maybe just as well).

There are a few issues with __array_function__ (hopefully I am not forgetting something important):

  1. All or nothing override, i.e. no clean fall-back to a non-dispatched version.
  2. Backcompat: An array-like implementing it "opts-in" their users.
    • There may be no solution to this, transitions are hard, but it is unclear right now that this actually is a huge show stopper in the long run.
    • get_array_module is explicit, so it is a bit better, but in the end if sklearn uses it by default, a similar transition issue occurs.
  3. We need like= argument or some other mechanism for issues
  4. It would be nice if we can have a story for libraries. I.e. how could a library end up allowing something like __array_function__ itself.

@thomasjpfan
Copy link
Member Author

I'll put together another PR to see how we can use __array__function__ for a simpler estimator, such as one of the scalers.

@shoyer
Copy link
Contributor

shoyer commented Jun 23, 2020

The problem for the test failing with jax is that a.copy() returns a (read-only) numpy array instead of a newly allocated jax array.

I think there is a good chance this will be changed in JAX:
jax-ml/jax#3473

@mblondel
Copy link
Member

SciPy has the same issue; most modules have C, C++, Fortran or Cython code that'll be very hard (or impossible) to rewrite in Python/NumPy. I think we want to make that a module-by-module rather than function-by-function decision, otherwise it becomes completely impossible to reason about.

If a function looks like this

def func():
  some_pre_processing() # Pure NumPy
  algorithm_core()  # Cython / Fortran / C / C++
  some_post_processing() # Pure NumPy

there could potentially be some value, if the cost of the pre and post processings ouweight the cost of CPU / GPU switch. Not sure if it's often the case in scikit-learn though.

@thomasjpfan
Copy link
Member Author

With the work around Array API, I think this PR is not needed anymore.

@jakirkham
Copy link
Contributor

Thank you Thomas for your work here! 🙏

It is nice to see how array support has evolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants