Skip to content

[DRAFT] Engine plugin API for whole class dispatching #30250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

betatim
Copy link
Member

@betatim betatim commented Nov 8, 2024

Putting this PR up so we can look at it, not because it is anywhere near to being done or what we should be doing.

The idea here is to explore how "whole class dispatching" could look like. This is a bit different from what we explored in #25535 which took the approach of fine grained and well defined functions being dispatched to a backend.

I've made a simple "backend" to get a feel for things: https://github.com/betatim/scikit-learn-phony-backend


The big idea, compared to the previous PR, is that a backend gets to handle "everything" an estimator does. This might make the dispatching mechanism simpler and easier to use for backend implementors. It also means there is more work needed to make a compliant backend because it has to do everything (algorithm implementation, input validation, etc).

When fit (or a fit_*) is called we use find_backend(estimator_instance, fit_args) to loop through the installed backends to find out if one of them wants to handle this case. A backend has to provide a function that we can call with these arguments that returns a True/False decision. If the backend wants to handle the case we instantiate its implementation and call its .fit() method. The backend is responsible for setting all the fitted attributes on the original estimator instance. It also has to perform all the input and parameter validation that currently happens in fit itself.

If no suitable backend is found, we execute the normal fit code.

In all methods that are not fit* we inspect the _backend attribute to see if a backend was using during fitting or not. If one was used, it will be called again to do the work for predict, score, etc. If for whatever reason the backend can't do it (say the input is in a format it doesn't support) it has to raise an exception telling the user. In the past we settled on backends having to provide a method for converting an estimator from/to Numpy arrays that users can call. I think this is an idea to keep (and add here). Explicit failure with conversion instructions seems better than trying to make it work via some magic (maybe in a later version?).

I've not hooked into the config system of scikit-learn, mostly because I think we should just do what #25535 did. Actually, for most things that are missing here I think "let's do what #25535 did".

WDYT @ogrisel @thomasjpfan ? Interested to hear thoughts

Copy link

github-actions bot commented Nov 8, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 89094ae. Link to the linter CI: here

@betatim
Copy link
Member Author

betatim commented Nov 8, 2024

Something @dantegd pointed out is: what should a backend store in the trees_ attribute of a something like a random forest? Until now we have only talked about python scalars and arrays in attributes. This is where the idea of "backend native type" came from. For trees it is maybe a bit more tricky? However, I think given that we want a backend to provide a function to convert back to "normal sklearn types" it doesn't matter soo much?

Comment on lines +1065 to +1067
self._backend = find_backend(self, X, y=y, sample_weight=sample_weight)
if self._backend is not None:
self._backend.fit_predict(X, y=y, sample_weight=sample_weight)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably find a nicer solution for this, instead of having it explicitly in the method. Maybe a decorator? Maybe via a meta-class?

For now I thought it would be simpler to just have it explicitly in the class instead of reaching into the Python box of tricks.

return backends


def dispatching_disabled():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this instead of the normal config system mostly because it was easier. I think we should use the normal config system of scikit-learn

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 on this idea.

I don't think it's a good idea for people to get a class which is not sklearn.cluster.KMeans when they do

from sklearn.cluster import KMeans
est = KMeans().fit(X, y)

est should always be the same class as the one user instantiated.

I'd be more okay with an idea of allowing a whole solver to be exposed as a plugin. Something like

from sklearn_nvidia_plugins import plugins
from sklearn.cluster import KMeans

plugins.enable()
est = KMeans(init="nvidia-kmeans++", algorithm="nvidia-fancy-algo").fit(X, y)

That's explicit, and easy to implement.

@betatim
Copy link
Member Author

betatim commented Nov 8, 2024

est should always be the same class as the one user instantiated.

This is what they get in this PR, like in the previous one #25535

I am not sure I understand your second proposal :-/ - at least not in the context of a plugin system, it looks more like how sklearn-intelx works or cudf.pandas.

For now we've declared adding functionality/alternative algorithms as something that is "maybe in the future" when it comes to the plugin system.

@thomasjpfan
Copy link
Member

what should a backend store in the trees_ attribute of a something like a random forest? Until now we have only talked about python scalars and arrays in attributes.

I had this exact discussion with NVIDIA folk at PyData NYC when going down the "dispatch on fit" idea. My take is to not enforce any requirements on the attributes, because different algorithm implementations may have their own attributes that do not match our own.

The only requirement I would enforce is that that estimator has predict_* that were already defined from the original scikit-learn class.

@adrinjalali
Copy link
Member

adrinjalali commented Nov 8, 2024

This is what they get in this PR, like in the previous one #25535

Yes. My bad. I had misread this part:

        if self._backend is not None:
            self._backend.fit(X, y=y, sample_weight=sample_weight)
            return self

This seems like wrapping another estimator inside the estimator, and forcing the other estimator to have the same hyper paramters as the one we have, and both of those things to me seem kinda off. Imagine users running code that they think does something, on a deployment environment where they don't know what's installed, and their code runs something completely different. Also, at the point where people implement algorithms as a whole, they could very well have different hyper parameters.

I think if a third party wants to completely rewrite a whole estimator, it's not our job to pretend they are a scikit-learn estimator. Also, they're usually terrible at actually keeping up with our development. cuML is way far off from our BaseEstimator, scikit-learn-intelex is almost always pretending to be the same as sklearn but it isn't, and the others ought to be the same.

I also really don't like that in this version of the story (this PR) the user code shows no indication of which backend is in use. The user code should be completely showing what's being run.

I think the reason we're having this discussion, is that hardware providers like to be able to inject their own versions of compute heavy algorithms where they think is better or faster than what we have, and I'm okay with that. I don't think what we have here is the solution though.

We should go estimator by estimator, find the compute heavy portions, and accept an implementation from the third party. And this can also be funded by those same hardware providers.

I'll try to explain what I meant by my suggestion. In my code above:

from sklearn_nvidia_plugins import plugins
from sklearn.cluster import KMeans

plugins.enable()
est = KMeans(init="nvidia-kmeans++", algorithm="nvidia-fancy-algo").fit(X, y)

plugins.enable() registers implementations for the init and algorithm arg of KMeans, and the user can explicitly specify those algorithms via strings when constructing KMeans, i.e. "nvidia-kmeans++" is only available after the call to sklearn_nvidia_plugins.plugins.enable().

Alternatively, users could also do things like

from sklearn_nvidia_plugins,kmeans import Init
from sklearn.cluster import KMeans

est = KMeans(init=Init(**nvidia's init args)).fit(X, y)

where we extend our API to accept objects, like solvers, with a pre-defined input/output API.

This has the benefit that the third party developers have a much smaller API surface to maintain. They won't have to worry about feature name consistency between calls or estimator tags or other stuff which are not related to the computationally intensive parts.

@betatim
Copy link
Member Author

betatim commented Nov 11, 2024

I had this exact discussion with NVIDIA folk at PyData NYC when going down the "dispatch on fit" idea. My take is to not enforce any requirements on the attributes, because different algorithm implementations may have their own attributes that do not match our own.

The only requirement I would enforce is that that estimator has predict_* that were already defined from the original scikit-learn class.

I think from a user's point of view you want the estimator fitted by a plugin or "native" scikit-learn to be as similar as possible. This means I'd not go for "set what ever attrs you want or don't want" but "you aspire to set the same attributes and they should as much as possible contain the same thing" (even if the type is different). My thinking is that there are tools out there that use the attributes to do useful things, people's code inspects the attributes, etc. If you can't rely on them being similar (say cupy arrays instead of numpy arrays) then your downstream code needs to know about "plugin" vs "not plugin". Which I think is not great.

The reason I think it is not great is that I think a reason (or the reason?) to have a plugin system (and array API support) is that you can run the same code on your laptop without a GPU and on a machine with a GPU. So if downstream code interacts with the attributes they need to be similar, otherwise the idea of running the same code on a machine with and without fancy hardware doesn't work.

In which case you could already switch to using import cuml with some if statements to branch where needed. I think in practice that isn't a great workflow. For a while I've been asking myself why don't people "just" rewrite their code to use cuml or cudf when they want to deploy it? I think now I've figured out an answer: because they want to be able to switch back and forth between "with" and "without" fancy hardware.

This is also why I think offering new algorithms/hyper-parameters is not the first thing I'd allow in a plugin system. Maybe at a later stage yes, but first lets enable the "I want to develop on my laptop and do large scale runs on machines with a GPU" use-case. People who want ultimate performance or always run on fancy hardware can already use cuml.

I also really don't like that in this version of the story (this PR) the user code shows no indication of which backend is in use. The user code should be completely showing what's being run.

I am undecided on this requirement. In practice (and one of the things I left out for now under the "let's just use what we did in #25535" caveat) at first users will have to enable the plugin system via a sklearn.set_config() - as well as installing a backend. In a related PR exploring backends/dispatching/plugins (the lingo differs a bit from project to project) in scikit-image there was the same worry about the fact that the same code should not work differently (no matter which scikit-image version you install). So they have even stricter backwards compatibility constraints than scikit-learn. The idea we landed on there is to emit a warning when something was being dispatched to a backend. Eliminating the "same code silently working differently" worry. In this PR I included the same idea of a warning being emitted. I'm not sure I love it, or what combination of warning and opt-in mechanism is my favourite.

fine grained vs whole sale dispatching

We explored the fine grained approach in #25535, this PR is a first draft for looking at the whole sale dispatching. I agree whole sale dispatching puts a lot more burden on the plugin developers. In particular when looking at the input validation mechanism I think this is quite a bit of work to be at the level of scikit-learn. However, I am not sure how much easier this would be in the fine grained approach, a lot of the input validation has to understand the type of the input :-/ At the moment I am thinking the amount of work to make a compliant plugin is approximately the same. The difference is in do you ask people to "fill in these five small blanks" or "fill in this big blank".

One thing I've learnt from #25535 is that the dispatching system itself got quite complex as the PR went on. This is why I want to explore this approach. It feels like it is less complex, now. The question is will it grow just as complex or remain simpler as we flesh it out? It would also be interesting to start a new attempt at #25535 with what we learnt from it. But realistically I only have so much brainpower, so won't be trying to do that in parallel :D

We should go estimator by estimator, find the compute heavy portions, and accept an implementation from the third party. And this can also be funded by those same hardware providers.

With "accept implementation from third party" you mean add it to the scikit-learn code base proper?

plugins.enable() registers implementations for the init and algorithm arg of KMeans, and the user can explicitly specify those algorithms via strings when constructing KMeans, i.e. "nvidia-kmeans++" is only available after the call to sklearn_nvidia_plugins.plugins.enable().

Ah ok. I'm less excited about this because I think the plugin system is a way to solve the problem of "I sometimes use fancy hardware and sometimes I don't, I want my code to work in both cases". Similar to PyTorch where putting your data on a particular device influences where the computations happen. There is one place where you need to make a change/if statement and that is it.

@adrinjalali
Copy link
Member

The reason I think it is not great is that I think a reason (or the reason?) to have a plugin system (and array API support) is that you can run the same code on your laptop without a GPU and on a machine with a GPU. So if downstream code interacts with the attributes they need to be similar, otherwise the idea of running the same code on a machine with and without fancy hardware doesn't work.

It's not really the same code though. It's a very different backend used for different data namespaces, and even the supported dtypes are different sometimes.

I see the plugin API as extending the list of "solvers" we have, and even our own solvers don't necessarily expose the same attributes. For instance, sometimes it's not easy to get n_iter_ out of a solver, and some other solvers easily expose that. So to me, it's okay if certain things become solver/plugin specific.

This is also why I think offering new algorithms/hyper-parameters is not the first thing I'd allow in a plugin system. Maybe at a later stage yes, but first lets enable the "I want to develop on my laptop and do large scale runs on machines with a GPU" use-case. People who want ultimate performance or always run on fancy hardware can already use cuml.

Pretty sure we already have solver specific hyperparameters anyway. So it's not like we have a "definite" set of hyper parameters per estimator. Which means it's okay for third parties to also have their own hyper parameters.

In which case you could already switch to using import cuml with some if statements to branch where needed. I think in practice that isn't a great workflow. For a while I've been asking myself why don't people "just" rewrite their code to use cuml or cudf when they want to deploy it? I think now I've figured out an answer: because they want to be able to switch back and forth between "with" and "without" fancy hardware.

This is also why I think offering new algorithms/hyper-parameters is not the first thing I'd allow in a plugin system. Maybe at a later stage yes, but first lets enable the "I want to develop on my laptop and do large scale runs on machines with a GPU" use-case. People who want ultimate performance or always run on fancy hardware can already use cuml.

With the caveat that cuml is exposes a loose scikit-learn-like API with a very limited subset of what's available in scikit-learn. For instance, cuml.model_selection only exposes __all__ = ["train_test_split", "GridSearchCV", "StratifiedKFold"], or that they don't support / have the metadata routing API / infrastructure. So the issue is not just hyperparameters, cuml merely looks like scikit-learn to give a somewhat familiar feeling to users.

We explored the fine grained approach in #25535, this PR is a first draft for looking at the whole sale dispatching. I agree whole sale dispatching puts a lot more burden on the plugin developers. In particular when looking at the input validation mechanism I think this is quite a bit of work to be at the level of scikit-learn. However, I am not sure how much easier this would be in the fine grained approach, a lot of the input validation has to understand the type of the input :-/ At the moment I am thinking the amount of work to make a compliant plugin is approximately the same. The difference is in do you ask people to "fill in these five small blanks" or "fill in this big blank".

One thing I've learnt from #25535 is that the dispatching system itself got quite complex as the PR went on. This is why I want to explore this approach. It feels like it is less complex, now. The question is will it grow just as complex or remain simpler as we flesh it out? It would also be interesting to start a new attempt at #25535 with what we learnt from it. But realistically I only have so much brainpower, so won't be trying to do that in parallel :D

#25535 indeed is quite a complicated approach. When I say "something in between", I mean not having all the machinery introduced in that PR. We can simply accept a callable as the solver for init and algorithm in KMeans, and expect them to return what we want them to return. The footprint of that approach in sklearn is very minimal, and people can implement their algorithms, and expose a callable to users to use very simply. We can also allow users to use a default dispatched callable by passing a string, but that's not even necessary to start enabling third parties develop their own solvers.

With "accept implementation from third party" you mean add it to the scikit-learn code base proper?

I mean accepting a callable which we call and get the results we expect. Also, note that this makes it quite trivial not to change the estimator attributes, which you desire.

Ah ok. I'm less excited about this because I think the plugin system is a way to solve the problem of "I sometimes use fancy hardware and sometimes I don't, I want my code to work in both cases". Similar to PyTorch where putting your data on a particular device influences where the computations happen. There is one place where you need to make a change/if statement and that is it.

You'd still need to change your code to put your data on the right device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants