-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #25535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I forgot to squash, let me do that now. Actually, with the previous merges, it sounds complicated to do it without making a mistake. Let's keep that for later. I will just update the current |
I fixed the linters related failures on the CI but there remains problems when running the common tests for
|
TY for the follow-up and the new branch on scikit-learn. Will bump to this commit at sklearn_numba_dpex and report here. |
I have a few questions regarding the intention regarding compatibility with the framework that enable a plugin to easily re-use scikit-learn unittest. Currently we use this where the option To ensure compatibility with scikit learn unit tests, the following requirements must be verified:
I think the first two requirements can be forced by setting Now, about the output type: in the case If this behavior is OK we can directly add Also, for a given plugin we want to skip the tests for parameters and input types that the engine doesn't accept. We had designed an exception for it (this one). What do you think about it ? An alternative could be that |
Latest discussions suggested to remove the But some tests can't pass with unless the config context also applies See soda-inria/sklearn-numba-dpex#91 for an example of what goes wrong currently. |
Real life is tricky :-/ I'm not sure what the "right thing" is here. The test that fails checks that you can fit with sparse data and then predict on dense, and vice-versa. For "native" scikit-learn that seems like a reasonable thing to check. For a plugin situation where we allow an engine to reject sparse input, which seems like a reasonable thing to allow, then it isn't surprising that the test fails. In some sense it is a feature that the test fails, not a bug. From that point of view, the test is "not relevant", so it shouldn't be run for the engine. Looking at the error message the user sees "TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array." makes me wonder if we could provide a better one by re-running the acceptance selection again. Right now the error isn't bad, but it comes from a place quite far away from the source of the problem. If we re-run the acceptance selection the engine can raise a more specific error as it knows that sparse input is not supported for this engine. |
# when they set the config (ad-hoc class or string naming | ||
# a provider). | ||
engine_class = provider_name | ||
if getattr(engine_class, "engine_name", None) != engine_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT of raising an error in this case ? or make the "engine_name" optional ? (just realized my configuration wasn't working properly because the class I was passing to the config context missed the "engine_name" attribute)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all classes that want to be an engine should have a engine_name
attribute. Otherwise it is unclear what engine the class implements. Imagine you wrap a pipeline that has several steps that each have a plugin.
So along those lines, maybe we should use if engine_class.engine_name != engine_name:
here, which would lead to an exception if the class doesn't define a engine_name
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for letting it fail if the attribute is not defined
The default cython engine for kmeans doesn't define this attribute. Even if it's useless here, it would be helpful to define it for the same reasons accepts
is ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default cython engine for kmeans doesn't define this attribute.
That is true. I forgot that for the default engine and all engines that aren't "ad-hoc" engines (is this a good name for classes directly passed in, compared to those registered via an entrypoint?) the look up has an additional level of indirection via a engine spec. The engine spec has a name
attribute that contains the name of the engine. Just name
seemed to generic as a name of the attribute for the "ad-hoc" classes, but maybe we should unify it?
I'm unsure about defining it also for the non ad-hoc engine classes. It adds an additional location where this information is stored (attribute of the class and name of the entrypoint) and they could become out of sync. However, I also like having symmetry/no differences between the entrypoint and ad-hoc engines :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the engine name could only be defined on the class itself and we could update the engine specs in the importlib metadata to only register the class?
But that means that we will need to import the class to discover its name, which is a bit sad, because it could trigger import overhead from plugins unrelated to the specific algorithm of interest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe this is not a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a engine_name
attribute to the engine class when it is loaded via EngineSpec.get_engine_class
?
Then I'd think that we either need:
I think the third point is what we had discussed initially, I'm a bit concerned that it could be more difficult to maintain, maintainers should be notified whenever a test that was registered as expected to fail is edited upstream, so we would need pytest to provide a hash of each test, that sounds complicated. I'd be more encline to explore one of the first two solutions first. |
At the last call, I seem to recall that we discussed the possibility to use the engine native types in the tests and check if the scikit-learn tests ( It might be worth experimenting how many tests would pass, how many would need to be adapted to be less strict about type checks and instead rely more on duck-typing. |
I look at it some more:
|
Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Franck Charras <29153872+fcharras@users.noreply.github.com>
@betatim @ogrisel, the feature/engine-api branch is in a bad state, conflicts are all over the place on random files, not sure to understand why (did I use a starting commit that was desync locally last time I did the pull --no-rebase PR ?). We could squash all commits and push force to get back to a cleaner state, the resulting branch is here: https://github.com/fcharras/scikit-learn/tree/feature/engine-api, the single squashed commit is 2ccfc8c, the diff with current scikit-learn latest commit on main can be seen here: main...fcharras:scikit-learn:feature/engine-api . WDYT push-forcing that on scikit-learn:feature/engine-api ? (I haven't altered the initial code from here, it should be conform to the latest state we were settled on except it's rebased). To ensure that nothing was lost in the process, and assuming I might have done a mistake with this unfortunate 40K lines PR last time, I re-did the merge of current main on the latest commit before the last merge, which was a897a34 , after double-checking that everything was well synced in my local dev environment. Then I squashed everything (using git diff / git apply patch) in a single commit. I've tested the commit 2ccfc8c using this new plugin that is about to be merged. Also, I'm about to start implementing an API for the KNN (+ corresponding pytorch implementation in this same plugin) |
Let me force push the new commit. |
1915869
to
2ccfc8c
Compare
This is a draft pull-request to allow third-party packages such as https://github.com/soda-inria/sklearn-numba-dpex to contribute alternative implementations of core computational routines of CPU-intensive scikit-learn estimators.
This would be particularly useful to experiment with GPU-optimized alternatives to our CPU-optimized Cython code.
This Draft PR serves as a long running design experiment (feature branch) to tackle #22438.
This is a takeover of the old PR (#24497) from the
ogrisel/scikit-learn:wip-engines
branch so as to make it easier to get the CI work when doing sub PRs targetingscikit-learn/scikit-learn:feature/engine-api
.