What is preventing sklearn to achieve true model persistence? #30609

Pierre-Bartet · 2025-01-08T11:34:53Z

Pierre-Bartet
Jan 8, 2025

What is preventing sklearn to achieve true model persistence?
For example model.dump(..) + LogisticRegression.load(...) ?
All the existing solutions are brittle or force users to use exactly the same sklearn version for training and inference:
https://scikit-learn.org/1.6/model_persistence.html

I understand that this is a deliberate choice because sklearn's team lack of resources, but offloading serialization logic to external libraries can only end up in an a much worse maintenance, communication, and interdependence nightmare.

For example sklearn-onnx accesses private sklearn components to be able to serialize them (such as PolynomialFeatures's _min_degree, or gradient boosting's _predictors).

Covering all of sklearn components would be a tremendous task, but it could be done step by step, and it is also somewhat parallelizable by assigning a few models to anyone who would be happy to help.

glemaitre · 2025-01-10T11:07:31Z

glemaitre
Jan 10, 2025
Maintainer

Basically, it is more a maintenance burden where with the team, we estimate that we could not maintain it. However, we had recent discussion in which we think that we could have a trimmed inference estimator for each estimator, reducing the impact of potential private changes that make it to update scikit-learn versions in this setting. Basically, it would make the life easier for packages as sklearn-onnx.

It would be possible to working on persistence with a fit + inference but the maintenance is really the bottleneck.

2 replies

Pierre-Bartet Jan 10, 2025
Author

Thanks, I understand the maintenance burden issue but right now a sklearn non breaking change (such as removing one of the above private attribute) can break sklearn-onnx or any external attempt at serializing models which seems to be an even larger burden maintenance for everyone (sklearn team included).

Your trimmed inference estimator idea is awesome !

Pierre-Bartet Jan 20, 2025
Author

Another path would be to "just" make sure everything necessary (but nothing more) for inference is accessible as public attributes (without creating a new class for each estimator), so that tools such as sklearn-onnx can rely on something reliable (and also maybe help them reach a point were sklearn-onnx is less buggy and has more coverage, since it is still a huge task).

jcbsv · 2025-04-15T12:05:48Z

jcbsv
Apr 15, 2025

I concur with you Pierre-Bartet, it should be feasible to implement model persistence as a community effort. Issue #31143 is relevant for this discussion. There is no need for deciding on a persistence format, the only requirement is that parameters/state can be retrieved from a model, as either numpy or python native data structures. And conversely, that a model can consume the same as input for initialisation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is preventing sklearn to achieve true model persistence? #30609

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What is preventing sklearn to achieve true model persistence? #30609

Pierre-Bartet Jan 8, 2025

Replies: 2 comments · 2 replies

glemaitre Jan 10, 2025 Maintainer

Pierre-Bartet Jan 10, 2025 Author

Pierre-Bartet Jan 20, 2025 Author

jcbsv Apr 15, 2025

Pierre-Bartet
Jan 8, 2025

Replies: 2 comments 2 replies

glemaitre
Jan 10, 2025
Maintainer

Pierre-Bartet Jan 10, 2025
Author

Pierre-Bartet Jan 20, 2025
Author

jcbsv
Apr 15, 2025