-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC revamp model persistence documentation #18046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @cmarmo, gave a quick first pass
doc/getting_started.rst
Outdated
@@ -212,6 +214,34 @@ the best set of parameters. Read more in the :ref:`User Guide | |||
Using a pipeline for cross-validation and searching will largely keep | |||
you from this common pitfall. | |||
|
|||
Model persistence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the getting started guide as concise as possible and with a limited amount of info.
IMO model persistence is more of a next step, sort of like model inspection, and thus doesn't really belong here.
It also doesn't really fit into the premise of the guide:
The purpose of this guide is to illustrate some of the main features that scikit-learn provides
So I'd suggest to write this instead in the UG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point: I have removed model persistence from the getting started guide.
But I need a way to give more visibility to model persistence, which is not a subsection of the model selection and evaluation section, but the final step of the process before deployment in production (if I understand correctly this comment). Let me know if you agree more with 2d2581d. Thanks.
doc/modules/model_persistence.rst
Outdated
|
||
PMML is an extension of the `XML | ||
<https://fr.wikipedia.org/wiki/Extensible_Markup_Language>`_ document standard | ||
defined to represent data mining and models. Beeing human and machine readable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defined to represent data mining and models. Beeing human and machine readable, | |
defined to represent data mining and models. Being human and machine readable, |
Also:
to represent data mining and models
I'm not sure what "to represent data mining" means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that more understandable now? Thanks.
doc/modules/model_persistence.rst
Outdated
Interoperable formats | ||
--------------------- | ||
|
||
For production and quality control needs, exporting the model in `Predictive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking that we could be explicit that the model will only use the prediction part without the possibility to be refitted once exported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a note at the beginning of the section: is that ok with you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it might make more sense to move this note here as you do not have this limitation when using pickle (you can refit the model if you wish as all the hyperparams are shipped in the pickled model).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, still puzzled about the necessity of retraining using a serialized model ... using the original python script seems a more transparent option to me especially if you look for architecture or environment dependencies...
I've modified the note as you suggested... but still not convinced to move it in the interoperable section....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cmarmo , minor comments, otherwise LGTM.
doc/modules/model_persistence.rst
Outdated
Interoperable formats | ||
--------------------- | ||
|
||
For production and quality control needs, exporting the model in `Predictive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it might make more sense to move this note here as you do not have this limitation when using pickle (you can refit the model if you wish as all the hyperparams are shipped in the pickled model).
doc/modules/model_persistence.rst
Outdated
Model Markup Language (PMML) | ||
<http://dmg.org/pmml/v4-4-1/GeneralStructure.html>`_ or `Open Neural Network | ||
Exchange <https://onnx.ai/>`_ format | ||
would be a better approach than using `pickle`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phrasing "For production and quality control needs... PMML / ONNX would be a better approach than using pickle": I think that most people who use scikit-learn models in production typically deploy pickled models in docker containers. It's perfectly fine as long as you are aware of the limitations of the pickle format, namely treat the pickled model as a piece of executable software (as is typically the rest of the contents of the docker image) rather than a piece of data or self-decribing source code. For organizations that are used to deploying Python and docker in production this is probably the best approach.
To me the pros of using interoperable formats are:
- make it possible to deploy to platforms that do not have Python installed for any reason (e.g. java-oriented sysadmin culture and JVM-centric production & diagnosis tools),
- decouple the trained model from the specific runtime type and version used to train the model.
E.g. some organizations might decide to allow their datascientist to develop models using any technology (e.g. Python, R, spark) but then only deploy PMML or ONNX models on the inference servers of operational reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a line about pickle "containerisation". and reformulate a bit. Your two point are part of the definition of interoperability to me... should I go to this level of detail? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comment otherwise LGTM, thanks @cmarmo !
doc/modules/model_persistence.rst
Outdated
@@ -86,7 +87,40 @@ same range as before. | |||
Since a model internal representation may be different on two different | |||
architectures, dumping a model on one architecture and loading it on | |||
another architecture is not supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm not sure this sentence is correct #17644 (comment) As long as the Python version and the versions of dependencies are the same it should work. When it doesn't it could be a bug on our side, particularly given that pickle (+ docker) is still the primary method of deployment.
Also (sentence below) containers are architecture specific. They are a solution to reliable deployments and fixing the environment including the dependencies, but not to architecture portability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm not sure this sentence is correct #17644 (comment) As long as the Python version and the versions of dependencies are the same it should work. When it doesn't it could be a bug on our side, particularly given that pickle (+ docker) is still the primary method of deployment.
I've tried to mitigate the sentence: but the point here is more if scikit-learn developers are willing to guarantee pickle portability than about pickle portability itself.
Also (sentence below) containers are architecture specific. They are a solution to reliable deployments and fixing the environment including the dependencies, but not to architecture portability.
Again, I've tried to mitigate the sentence, but IMO any solution enforcing portability and interoperability is architecture dependent, the goal is to make this dependence invisible to the user. The "pickle + docker" option offers to the user a way to run predictions with a model trained on a different architecture... obviously the producer of the model should provide the right "packaging" to the user.
Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
…kit-learn into doc_model_persistence
…kit-learn into doc_model_persistence
@NicolasHug towards #18257 I have removed the model persistence section from the tutorial. It was a duplicate of the same entry in the User Guide. |
…kit-learn into doc_model_persistence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise lgtm
Thanks @cmarmo! |
Thank you @jnothman! |
Reference Issues/PRs
Following @jnothman comment in #16875.
Closes #2801
What does this implement/fix? Explain your changes.
Add details in the model persistence documentation section.
Not sure this was the suggested direction... LMK