Skip to content

DOC revamp model persistence documentation #18046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Sep 23, 2020

Conversation

cmarmo
Copy link
Contributor

@cmarmo cmarmo commented Jul 31, 2020

Reference Issues/PRs

Following @jnothman comment in #16875.

Closes #2801

What does this implement/fix? Explain your changes.

Add details in the model persistence documentation section.

Not sure this was the suggested direction... LMK

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @cmarmo, gave a quick first pass

@@ -212,6 +214,34 @@ the best set of parameters. Read more in the :ref:`User Guide
Using a pipeline for cross-validation and searching will largely keep
you from this common pitfall.

Model persistence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the getting started guide as concise as possible and with a limited amount of info.

IMO model persistence is more of a next step, sort of like model inspection, and thus doesn't really belong here.

It also doesn't really fit into the premise of the guide:

The purpose of this guide is to illustrate some of the main features that scikit-learn provides

So I'd suggest to write this instead in the UG.

Copy link
Contributor Author

@cmarmo cmarmo Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your point: I have removed model persistence from the getting started guide.
But I need a way to give more visibility to model persistence, which is not a subsection of the model selection and evaluation section, but the final step of the process before deployment in production (if I understand correctly this comment). Let me know if you agree more with 2d2581d. Thanks.


PMML is an extension of the `XML
<https://fr.wikipedia.org/wiki/Extensible_Markup_Language>`_ document standard
defined to represent data mining and models. Beeing human and machine readable,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defined to represent data mining and models. Beeing human and machine readable,
defined to represent data mining and models. Being human and machine readable,

Also:

to represent data mining and models

I'm not sure what "to represent data mining" means?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that more understandable now? Thanks.

Interoperable formats
---------------------

For production and quality control needs, exporting the model in `Predictive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that we could be explicit that the model will only use the prediction part without the possibility to be refitted once exported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a note at the beginning of the section: is that ok with you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it might make more sense to move this note here as you do not have this limitation when using pickle (you can refit the model if you wish as all the hyperparams are shipped in the pickled model).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, still puzzled about the necessity of retraining using a serialized model ... using the original python script seems a more transparent option to me especially if you look for architecture or environment dependencies...
I've modified the note as you suggested... but still not convinced to move it in the interoperable section....

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cmarmo , minor comments, otherwise LGTM.

Interoperable formats
---------------------

For production and quality control needs, exporting the model in `Predictive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it might make more sense to move this note here as you do not have this limitation when using pickle (you can refit the model if you wish as all the hyperparams are shipped in the pickled model).

Model Markup Language (PMML)
<http://dmg.org/pmml/v4-4-1/GeneralStructure.html>`_ or `Open Neural Network
Exchange <https://onnx.ai/>`_ format
would be a better approach than using `pickle`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phrasing "For production and quality control needs... PMML / ONNX would be a better approach than using pickle": I think that most people who use scikit-learn models in production typically deploy pickled models in docker containers. It's perfectly fine as long as you are aware of the limitations of the pickle format, namely treat the pickled model as a piece of executable software (as is typically the rest of the contents of the docker image) rather than a piece of data or self-decribing source code. For organizations that are used to deploying Python and docker in production this is probably the best approach.

To me the pros of using interoperable formats are:

  • make it possible to deploy to platforms that do not have Python installed for any reason (e.g. java-oriented sysadmin culture and JVM-centric production & diagnosis tools),
  • decouple the trained model from the specific runtime type and version used to train the model.

E.g. some organizations might decide to allow their datascientist to develop models using any technology (e.g. Python, R, spark) but then only deploy PMML or ONNX models on the inference servers of operational reasons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a line about pickle "containerisation". and reformulate a bit. Your two point are part of the definition of interoperability to me... should I go to this level of detail? Thanks.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comment otherwise LGTM, thanks @cmarmo !

@@ -86,7 +87,40 @@ same range as before.
Since a model internal representation may be different on two different
architectures, dumping a model on one architecture and loading it on
another architecture is not supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm not sure this sentence is correct #17644 (comment) As long as the Python version and the versions of dependencies are the same it should work. When it doesn't it could be a bug on our side, particularly given that pickle (+ docker) is still the primary method of deployment.

Also (sentence below) containers are architecture specific. They are a solution to reliable deployments and fixing the environment including the dependencies, but not to architecture portability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm not sure this sentence is correct #17644 (comment) As long as the Python version and the versions of dependencies are the same it should work. When it doesn't it could be a bug on our side, particularly given that pickle (+ docker) is still the primary method of deployment.

I've tried to mitigate the sentence: but the point here is more if scikit-learn developers are willing to guarantee pickle portability than about pickle portability itself.

Also (sentence below) containers are architecture specific. They are a solution to reliable deployments and fixing the environment including the dependencies, but not to architecture portability.

Again, I've tried to mitigate the sentence, but IMO any solution enforcing portability and interoperability is architecture dependent, the goal is to make this dependence invisible to the user. The "pickle + docker" option offers to the user a way to run predictions with a model trained on a different architecture... obviously the producer of the model should provide the right "packaging" to the user.

@cmarmo
Copy link
Contributor Author

cmarmo commented Aug 26, 2020

@NicolasHug towards #18257 I have removed the model persistence section from the tutorial. It was a duplicate of the same entry in the User Guide.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise lgtm

@jnothman jnothman merged commit 54ce422 into scikit-learn:master Sep 23, 2020
@jnothman
Copy link
Member

Thanks @cmarmo!

@cmarmo cmarmo deleted the doc_model_persistence branch September 23, 2020 13:04
@cmarmo
Copy link
Contributor Author

cmarmo commented Sep 23, 2020

Thank you @jnothman!

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC More docs needed on model and data persistence
6 participants