Skip to content

DOC revamp model persistence documentation #18046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Sep 23, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
d8c0ae6
Add model persistence in Getting Started.
cmarmo Jul 23, 2020
fdf0b43
Merge branch 'master' into doc_model_persistence
cmarmo Jul 31, 2020
49ca688
Describe PMML and ONNX.
cmarmo Jul 31, 2020
2d2581d
Make model persistence an independent chapter of the User Guide.
cmarmo Jul 31, 2020
cabbf02
Clarifying.
cmarmo Jul 31, 2020
331ac00
Merge branch 'master' into doc_model_persistence
cmarmo Aug 1, 2020
acbf0ea
Add a note about refitting.
cmarmo Aug 1, 2020
d8b7f26
Sync with upstream.
cmarmo Aug 3, 2020
bd9ef73
Reformulate recommendation.
cmarmo Aug 3, 2020
1c2cbf0
Merge branch 'master' into doc_model_persistence
cmarmo Aug 5, 2020
3667b51
Address comments.
cmarmo Aug 5, 2020
563fce0
Address comments.
cmarmo Aug 5, 2020
2a224e0
Update doc/modules/model_persistence.rst
cmarmo Aug 6, 2020
709c502
Merge branch 'master' into doc_model_persistence
cmarmo Aug 6, 2020
8b2e393
Some clarifications.
cmarmo Aug 6, 2020
6d4ba59
Merge branch 'doc_model_persistence' of https://github.com/cmarmo/sci…
cmarmo Aug 6, 2020
2c9dec7
Merge branch 'master' into doc_model_persistence
cmarmo Aug 23, 2020
dc18f09
Merge branch 'doc_model_persistence' of https://github.com/cmarmo/sci…
cmarmo Aug 23, 2020
dad6ceb
Merge branch 'master' into doc_model_persistence
cmarmo Aug 26, 2020
2ae86d8
Remove model persistence from tutorial.
cmarmo Aug 26, 2020
3aef493
Merge branch 'master' into doc_model_persistence
cmarmo Sep 1, 2020
05abc39
Merge branch 'master' into doc_model_persistence
cmarmo Sep 4, 2020
ea07a01
Merge branch 'master' into doc_model_persistence
cmarmo Sep 7, 2020
1854c82
Merge branch 'doc_model_persistence' of https://github.com/cmarmo/sci…
cmarmo Sep 7, 2020
a3799a0
Merge branch 'master' into doc_model_persistence
cmarmo Sep 10, 2020
e13adc1
Merge branch 'master' into doc_model_persistence
cmarmo Sep 14, 2020
85b288b
Merge branch 'master' into doc_model_persistence
cmarmo Sep 18, 2020
ccf6f59
Merge branch 'master' into doc_model_persistence
cmarmo Sep 21, 2020
7b7ff28
Merge branch 'master' into doc_model_persistence
cmarmo Sep 22, 2020
653efb5
Address comments.
cmarmo Sep 22, 2020
853577f
Merge branch 'master' into doc_model_persistence
cmarmo Sep 23, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion doc/model_selection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,4 @@ Model selection and evaluation
modules/cross_validation
modules/grid_search
modules/model_evaluation
modules/model_persistence
modules/learning_curve
64 changes: 47 additions & 17 deletions doc/modules/model_persistence.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,15 @@ Model persistence
=================

After training a scikit-learn model, it is desirable to have a way to persist
the model for future use without having to retrain. The following section gives
you an example of how to persist a model with pickle. We'll also review a few
security and maintainability issues when working with pickle serialization.
the model for future use without having to retrain. The following sections give
you some hints on how to persist a scikit-learn model.

An alternative to pickling is to export the model to another format using one
of the model export tools listed under :ref:`related_projects`. Unlike
pickling, once exported you cannot recover the full Scikit-learn estimator
object, but you can deploy the model for prediction, usually by using tools
supporting open model interchange formats such as `ONNX <https://onnx.ai/>`_ or
`PMML <http://dmg.org/pmml/v4-4/GeneralStructure.html>`_.

Persistence example
-------------------
Python specific serialization
-----------------------------

It is possible to save a model in scikit-learn by using Python's built-in
persistence model, namely `pickle <https://docs.python.org/3/library/pickle.html>`_::
persistence model, namely `pickle
<https://docs.python.org/3/library/pickle.html>`_::

>>> from sklearn import svm
>>> from sklearn import datasets
Expand Down Expand Up @@ -55,12 +48,13 @@ with::

``dump`` and ``load`` functions also accept file-like object
instead of filenames. More information on data persistence with Joblib is
available `here <https://joblib.readthedocs.io/en/latest/persistence.html>`_.
available `here
<https://joblib.readthedocs.io/en/latest/persistence.html>`_.

.. _persistence_limitations:

Security & maintainability limitations
--------------------------------------
......................................

pickle (and joblib by extension), has some issues regarding maintainability
and security. Because of this,
Expand All @@ -85,8 +79,44 @@ same range as before.

Since a model internal representation may be different on two different
architectures, dumping a model on one architecture and loading it on
another architecture is not supported.
another architecture is not a supported behaviour, even if it might work
on some cases.
To overcome the issue of portability, pickle models are often deployed in
production using containers, like docker.

If you want to know more about these issues and explore other possible
serialization methods, please refer to this
`talk by Alex Gaynor <https://pyvideo.org/video/2566/pickles-are-for-delis-not-software>`_.
`talk by Alex Gaynor
<https://pyvideo.org/video/2566/pickles-are-for-delis-not-software>`_.

Interoperable formats
---------------------

For reproducibility and quality control needs, when different architectures
and environments should be taken into account, exporting the model in
`Open Neural Network
Exchange <https://onnx.ai/>`_ format or `Predictive Model Markup Language
(PMML) <http://dmg.org/pmml/v4-4-1/GeneralStructure.html>`_ format
might be a better approach than using `pickle` alone.
These are helpful where you may want to use your model for prediction in a
different environment from where the model was trained.

ONNX is a binary serialization of the model. It has been developed to improve
the usability of the interoperable representation of data models.
It aims to facilitate the conversion of the data
models between different machine learning frameworks, and to improve their
portability on different computing architectures. More details are available
from the `ONNX tutorial <https://onnx.ai/get-started.html>`_.
To convert scikit-learn model to ONNX a specific tool `sklearn-onnx
<http://onnx.ai/sklearn-onnx/>`_ has been developed.

PMML is an implementation of the `XML
<https://en.wikipedia.org/wiki/XML>`_ document standard
defined to represent data models together with the data used to generate them.
Being human and machine readable,
PMML is a good option for model validation on different platforms and
long term archiving. On the other hand, as XML in general, its verbosity does
not help in production when performance is critical.
To convert scikit-learn model to PMML you can use for example `sklearn2pmml
<https://github.com/jpmml/sklearn2pmml>`_ distributed under the Affero GPLv3
license.
46 changes: 0 additions & 46 deletions doc/tutorial/basic/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -204,52 +204,6 @@ A complete example of this classification problem is available as an
example that you can run and study:
:ref:`sphx_glr_auto_examples_classification_plot_digits_classification.py`.


Model persistence
-----------------

It is possible to save a model in scikit-learn by using Python's built-in
persistence model, `pickle <https://docs.python.org/2/library/pickle.html>`_::

>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> X, y = datasets.load_iris(return_X_y=True)
>>> clf.fit(X, y)
SVC()

>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.predict(X[0:1])
array([0])
>>> y[0]
0

In the specific case of scikit-learn, it may be more interesting to use
joblib's replacement for pickle (``joblib.dump`` & ``joblib.load``),
which is more efficient on big data but it can only pickle to the disk
and not to a string::

>>> from joblib import dump, load
>>> dump(clf, 'filename.joblib') # doctest: +SKIP

Later, you can reload the pickled model (possibly in another Python process)
with::

>>> clf = load('filename.joblib') # doctest:+SKIP

.. note::

``joblib.dump`` and ``joblib.load`` functions also accept file-like object
instead of filenames. More information on data persistence with Joblib is
available `here <https://joblib.readthedocs.io/en/latest/persistence.html>`_.

Note that pickle has some security and maintainability issues. Please refer to
section :ref:`model_persistence` for more detailed information about model
persistence with scikit-learn.


Conventions
-----------

Expand Down
1 change: 1 addition & 0 deletions doc/user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,4 @@ User Guide
data_transforms.rst
datasets.rst
computing.rst
modules/model_persistence.rst