From d8c0ae652b28e16f3fa3d9868a953e33d182e204 Mon Sep 17 00:00:00 2001 From: Chiara Marmo Date: Thu, 23 Jul 2020 11:02:25 +0200 Subject: [PATCH 01/12] Add model persistence in Getting Started. --- doc/getting_started.rst | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/doc/getting_started.rst b/doc/getting_started.rst index 09453cf22845b..4950233656673 100644 --- a/doc/getting_started.rst +++ b/doc/getting_started.rst @@ -78,6 +78,8 @@ Sometimes, you want to apply different transformations to different features: the :ref:`ColumnTransformer` is designed for these use-cases. +.. _pipelines: + Pipelines: chaining pre-processors and estimators -------------------------------------------------- @@ -212,6 +214,34 @@ the best set of parameters. Read more in the :ref:`User Guide Using a pipeline for cross-validation and searching will largely keep you from this common pitfall. +Model persistence +----------------- + +Once the model has been trained, it could be appropriate to store it in a +persistent way in order to reuse it for future predictions. +The first natural pick for python libraries is to binary serialize the Python +object structure using +`pickle `_. +For example the pipeline seen in the :ref:`pipeline section ` can be +recalled in the following way: + + >>> import pickle + >>> X, y = load_iris(return_X_y=True) + >>> s = pickle.dumps(pipe) + >>> pipe2 = pickle.loads(s) + >>> pipe2.predict(X[0:1]) + array([0]) + +However pickle should not completely ensure interoperability between +different architectures. +For production and quality control needs exporting the model in `Predictive +Model Markup Language (PMML) +`_ or `Open Neural Network +Exchange `_ +would be a better approach. + +The :ref:`model persistence ` section of the User Guide +gives more details about pros and cons of each of these representations. Next steps ---------- From 49ca6886a29b685063bc4cdbeef7c30b359e977d Mon Sep 17 00:00:00 2001 From: Chiara Marmo Date: Fri, 31 Jul 2020 18:10:24 +0200 Subject: [PATCH 02/12] Describe PMML and ONNX. --- doc/getting_started.rst | 4 +-- doc/modules/model_persistence.rst | 49 +++++++++++++++++++++---------- 2 files changed, 35 insertions(+), 18 deletions(-) diff --git a/doc/getting_started.rst b/doc/getting_started.rst index 4950233656673..441d9775ef7fb 100644 --- a/doc/getting_started.rst +++ b/doc/getting_started.rst @@ -234,10 +234,10 @@ recalled in the following way: However pickle should not completely ensure interoperability between different architectures. -For production and quality control needs exporting the model in `Predictive +For production and quality control needs, exporting the model in `Predictive Model Markup Language (PMML) `_ or `Open Neural Network -Exchange `_ +Exchange `_ format would be a better approach. The :ref:`model persistence ` section of the User Guide diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst index 3c59b7a313a17..e7348e6c95b9e 100644 --- a/doc/modules/model_persistence.rst +++ b/doc/modules/model_persistence.rst @@ -5,22 +5,15 @@ Model persistence ================= After training a scikit-learn model, it is desirable to have a way to persist -the model for future use without having to retrain. The following section gives -you an example of how to persist a model with pickle. We'll also review a few -security and maintainability issues when working with pickle serialization. +the model for future use without having to retrain. The following sections give +you some hints on how to persist a scikit-learn model. -An alternative to pickling is to export the model to another format using one -of the model export tools listed under :ref:`related_projects`. Unlike -pickling, once exported you cannot recover the full Scikit-learn estimator -object, but you can deploy the model for prediction, usually by using tools -supporting open model interchange formats such as `ONNX `_ or -`PMML `_. - -Persistence example -------------------- +Binary serialization +-------------------- It is possible to save a model in scikit-learn by using Python's built-in -persistence model, namely `pickle `_:: +persistence model, namely `pickle +`_:: >>> from sklearn import svm >>> from sklearn import datasets @@ -55,12 +48,13 @@ with:: ``dump`` and ``load`` functions also accept file-like object instead of filenames. More information on data persistence with Joblib is - available `here `_. + available `here + `_. .. _persistence_limitations: Security & maintainability limitations --------------------------------------- +...................................... pickle (and joblib by extension), has some issues regarding maintainability and security. Because of this, @@ -89,4 +83,27 @@ another architecture is not supported. If you want to know more about these issues and explore other possible serialization methods, please refer to this -`talk by Alex Gaynor `_. +`talk by Alex Gaynor +`_. + +Interoperable formats +--------------------- + +For production and quality control needs, exporting the model in `Predictive +Model Markup Language (PMML) +`_ or `Open Neural Network +Exchange `_ format +would be a better approach. + +PMML is an extension of the `XML +`_ document standard +defined to represent data mining and models. Beeing human and machine readable, +PMML is a good option for model validation on different platforms and +long term archiving. On the other hand, as XML in general, its verbosity does +not help in production when performances are the issue. + +ONNX has been developed to improve the usability of the interoperable +representation of data models. It aims to facilitate the conversion of the data +models between different machine learning frameworks, and to improve their +portability on different computing architectures. More details are available +from the `ONNX tutorial `_. From 2d2581d4744135730b93b4b5a250bbb92e9f7386 Mon Sep 17 00:00:00 2001 From: Chiara Marmo Date: Fri, 31 Jul 2020 21:31:45 +0200 Subject: [PATCH 03/12] Make model persistence an independent chapter of the User Guide. --- doc/getting_started.rst | 30 ------------------------------ doc/model_selection.rst | 1 - doc/modules/model_persistence.rst | 6 +++--- doc/user_guide.rst | 1 + 4 files changed, 4 insertions(+), 34 deletions(-) diff --git a/doc/getting_started.rst b/doc/getting_started.rst index 441d9775ef7fb..09453cf22845b 100644 --- a/doc/getting_started.rst +++ b/doc/getting_started.rst @@ -78,8 +78,6 @@ Sometimes, you want to apply different transformations to different features: the :ref:`ColumnTransformer` is designed for these use-cases. -.. _pipelines: - Pipelines: chaining pre-processors and estimators -------------------------------------------------- @@ -214,34 +212,6 @@ the best set of parameters. Read more in the :ref:`User Guide Using a pipeline for cross-validation and searching will largely keep you from this common pitfall. -Model persistence ------------------ - -Once the model has been trained, it could be appropriate to store it in a -persistent way in order to reuse it for future predictions. -The first natural pick for python libraries is to binary serialize the Python -object structure using -`pickle `_. -For example the pipeline seen in the :ref:`pipeline section ` can be -recalled in the following way: - - >>> import pickle - >>> X, y = load_iris(return_X_y=True) - >>> s = pickle.dumps(pipe) - >>> pipe2 = pickle.loads(s) - >>> pipe2.predict(X[0:1]) - array([0]) - -However pickle should not completely ensure interoperability between -different architectures. -For production and quality control needs, exporting the model in `Predictive -Model Markup Language (PMML) -`_ or `Open Neural Network -Exchange `_ format -would be a better approach. - -The :ref:`model persistence ` section of the User Guide -gives more details about pros and cons of each of these representations. Next steps ---------- diff --git a/doc/model_selection.rst b/doc/model_selection.rst index 7b540072c15e5..4582f4f33db05 100644 --- a/doc/model_selection.rst +++ b/doc/model_selection.rst @@ -11,5 +11,4 @@ Model selection and evaluation modules/cross_validation modules/grid_search modules/model_evaluation - modules/model_persistence modules/learning_curve diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst index e7348e6c95b9e..ed2dcd1105e93 100644 --- a/doc/modules/model_persistence.rst +++ b/doc/modules/model_persistence.rst @@ -93,14 +93,14 @@ For production and quality control needs, exporting the model in `Predictive Model Markup Language (PMML) `_ or `Open Neural Network Exchange `_ format -would be a better approach. +would be a better approach than using `pickle`. PMML is an extension of the `XML `_ document standard -defined to represent data mining and models. Beeing human and machine readable, +defined to represent data mining and models. Being human and machine readable, PMML is a good option for model validation on different platforms and long term archiving. On the other hand, as XML in general, its verbosity does -not help in production when performances are the issue. +not help in production when performance is critical. ONNX has been developed to improve the usability of the interoperable representation of data models. It aims to facilitate the conversion of the data diff --git a/doc/user_guide.rst b/doc/user_guide.rst index 48679aa961782..f37a87494bea8 100644 --- a/doc/user_guide.rst +++ b/doc/user_guide.rst @@ -28,3 +28,4 @@ User Guide data_transforms.rst Dataset loading utilities modules/computing.rst + modules/model_persistence.rst From cabbf02c9cffda7d3df084b508b74aaa92286f87 Mon Sep 17 00:00:00 2001 From: Chiara Marmo Date: Fri, 31 Jul 2020 21:46:05 +0200 Subject: [PATCH 04/12] Clarifying. --- doc/modules/model_persistence.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst index ed2dcd1105e93..5abe3dffe61b1 100644 --- a/doc/modules/model_persistence.rst +++ b/doc/modules/model_persistence.rst @@ -97,7 +97,8 @@ would be a better approach than using `pickle`. PMML is an extension of the `XML `_ document standard -defined to represent data mining and models. Being human and machine readable, +defined to represent data models together with the data used to generate them. +Being human and machine readable, PMML is a good option for model validation on different platforms and long term archiving. On the other hand, as XML in general, its verbosity does not help in production when performance is critical. From acbf0eadca128580ce9222ecb8889c80208a6bfe Mon Sep 17 00:00:00 2001 From: Chiara Marmo Date: Sat, 1 Aug 2020 23:15:33 +0200 Subject: [PATCH 05/12] Add a note about refitting. --- doc/modules/model_persistence.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst index 5abe3dffe61b1..3c865c77d5fae 100644 --- a/doc/modules/model_persistence.rst +++ b/doc/modules/model_persistence.rst @@ -8,6 +8,11 @@ After training a scikit-learn model, it is desirable to have a way to persist the model for future use without having to retrain. The following sections give you some hints on how to persist a scikit-learn model. +.. note:: + + Remember that, once exported in a persistent format, the model could only be + used for predictions and it cannot be refitted. + Binary serialization -------------------- From bd9ef73fa93995ffbd5078ad2ced97a97017d6fe Mon Sep 17 00:00:00 2001 From: Chiara Marmo Date: Mon, 3 Aug 2020 15:18:41 +0200 Subject: [PATCH 06/12] Reformulate recommendation. --- doc/modules/model_persistence.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst index 3c865c77d5fae..b9ad0bc56da11 100644 --- a/doc/modules/model_persistence.rst +++ b/doc/modules/model_persistence.rst @@ -10,8 +10,8 @@ you some hints on how to persist a scikit-learn model. .. note:: - Remember that, once exported in a persistent format, the model could only be - used for predictions and it cannot be refitted. + Remember that, once exported in a persistent format, the model should only + be used for predictions and it shouldn not be refitted. Binary serialization -------------------- From 3667b51ff908921bf2e4818d55509df003cf34d6 Mon Sep 17 00:00:00 2001 From: Chiara Marmo Date: Thu, 6 Aug 2020 01:19:33 +0200 Subject: [PATCH 07/12] Address comments. --- doc/modules/model_persistence.rst | 36 +++++++++++++++++++------------ 1 file changed, 22 insertions(+), 14 deletions(-) diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst index b9ad0bc56da11..bcd24073c24f0 100644 --- a/doc/modules/model_persistence.rst +++ b/doc/modules/model_persistence.rst @@ -11,10 +11,10 @@ you some hints on how to persist a scikit-learn model. .. note:: Remember that, once exported in a persistent format, the model should only - be used for predictions and it shouldn not be refitted. + be used for predictions and it should not be refitted. -Binary serialization --------------------- +Python specific serialization +----------------------------- It is possible to save a model in scikit-learn by using Python's built-in persistence model, namely `pickle @@ -85,6 +85,8 @@ same range as before. Since a model internal representation may be different on two different architectures, dumping a model on one architecture and loading it on another architecture is not supported. +To overcome the issue of portability between different architectures, pickle +models might be deployed in production using containers, like docker. If you want to know more about these issues and explore other possible serialization methods, please refer to this @@ -94,11 +96,20 @@ serialization methods, please refer to this Interoperable formats --------------------- -For production and quality control needs, exporting the model in `Predictive -Model Markup Language (PMML) -`_ or `Open Neural Network -Exchange `_ format -would be a better approach than using `pickle`. +For reproducibility and quality control needs, exporting the model in +`Open Neural Network +Exchange `_ format or `Predictive Model Markup Language +(PMML) `_ format +might be a better approach than using `pickle` alone. + +ONNX is a binary serialization of the model. It has been developed to improve +the usability of the interoperable representation of data models. +It aims to facilitate the conversion of the data +models between different machine learning frameworks, and to improve their +portability on different computing architectures. More details are available +from the `ONNX tutorial `_. +To convert scikit-learn model to ONNX a specific tool `sklearn-onnx +`_ has been developed. PMML is an extension of the `XML `_ document standard @@ -107,9 +118,6 @@ Being human and machine readable, PMML is a good option for model validation on different platforms and long term archiving. On the other hand, as XML in general, its verbosity does not help in production when performance is critical. - -ONNX has been developed to improve the usability of the interoperable -representation of data models. It aims to facilitate the conversion of the data -models between different machine learning frameworks, and to improve their -portability on different computing architectures. More details are available -from the `ONNX tutorial `_. +To convert scikit-learn model to PMML you can use for example `sklearn2pmml +`_ distributed under the GNU Affero +License version3. From 563fce045be9bc6856ff0ad6f9225bb85552d8c5 Mon Sep 17 00:00:00 2001 From: Chiara Marmo Date: Thu, 6 Aug 2020 01:40:23 +0200 Subject: [PATCH 08/12] Address comments. --- doc/modules/model_persistence.rst | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst index bcd24073c24f0..aac91074ae177 100644 --- a/doc/modules/model_persistence.rst +++ b/doc/modules/model_persistence.rst @@ -11,7 +11,9 @@ you some hints on how to persist a scikit-learn model. .. note:: Remember that, once exported in a persistent format, the model should only - be used for predictions and it should not be refitted. + be used for predictions and cannot necessarily be refitted as information + about the original hyper-parameters might not have been exported + (depending on the serialization method). Python specific serialization ----------------------------- @@ -96,7 +98,8 @@ serialization methods, please refer to this Interoperable formats --------------------- -For reproducibility and quality control needs, exporting the model in +For reproducibility and quality control needs, when different architectures +and environments should be taken into account, exporting the model in `Open Neural Network Exchange `_ format or `Predictive Model Markup Language (PMML) `_ format From 2a224e031b02047d78efb04066da4d10a3abdf98 Mon Sep 17 00:00:00 2001 From: Chiara Marmo Date: Thu, 6 Aug 2020 12:41:50 +0200 Subject: [PATCH 09/12] Update doc/modules/model_persistence.rst Co-authored-by: Roman Yurchak --- doc/modules/model_persistence.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst index aac91074ae177..fcf2f55886cda 100644 --- a/doc/modules/model_persistence.rst +++ b/doc/modules/model_persistence.rst @@ -122,5 +122,5 @@ PMML is a good option for model validation on different platforms and long term archiving. On the other hand, as XML in general, its verbosity does not help in production when performance is critical. To convert scikit-learn model to PMML you can use for example `sklearn2pmml -`_ distributed under the GNU Affero -License version3. +`_ distributed under the Affero GPLv3 +license. From 8b2e3933a47f9aab4f23a0e2c3e0485eef5ab499 Mon Sep 17 00:00:00 2001 From: Chiara Marmo Date: Thu, 6 Aug 2020 13:04:08 +0200 Subject: [PATCH 10/12] Some clarifications. --- doc/modules/model_persistence.rst | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst index aac91074ae177..a9f7963bffce0 100644 --- a/doc/modules/model_persistence.rst +++ b/doc/modules/model_persistence.rst @@ -86,9 +86,10 @@ same range as before. Since a model internal representation may be different on two different architectures, dumping a model on one architecture and loading it on -another architecture is not supported. -To overcome the issue of portability between different architectures, pickle -models might be deployed in production using containers, like docker. +another architecture is not a supported behaviour, even if it might work +on some cases. +To overcome the issue of portability, pickle models are often deployed in +production using containers, like docker. If you want to know more about these issues and explore other possible serialization methods, please refer to this From 2ae86d8aa2cf877a847c0c321ce3a3ad2e6ccee3 Mon Sep 17 00:00:00 2001 From: Chiara Marmo Date: Wed, 26 Aug 2020 12:29:15 +0200 Subject: [PATCH 11/12] Remove model persistence from tutorial. --- doc/tutorial/basic/tutorial.rst | 46 --------------------------------- 1 file changed, 46 deletions(-) diff --git a/doc/tutorial/basic/tutorial.rst b/doc/tutorial/basic/tutorial.rst index 28e965bd925a5..5199ad226243f 100644 --- a/doc/tutorial/basic/tutorial.rst +++ b/doc/tutorial/basic/tutorial.rst @@ -204,52 +204,6 @@ A complete example of this classification problem is available as an example that you can run and study: :ref:`sphx_glr_auto_examples_classification_plot_digits_classification.py`. - -Model persistence ------------------ - -It is possible to save a model in scikit-learn by using Python's built-in -persistence model, `pickle `_:: - - >>> from sklearn import svm - >>> from sklearn import datasets - >>> clf = svm.SVC() - >>> X, y = datasets.load_iris(return_X_y=True) - >>> clf.fit(X, y) - SVC() - - >>> import pickle - >>> s = pickle.dumps(clf) - >>> clf2 = pickle.loads(s) - >>> clf2.predict(X[0:1]) - array([0]) - >>> y[0] - 0 - -In the specific case of scikit-learn, it may be more interesting to use -joblib's replacement for pickle (``joblib.dump`` & ``joblib.load``), -which is more efficient on big data but it can only pickle to the disk -and not to a string:: - - >>> from joblib import dump, load - >>> dump(clf, 'filename.joblib') # doctest: +SKIP - -Later, you can reload the pickled model (possibly in another Python process) -with:: - - >>> clf = load('filename.joblib') # doctest:+SKIP - -.. note:: - - ``joblib.dump`` and ``joblib.load`` functions also accept file-like object - instead of filenames. More information on data persistence with Joblib is - available `here `_. - -Note that pickle has some security and maintainability issues. Please refer to -section :ref:`model_persistence` for more detailed information about model -persistence with scikit-learn. - - Conventions ----------- From 653efb54b2285f994b793a285f1155605d6bc8e8 Mon Sep 17 00:00:00 2001 From: Chiara Marmo Date: Tue, 22 Sep 2020 08:54:26 +0200 Subject: [PATCH 12/12] Address comments. --- doc/modules/model_persistence.rst | 13 ++++--------- 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst index f7ecbbfc15975..19d3e12205c12 100644 --- a/doc/modules/model_persistence.rst +++ b/doc/modules/model_persistence.rst @@ -8,13 +8,6 @@ After training a scikit-learn model, it is desirable to have a way to persist the model for future use without having to retrain. The following sections give you some hints on how to persist a scikit-learn model. -.. note:: - - Remember that, once exported in a persistent format, the model should only - be used for predictions and cannot necessarily be refitted as information - about the original hyper-parameters might not have been exported - (depending on the serialization method). - Python specific serialization ----------------------------- @@ -105,6 +98,8 @@ and environments should be taken into account, exporting the model in Exchange `_ format or `Predictive Model Markup Language (PMML) `_ format might be a better approach than using `pickle` alone. +These are helpful where you may want to use your model for prediction in a +different environment from where the model was trained. ONNX is a binary serialization of the model. It has been developed to improve the usability of the interoperable representation of data models. @@ -115,8 +110,8 @@ from the `ONNX tutorial `_. To convert scikit-learn model to ONNX a specific tool `sklearn-onnx `_ has been developed. -PMML is an extension of the `XML -`_ document standard +PMML is an implementation of the `XML +`_ document standard defined to represent data models together with the data used to generate them. Being human and machine readable, PMML is a good option for model validation on different platforms and