diff --git a/doc/model_selection.rst b/doc/model_selection.rst index 04e41c454419e..25cd2b655ccc5 100644 --- a/doc/model_selection.rst +++ b/doc/model_selection.rst @@ -15,5 +15,4 @@ Model selection and evaluation modules/cross_validation modules/grid_search modules/model_evaluation - modules/model_persistence modules/learning_curve diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst index 3c59b7a313a17..19d3e12205c12 100644 --- a/doc/modules/model_persistence.rst +++ b/doc/modules/model_persistence.rst @@ -5,22 +5,15 @@ Model persistence ================= After training a scikit-learn model, it is desirable to have a way to persist -the model for future use without having to retrain. The following section gives -you an example of how to persist a model with pickle. We'll also review a few -security and maintainability issues when working with pickle serialization. +the model for future use without having to retrain. The following sections give +you some hints on how to persist a scikit-learn model. -An alternative to pickling is to export the model to another format using one -of the model export tools listed under :ref:`related_projects`. Unlike -pickling, once exported you cannot recover the full Scikit-learn estimator -object, but you can deploy the model for prediction, usually by using tools -supporting open model interchange formats such as `ONNX `_ or -`PMML `_. - -Persistence example -------------------- +Python specific serialization +----------------------------- It is possible to save a model in scikit-learn by using Python's built-in -persistence model, namely `pickle `_:: +persistence model, namely `pickle +`_:: >>> from sklearn import svm >>> from sklearn import datasets @@ -55,12 +48,13 @@ with:: ``dump`` and ``load`` functions also accept file-like object instead of filenames. More information on data persistence with Joblib is - available `here `_. + available `here + `_. .. _persistence_limitations: Security & maintainability limitations --------------------------------------- +...................................... pickle (and joblib by extension), has some issues regarding maintainability and security. Because of this, @@ -85,8 +79,44 @@ same range as before. Since a model internal representation may be different on two different architectures, dumping a model on one architecture and loading it on -another architecture is not supported. +another architecture is not a supported behaviour, even if it might work +on some cases. +To overcome the issue of portability, pickle models are often deployed in +production using containers, like docker. If you want to know more about these issues and explore other possible serialization methods, please refer to this -`talk by Alex Gaynor `_. +`talk by Alex Gaynor +`_. + +Interoperable formats +--------------------- + +For reproducibility and quality control needs, when different architectures +and environments should be taken into account, exporting the model in +`Open Neural Network +Exchange `_ format or `Predictive Model Markup Language +(PMML) `_ format +might be a better approach than using `pickle` alone. +These are helpful where you may want to use your model for prediction in a +different environment from where the model was trained. + +ONNX is a binary serialization of the model. It has been developed to improve +the usability of the interoperable representation of data models. +It aims to facilitate the conversion of the data +models between different machine learning frameworks, and to improve their +portability on different computing architectures. More details are available +from the `ONNX tutorial `_. +To convert scikit-learn model to ONNX a specific tool `sklearn-onnx +`_ has been developed. + +PMML is an implementation of the `XML +`_ document standard +defined to represent data models together with the data used to generate them. +Being human and machine readable, +PMML is a good option for model validation on different platforms and +long term archiving. On the other hand, as XML in general, its verbosity does +not help in production when performance is critical. +To convert scikit-learn model to PMML you can use for example `sklearn2pmml +`_ distributed under the Affero GPLv3 +license. diff --git a/doc/tutorial/basic/tutorial.rst b/doc/tutorial/basic/tutorial.rst index 28e965bd925a5..5199ad226243f 100644 --- a/doc/tutorial/basic/tutorial.rst +++ b/doc/tutorial/basic/tutorial.rst @@ -204,52 +204,6 @@ A complete example of this classification problem is available as an example that you can run and study: :ref:`sphx_glr_auto_examples_classification_plot_digits_classification.py`. - -Model persistence ------------------ - -It is possible to save a model in scikit-learn by using Python's built-in -persistence model, `pickle `_:: - - >>> from sklearn import svm - >>> from sklearn import datasets - >>> clf = svm.SVC() - >>> X, y = datasets.load_iris(return_X_y=True) - >>> clf.fit(X, y) - SVC() - - >>> import pickle - >>> s = pickle.dumps(clf) - >>> clf2 = pickle.loads(s) - >>> clf2.predict(X[0:1]) - array([0]) - >>> y[0] - 0 - -In the specific case of scikit-learn, it may be more interesting to use -joblib's replacement for pickle (``joblib.dump`` & ``joblib.load``), -which is more efficient on big data but it can only pickle to the disk -and not to a string:: - - >>> from joblib import dump, load - >>> dump(clf, 'filename.joblib') # doctest: +SKIP - -Later, you can reload the pickled model (possibly in another Python process) -with:: - - >>> clf = load('filename.joblib') # doctest:+SKIP - -.. note:: - - ``joblib.dump`` and ``joblib.load`` functions also accept file-like object - instead of filenames. More information on data persistence with Joblib is - available `here `_. - -Note that pickle has some security and maintainability issues. Please refer to -section :ref:`model_persistence` for more detailed information about model -persistence with scikit-learn. - - Conventions ----------- diff --git a/doc/user_guide.rst b/doc/user_guide.rst index 464b7918d7ba5..9c9804c702c06 100644 --- a/doc/user_guide.rst +++ b/doc/user_guide.rst @@ -28,3 +28,4 @@ User Guide data_transforms.rst datasets.rst computing.rst + modules/model_persistence.rst