Merge pull request #2765 from AlexanderFabisch/validation_curves

jakevdp · jakevdp · commit 5319994adc57 · 2014-02-05T14:31:10.000-08:00
[MRG] Validation curves
diff --git a/doc/model_selection.rst b/doc/model_selection.rst
@@ -11,3 +11,4 @@ Model selection and evaluation
     modules/grid_search
     modules/pipeline
     modules/model_evaluation
+    modules/learning_curve
diff --git a/doc/modules/classes.rst b/doc/modules/classes.rst
@@ -624,6 +624,7 @@ From text
    :template: function.rst
 
    learning_curve.learning_curve
+   learning_curve.validation_curve
 
 .. _linear_model_ref:
 
@@ -1059,6 +1060,7 @@ Pairwise metrics
    preprocessing.Normalizer
    preprocessing.OneHotEncoder
    preprocessing.StandardScaler
+   preprocessing.PolynomialFeatures
 
 .. autosummary::
    :toctree: generated/
diff --git a/doc/modules/learning_curve.rst b/doc/modules/learning_curve.rst
@@ -0,0 +1,158 @@
+.. _learning_curves:
+
+=====================================================
+Validation curves: plotting scores to evaluate models
+=====================================================
+
+.. currentmodule:: sklearn.learning_curve
+
+Every estimator has its advantages and drawbacks. Its generalization error
+can be decomposed in terms of bias, variance and noise. The **bias** of an
+estimator is its average error for different training sets. The **variance**
+of an estimator indicates how sensitive it is to varying training sets. Noise
+is a property of the data.
+
+In the following plot, we see a function :math:`f(x) = \cos (\frac{3}{2} \pi x)`
+and some noisy samples from that function. We use three different estimators
+to fit the function: linear regression with polynomial features of degree 1,
+4 and 15. We see that the first estimator can at best provide only a poor fit
+to the samples and the true function because it is too simple (high bias),
+the second estimator approximates it almost perfectly and the last estimator
+approximates the training data perfectly but does not fit the true function
+very well, i.e. it is very sensitive to varying training data (high variance).
+
+.. figure:: ../auto_examples/images/plot_polynomial_regression_1.png
+   :target: ../auto_examples/plot_polynomial_regression.html
+   :align: center
+   :scale: 50%
+
+Bias and variance are inherent properties of estimators and we usually have to
+select learning algorithms and hyperparameters so that both bias and variance
+are as low as possible (see `Bias-variance dilemma
+<http://en.wikipedia.org/wiki/Bias-variance_dilemma>`_). Another way to reduce
+the variance of a model is to use more training data. However, you should only
+collect more training data if the true function is too complex to be
+approximated by an estimator with a lower variance.
+
+In the simple one-dimensional problem that we have seen in the example it is
+easy to see whether the estimator suffers from bias or variance. However, in
+high-dimensional spaces, models can become very difficult to visualize. For
+this reason, it is often helpful to use the tools described below.
+
+.. topic:: Examples:
+
+   * :ref:`example_plot_polynomial_regression.py`
+   * :ref:`example_plot_validation_curve.py`
+   * :ref:`example_plot_learning_curve.py`
+
+
+.. _validation_curve:
+
+Validation curve
+================
+
+To validate a model we need a scoring function (see :ref:`model_evaluation`),
+for example accuracy for classifiers. The proper way of choosing multiple
+hyperparameters of an estimator are of course grid search or similar methods
+(see :ref:`grid_search`) that select the hyperparameter with the maximum score
+on a validation set or multiple validation sets. Note that if we optimized
+the hyperparameters based on a validation score the validation score is biased
+and not a good estimate of the generalization any longer. To get a proper
+estimate of the generalization we have to compute the score on another test
+set.
+
+However, it is sometimes helpful to plot the influence of a single
+hyperparameter on the training score and the validation score to find out
+whether the estimator is overfitting or underfitting for some hyperparameter
+values.
+
+The function :func:`validation_curve` can help in this case::
+
+  >>> import numpy as np
+  >>> from sklearn.learning_curve import validation_curve
+  >>> from sklearn.datasets import load_iris
+  >>> from sklearn.linear_model import Ridge
+
+  >>> np.random.seed(0)
+  >>> iris = load_iris()
+  >>> X, y = iris.data, iris.target
+  >>> indices = np.arange(y.shape[0])
+  >>> np.random.shuffle(indices)
+  >>> X, y = X[indices], y[indices]
+
+  >>> train_scores, valid_scores = validation_curve(Ridge(), X, y, "alpha",
+  ...                                               np.logspace(-7, 3, 3))
+  >>> train_scores           # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
+  array([[ 0.94...,  0.92...,  0.92...],
+         [ 0.94...,  0.92...,  0.92...],
+         [ 0.47...,  0.45...,  0.42...]])
+  >>> valid_scores           # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
+  array([[ 0.90...,  0.92...,  0.94...],
+         [ 0.90...,  0.92...,  0.94...],
+         [ 0.44...,  0.39...,  0.45...]])
+
+If the training score and the validation score are both low, the estimator will
+be underfitting. If the training score is high and the validation score is low,
+the estimator is overfitting and otherwise it is working very well. A low
+training score and a high validation score is usually not possible. All three
+cases can be found in the plot below where we vary the parameter
+:math:`\gamma` of an SVM on the digits dataset.
+
+.. figure:: ../auto_examples/images/plot_validation_curve_1.png
+   :target: ../auto_examples/plot_validation_curve.html
+   :align: center
+   :scale: 50%
+
+
+.. _learning_curve:
+
+Learning curve
+==============
+
+A learning curve shows the validation and training score of an estimator
+for varying numbers of training samples. It is a tool to find out how much
+we benefit from adding more training data and whether the estimator suffers
+more from a variance error or a bias error. If both the validation score and
+the training score converge to a value that is too low with increasing
+size of the training set, we will not benefit much from more training data.
+In the following plot you can see an example: naive Bayes roughly converges
+to a low score.
+
+.. figure:: ../auto_examples/images/plot_learning_curve_1.png
+   :target: ../auto_examples/plot_learning_curve.html
+   :align: center
+   :scale: 50%
+
+We will probably have to use an estimator or a parametrization of the
+current estimator that can learn more complex concepts (i.e. has a lower
+bias). If the training score is much greater than the validation score for
+the maximum number of training samples, adding more training samples will
+most likely increase generalization. In the following plot you can see that
+the SVM could benefit from more training examples.
+
+.. figure:: ../auto_examples/images/plot_learning_curve_2.png
+   :target: ../auto_examples/plot_learning_curve.html
+   :align: center
+   :scale: 50%
+
+We can use the function :func:`learning_curve` to generate the values
+that are required to plot such a learning curve (number of samples
+that have been used, the average scores on the training sets and the
+average scores on the validation sets)::
+
+  >>> from sklearn.learning_curve import learning_curve
+  >>> from sklearn.svm import SVC
+
+  >>> train_sizes, train_scores, valid_scores = learning_curve(
+  ...     SVC(kernel='linear'), X, y, train_sizes=[50, 80, 110], cv=5)
+  >>> train_sizes            # doctest: +NORMALIZE_WHITESPACE
+  array([ 50, 80, 110])
+  >>> train_scores           # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
+  array([[ 0.98...,  0.98 ,  0.98...,  0.98...,  0.98...],
+         [ 0.98...,  1.   ,  0.98...,  0.98...,  0.98...],
+         [ 0.98...,  1.   ,  0.98...,  0.98...,  0.99...]])
+  >>> valid_scores           # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
+  array([[ 1. ,  0.93...,  1. ,  1. ,  0.96...],
+         [ 1. ,  0.96...,  1. ,  1. ,  0.96...],
+         [ 1. ,  0.96...,  1. ,  1. ,  0.96...]])
+
diff --git a/examples/plot_learning_curve.py b/examples/plot_learning_curve.py
@@ -3,27 +3,19 @@
 Plotting Learning Curves
 ========================
 
-A learning curve shows the validation and training score of a learning
-algorithm for varying numbers of training samples. It is a tool to
-find out how much we benefit from adding more training data. If both
-the validation score and the training score converge to a value that is
-too low, we will not benefit much from more training data and we will
-probably have to use a learning algorithm or a parametrization of the
-current learning algorithm that can learn more complex concepts (i.e.
-has a lower bias).
-
-In this example, on the left side the learning curve of a naive Bayes
-classifier is shown for the digits dataset. Note that the training score
-and the cross-validation score are both not very good at the end. However,
-the shape of the curve can be found in more complex datasets very often:
-the training score is very high at the beginning and decreases and the
-cross-validation score is very low at the beginning and increases. On the
-right side we see the learning curve of an SVM with RBF kernel. We can
-see clearly that the training score is still around the maximum and the
-validation score could be increased with more training samples.
+On the left side the learning curve of a naive Bayes classifier is shown for
+the digits dataset. Note that the training score and the cross-validation score
+are both not very good at the end. However, the shape of the curve can be found
+in more complex datasets very often: the training score is very high at the
+beginning and decreases and the cross-validation score is very low at the
+beginning and increases. On the right side we see the learning curve of an SVM
+with RBF kernel. We can see clearly that the training score is still around
+the maximum and the validation score could be increased with more training
+samples.
 """
 print(__doc__)
 
+import numpy as np
 import matplotlib.pyplot as plt
 from sklearn.naive_bayes import GaussianNB
 from sklearn.svm import SVC
@@ -40,18 +32,36 @@
 plt.ylabel("Score")
 train_sizes, train_scores, test_scores = learning_curve(
     GaussianNB(), X, y, cv=10, n_jobs=1)
-plt.plot(train_sizes, train_scores, label="Training score")
-plt.plot(train_sizes, test_scores, label="Cross-validation score")
+train_scores_mean = np.mean(train_scores, axis=1)
+train_scores_std = np.std(train_scores, axis=1)
+test_scores_mean = np.mean(test_scores, axis=1)
+test_scores_std = np.std(test_scores, axis=1)
+plt.plot(train_sizes, train_scores_mean, label="Training score", color="r")
+plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
+                 train_scores_mean + train_scores_std, alpha=0.2, color="r")
+plt.plot(train_sizes, test_scores_mean, label="Cross-validation score",
+         color="g")
+plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
+                 test_scores_mean + test_scores_std, alpha=0.2, color="g")
 plt.legend(loc="best")
 
 plt.figure()
 plt.title("Learning Curve (SVM, RBF kernel, $\gamma=0.001$)")
 plt.xlabel("Training examples")
 plt.ylabel("Score")
 train_sizes, train_scores, test_scores = learning_curve(
-    SVC(gamma=0.001), X, y, cv=10, n_jobs=1)
-plt.plot(train_sizes, train_scores, label="Training score")
-plt.plot(train_sizes, test_scores, label="Cross-validation score")
+    SVC(gamma=0.001), X, y, cv=10, n_jobs=4)
+train_scores_mean = np.mean(train_scores, axis=1)
+train_scores_std = np.std(train_scores, axis=1)
+test_scores_mean = np.mean(test_scores, axis=1)
+test_scores_std = np.std(test_scores, axis=1)
+plt.plot(train_sizes, train_scores_mean, label="Training score", color="r")
+plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
+                 train_scores_mean + train_scores_std, alpha=0.2, color="r")
+plt.plot(train_sizes, test_scores_mean, label="Cross-validation score",
+         color="g")
+plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
+                 test_scores_mean + test_scores_std, alpha=0.2, color="g")
 plt.legend(loc="best")
 
 plt.show()
diff --git a/examples/plot_polynomial_regression.py b/examples/plot_polynomial_regression.py
@@ -0,0 +1,57 @@
+"""
+=====================
+Polynomial Regression
+=====================
+
+This example demonstrates how we can use linear regression with polynomial
+features to approximate nonlinear functions. The plot shows the function that
+we want to approximate, which is a part of the cosine function. In addition,
+the samples from the real function and the approximations of different models
+are displayed. The models have polynomial features of different degrees. We
+can see that a linear function (polynomial with degree 1) is not sufficient
+to fit the training samples. This is called **underfitting**. A polynomial of
+degree 4 approximates the true function almost perfectly. However, for higher
+degrees the model will **overfit** the training data, i.e. it learns the
+noise of the training data.
+"""
+print(__doc__)
+
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import PolynomialFeatures
+from sklearn.linear_model import LinearRegression
+
+
+np.random.seed(0)
+
+n_samples = 30
+degrees = [1, 4, 15]
+
+true_fun = lambda X: np.cos(1.5 * np.pi * X)
+X = np.sort(np.random.rand(n_samples))
+y = true_fun(X) + np.random.randn(n_samples) * 0.1
+
+plt.figure(figsize=(14, 4))
+for i in range(len(degrees)):
+    ax = plt.subplot(1, len(degrees), i+1)
+    plt.setp(ax, xticks=(), yticks=())
+
+    polynomial_features = PolynomialFeatures(degree=degrees[i],
+                                             include_bias=False)
+    linear_regression = LinearRegression()
+    pipeline = Pipeline([("polynomial_features", polynomial_features),
+                         ("linear_regression", linear_regression)])
+    pipeline.fit(X[:, np.newaxis], y)
+
+    X_test = np.linspace(0, 1, 100)
+    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
+    plt.plot(X_test, true_fun(X_test), label="True function")
+    plt.scatter(X, y, label="Samples")
+    plt.xlabel("x")
+    plt.ylabel("y")
+    plt.xlim((0, 1))
+    plt.ylim((-2, 2))
+    plt.legend(loc="best")
+    plt.title("Degree %d" % degrees[i])
+plt.show()
diff --git a/examples/plot_validation_curve.py b/examples/plot_validation_curve.py
@@ -0,0 +1,46 @@
+"""
+==========================
+Plotting Validation Curves
+==========================
+
+In this plot you can see the training scores and validation scores of an SVM
+for different values of the kernel parameter gamma. For very low values of
+gamma, you can see that both the training score and the validation score are
+low. This is called underfitting. Medium values of gamma will result in high
+values for both scores, i.e. the classifier is perfoming fairly well. If gamma
+is too high, the classifier will overfit, which means that the training score
+is good but the validation score is poor.
+"""
+print(__doc__)
+
+import matplotlib.pyplot as plt
+import numpy as np
+from sklearn.datasets import load_digits
+from sklearn.svm import SVC
+from sklearn.learning_curve import validation_curve
+
+digits = load_digits()
+X, y = digits.data, digits.target
+
+param_range = np.logspace(-6, -1, 5)
+train_scores, test_scores = validation_curve(
+    SVC(), X, y, param_name="gamma", param_range=param_range,
+    cv=10, scoring="accuracy", n_jobs=1)
+train_scores_mean = np.mean(train_scores, axis=1)
+train_scores_std = np.std(train_scores, axis=1)
+test_scores_mean = np.mean(test_scores, axis=1)
+test_scores_std = np.std(test_scores, axis=1)
+
+plt.title("Validation Curve with SVM")
+plt.xlabel("$\gamma$")
+plt.ylabel("Score")
+plt.ylim(0.0, 1.1)
+plt.semilogx(param_range, train_scores_mean, label="Training score", color="r")
+plt.fill_between(param_range, train_scores_mean - train_scores_std,
+                 train_scores_mean + train_scores_std, alpha=0.2, color="r")
+plt.semilogx(param_range, test_scores_mean, label="Cross-validation score",
+             color="g")
+plt.fill_between(param_range, test_scores_mean - test_scores_std,
+                 test_scores_mean + test_scores_std, alpha=0.2, color="g")
+plt.legend(loc="best")
+plt.show()
diff --git a/sklearn/learning_curve.py b/sklearn/learning_curve.py
diff --git a/sklearn/preprocessing/data.py b/sklearn/preprocessing/data.py
diff --git a/sklearn/tests/test_learning_curve.py b/sklearn/tests/test_learning_curve.py