Skip to content

Commit 5319994

Browse files
committed
Merge pull request #2765 from AlexanderFabisch/validation_curves
[MRG] Validation curves
2 parents 5b1f20a + 5314720 commit 5319994

File tree

9 files changed

+492
-65
lines changed

9 files changed

+492
-65
lines changed

doc/model_selection.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,4 @@ Model selection and evaluation
1111
modules/grid_search
1212
modules/pipeline
1313
modules/model_evaluation
14+
modules/learning_curve

doc/modules/classes.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -624,6 +624,7 @@ From text
624624
:template: function.rst
625625

626626
learning_curve.learning_curve
627+
learning_curve.validation_curve
627628

628629
.. _linear_model_ref:
629630

@@ -1059,6 +1060,7 @@ Pairwise metrics
10591060
preprocessing.Normalizer
10601061
preprocessing.OneHotEncoder
10611062
preprocessing.StandardScaler
1063+
preprocessing.PolynomialFeatures
10621064

10631065
.. autosummary::
10641066
:toctree: generated/

doc/modules/learning_curve.rst

Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
.. _learning_curves:
2+
3+
=====================================================
4+
Validation curves: plotting scores to evaluate models
5+
=====================================================
6+
7+
.. currentmodule:: sklearn.learning_curve
8+
9+
Every estimator has its advantages and drawbacks. Its generalization error
10+
can be decomposed in terms of bias, variance and noise. The **bias** of an
11+
estimator is its average error for different training sets. The **variance**
12+
of an estimator indicates how sensitive it is to varying training sets. Noise
13+
is a property of the data.
14+
15+
In the following plot, we see a function :math:`f(x) = \cos (\frac{3}{2} \pi x)`
16+
and some noisy samples from that function. We use three different estimators
17+
to fit the function: linear regression with polynomial features of degree 1,
18+
4 and 15. We see that the first estimator can at best provide only a poor fit
19+
to the samples and the true function because it is too simple (high bias),
20+
the second estimator approximates it almost perfectly and the last estimator
21+
approximates the training data perfectly but does not fit the true function
22+
very well, i.e. it is very sensitive to varying training data (high variance).
23+
24+
.. figure:: ../auto_examples/images/plot_polynomial_regression_1.png
25+
:target: ../auto_examples/plot_polynomial_regression.html
26+
:align: center
27+
:scale: 50%
28+
29+
Bias and variance are inherent properties of estimators and we usually have to
30+
select learning algorithms and hyperparameters so that both bias and variance
31+
are as low as possible (see `Bias-variance dilemma
32+
<http://en.wikipedia.org/wiki/Bias-variance_dilemma>`_). Another way to reduce
33+
the variance of a model is to use more training data. However, you should only
34+
collect more training data if the true function is too complex to be
35+
approximated by an estimator with a lower variance.
36+
37+
In the simple one-dimensional problem that we have seen in the example it is
38+
easy to see whether the estimator suffers from bias or variance. However, in
39+
high-dimensional spaces, models can become very difficult to visualize. For
40+
this reason, it is often helpful to use the tools described below.
41+
42+
.. topic:: Examples:
43+
44+
* :ref:`example_plot_polynomial_regression.py`
45+
* :ref:`example_plot_validation_curve.py`
46+
* :ref:`example_plot_learning_curve.py`
47+
48+
49+
.. _validation_curve:
50+
51+
Validation curve
52+
================
53+
54+
To validate a model we need a scoring function (see :ref:`model_evaluation`),
55+
for example accuracy for classifiers. The proper way of choosing multiple
56+
hyperparameters of an estimator are of course grid search or similar methods
57+
(see :ref:`grid_search`) that select the hyperparameter with the maximum score
58+
on a validation set or multiple validation sets. Note that if we optimized
59+
the hyperparameters based on a validation score the validation score is biased
60+
and not a good estimate of the generalization any longer. To get a proper
61+
estimate of the generalization we have to compute the score on another test
62+
set.
63+
64+
However, it is sometimes helpful to plot the influence of a single
65+
hyperparameter on the training score and the validation score to find out
66+
whether the estimator is overfitting or underfitting for some hyperparameter
67+
values.
68+
69+
The function :func:`validation_curve` can help in this case::
70+
71+
>>> import numpy as np
72+
>>> from sklearn.learning_curve import validation_curve
73+
>>> from sklearn.datasets import load_iris
74+
>>> from sklearn.linear_model import Ridge
75+
76+
>>> np.random.seed(0)
77+
>>> iris = load_iris()
78+
>>> X, y = iris.data, iris.target
79+
>>> indices = np.arange(y.shape[0])
80+
>>> np.random.shuffle(indices)
81+
>>> X, y = X[indices], y[indices]
82+
83+
>>> train_scores, valid_scores = validation_curve(Ridge(), X, y, "alpha",
84+
... np.logspace(-7, 3, 3))
85+
>>> train_scores # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
86+
array([[ 0.94..., 0.92..., 0.92...],
87+
[ 0.94..., 0.92..., 0.92...],
88+
[ 0.47..., 0.45..., 0.42...]])
89+
>>> valid_scores # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
90+
array([[ 0.90..., 0.92..., 0.94...],
91+
[ 0.90..., 0.92..., 0.94...],
92+
[ 0.44..., 0.39..., 0.45...]])
93+
94+
If the training score and the validation score are both low, the estimator will
95+
be underfitting. If the training score is high and the validation score is low,
96+
the estimator is overfitting and otherwise it is working very well. A low
97+
training score and a high validation score is usually not possible. All three
98+
cases can be found in the plot below where we vary the parameter
99+
:math:`\gamma` of an SVM on the digits dataset.
100+
101+
.. figure:: ../auto_examples/images/plot_validation_curve_1.png
102+
:target: ../auto_examples/plot_validation_curve.html
103+
:align: center
104+
:scale: 50%
105+
106+
107+
.. _learning_curve:
108+
109+
Learning curve
110+
==============
111+
112+
A learning curve shows the validation and training score of an estimator
113+
for varying numbers of training samples. It is a tool to find out how much
114+
we benefit from adding more training data and whether the estimator suffers
115+
more from a variance error or a bias error. If both the validation score and
116+
the training score converge to a value that is too low with increasing
117+
size of the training set, we will not benefit much from more training data.
118+
In the following plot you can see an example: naive Bayes roughly converges
119+
to a low score.
120+
121+
.. figure:: ../auto_examples/images/plot_learning_curve_1.png
122+
:target: ../auto_examples/plot_learning_curve.html
123+
:align: center
124+
:scale: 50%
125+
126+
We will probably have to use an estimator or a parametrization of the
127+
current estimator that can learn more complex concepts (i.e. has a lower
128+
bias). If the training score is much greater than the validation score for
129+
the maximum number of training samples, adding more training samples will
130+
most likely increase generalization. In the following plot you can see that
131+
the SVM could benefit from more training examples.
132+
133+
.. figure:: ../auto_examples/images/plot_learning_curve_2.png
134+
:target: ../auto_examples/plot_learning_curve.html
135+
:align: center
136+
:scale: 50%
137+
138+
We can use the function :func:`learning_curve` to generate the values
139+
that are required to plot such a learning curve (number of samples
140+
that have been used, the average scores on the training sets and the
141+
average scores on the validation sets)::
142+
143+
>>> from sklearn.learning_curve import learning_curve
144+
>>> from sklearn.svm import SVC
145+
146+
>>> train_sizes, train_scores, valid_scores = learning_curve(
147+
... SVC(kernel='linear'), X, y, train_sizes=[50, 80, 110], cv=5)
148+
>>> train_sizes # doctest: +NORMALIZE_WHITESPACE
149+
array([ 50, 80, 110])
150+
>>> train_scores # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
151+
array([[ 0.98..., 0.98 , 0.98..., 0.98..., 0.98...],
152+
[ 0.98..., 1. , 0.98..., 0.98..., 0.98...],
153+
[ 0.98..., 1. , 0.98..., 0.98..., 0.99...]])
154+
>>> valid_scores # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
155+
array([[ 1. , 0.93..., 1. , 1. , 0.96...],
156+
[ 1. , 0.96..., 1. , 1. , 0.96...],
157+
[ 1. , 0.96..., 1. , 1. , 0.96...]])
158+

examples/plot_learning_curve.py

Lines changed: 33 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -3,27 +3,19 @@
33
Plotting Learning Curves
44
========================
55
6-
A learning curve shows the validation and training score of a learning
7-
algorithm for varying numbers of training samples. It is a tool to
8-
find out how much we benefit from adding more training data. If both
9-
the validation score and the training score converge to a value that is
10-
too low, we will not benefit much from more training data and we will
11-
probably have to use a learning algorithm or a parametrization of the
12-
current learning algorithm that can learn more complex concepts (i.e.
13-
has a lower bias).
14-
15-
In this example, on the left side the learning curve of a naive Bayes
16-
classifier is shown for the digits dataset. Note that the training score
17-
and the cross-validation score are both not very good at the end. However,
18-
the shape of the curve can be found in more complex datasets very often:
19-
the training score is very high at the beginning and decreases and the
20-
cross-validation score is very low at the beginning and increases. On the
21-
right side we see the learning curve of an SVM with RBF kernel. We can
22-
see clearly that the training score is still around the maximum and the
23-
validation score could be increased with more training samples.
6+
On the left side the learning curve of a naive Bayes classifier is shown for
7+
the digits dataset. Note that the training score and the cross-validation score
8+
are both not very good at the end. However, the shape of the curve can be found
9+
in more complex datasets very often: the training score is very high at the
10+
beginning and decreases and the cross-validation score is very low at the
11+
beginning and increases. On the right side we see the learning curve of an SVM
12+
with RBF kernel. We can see clearly that the training score is still around
13+
the maximum and the validation score could be increased with more training
14+
samples.
2415
"""
2516
print(__doc__)
2617

18+
import numpy as np
2719
import matplotlib.pyplot as plt
2820
from sklearn.naive_bayes import GaussianNB
2921
from sklearn.svm import SVC
@@ -40,18 +32,36 @@
4032
plt.ylabel("Score")
4133
train_sizes, train_scores, test_scores = learning_curve(
4234
GaussianNB(), X, y, cv=10, n_jobs=1)
43-
plt.plot(train_sizes, train_scores, label="Training score")
44-
plt.plot(train_sizes, test_scores, label="Cross-validation score")
35+
train_scores_mean = np.mean(train_scores, axis=1)
36+
train_scores_std = np.std(train_scores, axis=1)
37+
test_scores_mean = np.mean(test_scores, axis=1)
38+
test_scores_std = np.std(test_scores, axis=1)
39+
plt.plot(train_sizes, train_scores_mean, label="Training score", color="r")
40+
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
41+
train_scores_mean + train_scores_std, alpha=0.2, color="r")
42+
plt.plot(train_sizes, test_scores_mean, label="Cross-validation score",
43+
color="g")
44+
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
45+
test_scores_mean + test_scores_std, alpha=0.2, color="g")
4546
plt.legend(loc="best")
4647

4748
plt.figure()
4849
plt.title("Learning Curve (SVM, RBF kernel, $\gamma=0.001$)")
4950
plt.xlabel("Training examples")
5051
plt.ylabel("Score")
5152
train_sizes, train_scores, test_scores = learning_curve(
52-
SVC(gamma=0.001), X, y, cv=10, n_jobs=1)
53-
plt.plot(train_sizes, train_scores, label="Training score")
54-
plt.plot(train_sizes, test_scores, label="Cross-validation score")
53+
SVC(gamma=0.001), X, y, cv=10, n_jobs=4)
54+
train_scores_mean = np.mean(train_scores, axis=1)
55+
train_scores_std = np.std(train_scores, axis=1)
56+
test_scores_mean = np.mean(test_scores, axis=1)
57+
test_scores_std = np.std(test_scores, axis=1)
58+
plt.plot(train_sizes, train_scores_mean, label="Training score", color="r")
59+
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
60+
train_scores_mean + train_scores_std, alpha=0.2, color="r")
61+
plt.plot(train_sizes, test_scores_mean, label="Cross-validation score",
62+
color="g")
63+
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
64+
test_scores_mean + test_scores_std, alpha=0.2, color="g")
5565
plt.legend(loc="best")
5666

5767
plt.show()
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
"""
2+
=====================
3+
Polynomial Regression
4+
=====================
5+
6+
This example demonstrates how we can use linear regression with polynomial
7+
features to approximate nonlinear functions. The plot shows the function that
8+
we want to approximate, which is a part of the cosine function. In addition,
9+
the samples from the real function and the approximations of different models
10+
are displayed. The models have polynomial features of different degrees. We
11+
can see that a linear function (polynomial with degree 1) is not sufficient
12+
to fit the training samples. This is called **underfitting**. A polynomial of
13+
degree 4 approximates the true function almost perfectly. However, for higher
14+
degrees the model will **overfit** the training data, i.e. it learns the
15+
noise of the training data.
16+
"""
17+
print(__doc__)
18+
19+
import numpy as np
20+
import matplotlib.pyplot as plt
21+
from sklearn.pipeline import Pipeline
22+
from sklearn.preprocessing import PolynomialFeatures
23+
from sklearn.linear_model import LinearRegression
24+
25+
26+
np.random.seed(0)
27+
28+
n_samples = 30
29+
degrees = [1, 4, 15]
30+
31+
true_fun = lambda X: np.cos(1.5 * np.pi * X)
32+
X = np.sort(np.random.rand(n_samples))
33+
y = true_fun(X) + np.random.randn(n_samples) * 0.1
34+
35+
plt.figure(figsize=(14, 4))
36+
for i in range(len(degrees)):
37+
ax = plt.subplot(1, len(degrees), i+1)
38+
plt.setp(ax, xticks=(), yticks=())
39+
40+
polynomial_features = PolynomialFeatures(degree=degrees[i],
41+
include_bias=False)
42+
linear_regression = LinearRegression()
43+
pipeline = Pipeline([("polynomial_features", polynomial_features),
44+
("linear_regression", linear_regression)])
45+
pipeline.fit(X[:, np.newaxis], y)
46+
47+
X_test = np.linspace(0, 1, 100)
48+
plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
49+
plt.plot(X_test, true_fun(X_test), label="True function")
50+
plt.scatter(X, y, label="Samples")
51+
plt.xlabel("x")
52+
plt.ylabel("y")
53+
plt.xlim((0, 1))
54+
plt.ylim((-2, 2))
55+
plt.legend(loc="best")
56+
plt.title("Degree %d" % degrees[i])
57+
plt.show()

examples/plot_validation_curve.py

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
"""
2+
==========================
3+
Plotting Validation Curves
4+
==========================
5+
6+
In this plot you can see the training scores and validation scores of an SVM
7+
for different values of the kernel parameter gamma. For very low values of
8+
gamma, you can see that both the training score and the validation score are
9+
low. This is called underfitting. Medium values of gamma will result in high
10+
values for both scores, i.e. the classifier is perfoming fairly well. If gamma
11+
is too high, the classifier will overfit, which means that the training score
12+
is good but the validation score is poor.
13+
"""
14+
print(__doc__)
15+
16+
import matplotlib.pyplot as plt
17+
import numpy as np
18+
from sklearn.datasets import load_digits
19+
from sklearn.svm import SVC
20+
from sklearn.learning_curve import validation_curve
21+
22+
digits = load_digits()
23+
X, y = digits.data, digits.target
24+
25+
param_range = np.logspace(-6, -1, 5)
26+
train_scores, test_scores = validation_curve(
27+
SVC(), X, y, param_name="gamma", param_range=param_range,
28+
cv=10, scoring="accuracy", n_jobs=1)
29+
train_scores_mean = np.mean(train_scores, axis=1)
30+
train_scores_std = np.std(train_scores, axis=1)
31+
test_scores_mean = np.mean(test_scores, axis=1)
32+
test_scores_std = np.std(test_scores, axis=1)
33+
34+
plt.title("Validation Curve with SVM")
35+
plt.xlabel("$\gamma$")
36+
plt.ylabel("Score")
37+
plt.ylim(0.0, 1.1)
38+
plt.semilogx(param_range, train_scores_mean, label="Training score", color="r")
39+
plt.fill_between(param_range, train_scores_mean - train_scores_std,
40+
train_scores_mean + train_scores_std, alpha=0.2, color="r")
41+
plt.semilogx(param_range, test_scores_mean, label="Cross-validation score",
42+
color="g")
43+
plt.fill_between(param_range, test_scores_mean - test_scores_std,
44+
test_scores_mean + test_scores_std, alpha=0.2, color="g")
45+
plt.legend(loc="best")
46+
plt.show()

0 commit comments

Comments
 (0)