Description
I hope this is not a duplicate issue, but it's something I've been thinking about for a while. It would be nice to create a utility function to generate learning curves: both for hyperparameter value vs. score/error, and for number of training samples vs. score/error. It's something I do by-hand very often. I think an interface similar to grid search would work well, with an interface that looks like this:
Some setup code:
from sklearn.linear_model import RidgeClassifier
clf = RidgeClassifier()
from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data, digits.target
Here's the first function we should create (arguments have the same meaning as in GridSearchCV
)
alpha_range = np.logspace(-2, 2, 50)
train_score, test_score = validation_curve(clf,
param_grid={'alpha':alpha_range},
scoring='accuracy', cv=10)
import matplotlib.pyplot as plt
plt.semilogx(alpha_range, train_score, label='train')
plt.semilogx(alpha_range, test_score, label='test')
Here's the second function we should create (same argument meanings):
N_range = np.arange(10, int(0.8 * X.shape[0]), 10)
train_score, test_score = learning_curve(clf, N_range,
scoring='accuracy', cv=10)
plt.plot(N_range, train_score, label='train')
plt.plot(N_range, test_score, label='test')
I use this sort of thing all the time both in tutorials and in practice; it would be nice to have this functionality available in a convenience routine. Any thoughts on this?