[MRG] Learning curves #2701

AlexanderFabisch · 2014-01-01T09:35:58Z

This is the pull request for the function learning_curve that has been proposed in issue #2584. I would like to get feedback on the interface and code.

Here is an example with naive base on the digits dataset:

jnothman · 2014-01-01T10:09:10Z

sklearn/learning_curve.py

+from .metrics.scorer import _deprecate_loss_and_score_funcs
+
+def learning_curve(estimator, X, y, n_samples_range=None, step_size=1,
+                   n_cv_folds=10, loss_func=None, scoring=None,


Rather than n_cv_folds, use cv with check_cv(...) as in cross_validation.cross_val_score. This allows different cv strategies.

I was looking for something like that. Great tip, thanks!

jnothman · 2014-01-01T10:21:37Z

Where can we put the stuff that occurs in *SearchCV and in learning_curve?

I'm not sure exactly what you're asking here. There is an outstanding issue to rename grid_search to model_selection or similar, but what else belongs in that module/package is unclear.

coveralls · 2014-01-01T15:01:04Z

Changes Unknown when pulling 8b8c7bb on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

AlexanderFabisch · 2014-01-01T15:06:10Z

Thanks for the tips. The learning curve with naive Bayes on the digits data set looks really nice now:

There are two sections in the code that overlap a little bit with the code in BaseGridSearch (because they are copied from it). There could be a way to refactor the code so that we don't have clones.

coveralls · 2014-01-01T15:16:19Z

Changes Unknown when pulling c922e83 on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

coveralls · 2014-01-01T15:23:28Z

Changes Unknown when pulling 559089e on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

coveralls · 2014-01-01T15:48:48Z

Changes Unknown when pulling a892b9a on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

jnothman · 2014-01-01T20:15:31Z

sklearn/learning_curve.py

+                             "but is within [%f, %f]."
+                             % (n_min_required_samples,
+                                n_max_required_samples))
+        n_samples_range = (n_samples_range *


You should probably do a np.unique to this, as there's no point evaluating the same size twice if they've provided a higher resolution than is meaningful.

However maybe this will break users' expectations in terms of output shape.

I will document that behavior.

Or you can fit as few times as possible, but report duplicate points in the output, with something like:

ticks, inverse = np.unique(ticks, return_inverse=True) ... return np.take(result, inverse)

coveralls · 2014-01-01T22:26:07Z

Changes Unknown when pulling a928fbe on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

coveralls · 2014-01-01T23:15:31Z

Changes Unknown when pulling 7bb4946 on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

coveralls · 2014-01-02T14:56:15Z

Changes Unknown when pulling 9dc7d72 on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

AlexanderFabisch · 2014-01-02T16:02:33Z

The learning curves for exploit_incremental_learning=True and =False with a PassiveAggressiveClassifier are not exactly the same. I am not completely sure how that could happen. Is there any reason why this can happen?

coveralls · 2014-01-02T16:09:16Z

Changes Unknown when pulling a50c2af on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

coveralls · 2014-01-02T17:17:18Z

Changes Unknown when pulling a300ead on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

jnothman · 2014-01-02T21:25:27Z

fit makes multiple (n_iter) passes over the data; partial_fit makes 1
(I think, but am not certain, that this is the only difference). This
should be better documented, and you're welcome to submit a PR to fix it.

On Fri, Jan 3, 2014 at 4:17 AM, Coveralls notifications@github.com wrote:

[image: Coverage Status] https://coveralls.io/builds/413477

Changes Unknown when pulling a300ead
a300ead
on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2701#issuecomment-31467597
.

AlexanderFabisch · 2014-01-02T21:37:27Z

Actually I set n_iter=1. I will take a look, but not before Saturday.

jnothman · 2014-01-03T00:34:07Z

There's a test for equivalence with other learning rates at https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/tests/test_sgd.py#L570, but as far as I can tell, the same is not tested for passive_aggressive. But I still can't see a reason they should differ.

AlexanderFabisch · 2014-01-03T11:07:13Z

Oh, I found the error. I assumed that the CV always generates train/test splits of equal sizes which is obviously not true. In this case it is:

1612 / 185
1614 / 183
1616 / 181
1617 / 180
1618 / 179
1618 / 179
1618 / 179
1619 / 178
1620 / 177
1621 / 176

AlexanderFabisch · 2014-01-07T21:41:42Z

I hope that is understandable now.

amueller · 2014-01-07T21:52:31Z

Awesome. Thanks a lot. Apart from my minor comment +1 for merge.

coveralls · 2014-01-07T21:53:37Z

Coverage remained the same when pulling 9748127 on AlexanderFabisch:learning_curves into b57b3a3 on scikit-learn:master.

AlexanderFabisch · 2014-01-07T23:36:57Z

I'm really happy with the code now. Thanks for the great reviews and advices.

coveralls · 2014-01-07T23:41:36Z

Coverage remained the same when pulling 822bd7b on AlexanderFabisch:learning_curves into b57b3a3 on scikit-learn:master.

glouppe · 2014-01-08T06:55:34Z

Just wanted to say, this is a great contribution @AlexanderFabisch ! Thanks for your patience :)

AlexanderFabisch · 2014-01-08T06:56:08Z

Thank you. :)

jakevdp · 2014-01-08T18:14:27Z

I did a final read-through of the code. 👍 for merge! Thanks for all the work on this! It's going to be extremely useful for my sklearn tutorials.

amueller · 2014-01-08T18:47:31Z

@AlexanderFabisch thanks for the great contribution :) I'm really glad that you are also happy with the code now!

OT @ogrisel @larsmans @GaelVaroquaux what is the current merging policy? Rebase? Squash to a single commit and rebase? Green button? I think I would vote for squashing, in particular as this went to and fro a bit.

jnothman · 2014-01-08T21:29:30Z

I think this one should ideally be squashed.

A belated thought: if we also had timing in here, it would be really simple to use as a generic benchmarking script. WDYT? This could go in a separate PR, but if it's done by default, it'll change the output signature.

AlexanderFabisch · 2014-01-08T21:37:12Z

I would prefer making that an optional part. Usually you make a learning curve to see how good the performance (score) is and not to measure training times. Anyway, we should collect these ideas in the corresponding issue.

jnothman · 2014-01-08T22:13:00Z

I'm certainly fine with that. I just don't see the need to duplicate this
procedure to do timing assessment, and current benchmarks like
bench_tree.py would give more stable results if they used this CV
procedure.

On 9 January 2014 08:37, Alexander Fabisch notifications@github.com wrote:

I would prefer making that an optional part. Usually you make a learning
curve to see how good the performance (score) is and not to measure
training times.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2701#issuecomment-31880525
.

amueller · 2014-01-08T23:30:36Z

pushed as a876682 after thinking about the PR I have to rebase against this ;)
Still would be interested in the opinion of the others.

Thanks again @AlexanderFabisch

jnothman · 2014-01-09T02:49:19Z

Some documentation was forgotten here: the feature's addition to whats_new and modules/classes. A nice-to-have would be a mention in the narrative doc. I'll push a whats_new and classes entry (unless someone tells me not to): 7de3d96

ltiao · 2014-01-10T06:40:57Z

There is a case where training subset might only contain a single class and may require stratified sampling.

Consider the below for example.

from sklearn.learning_curve import learning_curve
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=100, n_features=5, n_informative=3, n_classes=3, weights=[0.15, 0.5, 0.35], shuffle=False)

clf = LogisticRegression()

train_sizes, train_score, test_score = learning_curve(clf, X, y)

This results in the following exception:

Traceback (most recent call last):
  File "evaluation.py", line 33, in <module>
    train_sizes, train_score, test_score = learning_curve(LogisticRegression(), X, y)
  File "/Users/louistiao/.virtualenvs/numerical/lib/python2.7/site-packages/sklearn/learning_curve.py", line 132, in learning_curve
    for train, test in cv for n_train_samples in train_sizes_abs)
  File "/Users/louistiao/.virtualenvs/numerical/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 595, in __call__
    self.dispatch(function, args, kwargs)
  File "/Users/louistiao/.virtualenvs/numerical/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 364, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/Users/louistiao/.virtualenvs/numerical/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 129, in __init__
    self.results = func(*args, **kwargs)
  File "/Users/louistiao/.virtualenvs/numerical/lib/python2.7/site-packages/sklearn/learning_curve.py", line 209, in _fit_estimator
    _fit(estimator.fit, X_train, y_train)
  File "/Users/louistiao/.virtualenvs/numerical/lib/python2.7/site-packages/sklearn/grid_search.py", line 295, in _fit
    fit_function(X_train, y_train, **fit_params)
  File "/Users/louistiao/.virtualenvs/numerical/lib/python2.7/site-packages/sklearn/svm/base.py", line 659, in fit
    raise ValueError("The number of classes has to be greater than"
ValueError: The number of classes has to be greater than one.

While StratifiedKFold with k=3 has been used here by default, the further splitting at line 211 of learning_curve.py is obviously not guaranteed to include all classes, let alone, more than one class in the training subsets.

In this example, y_train=[0 0 0 0 0 0] after executing the _split() in the above line.

This poses a problem for estimators such as LogisticRegression and somewhat skewed class frequency distributions.

Of course, one could significantly reduce the probability of having a single class in the training subset resulting from the _split() by shuffling the dataset (e.g. set shuffle=True in make_classification above) but this would still not guarantee the inclusion of all classes.

AlexanderFabisch · 2014-01-10T08:05:13Z

That is on my todo list. My first idea would be to add an additional argument stratify which would result in an order like

class 1
class 2
...
class n
class 1
class 2
...
class n
...

jnothman · 2014-01-10T08:18:10Z

Nice catch, Louis. I think shuffling the train indices by default is a
reasinable approach.

On 10 January 2014 19:05, Alexander Fabisch notifications@github.comwrote:

That is on my todo listhttps://gist.github.com/AlexanderFabisch/8348366.
My first idea would be to add an additional argument stratify which would
result in an order like

class 1
class 2
...
class n
class 1
class 2
...
class n
...

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2701#issuecomment-32009514
.

ltiao · 2014-01-13T04:13:31Z

@AlexanderFabisch can I help on this in any way?

Separately, I am wondering if learning_curve() ought to have a return value similar to sklearn.cross_validation.cross_val_score(), i.e. array of float, shape=(n_ticks, len(list(cv))) (or its transpose) so that the standard deviation may also be obtained?

AlexanderFabisch · 2014-01-13T07:05:42Z

Of course you can help me. I'm still busy with #2736 at the moment. So you could start with it already and open a pull request with a solution. Send me a message as soon as it is open.

Returning the score of all runs seems to be a good idea. I'm wondering if it should be the default or if we should introduce another function that returns all scores. learning_curve could call this function and average of all cv runs.

GaelVaroquaux · 2014-01-14T08:48:23Z

sklearn/grid_search.py


-        if scorer is not None:
-            this_score = scorer(clf, X_test, y_test)
+def _fit(fit_function, X_train, y_train, **fit_params):


I agree with @amueller that it is not great style to have a helper function that is a 4-liner just to avoid an "if" statement in the main code.

The reason that I dislike such style is that when reading the calling code, you don't know what this function is doing, especially given the fact that '_fit' as a name doesn't doesn't tell what this does differently from calling the fit_function. All in all, the code is riddled with those small helper functions, and I find this hard to read.

This can be reconsidered in #2736. I'm not a big fan of these mini-helpers either.

+1 as well.

That has been fixed in #2736 already.

mblondel · 2014-02-07T09:17:34Z

Sorry for the late comment. I only noticed this now because I'm resolving merge conflicts between this code and PR #2759.

learning_curve with exploit_incremental_learning=True is only equal to learning_curve with exploit_incremental_learning=False if the user sets n_iter=1 in the estimator, since partial_fit uses each instance only once. I think this is not clear at all from the documentation. Also, the point of doing a learning curve is to judge whether we would benefit from adding more training data but using n_iter=1 will return very pessimistic results. So I would personally not trust the results when using exploit_incremental_learning=True.

AlexanderFabisch · 2014-02-07T12:15:04Z

You are right, we should document that.

I would like to keep that option because it is sometimes interesting to see how the estimator performs for different training set sizes online if you have a large amount of data. In addition, there are incremental learning algorithms that do not need multiple iterations, e.g. some incremental variants of Gaussian processes. We should not limit the functionality only because there is no good use case in sklearn (yet).

mblondel · 2014-02-09T15:55:03Z

After thinking about it, the exploit_incremental_learning=True is a good idea for MultinomialNB, since the results should be exactly the same as exploit_incremental_learning=False. BTW, the option name is really long. I think incremental_learning or incremental_fit are shorter yet explicit enough.

AlexanderFabisch · 2014-02-09T17:24:53Z

... or maybe partial_fit?

jnothman reviewed Jan 1, 2014
View reviewed changes

AlexanderFabisch and others added 2 commits January 4, 2014 16:01

Add prototype for learning curve

0ff07a0

Less complicated conditions

be6e185

Add documentation of '_translate_train_sizes'

9748127

Improve documentation of private functions

822bd7b

AlexanderFabisch mentioned this pull request Jan 8, 2014

Enhancement: Learning curves #2584

Closed

amueller closed this Jan 8, 2014

AlexanderFabisch mentioned this pull request Jan 9, 2014

[MRG] Refactor CV and grid search #2736

Closed

1 task

GaelVaroquaux reviewed Jan 14, 2014
View reviewed changes

astrakhantsev mentioned this pull request Feb 24, 2018

Stratify option for learning_curve #10684

Open

[MRG] Learning curves #2701

[MRG] Learning curves #2701

Conversation

AlexanderFabisch commented Jan 1, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jan 1, 2014

coveralls commented Jan 1, 2014

AlexanderFabisch commented Jan 1, 2014

coveralls commented Jan 1, 2014

coveralls commented Jan 1, 2014

coveralls commented Jan 1, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jan 1, 2014

coveralls commented Jan 1, 2014

coveralls commented Jan 2, 2014

AlexanderFabisch commented Jan 2, 2014

coveralls commented Jan 2, 2014

coveralls commented Jan 2, 2014

jnothman commented Jan 2, 2014

AlexanderFabisch commented Jan 2, 2014

jnothman commented Jan 3, 2014

AlexanderFabisch commented Jan 3, 2014

AlexanderFabisch commented Jan 7, 2014

amueller commented Jan 7, 2014

coveralls commented Jan 7, 2014

AlexanderFabisch commented Jan 7, 2014

coveralls commented Jan 7, 2014

glouppe commented Jan 8, 2014

AlexanderFabisch commented Jan 8, 2014

jakevdp commented Jan 8, 2014

amueller commented Jan 8, 2014

jnothman commented Jan 8, 2014

AlexanderFabisch commented Jan 8, 2014

jnothman commented Jan 8, 2014

amueller commented Jan 8, 2014

jnothman commented Jan 9, 2014

ltiao commented Jan 10, 2014

AlexanderFabisch commented Jan 10, 2014

jnothman commented Jan 10, 2014

ltiao commented Jan 13, 2014

AlexanderFabisch commented Jan 13, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mblondel commented Feb 7, 2014

AlexanderFabisch commented Feb 7, 2014

mblondel commented Feb 9, 2014

AlexanderFabisch commented Feb 9, 2014