Skip to content

[MRG] Learning curves #2701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 30 commits into from
Closed

[MRG] Learning curves #2701

wants to merge 30 commits into from

Conversation

AlexanderFabisch
Copy link
Member

This is the pull request for the function learning_curve that has been proposed in issue #2584. I would like to get feedback on the interface and code.

Here is an example with naive base on the digits dataset:

learning curve

from .metrics.scorer import _deprecate_loss_and_score_funcs

def learning_curve(estimator, X, y, n_samples_range=None, step_size=1,
n_cv_folds=10, loss_func=None, scoring=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than n_cv_folds, use cv with check_cv(...) as in cross_validation.cross_val_score. This allows different cv strategies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking for something like that. Great tip, thanks!

@jnothman
Copy link
Member

jnothman commented Jan 1, 2014

Where can we put the stuff that occurs in *SearchCV and in learning_curve?

I'm not sure exactly what you're asking here. There is an outstanding issue to rename grid_search to model_selection or similar, but what else belongs in that module/package is unclear.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 8b8c7bb on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

@AlexanderFabisch
Copy link
Member Author

Thanks for the tips. The learning curve with naive Bayes on the digits data set looks really nice now:

learning curve

There are two sections in the code that overlap a little bit with the code in BaseGridSearch (because they are copied from it). There could be a way to refactor the code so that we don't have clones.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling c922e83 on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 559089e on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling a892b9a on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

"but is within [%f, %f]."
% (n_min_required_samples,
n_max_required_samples))
n_samples_range = (n_samples_range *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably do a np.unique to this, as there's no point evaluating the same size twice if they've provided a higher resolution than is meaningful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However maybe this will break users' expectations in terms of output shape.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will document that behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you can fit as few times as possible, but report duplicate points in the output, with something like:

ticks, inverse = np.unique(ticks, return_inverse=True)
...
return np.take(result, inverse)

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling a928fbe on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 7bb4946 on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 9dc7d72 on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

@AlexanderFabisch
Copy link
Member Author

The learning curves for exploit_incremental_learning=True and =False with a PassiveAggressiveClassifier are not exactly the same. I am not completely sure how that could happen. Is there any reason why this can happen?

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling a50c2af on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling a300ead on AlexanderFabisch:learning_curves into * on scikit-learn:master*.

@jnothman
Copy link
Member

jnothman commented Jan 2, 2014

fit makes multiple (n_iter) passes over the data; partial_fit makes 1
(I think, but am not certain, that this is the only difference). This
should be better documented, and you're welcome to submit a PR to fix it.

On Fri, Jan 3, 2014 at 4:17 AM, Coveralls notifications@github.com wrote:

[image: Coverage Status] https://coveralls.io/builds/413477

Changes Unknown when pulling a300ead
a300ead
on AlexanderFabisch:learning_curves
into * on scikit-learn:master*.


Reply to this email directly or view it on GitHubhttps://github.com//pull/2701#issuecomment-31467597
.

@AlexanderFabisch
Copy link
Member Author

Actually I set n_iter=1. I will take a look, but not before Saturday.

@jnothman
Copy link
Member

jnothman commented Jan 3, 2014

There's a test for equivalence with other learning rates at https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/tests/test_sgd.py#L570, but as far as I can tell, the same is not tested for passive_aggressive. But I still can't see a reason they should differ.

@AlexanderFabisch
Copy link
Member Author

Oh, I found the error. I assumed that the CV always generates train/test splits of equal sizes which is obviously not true. In this case it is:

1612 / 185
1614 / 183
1616 / 181
1617 / 180
1618 / 179
1618 / 179
1618 / 179
1619 / 178
1620 / 177
1621 / 176

@AlexanderFabisch
Copy link
Member Author

I hope that is understandable now.

@amueller
Copy link
Member

amueller commented Jan 7, 2014

Awesome. Thanks a lot. Apart from my minor comment +1 for merge.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 9748127 on AlexanderFabisch:learning_curves into b57b3a3 on scikit-learn:master.

@AlexanderFabisch
Copy link
Member Author

I'm really happy with the code now. Thanks for the great reviews and advices.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 822bd7b on AlexanderFabisch:learning_curves into b57b3a3 on scikit-learn:master.

@glouppe
Copy link
Contributor

glouppe commented Jan 8, 2014

Just wanted to say, this is a great contribution @AlexanderFabisch ! Thanks for your patience :)

@AlexanderFabisch
Copy link
Member Author

Thank you. :)

@jakevdp
Copy link
Member

jakevdp commented Jan 8, 2014

I did a final read-through of the code. 👍 for merge! Thanks for all the work on this! It's going to be extremely useful for my sklearn tutorials.

@amueller
Copy link
Member

amueller commented Jan 8, 2014

@AlexanderFabisch thanks for the great contribution :) I'm really glad that you are also happy with the code now!

OT @ogrisel @larsmans @GaelVaroquaux what is the current merging policy? Rebase? Squash to a single commit and rebase? Green button? I think I would vote for squashing, in particular as this went to and fro a bit.

@jnothman
Copy link
Member

jnothman commented Jan 8, 2014

I think this one should ideally be squashed.

A belated thought: if we also had timing in here, it would be really simple to use as a generic benchmarking script. WDYT? This could go in a separate PR, but if it's done by default, it'll change the output signature.

@AlexanderFabisch
Copy link
Member Author

I would prefer making that an optional part. Usually you make a learning curve to see how good the performance (score) is and not to measure training times. Anyway, we should collect these ideas in the corresponding issue.

@jnothman
Copy link
Member

jnothman commented Jan 8, 2014

I'm certainly fine with that. I just don't see the need to duplicate this
procedure to do timing assessment, and current benchmarks like
bench_tree.py would give more stable results if they used this CV
procedure.

On 9 January 2014 08:37, Alexander Fabisch notifications@github.com wrote:

I would prefer making that an optional part. Usually you make a learning
curve to see how good the performance (score) is and not to measure
training times.


Reply to this email directly or view it on GitHubhttps://github.com//pull/2701#issuecomment-31880525
.

@amueller
Copy link
Member

amueller commented Jan 8, 2014

pushed as a876682 after thinking about the PR I have to rebase against this ;)
Still would be interested in the opinion of the others.

Thanks again @AlexanderFabisch

@amueller amueller closed this Jan 8, 2014
@jnothman
Copy link
Member

jnothman commented Jan 9, 2014

Some documentation was forgotten here: the feature's addition to whats_new and modules/classes. A nice-to-have would be a mention in the narrative doc. I'll push a whats_new and classes entry (unless someone tells me not to): 7de3d96

@ltiao
Copy link
Contributor

ltiao commented Jan 10, 2014

There is a case where training subset might only contain a single class and may require stratified sampling.

Consider the below for example.

from sklearn.learning_curve import learning_curve
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=100, n_features=5, n_informative=3, n_classes=3, weights=[0.15, 0.5, 0.35], shuffle=False)

clf = LogisticRegression()

train_sizes, train_score, test_score = learning_curve(clf, X, y)

This results in the following exception:

Traceback (most recent call last):
  File "evaluation.py", line 33, in <module>
    train_sizes, train_score, test_score = learning_curve(LogisticRegression(), X, y)
  File "/Users/louistiao/.virtualenvs/numerical/lib/python2.7/site-packages/sklearn/learning_curve.py", line 132, in learning_curve
    for train, test in cv for n_train_samples in train_sizes_abs)
  File "/Users/louistiao/.virtualenvs/numerical/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 595, in __call__
    self.dispatch(function, args, kwargs)
  File "/Users/louistiao/.virtualenvs/numerical/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 364, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/Users/louistiao/.virtualenvs/numerical/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 129, in __init__
    self.results = func(*args, **kwargs)
  File "/Users/louistiao/.virtualenvs/numerical/lib/python2.7/site-packages/sklearn/learning_curve.py", line 209, in _fit_estimator
    _fit(estimator.fit, X_train, y_train)
  File "/Users/louistiao/.virtualenvs/numerical/lib/python2.7/site-packages/sklearn/grid_search.py", line 295, in _fit
    fit_function(X_train, y_train, **fit_params)
  File "/Users/louistiao/.virtualenvs/numerical/lib/python2.7/site-packages/sklearn/svm/base.py", line 659, in fit
    raise ValueError("The number of classes has to be greater than"
ValueError: The number of classes has to be greater than one.

While StratifiedKFold with k=3 has been used here by default, the further splitting at line 211 of learning_curve.py is obviously not guaranteed to include all classes, let alone, more than one class in the training subsets.

In this example, y_train=[0 0 0 0 0 0] after executing the _split() in the above line.

This poses a problem for estimators such as LogisticRegression and somewhat skewed class frequency distributions.

Of course, one could significantly reduce the probability of having a single class in the training subset resulting from the _split() by shuffling the dataset (e.g. set shuffle=True in make_classification above) but this would still not guarantee the inclusion of all classes.

@AlexanderFabisch
Copy link
Member Author

That is on my todo list. My first idea would be to add an additional argument stratify which would result in an order like

class 1
class 2
...
class n
class 1
class 2
...
class n
...

@jnothman
Copy link
Member

Nice catch, Louis. I think shuffling the train indices by default is a
reasinable approach.

On 10 January 2014 19:05, Alexander Fabisch notifications@github.comwrote:

That is on my todo listhttps://gist.github.com/AlexanderFabisch/8348366.
My first idea would be to add an additional argument stratify which would
result in an order like

class 1
class 2
...
class n
class 1
class 2
...
class n
...


Reply to this email directly or view it on GitHubhttps://github.com//pull/2701#issuecomment-32009514
.

@ltiao
Copy link
Contributor

ltiao commented Jan 13, 2014

@AlexanderFabisch can I help on this in any way?

Separately, I am wondering if learning_curve() ought to have a return value similar to sklearn.cross_validation.cross_val_score(), i.e. array of float, shape=(n_ticks, len(list(cv))) (or its transpose) so that the standard deviation may also be obtained?

@AlexanderFabisch
Copy link
Member Author

Of course you can help me. I'm still busy with #2736 at the moment. So you could start with it already and open a pull request with a solution. Send me a message as soon as it is open.

Returning the score of all runs seems to be a good idea. I'm wondering if it should be the default or if we should introduce another function that returns all scores. learning_curve could call this function and average of all cv runs.


if scorer is not None:
this_score = scorer(clf, X_test, y_test)
def _fit(fit_function, X_train, y_train, **fit_params):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @amueller that it is not great style to have a helper function that is a 4-liner just to avoid an "if" statement in the main code.

The reason that I dislike such style is that when reading the calling code, you don't know what this function is doing, especially given the fact that '_fit' as a name doesn't doesn't tell what this does differently from calling the fit_function. All in all, the code is riddled with those small helper functions, and I find this hard to read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be reconsidered in #2736. I'm not a big fan of these mini-helpers either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That has been fixed in #2736 already.

@mblondel
Copy link
Member

mblondel commented Feb 7, 2014

Sorry for the late comment. I only noticed this now because I'm resolving merge conflicts between this code and PR #2759.

learning_curve with exploit_incremental_learning=True is only equal to learning_curve with exploit_incremental_learning=False if the user sets n_iter=1 in the estimator, since partial_fit uses each instance only once. I think this is not clear at all from the documentation. Also, the point of doing a learning curve is to judge whether we would benefit from adding more training data but using n_iter=1 will return very pessimistic results. So I would personally not trust the results when using exploit_incremental_learning=True.

@AlexanderFabisch
Copy link
Member Author

You are right, we should document that.

I would like to keep that option because it is sometimes interesting to see how the estimator performs for different training set sizes online if you have a large amount of data. In addition, there are incremental learning algorithms that do not need multiple iterations, e.g. some incremental variants of Gaussian processes. We should not limit the functionality only because there is no good use case in sklearn (yet).

@mblondel
Copy link
Member

mblondel commented Feb 9, 2014

After thinking about it, the exploit_incremental_learning=True is a good idea for MultinomialNB, since the results should be exactly the same as exploit_incremental_learning=False. BTW, the option name is really long. I think incremental_learning or incremental_fit are shorter yet explicit enough.

@AlexanderFabisch
Copy link
Member Author

... or maybe partial_fit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.