-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Learning curves #2701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Learning curves #2701
Conversation
from .metrics.scorer import _deprecate_loss_and_score_funcs | ||
|
||
def learning_curve(estimator, X, y, n_samples_range=None, step_size=1, | ||
n_cv_folds=10, loss_func=None, scoring=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than n_cv_folds
, use cv
with check_cv(...)
as in cross_validation.cross_val_score
. This allows different cv strategies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking for something like that. Great tip, thanks!
I'm not sure exactly what you're asking here. There is an outstanding issue to rename |
Changes Unknown when pulling 8b8c7bb on AlexanderFabisch:learning_curves into * on scikit-learn:master*. |
Thanks for the tips. The learning curve with naive Bayes on the digits data set looks really nice now: There are two sections in the code that overlap a little bit with the code in BaseGridSearch (because they are copied from it). There could be a way to refactor the code so that we don't have clones. |
Changes Unknown when pulling c922e83 on AlexanderFabisch:learning_curves into * on scikit-learn:master*. |
Changes Unknown when pulling 559089e on AlexanderFabisch:learning_curves into * on scikit-learn:master*. |
Changes Unknown when pulling a892b9a on AlexanderFabisch:learning_curves into * on scikit-learn:master*. |
"but is within [%f, %f]." | ||
% (n_min_required_samples, | ||
n_max_required_samples)) | ||
n_samples_range = (n_samples_range * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably do a np.unique
to this, as there's no point evaluating the same size twice if they've provided a higher resolution than is meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However maybe this will break users' expectations in terms of output shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will document that behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you can fit as few times as possible, but report duplicate points in the output, with something like:
ticks, inverse = np.unique(ticks, return_inverse=True)
...
return np.take(result, inverse)
Changes Unknown when pulling a928fbe on AlexanderFabisch:learning_curves into * on scikit-learn:master*. |
Changes Unknown when pulling 7bb4946 on AlexanderFabisch:learning_curves into * on scikit-learn:master*. |
Changes Unknown when pulling 9dc7d72 on AlexanderFabisch:learning_curves into * on scikit-learn:master*. |
The learning curves for exploit_incremental_learning=True and =False with a |
Changes Unknown when pulling a50c2af on AlexanderFabisch:learning_curves into * on scikit-learn:master*. |
Changes Unknown when pulling a300ead on AlexanderFabisch:learning_curves into * on scikit-learn:master*. |
On Fri, Jan 3, 2014 at 4:17 AM, Coveralls notifications@github.com wrote:
|
Actually I set n_iter=1. I will take a look, but not before Saturday. |
There's a test for equivalence with other learning rates at https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/tests/test_sgd.py#L570, but as far as I can tell, the same is not tested for |
Oh, I found the error. I assumed that the CV always generates train/test splits of equal sizes which is obviously not true. In this case it is: 1612 / 185 |
I hope that is understandable now. |
Awesome. Thanks a lot. Apart from my minor comment +1 for merge. |
I'm really happy with the code now. Thanks for the great reviews and advices. |
Just wanted to say, this is a great contribution @AlexanderFabisch ! Thanks for your patience :) |
Thank you. :) |
I did a final read-through of the code. 👍 for merge! Thanks for all the work on this! It's going to be extremely useful for my sklearn tutorials. |
@AlexanderFabisch thanks for the great contribution :) I'm really glad that you are also happy with the code now! OT @ogrisel @larsmans @GaelVaroquaux what is the current merging policy? Rebase? Squash to a single commit and rebase? Green button? I think I would vote for squashing, in particular as this went to and fro a bit. |
I think this one should ideally be squashed. A belated thought: if we also had timing in here, it would be really simple to use as a generic benchmarking script. WDYT? This could go in a separate PR, but if it's done by default, it'll change the output signature. |
I would prefer making that an optional part. Usually you make a learning curve to see how good the performance (score) is and not to measure training times. Anyway, we should collect these ideas in the corresponding issue. |
I'm certainly fine with that. I just don't see the need to duplicate this On 9 January 2014 08:37, Alexander Fabisch notifications@github.com wrote:
|
pushed as a876682 after thinking about the PR I have to rebase against this ;) Thanks again @AlexanderFabisch |
Some documentation was forgotten here: the feature's addition to whats_new and modules/classes. A nice-to-have would be a mention in the narrative doc. I'll push a whats_new and classes entry (unless someone tells me not to): 7de3d96 |
There is a case where training subset might only contain a single class and may require stratified sampling. Consider the below for example. from sklearn.learning_curve import learning_curve
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_samples=100, n_features=5, n_informative=3, n_classes=3, weights=[0.15, 0.5, 0.35], shuffle=False)
clf = LogisticRegression()
train_sizes, train_score, test_score = learning_curve(clf, X, y) This results in the following exception:
While StratifiedKFold with k=3 has been used here by default, the further splitting at line 211 of In this example, This poses a problem for estimators such as Of course, one could significantly reduce the probability of having a single class in the training subset resulting from the |
That is on my todo list. My first idea would be to add an additional argument class 1 |
Nice catch, Louis. I think shuffling the train indices by default is a On 10 January 2014 19:05, Alexander Fabisch notifications@github.comwrote:
|
@AlexanderFabisch can I help on this in any way? Separately, I am wondering if |
Of course you can help me. I'm still busy with #2736 at the moment. So you could start with it already and open a pull request with a solution. Send me a message as soon as it is open. Returning the score of all runs seems to be a good idea. I'm wondering if it should be the default or if we should introduce another function that returns all scores. |
|
||
if scorer is not None: | ||
this_score = scorer(clf, X_test, y_test) | ||
def _fit(fit_function, X_train, y_train, **fit_params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @amueller that it is not great style to have a helper function that is a 4-liner just to avoid an "if" statement in the main code.
The reason that I dislike such style is that when reading the calling code, you don't know what this function is doing, especially given the fact that '_fit' as a name doesn't doesn't tell what this does differently from calling the fit_function. All in all, the code is riddled with those small helper functions, and I find this hard to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be reconsidered in #2736. I'm not a big fan of these mini-helpers either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That has been fixed in #2736 already.
Sorry for the late comment. I only noticed this now because I'm resolving merge conflicts between this code and PR #2759.
|
You are right, we should document that. I would like to keep that option because it is sometimes interesting to see how the estimator performs for different training set sizes online if you have a large amount of data. In addition, there are incremental learning algorithms that do not need multiple iterations, e.g. some incremental variants of Gaussian processes. We should not limit the functionality only because there is no good use case in sklearn (yet). |
After thinking about it, the |
... or maybe |
This is the pull request for the function
learning_curve
that has been proposed in issue #2584. I would like to get feedback on the interface and code.Here is an example with naive base on the digits dataset: