Skip to content

[MRG+1] DOC: Added Nested Cross Validation Example #7111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 25, 2016

Conversation

mlliou112
Copy link
Contributor

Reference Issue

Fixes #5589

What does this implement/fix? Explain your changes.

Example of Nested Cross Validation using new model selection module. SVC on digits dataset.

Any other comments?

First contribution to Scikit-learn. Hence, I am by no means an expert on nested cross validation, but I read as much as I could and tried to follow all contribution guidelines appropriately.

I happily welcome any constructive comments, large or small!

@amueller
Copy link
Member

I think the idea was more to use cross_val_score(GridSearchCV(...)).

@mlliou112
Copy link
Contributor Author

Okay! There's a comment included at the bottom of the code that shows that the cross_val_score(GridSearchCV(...)) way. I thought it may be helpful to show the long way to show how to get the optimized hyperparameters on each inner iteration. Would it be preferred to remove that? I'm afraid it would be too short of an example; is anything else I can add in addition?

@raghavrv
Copy link
Member

raghavrv commented Aug 1, 2016

Thanks for the PR :)

Yes like Andy suggests conveying that doing nested cv has become simpler due to our new cv iterators could be more helpful.

I think using using LOLO-cv would be a good idea as you can show that we can pass labels for nested cv... Nested cv using KFold can be done with v0.17 itself.

I remember speaking with @ogrisel IRL at Paris Pydata. He suggested a good toy problem would be to predict the sepal/petal length/width using the other 3 features and use the iris classes as group labels. With this you can show that it is useful to use LOLO cross-validator as the measurements would depend on the iris class.

@amueller
Copy link
Member

amueller commented Aug 1, 2016

I feel that task will be confusing to people new to ml / sklearn.

@raghavrv
Copy link
Member

raghavrv commented Aug 2, 2016

Alright. In that case @mlliou112, you could additionally show how the scores without nested cv are not reliable and how nested cv reveals that inconsistency in scores for different parameter setting. That would be a nice way to highlight the importance of nested cv...

@mlliou112
Copy link
Contributor Author

mlliou112 commented Aug 2, 2016

Great, sounds good. I'll work on it and push changes when I have them.

On a separate note, is there any reason you would suspect for the test failing on CircleCI? Also, is this something I should be worried about and how should I approach fixing it?

@mlliou112
Copy link
Contributor Author

Alright, sorry for the delay. In this change, I've tried to illustrate here the slight optimistic bias of non-nested CV versus nested CV, especially when the splits are on a small dataset such as the iris one. I tried to narrow down how many parameter values are optimized over on GridSearchCV to cut down on running time.

@jnothman
Copy link
Member

Don't worry about the delay. This looks okay, though I'm not sure we should give the impression that those differences are significant. The text is also a bit verbose. The thing to emphasise is that taking the max over multiple parameter settings in grid search is liable to over-fit, often yielding an over-estimate of generalisation error.

@mlliou112
Copy link
Contributor Author

I trimmed the text to make it more concise. Let me know what you think!

@jnothman
Copy link
Member

I'll take a look at this later, but it looks like your example is failing tests.

@amueller
Copy link
Member

I don't want to make the text too long, but I think there should be a connection to doing a "train/test" split and doing a "train/validation/test" split.

@amueller
Copy link
Member

Also, it might be important to point out what the result of nested cross-validation is. It doesn't yield a model or even a best parameter setting, so you don't get a model that you could use on new data.

It approximates the generalization error of the GridSearchCV estimator. I'm not sure there is a better way to say that. Can someone come up with a good use-case of this?

@betatim
Copy link
Member

betatim commented Sep 6, 2016

I was looking for a demo like this, so I like it! Then wanted to take it one step further and shuffled the labels expecting that nested_scores.mean() would be close to 1/3. Is my expectation wrong?

The original use-case that made me think about this is:

  1. set aside a test set,
  2. do *SearchCV on the rest
  3. predict your leaderboard score using the test set, if higher than current submit to kaggle

Later on return to repeat steps 2. and 3.

However that setup is subtly different from the one in this example. So yeah.

@mlliou112
Copy link
Contributor Author

@betatim Your expectation is correct. What numbers are you getting? I added a few lines of np.random.shuffle(y_iris) after loading the dataset, and the mean is reasonably around 1/3. nested_scores.mean() will have high variance since the dataset is relatively small and has only 3 labels.

@amueller I added two sentences with your suggestions. Let me know if they are appropriately placed!

@betatim
Copy link
Member

betatim commented Sep 9, 2016

Trying various seeds it fluctuates up and down. Should have tried that first. What I learnt from this is that sem(nested_scores) is not representative of the uncertainty of the mean, which is interesting.

@raghavrv
Copy link
Member

I think this would be nice to have in 0.18. Could @jnothman mark this so?

@jnothman
Copy link
Member

This will likely miss the RC, @raghavrv, but we might be able to throw it in for final.

@jnothman jnothman added this to the 0.18 milestone Sep 12, 2016
@raghavrv
Copy link
Member

@mlliou112 you with us? :)

@mlliou112
Copy link
Contributor Author

Yes! Following the 0.18 release has certainly been exciting. :) Is there anything else I have to do for this/or help in general?

@jnothman
Copy link
Member

You should refer to this example from the narrative docs on model selection.

performance of non-nested and nested CV strategies by taking the difference
between their scores.

See Also
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These headings are too big. Can we use .. topic instead?

This example compares non-nested and nested cross-validation strategies on a
classifier of the iris data set. Nested cross-validation (CV) is often used to
train a model in which hyperparameters also need to be optimized. Nested CV
approximates the generalization error of the resulting estimator of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"approximates" -> "estimates"

How about "estimates the generalization error of the underlying model and its (hyper)parameter search"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I also changed other occurrences of "hyperparameter" -> (hyper)parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're inconsistent on this terminology. Hyperparameter is a Bayesian term that's recently become more popular, for clarity, in the rest of the ML community. But scikit-learn uses "get_params" and "set_params" to operate on these things and not on the model parameters in the bayesian sense, so I find it a bit strange to call it a hyperparameter.

train a model in which hyperparameters also need to be optimized. Nested CV
approximates the generalization error of the resulting estimator of the
hyperparameter search. It is generally best practice to use nested CV as
non-nested CV will sometimes provide a slightly more biased and optimistic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need "slightly" and "sometimes".

Choosing the parameters that maximise non-nested CV biases it to the dataset, yielding an overly-optimistic score.

This notion is repeated multiple times here and I think you need to work on making it succinct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I have specified "it" -> "the model" and chose "maximize" over "maximise"... (sorry uk/canada, just to be consistent with rest of the example)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rereading "them" (i.e. the parameters) would have done instead of it. I didn't write that with the intention that you would necessarily copy verbatim, so your edits to my paraphrases are welcome. Even when they arbitrarily side with Webster.

non-nested CV will sometimes provide a slightly more biased and optimistic
score.

In contrast to non-nested CV, an inner CV loop is introduced that partitions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

Model selection without nesting CV involves evaluating the model's performance on data that is also used to tune the model. Information may thus "leak" into the model and overfit the data. The magnitude of this effect is primarily dependent on the size of the dataset and the stability of the model. See Cawley and Talbot [1]_ for an analysis of these issues.

Nested CV effectively uses a series of train/validation/test set splits. Score is approximately maximised in fitting a model to each training set, and then directly maximised in selecting hyperparameters over the validation set. Assessing performance on a held-out test set avoids evaluating a model on data that has been used to tune it. Generalization error is estimated by averaging test set scores over several dataset splits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 .

First paragraph: I altered first sentence to emphasize the data part.
"Model selection without nested CV uses the same data to tune model parameters and evaluate model performance."

Second paragraph: I would really like to emphasize where the tests splits are relative to the inner and outer CV loops, so I added prepositional phrases. I also removed the sentence "assessing performance..." and added a transitional phrase before the paragraph that I think demonstrates the same thing.

Again, "s" -> "z" 🇺🇸 :p

the dataset and the stability of the model. For more quantitative detail of
potential bias when tuning parameters, see this paper. [1]_

Each iteration of the inner CV loop will provide the best estimator for the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop this sentence, perhaps this paragraph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropped paragraph.
It was a point of confusion for me when first starting so I thought I would point it out explicitly, but i'm not at all attached.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true of CV generally. If you feel that's not clear enough in the narrative docs, propose a change there? I'd rather example text be to the point.

@mlliou112
Copy link
Contributor Author

@jnothman Thanks for the review, I took most your suggestions, w/ some minor alterations (see above). Let me know what you think.

@@ -79,6 +79,10 @@ evaluated and the best combination is retained.
classifier (here a linear SVM trained with SGD with either elastic
net or L2 penalty) using a :class:`pipeline.Pipeline` instance.

- See :ref:`example_model_selection_plot_nested_cross_validation_iris.py`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "This is best practice for evaluating the performance of a model with grid search."

[1]_ for an analysis of these issues.

To avoid this problem, nested CV effectively uses a series of
train/validation/test set splits. In the inner loop, score is approximately
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"score" -> "the score"


# Choose cross-validation techniques for the inner and outer loops,
# independently of the dataset.
# E.g "LabelKFold", "LeaveOneOut","LeaveOneLabelOut", etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after comma.

@jnothman
Copy link
Member

otherwise LGTM

# Choose cross-validation techniques for the inner and outer loops,
# independently of the dataset.
# E.g "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.
inner_cv = KFold(n_folds=4, shuffle=True, random_state=i)
Copy link
Member

@TomDLT TomDLT Sep 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_splits instead of n_folds

@mlliou112
Copy link
Contributor Author

Aye, I think I messed up this branch/PR. There were new commits to cross_validation.rst and the KFold(...) method that affected this example so I tried to rebase the master commits.

How do I fix this? (And what was the better way to integrate those changes?) 🙏 Should I just make a new PR?

@TomDLT
Copy link
Member

TomDLT commented Sep 23, 2016

To fix the mess:

# go to your branch
git checkout your_branch
# always do a backup
git checkout -b your_branch_back_up
# squash your 9 commits into one
git rebase -i HEAD~9
# Now you have only one commit. Note its SHA (16 hexadecimal characters)
git log -n 1
# delete master
git branch -D master
# and download a fresh new version of master
git fetch upstream master:master
# go to master
git checkout master
# delete your branch (you have a backup)
git branch -D your_branch
# recreate your branch from fresh master
git checkout -b your_branch
# apply only your last commit with its SHA
git cherry-pick SHA
# force push
git push -f origin your_branch

In the future, when you need to rebase:

# go to master
git checkout master
# update master
git pull --rebase upstream master
# got to your branch (never work on master)
git checkout your_branch
# rebase your branch on master
git rebase master
# solve your conflicts and force push
git push -f origin your_branch

Highly recommend readings: http://docs.scipy.org/doc/numpy/dev/gitwash/development_workflow.html

@mlliou112 mlliou112 force-pushed the nested_cross_val_example branch from ec77e9f to dd8fa72 Compare September 23, 2016 15:15
@mlliou112
Copy link
Contributor Author

@TomDLT Thanks very much! The development workflow article is much more helpful than anything that I found.

@TomDLT TomDLT changed the title [MRG] DOC: Added Nested Cross Validation Example [MRG+1] DOC: Added Nested Cross Validation Example Sep 23, 2016
@jnothman
Copy link
Member

LGTM, thanks @mlliou112!

Merging. @amueller, please backport.

@jnothman jnothman merged commit 5fcc7e5 into scikit-learn:master Sep 25, 2016
@raghavrv
Copy link
Member

Thanks @mlliou112 :)

@amueller
Copy link
Member

@TomDLT can you add the git stuff as a link to the contributing docs?

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 10, 2016
* tag '0.18': (1286 commits)
  [MRG + 1] More versionadded everywhere! (scikit-learn#7403)
  minor doc fixes
  fix lbfgs rename (scikit-learn#7503)
  minor fixes to whatsnew
  fix scoring function table
  fix rebase messup
  DOC more what's new subdivision
  DOC Attempt to impose some order on What's New 0.18
  no fixed width within bold
  REL changes for release in 0.18.X branch (scikit-learn#7414)
  [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325)
  DOC: Added Nested Cross Validation Example (scikit-learn#7111)
  Sync docstring and definition default argument in kneighbors (scikit-learn#7476)
  added contributors for 0.18, minor formatting fixes.
  Fix typo in whats_new.rst
  [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411)
  Addressing issue scikit-learn#7468. (scikit-learn#7472)
  Reorganize README
  clean up deprecation warning stuff in common tests
  [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438)
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 10, 2016
* releases: (1286 commits)
  [MRG + 1] More versionadded everywhere! (scikit-learn#7403)
  minor doc fixes
  fix lbfgs rename (scikit-learn#7503)
  minor fixes to whatsnew
  fix scoring function table
  fix rebase messup
  DOC more what's new subdivision
  DOC Attempt to impose some order on What's New 0.18
  no fixed width within bold
  REL changes for release in 0.18.X branch (scikit-learn#7414)
  [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325)
  DOC: Added Nested Cross Validation Example (scikit-learn#7111)
  Sync docstring and definition default argument in kneighbors (scikit-learn#7476)
  added contributors for 0.18, minor formatting fixes.
  Fix typo in whats_new.rst
  [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411)
  Addressing issue scikit-learn#7468. (scikit-learn#7472)
  Reorganize README
  clean up deprecation warning stuff in common tests
  [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438)
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 10, 2016
* dfsg: (1286 commits)
  [MRG + 1] More versionadded everywhere! (scikit-learn#7403)
  minor doc fixes
  fix lbfgs rename (scikit-learn#7503)
  minor fixes to whatsnew
  fix scoring function table
  fix rebase messup
  DOC more what's new subdivision
  DOC Attempt to impose some order on What's New 0.18
  no fixed width within bold
  REL changes for release in 0.18.X branch (scikit-learn#7414)
  [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325)
  DOC: Added Nested Cross Validation Example (scikit-learn#7111)
  Sync docstring and definition default argument in kneighbors (scikit-learn#7476)
  added contributors for 0.18, minor formatting fixes.
  Fix typo in whats_new.rst
  [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411)
  Addressing issue scikit-learn#7468. (scikit-learn#7472)
  Reorganize README
  clean up deprecation warning stuff in common tests
  [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438)
  ...
@johnny5550822
Copy link

@amueller @mlliou112 I read the nested cross validation material (http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html) and it is really good! Do you know if I can get the best_params_ out of each fold of outer loop cross validation during the nested cross validation? I want to use nested cross validation to identify the best hyperparameters

@raghavrv
Copy link
Member

raghavrv commented Jan 4, 2017

@johnny5550822 The outer loop of the nested cross-validation is done only to evaluate the best models chosen by the inner loop. Refer this stackoverflow answer

@johnny5550822
Copy link

@raghavrv ya, Hmm......so what should I do if I want to get the best hyperparameters?

@raghavrv
Copy link
Member

raghavrv commented Jan 4, 2017

To select the best hyper params, you simply do one cross-validation. To evaluate this 'selection', you do the outer cross-validation and infer if your selection can be trusted. Read the lower part of the above SO answer on what to look for in outer cv.

If from those inferences you understand that your selection can't be trusted, you tweak a different set of parameters / choose a different model / do more feature engineering.

Maybe @GaelVaroquaux @jnothman @amueller would be able to answer you in more detail. But unless I am mistaken, that is the crux of nested cv. Do the selection in inner cv, and use the outer cv to see if your selection can be trusted.

@johnny5550822
Copy link

@raghavrv "To select the best hyper params, you simply do one cross-validation. To evaluate this 'selection', you do the outer cross-validation and infer if your selection can be trusted" Are you suggesting I use the inner-loop cross validation to identify the best hyperparameters? @GaelVaroquaux @jnothman @amueller

@raghavrv
Copy link
Member

raghavrv commented Jan 4, 2017

Are you suggesting I use the inner-loop cross validation to identify the best hyperparameters?

Yes...

@johnny5550822
Copy link

@raghavrv but how can I do it in scikit-learn. In the nested cv example, seem like GridSearchCV is wrapped by cross_val_score. I don't know how to obtain the hyperparameters.

clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=inner_cv)
nested_score = cross_val_score(clf,scoring='roc_auc',X=features, y=labels, cv=outer_cv)
nested_scores[i] = nested_score.mean()

@raghavrv
Copy link
Member

raghavrv commented Jan 4, 2017

Just do clf.fit(features, labels).best_params_

(But please do wait for replies from others to gain more clarity...)

@jnothman
Copy link
Member

jnothman commented Jan 5, 2017 via email

@johnny5550822
Copy link

johnny5550822 commented Jan 5, 2017

@jnothman To be clear, that is to say nested cross validation is not really the way to do hyperparameter selection (instead, we use repeated simple cross-validation). Rather, nested cross validation is to not only provide an estimated performance of the model, but also tell you roughly about "what is the simple cross-validation performance on hyperparameter selection?" Thus, the nested cross validation will give something like xx+-x%. Am I right?

Also, when we construct the nested CV, we pass the clf into cross_val_score (see below), there is no repeated trial inside the inner loop of CV right (i.e., just one time CV for each possible combination of parameter inside the GridSearchCV)? I

clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)

@mlliou112
Copy link
Contributor Author

@johnny5550822 I think what you're saying is correct. I'd caution thinking about nested and simple as different methods of cross validation. Nested is really just the simple CV done twice for two things at once, estimating hyperparameters and evaluating parameter selection strategy. (I'll reemphasize that the "optimal parameters" given by the inner CV may not be the same for all outer CV loops.)

Yes, the 3-fold CV is just done once per combination of parameters, and gives out the average performance of those 3 folds.

@johnny5550822
Copy link

@mlliou112 Great. I want to make sure the inner-loop is just done once because I originally thought it is done more than just one time. This is because the suggested nested CV is a repeated nested CV, i.e. the inner loop is repeated several times before identifying the best score. (https://jcheminf.springeropen.com/articles/10.1186/1758-2946-6-10) Maybe this can be part of the future work and allow user to use repeated or not.

@anselal
Copy link

anselal commented Jun 12, 2017

@mlliou112

(I'll reemphasize that the "optimal parameters" given by the inner CV may not be the same for all outer CV loops.)

You are absolutely right. And that is exactly the problem. You cannot get the best parameters of a model, while using nested crossvalidation since the models could be different in each outer loop. A good way to visualize this is with this image I found https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png

Noone guaranties that the split of the inner CV will be the same in every outer CV loop, thus the hyper parameters and the model itself.

So I think there are two solutions:

  1. you just do a GridSearchCV with a crossvalidation and you are done. You are left with the best model, best parameters, every developments set etc.
  2. you do a nested cross validation and just get the mean scores.

Have I understood it correctly ???

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Example of nested cross-validation
8 participants