-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC add example to show how to deal with cross-validation #18821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
DOC add example to show how to deal with cross-validation #18821
Conversation
I found that maybe our documentation was missing such an example regarding the way to inspect parameters and you should do while inspecting them. |
We have a similar analysis in the example of the interpretation of coefficients of linear models. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Lucy Liu <jliu176@gmail.com>
# regressor is also optimized via another internal cross-validation. This | ||
# process is called a nested cross-validation and should always be implemented | ||
# whenever model's parameters need to be optimized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nested cross validation takes much longer to train and I do not know if it is always recommended.
In this example, the nested cross validation only searches through one parameter and that parameter does not vary across folds. This leads to a model selection process leading to alpha=40
.
In general, one can search through many parameter combinations, which would lead to different models. I.E, there would be one model for each loop of the outer fold. I do not think there is a way to decide which hyper parameter combination one should use in this case. (Unless you are ensembling them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, one can search through many parameter combinations, which would lead to different models. I.E, there would be one model for each loop of the outer fold. I do not think there is a way to decide which hyper parameter combination one should use in this case. (Unless you are ensembling them)
I see your point here.
I still think this is beneficial to do the nested-cross-validation even with many hyper-parameters. The variability of the hyper-parameters would be informative. And I am not aware of any other way to make this analysis with a less costly strategy. However, this is true that it would not necessarily help you in choosing a specific configuration.
Do you have a proposal to mitigate the statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again, I think I am okay with the wording as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, the wording is a bit absolute. One could state that it is recommended, but certainly not "should always be implemented".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think techniques in this example is very specific to linear models and takes advantage of RidgeCV
's ability to do "cv for free".
The first part is seeing how stability the algorithm is by using repeated k-fold cross validation by looking at the scores and coefficients. I think this makes sense for seeing how the linear model behaves with this dataset.
When it gets to "Putting the model in production", I think this example suggest that stable coefficients and similar alpha
values implies that we can move forward with putting the model into production. I do not think this is always the case. Imagine if the distribution of scores had a bimodal distribution. In this case, one would select the hyper-parameter config that maximizes the score.
From my understanding, connecting the stability of the coefficients to "good for production" means that the application requires model inspection. I do not think this is true in general?
# regressor is also optimized via another internal cross-validation. This | ||
# process is called a nested cross-validation and should always be implemented | ||
# whenever model's parameters need to be optimized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again, I think I am okay with the wording as is.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, otherwise LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments, otherwise LGTM!
# To conclude, cross-validation allows us to answer to two questions: are the | ||
# results reliable and, if it is, how good is the predictive model. | ||
# | ||
# Model inspection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole section is very specific for linear models. I would be nice to use some model agnostic methods, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial thoughts when building the example was to show how to access a fitted attribute of one of the models fitted during cross-validation. So this is not really specific to linear models in this regard.
In terms of "model-agnostic" approaches, what are you thinking about regarding the message to convey?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you inspect gradient boosted models?
# We see that the regularization parameter, `alpha`, values are centered and | ||
# condensed around 40. This is a good sign and means that most of the models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the plot not that much convincing as there are several values of alpha far beyond 40.
# information about the variance of the model. It should never be used to | ||
# evaluate the model itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exaclty is meant by that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here I wanted again to point out that you get a single point and not a score distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still confused. Let's say you evaluate MSE on the test set. Then, one contribution of your MSE sore comes from the variance of you model—variance of the model's prediction to be precise—the other from the bias term and the variance of the target.
If you mean variance of the algorithm, that's something slightly different. Additionally, the variance of the coefficients could be inferred in-sample (not very unbiased with the cross validated alpha, but still). But for some reasons, this is not done in scikit-learn.
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @glemaitre, are you still interested in working on this PR? If so here is a batch of comments.
Notice that you will also have to merge main with this branch, as it is a bit outdated.
Co-authored-by: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre Do you intend to finish this one?
# on the coefficients. Thus, the penalty parameter `alpha` has to be tuned. | ||
# More importantly, this parameter needs to be tuned for our specific problem: | ||
# tuning on another dataset does not ensure an optimal parameter value for the | ||
# current dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be phrased more clearly and more elegant?
# | ||
# Here, we define a machine learning pipeline `model` made of a preprocessing | ||
# stage to :ref:`standardize <preprocessing_scaler>` the data such that the | ||
# regularization strength is applied homogeneously on each coefficient; followed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# regularization strength is applied homogeneously on each coefficient; followed | |
# regularization strength `alpha` is applied homogeneously on each coefficient; followed |
# production, it is a good practice and generally advisable to perform an | ||
# unbiased evaluation of the performance of the model. | ||
# | ||
# Cross-validation should be used to make this analysis. First, it allows us to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Fist, it allows..." What is the second point?
We should better state why and when to use cross-validation. For instance, if I have a huge amount of very identically distributed data, then I don't need CV. Also, it is still best practice to keep one test set away from CV to evaluate the final model (possibly retrained on the whole train-validation set) - as is done in this example.
Also, CV assesses more a training algorithm and not a single model.
# interpret findings built on internal model's parameters. One possible cause | ||
# of large variations are small sample sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With perfect iid data, I can trust the variance / CI interval of a score, which are usually just a mean over the data.
A more important reason, IMO, is violation of iid.
# regressor is also optimized via another internal cross-validation. This | ||
# process is called a nested cross-validation and should always be implemented | ||
# whenever model's parameters need to be optimized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, the wording is a bit absolute. One could state that it is recommended, but certainly not "should always be implemented".
# regressor. We can therefore use this predictive model as a baseline against | ||
# more advanced machine learning pipelines. | ||
# | ||
# To conclude, cross-validation allows us to answer to two questions: are the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# To conclude, cross-validation allows us to answer to two questions: are the | |
# To conclude, cross-validation allows us to answer two questions: are the |
# more advanced machine learning pipelines. | ||
# | ||
# To conclude, cross-validation allows us to answer to two questions: are the | ||
# results reliable and, if it is, how good is the algorithm used to create |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# results reliable and, if it is, how good is the algorithm used to create | |
# results reliable and, if they are, how good is the algorithm used to create |
cv_pipelines = cv_results["estimator"] | ||
|
||
# %% | ||
# While the cross-validation allows us to know if our models are reliable, we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"models are reliable" sound more like reliability diagrams so I could rephrase a bit.
# However, you should be aware that this latest step does not give any | ||
# information about the variance of our final model. Thus, if we want to | ||
# evaluate our final model, we should get a distribution of scores if we would | ||
# like to get this information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "variance of our final model". Do you have the bias variance decomposition in mind?
How would we get the "distribution of scores"?
I'm inclined to remove this remark.
This new example shows which question cross-validation can answer.
It is also used to show how to inspect model parameters and hyperparameters and more importantly their variances.