Skip to content

DOC add example to show how to deal with cross-validation #18821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

glemaitre
Copy link
Member

This new example shows which question cross-validation can answer.
It is also used to show how to inspect model parameters and hyperparameters and more importantly their variances.

@glemaitre
Copy link
Member Author

I found that maybe our documentation was missing such an example regarding the way to inspect parameters and you should do while inspecting them.

@glemaitre
Copy link
Member Author

We have a similar analysis in the example of the interpretation of coefficients of linear models.
However, I think that the point here is to give the recipe of how to extract the information instead of making an interpretation.
In this regard, I did not want to modify any other example to not modify the take home message.

glemaitre and others added 2 commits November 12, 2020 18:20
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really interesting and useful. A lot of nitpicks - feel free to ignore the more opinionated ones!

Also for some reason the seaborn boxplot is giving some warnings:
image

Co-authored-by: Lucy Liu <jliu176@gmail.com>
Comment on lines +91 to +93
# regressor is also optimized via another internal cross-validation. This
# process is called a nested cross-validation and should always be implemented
# whenever model's parameters need to be optimized.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested cross validation takes much longer to train and I do not know if it is always recommended.

In this example, the nested cross validation only searches through one parameter and that parameter does not vary across folds. This leads to a model selection process leading to alpha=40.

In general, one can search through many parameter combinations, which would lead to different models. I.E, there would be one model for each loop of the outer fold. I do not think there is a way to decide which hyper parameter combination one should use in this case. (Unless you are ensembling them)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, one can search through many parameter combinations, which would lead to different models. I.E, there would be one model for each loop of the outer fold. I do not think there is a way to decide which hyper parameter combination one should use in this case. (Unless you are ensembling them)

I see your point here.

I still think this is beneficial to do the nested-cross-validation even with many hyper-parameters. The variability of the hyper-parameters would be informative. And I am not aware of any other way to make this analysis with a less costly strategy. However, this is true that it would not necessarily help you in choosing a specific configuration.

Do you have a proposal to mitigate the statement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, I think I am okay with the wording as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, the wording is a bit absolute. One could state that it is recommended, but certainly not "should always be implemented".

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think techniques in this example is very specific to linear models and takes advantage of RidgeCV's ability to do "cv for free".

The first part is seeing how stability the algorithm is by using repeated k-fold cross validation by looking at the scores and coefficients. I think this makes sense for seeing how the linear model behaves with this dataset.

When it gets to "Putting the model in production", I think this example suggest that stable coefficients and similar alpha values implies that we can move forward with putting the model into production. I do not think this is always the case. Imagine if the distribution of scores had a bimodal distribution. In this case, one would select the hyper-parameter config that maximizes the score.

From my understanding, connecting the stability of the coefficients to "good for production" means that the application requires model inspection. I do not think this is true in general?

Comment on lines +91 to +93
# regressor is also optimized via another internal cross-validation. This
# process is called a nested cross-validation and should always be implemented
# whenever model's parameters need to be optimized.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, I think I am okay with the wording as is.

glemaitre and others added 3 commits January 6, 2021 17:54
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Base automatically changed from master to main January 22, 2021 10:53
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, otherwise LGTM.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments, otherwise LGTM!

# To conclude, cross-validation allows us to answer to two questions: are the
# results reliable and, if it is, how good is the predictive model.
#
# Model inspection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole section is very specific for linear models. I would be nice to use some model agnostic methods, too.

Copy link
Member Author

@glemaitre glemaitre Aug 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial thoughts when building the example was to show how to access a fitted attribute of one of the models fitted during cross-validation. So this is not really specific to linear models in this regard.

In terms of "model-agnostic" approaches, what are you thinking about regarding the message to convey?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you inspect gradient boosted models?

Comment on lines 158 to 159
# We see that the regularization parameter, `alpha`, values are centered and
# condensed around 40. This is a good sign and means that most of the models
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the plot not that much convincing as there are several values of alpha far beyond 40.

Comment on lines 216 to 217
# information about the variance of the model. It should never be used to
# evaluate the model itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exaclty is meant by that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here I wanted again to point out that you get a single point and not a score distribution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused. Let's say you evaluate MSE on the test set. Then, one contribution of your MSE sore comes from the variance of you model—variance of the model's prediction to be precise—the other from the bias term and the variance of the target.
If you mean variance of the algorithm, that's something slightly different. Additionally, the variance of the coefficients could be inferred in-sample (not very unbiased with the cross validated alpha, but still). But for some reasons, this is not done in scikit-learn.

glemaitre and others added 2 commits August 30, 2021 21:05
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@cmarmo cmarmo added Waiting for Second Reviewer First reviewer is done, need a second one! and removed Waiting for Reviewer labels Oct 20, 2022
Copy link
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @glemaitre, are you still interested in working on this PR? If so here is a batch of comments.

Notice that you will also have to merge main with this branch, as it is a bit outdated.

glemaitre and others added 2 commits November 15, 2022 20:16
Co-authored-by: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre Do you intend to finish this one?

Comment on lines +39 to +42
# on the coefficients. Thus, the penalty parameter `alpha` has to be tuned.
# More importantly, this parameter needs to be tuned for our specific problem:
# tuning on another dataset does not ensure an optimal parameter value for the
# current dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be phrased more clearly and more elegant?

#
# Here, we define a machine learning pipeline `model` made of a preprocessing
# stage to :ref:`standardize <preprocessing_scaler>` the data such that the
# regularization strength is applied homogeneously on each coefficient; followed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# regularization strength is applied homogeneously on each coefficient; followed
# regularization strength `alpha` is applied homogeneously on each coefficient; followed

# production, it is a good practice and generally advisable to perform an
# unbiased evaluation of the performance of the model.
#
# Cross-validation should be used to make this analysis. First, it allows us to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Fist, it allows..." What is the second point?

We should better state why and when to use cross-validation. For instance, if I have a huge amount of very identically distributed data, then I don't need CV. Also, it is still best practice to keep one test set away from CV to evaluate the final model (possibly retrained on the whole train-validation set) - as is done in this example.

Also, CV assesses more a training algorithm and not a single model.

Comment on lines +74 to +75
# interpret findings built on internal model's parameters. One possible cause
# of large variations are small sample sizes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With perfect iid data, I can trust the variance / CI interval of a score, which are usually just a mean over the data.
A more important reason, IMO, is violation of iid.

Comment on lines +91 to +93
# regressor is also optimized via another internal cross-validation. This
# process is called a nested cross-validation and should always be implemented
# whenever model's parameters need to be optimized.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, the wording is a bit absolute. One could state that it is recommended, but certainly not "should always be implemented".

# regressor. We can therefore use this predictive model as a baseline against
# more advanced machine learning pipelines.
#
# To conclude, cross-validation allows us to answer to two questions: are the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# To conclude, cross-validation allows us to answer to two questions: are the
# To conclude, cross-validation allows us to answer two questions: are the

# more advanced machine learning pipelines.
#
# To conclude, cross-validation allows us to answer to two questions: are the
# results reliable and, if it is, how good is the algorithm used to create
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# results reliable and, if it is, how good is the algorithm used to create
# results reliable and, if they are, how good is the algorithm used to create

cv_pipelines = cv_results["estimator"]

# %%
# While the cross-validation allows us to know if our models are reliable, we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"models are reliable" sound more like reliability diagrams so I could rephrase a bit.

Comment on lines +221 to +224
# However, you should be aware that this latest step does not give any
# information about the variance of our final model. Thus, if we want to
# evaluate our final model, we should get a distribution of scores if we would
# like to get this information.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "variance of our final model". Do you have the bias variance decomposition in mind?
How would we get the "distribution of scores"?
I'm inclined to remove this remark.

@Micky774 Micky774 removed the Waiting for Second Reviewer First reviewer is done, need a second one! label Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants