Skip to content

ENH add from_cv_results in RocCurveDisplay #25939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Mar 22, 2023

Add a method from_cv_results in RocCurveDisplay to plot several ROC curves at once obtained from the cross-validation.

Discussions

In terms of design, I created a class handling multiple single-curve displays. An alternative is to alterate RocCurveDisplay to accept a list of ndarray for the fpr, tpr and thresholds. However, it makes a relatively complex change in the plot method as well as adding new attributes as lines_ in conjunction with line_.

The code becomes difficult to read similar to PartialDependenceDisplay and the reason is to handle multiple and single curves in the same object.

The reasoning here is to expose from_cv_results directly from RocCurveDisplay but return a MultiRocCurveDisplay that can use the information stored in each underlying display to make the plotting.

It is the based design that we come up with @ogrisel so far.

@glemaitre glemaitre marked this pull request as draft March 22, 2023 10:44
@glemaitre glemaitre changed the title ENH add from_cv_results in RocCurveDisplays ENH add from_cv_results in RocCurveDisplay Mar 22, 2023
@glemaitre glemaitre marked this pull request as ready for review March 23, 2023 09:22
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. My main concern is about the API.

Comment on lines 556 to 566
for fold_id, (estimator, test_indices, name) in enumerate(
zip(cv_results["estimator"], cv_results["indices"]["test"], fold_name_)
):
y_true = _safe_indexing(y, test_indices)
y_pred = _get_response(
_safe_indexing(X, test_indices),
estimator,
response_method=response_method,
pos_label=pos_label,
)[0]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wonder if it's the right approach. The user has to provide exactly the same X as the one used to compute the cross val and we have no way to ensure that. For instance if the dataset is shuffled in between, the results will be overconfident due to leakage.

Would it make sense to not accept the cv_results, but compute the cross val as part of this method ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to add a return_predictions to cross_validation and directly get the prediction instead of the indices?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I meant run the cross-validation inside the display, but giving it more thoughts it's probably not a good idea: we don't want the display to fit the estimators.

I remember that we talked about returning the prediction from cross_validate, but it was not necessarily easy. I can't remember why, can you remind me ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that we talked about returning the prediction from cross_validate, but it was not necessarily easy. I can't remember why, can you remind me ?

It is just that this is not implemented but we could make cross_validate return the predictions. However, we should also know the method that provides the predictions.

Comment on lines +593 to +597
It is recommended to use
:meth:`~sklearn.metrics.RocCurveDisplay.from_cv_results` to create a
:class:`~sklearn.metrics.MultiRocCurveDisplay`. All parameters are stored as
attributes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't want the users to directly use it, we can make it private instead, no ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do the same mention in over display. However, I am not against making the class private. However, I am thinking that it could be a handy display to plot several ROC curves even if they don't come from cross-validation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the other display we discourage calling plot directly. Here it's the whole class. If we find it useful on its own, maybe the message should be something like "if you want to plot multiple roc curves from cv results, you should use ... instead".

I'm guessing that if we want to add from_cv_results to other displays, we'll have to make a Multi...Display for each. Do you think we could at least factor some code into a Multi...Mixin ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the other display we discourage calling plot directly. Here it's the whole class. If we find it useful on its own, maybe the message should be something like "if you want to plot multiple roc curves from cv results, you should use ... instead".

I see. I can alter the code.

I'm guessing that if we want to add from_cv_results to other displays, we'll have to make a Multi...Display for each. Do you think we could at least factor some code into a Multi...Mixin ?

Indeed. I would postpone this introduction to the next display to see how much in common we will have between displays.

@jeremiedbb
Copy link
Member

I also think that we can improve the vizualisation. For now, the curves for each fold have a different color, the mean is in black and the fill-between is grey. My proposition is that we use a single color (probably blue) for all curves and the +-1std fill-between, with different line styles and alphas. The drawback is that we can't distinguish the different folds in the legend, but is it bad ?

@glemaitre
Copy link
Member Author

The drawback is that we can't distinguish the different folds in the legend, but is it bad ?

I assume there are cases where you want to later inspect the fold. But it might not be the main use case. So having the default style as the one that you mention could be fine and provide a way (which is already the case) to have an individual style per CV fold. I don't know if having a string option to the kwargs could be useful.

@glemaitre
Copy link
Member Author

I also need to make the improvements proposed here: #25929

@ogrisel
Copy link
Member

ogrisel commented Jun 15, 2023

There is a very related paper recently published at TMLR:

I have not read it yet but I think we should before considering adding aggregation methods to this PR.

@jeremiedbb jeremiedbb added this to the 1.6 milestone Mar 6, 2024
@lorentzenchr
Copy link
Member

I haven't read this PR and only skimmed through the paper. My first reaction is to not pursue averaging of ROC cuves. I also think that a ROC curve is a decision instrument and not a model selection instrument.

@glemaitre glemaitre removed this from the 1.6 milestone Oct 29, 2024
@glemaitre glemaitre added this to the 1.7 milestone Oct 29, 2024
@lucyleeow
Copy link
Member

I have had a look at the paper and a bit of digging into what R does.

R

Picked these three popular packages mostly based on this blog (which summarised a few ROC curve plotting packages after narrowing search results via orphaned status, relevance and downloads).

  • ROCR - mentioned in the paper. Offers 3 averaging options:
    • threshold
    • horizontal
    • vertical
  • pROC - plots confidence invervals via bootstrapping replicates. (Equivalent) methods offered:
    • threshold
    • horizontal
    • vertical
  • PRROC - no averaging but they do explain how to draw more than one curve on a plot

scikit-learn

We have an example, Receiver Operating Characteristic (ROC) with cross validation, which uses vertical averaging.

Hogan & Adams

The paper does not give any warnings against averaging of ROC curves, but compares the different methods and advises on how to select the most appriopriate method.

The paper introduces 3 'main' methods, each of which supported by a paper:

  • Pooling - pool all scores assigned to all instances across all datasets, ROC curve is constructed as if all instances came from the same dataset. Can be considered weighted average, where the weight is size each dataset.
  • Threshold - sample threshold values and average the TPR and FPR for all curves.
    • compared with pooling, treats each dataset as equal, regardless of size
    • equal to pooling if number of positive and negative labels in each dataset are the same
    • interpreted as "average performance achieved amongst datasets if the threshold is fixed across all datasets"
  • Vertical averaging (ROC space) - sample FPRs and average the corresponding TPRs for each curve, interpolating between points where necessary. Standard errors can also be calculated via this method.
    • Gives performance wrt fixed FPRs
    • Note that the threshold that yields a given value of fp on each ROC curve will usually be different.
    • Variants (AFAICT no papers dedicated to these methods, but are used in some papers):
      • Horizontal - sample TPRs and average corresponding FPRs. Gives performance wtf fixed TPRs
      • Diagonal - sample error rates, average both TPR and FPR. Gives performance wrt to fixed error rates

Pooling and threshold gives averages for a fixed threshold but assumes that the "classifier outputs across folds of cross-validation are comparable and thus can be globally ordered", which has been shown to be generally not valid and can lead to large pessimistic biases. Averaging in ROC space does not suffer such problems.

Ultimately, they recommend that if classifier is intended to operate on all datasets using a common threshold, use threshold averaging. Otherwise, should be done in ROC space and user should decide direction depending on use case.

Going forward

I think that if we were to have a method to average, we should offer more than one method, as no one method is suitable for every use case (and it's what R does and I just assume the R community knows their stats 😅 )

I also think that a ROC curve is a decision instrument and not a model selection instrument.

@lorentzenchr do you mean decision instrument, as in deciding on the best threshold? Can you expand on why it shouldn't be used as a model selection instrument, or point me to a paper to read? I know (IRL and in papers) that it is used as a model selection instrument...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

5 participants