Title: Clarify misleading threshold implication in "ROC with Cross-Validation" example #31334

rajgurubhosale · 2025-05-07T14:46:24Z

Describe the issue linked to the documentation

Location of the issue:
The example titled "Receiver Operating Characteristic (ROC) with cross validation" (link) can lead to misunderstanding regarding decision threshold selection.

🔍 Description of the problem
The example uses RocCurveDisplay.from_estimator() to plot ROC curves for each test fold in cross-validation,

fig, ax = plt.subplots(figsize=(6, 6))
for fold, (train, test) in enumerate(cv.split(X, y)):
classifier.fit(X[train], y[train])
viz = RocCurveDisplay.from_estimator(
classifier,
X[test], the test set is used here instead of train
y[test],
name=f"ROC fold {fold}",
alpha=0.3,
lw=1,
ax=ax,
plot_chance_level=(fold == n_splits - 1),
)
interp_tpr = np.interp(mean_fpr, viz.fpr, viz.tpr)
interp_tpr[0] = 0.0
tprs.append(interp_tpr)
aucs.append(viz.roc_auc)

here is no warning or clarification that:

1)Users should not select thresholds based on predictions from these test folds.

2)Even for ROC visualization, using predictions from training folds (via cross_val_predict) avoids potential bias and better simulates threshold tuning workflows.

Without this guidance, users may mistakenly tune thresholds by inspecting ROC curves on test sets — leading to data leakage and over-optimistic results.

✅ Proposed solution
replace the test set with the train set in this code

fig, ax = plt.subplots(figsize=(6, 6))
for fold, (train, test) in enumerate(cv.split(X, y)):
classifier.fit(X[train], y[train])
viz = RocCurveDisplay.from_estimator(
classifier,
X[train], train set is used here
y[train],
name=f"ROC fold {fold}",
alpha=0.3,
lw=1,
ax=ax,
plot_chance_level=(fold == n_splits - 1),
)
interp_tpr = np.interp(mean_fpr, viz.fpr, viz.tpr)
interp_tpr[0] = 0.0
tprs.append(interp_tpr)
aucs.append(viz.roc_auc)

add point:
use predictions from the training folds (e.g., via cross_val_predict) or apply nested cross-validation.
This prevents data leakage and ensures realistic model evaluation.

Suggest a potential alternative/fix

make changes in this example link

use predictions from the training folds (e.g., via cross_val_predict) or apply nested cross-validation.
This prevents data leakage and ensures realistic model evaluation.
here is revised code
and also add short docs that for roc we should use the validation train data for the using threshold value. and not the test this will not result in over optimistic model and will not add data leakage

fig, ax = plt.subplots(figsize=(6, 6))
for fold, (train, test) in enumerate(cv.split(X, y)):
classifier.fit(X[train], y[train])
viz = RocCurveDisplay.from_estimator(
classifier,
X[train],
y[train],
name=f"ROC fold {fold}",
alpha=0.3,
lw=1,
ax=ax,
plot_chance_level=(fold == n_splits - 1),
)
interp_tpr = np.interp(mean_fpr, viz.fpr, viz.tpr)
interp_tpr[0] = 0.0
tprs.append(interp_tpr)
aucs.append(viz.roc_auc)

rajgurubhosale added Documentation Needs Triage Issue requires triage labels May 7, 2025

rajgurubhosale closed this as completed May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Title: Clarify misleading threshold implication in "ROC with Cross-Validation" example #31334

Title: Clarify misleading threshold implication in "ROC with Cross-Validation" example #31334

rajgurubhosale commented May 7, 2025

Title: Clarify misleading threshold implication in "ROC with Cross-Validation" example #31334

Title: Clarify misleading threshold implication in "ROC with Cross-Validation" example #31334

Comments

rajgurubhosale commented May 7, 2025

Describe the issue linked to the documentation

Suggest a potential alternative/fix