Skip to content

FEA Add strategy isotonic to calibration curve #23824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

lorentzenchr
Copy link
Member

Reference Issues/PRs

Fixes #23132.

What does this implement/fix? Explain your changes.

This PR adds strategy="isotonic" to calibration_curve and CalibrationDisplay.

Any other comments?

Reliability diagrams with (PAV algorithm) isotonic regression is the CORP approach of (https://doi.org/10.1073/pnas.2016191118).

@lorentzenchr
Copy link
Member Author

Results

From the example of CalibrationDisplay.
image

import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibrationDisplay


X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=0)
clf = LogisticRegression(random_state=0)
clf.fit(X_train, y_train)
LogisticRegression(random_state=0)
fig, ax = plt.subplots()
CalibrationDisplay.from_estimator(clf, X_test, y_test, ax=ax)
CalibrationDisplay.from_estimator(clf, X_test, y_test, ax=ax, strategy="isotonic")
ax.get_legend().get_texts()[1].set_text('LogisticRegression uniform')
ax.get_legend().get_texts()[2].set_text('LogisticRegression isotonic')

From https://scikit-learn.org/stable/auto_examples/calibration/plot_compare_calibration.html#calibration-curves.
image

import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import numpy as np
from sklearn.calibration import CalibrationDisplay
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC


X, y = make_classification(
    n_samples=100_000, n_features=20, n_informative=2, n_redundant=2, random_state=42
)

train_samples = 100  # Samples used for training the models
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    shuffle=False,
    test_size=100_000 - train_samples,
)


class NaivelyCalibratedLinearSVC(LinearSVC):
    """LinearSVC with `predict_proba` method that naively scales
    `decision_function` output."""

    def fit(self, X, y):
        super().fit(X, y)
        df = self.decision_function(X)
        self.df_min_ = df.min()
        self.df_max_ = df.max()

    def predict_proba(self, X):
        """Min-max scale output of `decision_function` to [0,1]."""
        df = self.decision_function(X)
        calibrated_df = (df - self.df_min_) / (self.df_max_ - self.df_min_)
        proba_pos_class = np.clip(calibrated_df, 0, 1)
        proba_neg_class = 1 - proba_pos_class
        proba = np.c_[proba_neg_class, proba_pos_class]
        return proba


# Create classifiers
lr = LogisticRegression()
gnb = GaussianNB()
svc = NaivelyCalibratedLinearSVC(C=1.0)
rfc = RandomForestClassifier()

clf_list = [
    (lr, "Logistic"),
    (gnb, "Naive Bayes"),
    (svc, "SVC"),
    (rfc, "Random forest"),
]


fig = plt.figure(figsize=(10, 10))
gs = GridSpec(4, 2)
colors = plt.cm.get_cmap("Dark2")

ax_calibration_curve = fig.add_subplot(gs[:2, :2])
calibration_displays = {}
for i, (clf, name) in enumerate(clf_list):
    clf.fit(X_train, y_train)
    display = CalibrationDisplay.from_estimator(
        clf,
        X_test,
        y_test,
        n_bins=10,
        strategy="isotonic",
        name=name,
        ax=ax_calibration_curve,
        color=colors(i),
    )
    calibration_displays[name] = display

ax_calibration_curve.grid()
ax_calibration_curve.set_title("Calibration plots")
plt.show()

@lorentzenchr
Copy link
Member Author

@ogrisel @glemaitre You might be interested.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you noted in #23132 (comment), the CORP paper does not meet our inclusion criterion. According to google scholar it has been cited 13 times.

If we can not include the method based on inclusion, then an alternative is to accept a callable here so it is simple to implement CORP:

def calibration_curve(...):
    ...
    elif callable(strategy):
        # n_bins to be flexible
        return strategy(y_prob, y_true, n_bins)

and strategy is:

def strategy(y_prob, y_true, n_bins):
    iso = IsotonicRegression(y_min=0, y_max=1).fit(y_prob, y_true)
    prob_true = iso.y_thresholds_
    prob_pred = iso.X_thresholds_
    return prob_true, prob_pred

Then we can update a calibration example to showcase passing a callable and using the CORP strategy.

@lorentzenchr
Copy link
Member Author

@thomasjpfan The point is that isotonic regression is already included in scikit-learn, so why not use it? In particular, CalibratedClassifierCV is using it and is related to the same topic:
Another way of putting it: Plot CalibratedClassifierCV(clf, method="isotonic", cv="prefit").fit(X, y).predict(X) vs clf.prefit(X).
I see the paper more as a theoretical foundation as to why isotonic regression is good to use in reliability diagrams.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Sep 5, 2022

To give it more citation counts:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add PAV algorithm for calibration_curve/reliability diagrams
3 participants