Skip to content

DOC Expand on sigmoid and isotonic in calibration.rst #17725

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 22, 2020

Conversation

lucyleeow
Copy link
Member

@lucyleeow lucyleeow commented Jun 25, 2020

Reference Issues/PRs

Addresses: #16321 (comment)

What does this implement/fix? Explain your changes.

  • Expands on when to use sigmoid vs isotonic for calibration

(I added these points below in this PR as well but I am happy to remove/change if not appropriate in this PR)

  • Expands on why data used for fitting classifier should be different from data used for fitting calibrator
  • Moves some sections, as I thought section about using cv='prefit' belongs next the the section about cross-valdiation.
  • Expands on how CalibratedClassifierCV uses one-vs-the-rest to extend to multiclass
  • Add internal doc links
  • Adds links to papers referenced

Any other comments?

ping @NicolasHug

@lucyleeow
Copy link
Member Author

lucyleeow commented Jun 25, 2020

I think adding some code examples would also be useful in calibration.rst - happy to do it here if appropriate.
Edit: I see examples have been added to the CalibratedClassifierCV docstring so code examples here is less important.

.. math::
\sum_i (y_i - f_i)^2

subject to :math:`\f_i \le f_j`. This method is more general when compared to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constraint is y_i < y_j whenever f_i < f_j

Though we should not use y_i: y_i is used before and corresponds to the true target (0 or 1).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused at this part but I think y_i here does mean the true target.

image

from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4180410/
(sorry for funny screenshot)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"subject to :math:\f_i \le f_j" is ambiguous as j is not defined. I think you should reproduce the full formula you quoted above only using f_i / f_{i+1} instead of the \tilde{s}_i / \tilde{s}_{i+1} notation.

The sigmoid method is biased in that it assumes the :ref:`calibration curve
<calibration_curve>` of the un-calibrated model has a sigmoid shape [1]_. It
is thus most effective when the un-calibrated model is over-confident.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could also mention the fact that Platt scaling assumes symmetric calibration errors, that is, it assumes that the over-confidence errors for low values of f_i have the same magnitude as for high values of f_i. This is not necessarily the case for highly imbalanced classification problems where the un-calibrated classifier can have asymmetric calibration errors.

This is just an intuition (I have not run experiments to confirm this happens in practice) though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@ogrisel ogrisel Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reference, Beta calibration looks very nice, I did not know about it. Unfortunately it does not meet the criterion for inclusion in scikit-learn in terms of citations but honestly I wouldn't mind considering a PR to add as a third option to CalibratedClassifierCV.

\sum_{i=1}^{n} (y_i - f_i)^2 : f_i \leq f_{i+1}\quad \forall i \{1,..., n-1\}

where :math:`y_i` is the true label of sample :math:`i` and :math:`f_i`
is the output of the classifier for sample :math:`i`. This method is more
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is the output of the classifier for sample :math:`i`. This method is more
is the calibrated output of the classifier for sample :math:`i`. This method is more

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...are you sure? I am confused now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the function that is minimized to find the isotonic function, so should be the output of the classifier..?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already using f_i above to define the output of the un-calibrated classifier.

The formula should be

\sum_{i=1}^{n} (y_i - \hat{f}_i)^2

where \hat{f} is as @ogrisel suggested (the calibrated probability)

And the constraint is that \hat{f}_i >= \hat{f}_j whenever f_i >= f_j

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed sorry for the confusion.

.. math::
\sum_{i=1}^{n} (y_i - \hat{f}_i)^2

subject to \hat{f}_i >= \hat{f}_j whenever f_i >= f_j. :math:`y_i` is the true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The math formatting is missing here: https://109917-843222-gh.circle-artifacts.com/0/doc/modules/calibration.html#isotonic

Suggested change
subject to \hat{f}_i >= \hat{f}_j whenever f_i >= f_j. :math:`y_i` is the true
subject to :math`:\hat{f}_i >= \hat{f}_j` whenever :math:`f_i >= f_j`. :math:`y_i` is the true

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph will probably need to be wrapped. to avoid going beyond 80 chars.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under vscode, I use https://marketplace.visualstudio.com/items?itemName=stkb.rewrap with the alt+q keyboard shortcut for this.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (assuming the circle ci output will be good after the latest commit). :)

@lucyleeow
Copy link
Member Author

lucyleeow commented Jun 26, 2020

@ogrisel do you have any idea why my class linking (e.g., :class:`SGDClassifier`) is not appearing (as a link) in the build documentation?

image

@@ -11,16 +11,21 @@ When performing classification you often want not only to predict the class
label, but also obtain a probability of the respective label. This probability
gives you some kind of confidence on the prediction. Some models can give you
poor estimates of the class probabilities and some even do not support
probability prediction. The calibration module allows you to better calibrate
probability prediction (e.g., :class:`SGDClassifier`). The calibration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need the whole path unless there's a previous sphinx directive indicating the current module (which wouldn't be sklearn.linear_model anyway)

Suggested change
probability prediction (e.g., :class:`SGDClassifier`). The calibration
probability prediction (e.g., :class:`~sklearn.linear_model.SGDClassifier`). The calibration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooohhhh thanks, that is good to know.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @lucyleeow this will be a nice addition to the UG!

some last comments from me

@@ -11,16 +11,21 @@ When performing classification you often want not only to predict the class
label, but also obtain a probability of the respective label. This probability
gives you some kind of confidence on the prediction. Some models can give you
poor estimates of the class probabilities and some even do not support
probability prediction. The calibration module allows you to better calibrate
probability prediction (e.g., :class:`~sklearn.linear_model.SGDClassifier`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
probability prediction (e.g., :class:`~sklearn.linear_model.SGDClassifier`).
probability prediction (e.g., some instances of :class:`~sklearn.linear_model.SGDClassifier`).

@@ -85,22 +90,29 @@ Calibrating a classifier

Calibrating a classifier consists in fitting a regressor (called a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consists of

(pretty sure that was from me :s)

is calibrated first for each class separately in a one-vs-rest fashion [4]_.
When predicting probabilities, the calibrated probabilities for each class
:class:`CalibratedClassifierCV` supports the use of two 'calibration'
regressors: 'sigmoid' and 'isotonic'. Both these regressors only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put the section about multiclass (from "Both these regressors" to "normalize them") into another subsesction e.g. "Multiclass support", once isotonic and sigmoid calibrators have been describe.

Otherwise the 2 subsections detailing isotonic and sigmoid are a bit abrupt. It would be more natural if they directely followed from ":class:CalibratedClassifierCV supports the use of two 'calibration' regressors: 'sigmoid' and 'isotonic'."

We can move the sentence about brier_score just before

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, it works much better. I thought it didn't flow well but I wasn't sure how best to change.

Comment on lines 162 to 163
symmetrical [1]_. It is thus most effective when the un-calibrated model is
under-confident and has similar errors for both high and low
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding is that we assume over-confidence for low probabilities and under-confidence for high probabilities (which are then compensated by the shape of the logistic function)?

also by "similar errors" do we mean errors with similar absolute values / magnitude?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I am confused. I thought that a 'sigmoid' shape of calibration curve meant under-confident classifier and 'transposed-sigmoid' shape meant over-confident model. This is just from the example: https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html#sphx-glr-auto-examples-calibration-plot-calibration-curve-py

I think similar errors is in terms of the shape of the calibration curve - the sigmoid is symmetrical in shape. Since we are dealing with difference in predicted probability and frequency of true positives per bin, i would say similar absolute difference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding is that we assume over-confidence for low probabilities and under-confidence for high probabilities (which are then compensated by the shape of the logistic function)?

Ok I see how I was wrong here

Oh I am confused. I thought that a 'sigmoid' shape of calibration curve meant under-confident classifier and 'transposed-sigmoid' shape meant over-confident model. This is just from the example: scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html#sphx-glr-auto-examples-calibration-plot-calibration-curve-py

Now I'm confused too: from the example, it seems to me that NB (with a transposed sigmoid shape) is
under confident, while the LinearSVC (with a sigmoid shape) is overconfident: for example for the bin at 0.8, its predictions are close to 1, so I interpret this as being over-confident about the positive class. What am I getting wrong? Maybe @ogrisel can chime in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rom the example, it seems to me that NB (with a transposed sigmoid shape) is
under confident, while the LinearSVC (with a sigmoid shape) is overconfident:

Yes you are right! I should have thought about this more. (I think @ogrisel will be on holiday from next week though...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What am I getting wrong?

Ok so what I was getting wrong is that I was inverting the axes: the x axis is what the classifiers predicts and the y axis are the actual proportions. So indeed the sigmoid curve describes under-confidence and the example is correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, okay. It is tricky to interpret. I guess you have to think about it from 0.5 and go up/down.

subject to :math:`\hat{f}_i >= \hat{f}_j` whenever
:math:`f_i >= f_j`. :math:`y_i` is the true
label of sample :math:`i` and :math:`\hat{f}_i` is the output of the
calibrated classifier for sample :math:`i`. This method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
calibrated classifier for sample :math:`i`. This method
calibrated classifier for sample :math:`i`, i.e. the calibrated probability. This method

@lucyleeow
Copy link
Member Author

Thanks for the review @NicolasHug. I expanded on brier score as well, adding the definitions from #10969

I think once #11096 is done, they will amend the doc to talk about calibration loss instead of brier score, but for now I thought it would be useful to expand on Brier score.

Comment on lines 160 to 162
The sigmoid method is biased in that it assumes the :ref:`calibration curve
<calibration_curve>` of the un-calibrated model has a sigmoid shape and is
symmetrical [1]_. It is thus most effective when the un-calibrated model is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to delay merging further if you're sure about this but there are a few things that aren't clear for me here:

  • why does sigmoid calibration assumes a sigmoid calibration curve?
  • is this really discussed in [1]_, or rather in projecteuclid.org/download/pdfview_1/euclid.ejs/1513306867?

In projecteuclid.org/download/pdfview_1/euclid.ejs/1513306867 it is said that

the parametric assumption made by logistic calibration is exactly the right one if the scores output by a classifier are normally distributed within each class around class means s+ and s− with the same variance σ2

though I'm not sure yet how that relates to the comment made above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The symmetry assumption is discussed in projecteuclid.org/download/pdfview_1/euclid.ejs/1513306867 but not in '[1]'. I can add it in.

* why does sigmoid calibration assumes a sigmoid calibration curve?

I am not clear on the maths but I think this is explained better in the original Platt paper: https://www.researchgate.net/publication/2594015_Probabilistic_Outputs_for_Support_Vector_Machines_and_Comparisons_to_Regularized_Likelihood_Methods
section 2.1 (which I can reference instead)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not clear on the maths but I think this is explained better in the original Platt paper

I don't see where this paper says such a thing. What I read is:

  • the sigmiod model is equivalent to assuming that the output of the SVM is proportional to the log odds of a positive example
  • the class-conditional densities between the margins are apparently exponential. Bayes rules on 2 exponentials suggests using a parametric form of a sigmoid

I don't understand how these two are equivalent to "using a sigmoid calibration assumes that the calibration curve has a sigmoid shape"

Copy link
Member Author

@lucyleeow lucyleeow Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yes that is true. (Is it fair to say that:) It was designed to calibrate the output of the SVM - which has a sigmoid shape (is this always true)?

Copy link
Member

@ogrisel ogrisel Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just say:

  • using the sigmoid calibration method assumes that the calibration curve can corrected by applying a sigmoid function to the raw predictions. This assumption has been empirically justified in the case of support vector machines with common kernel functions on various benchmark datasets in section 2.1 of Platt 2000 [1] but does not necessarily hold in general.

@lucyleeow
Copy link
Member Author

Thanks for your help @ogrisel and @NicolasHug, I've made some changes and hopefully it is correct now...

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot @lucyleeow , will merge when green

@NicolasHug NicolasHug merged commit effc436 into scikit-learn:master Jul 22, 2020
@lucyleeow lucyleeow deleted the doc_calb branch July 30, 2020 17:35
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants