-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC Expand on sigmoid and isotonic in calibration.rst #17725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think adding some code examples would also be useful in |
doc/modules/calibration.rst
Outdated
.. math:: | ||
\sum_i (y_i - f_i)^2 | ||
|
||
subject to :math:`\f_i \le f_j`. This method is more general when compared to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constraint is y_i < y_j whenever f_i < f_j
Though we should not use y_i: y_i is used before and corresponds to the true target (0 or 1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused at this part but I think y_i here does mean the true target.
from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4180410/
(sorry for funny screenshot)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"subject to :math:\f_i \le f_j
" is ambiguous as j
is not defined. I think you should reproduce the full formula you quoted above only using f_i
/ f_{i+1}
instead of the \tilde{s}_i
/ \tilde{s}_{i+1}
notation.
The sigmoid method is biased in that it assumes the :ref:`calibration curve | ||
<calibration_curve>` of the un-calibrated model has a sigmoid shape [1]_. It | ||
is thus most effective when the un-calibrated model is over-confident. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could also mention the fact that Platt scaling assumes symmetric calibration errors, that is, it assumes that the over-confidence errors for low values of f_i
have the same magnitude as for high values of f_i
. This is not necessarily the case for highly imbalanced classification problems where the un-calibrated classifier can have asymmetric calibration errors.
This is just an intuition (I have not run experiments to confirm this happens in practice) though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right it's discussed here: https://projecteuclid.org/download/pdfview_1/euclid.ejs/1513306867
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reference, Beta calibration looks very nice, I did not know about it. Unfortunately it does not meet the criterion for inclusion in scikit-learn in terms of citations but honestly I wouldn't mind considering a PR to add as a third option to CalibratedClassifierCV
.
doc/modules/calibration.rst
Outdated
\sum_{i=1}^{n} (y_i - f_i)^2 : f_i \leq f_{i+1}\quad \forall i \{1,..., n-1\} | ||
|
||
where :math:`y_i` is the true label of sample :math:`i` and :math:`f_i` | ||
is the output of the classifier for sample :math:`i`. This method is more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the output of the classifier for sample :math:`i`. This method is more | |
is the calibrated output of the classifier for sample :math:`i`. This method is more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...are you sure? I am confused now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the function that is minimized to find the isotonic function, so should be the output of the classifier..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already using f_i
above to define the output of the un-calibrated classifier.
The formula should be
\sum_{i=1}^{n} (y_i - \hat{f}_i)^2
where \hat{f} is as @ogrisel suggested (the calibrated probability)
And the constraint is that \hat{f}_i >= \hat{f}_j whenever f_i >= f_j
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed sorry for the confusion.
doc/modules/calibration.rst
Outdated
.. math:: | ||
\sum_{i=1}^{n} (y_i - \hat{f}_i)^2 | ||
|
||
subject to \hat{f}_i >= \hat{f}_j whenever f_i >= f_j. :math:`y_i` is the true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The math formatting is missing here: https://109917-843222-gh.circle-artifacts.com/0/doc/modules/calibration.html#isotonic
subject to \hat{f}_i >= \hat{f}_j whenever f_i >= f_j. :math:`y_i` is the true | |
subject to :math`:\hat{f}_i >= \hat{f}_j` whenever :math:`f_i >= f_j`. :math:`y_i` is the true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph will probably need to be wrapped. to avoid going beyond 80 chars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under vscode, I use https://marketplace.visualstudio.com/items?itemName=stkb.rewrap with the alt+q
keyboard shortcut for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (assuming the circle ci output will be good after the latest commit). :)
@ogrisel do you have any idea why my class linking (e.g., :class:`SGDClassifier`) is not appearing (as a link) in the build documentation? |
doc/modules/calibration.rst
Outdated
@@ -11,16 +11,21 @@ When performing classification you often want not only to predict the class | |||
label, but also obtain a probability of the respective label. This probability | |||
gives you some kind of confidence on the prediction. Some models can give you | |||
poor estimates of the class probabilities and some even do not support | |||
probability prediction. The calibration module allows you to better calibrate | |||
probability prediction (e.g., :class:`SGDClassifier`). The calibration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need the whole path unless there's a previous sphinx directive indicating the current module (which wouldn't be sklearn.linear_model anyway)
probability prediction (e.g., :class:`SGDClassifier`). The calibration | |
probability prediction (e.g., :class:`~sklearn.linear_model.SGDClassifier`). The calibration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooohhhh thanks, that is good to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @lucyleeow this will be a nice addition to the UG!
some last comments from me
doc/modules/calibration.rst
Outdated
@@ -11,16 +11,21 @@ When performing classification you often want not only to predict the class | |||
label, but also obtain a probability of the respective label. This probability | |||
gives you some kind of confidence on the prediction. Some models can give you | |||
poor estimates of the class probabilities and some even do not support | |||
probability prediction. The calibration module allows you to better calibrate | |||
probability prediction (e.g., :class:`~sklearn.linear_model.SGDClassifier`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probability prediction (e.g., :class:`~sklearn.linear_model.SGDClassifier`). | |
probability prediction (e.g., some instances of :class:`~sklearn.linear_model.SGDClassifier`). |
doc/modules/calibration.rst
Outdated
@@ -85,22 +90,29 @@ Calibrating a classifier | |||
|
|||
Calibrating a classifier consists in fitting a regressor (called a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consists of
(pretty sure that was from me :s)
doc/modules/calibration.rst
Outdated
is calibrated first for each class separately in a one-vs-rest fashion [4]_. | ||
When predicting probabilities, the calibrated probabilities for each class | ||
:class:`CalibratedClassifierCV` supports the use of two 'calibration' | ||
regressors: 'sigmoid' and 'isotonic'. Both these regressors only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put the section about multiclass (from "Both these regressors" to "normalize them") into another subsesction e.g. "Multiclass support", once isotonic and sigmoid calibrators have been describe.
Otherwise the 2 subsections detailing isotonic and sigmoid are a bit abrupt. It would be more natural if they directely followed from ":class:CalibratedClassifierCV
supports the use of two 'calibration' regressors: 'sigmoid' and 'isotonic'."
We can move the sentence about brier_score
just before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, it works much better. I thought it didn't flow well but I wasn't sure how best to change.
doc/modules/calibration.rst
Outdated
symmetrical [1]_. It is thus most effective when the un-calibrated model is | ||
under-confident and has similar errors for both high and low |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding is that we assume over-confidence for low probabilities and under-confidence for high probabilities (which are then compensated by the shape of the logistic function)?
also by "similar errors" do we mean errors with similar absolute values / magnitude?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I am confused. I thought that a 'sigmoid' shape of calibration curve meant under-confident classifier and 'transposed-sigmoid' shape meant over-confident model. This is just from the example: https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html#sphx-glr-auto-examples-calibration-plot-calibration-curve-py
I think similar errors is in terms of the shape of the calibration curve - the sigmoid is symmetrical in shape. Since we are dealing with difference in predicted probability and frequency of true positives per bin, i would say similar absolute difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding is that we assume over-confidence for low probabilities and under-confidence for high probabilities (which are then compensated by the shape of the logistic function)?
Ok I see how I was wrong here
Oh I am confused. I thought that a 'sigmoid' shape of calibration curve meant under-confident classifier and 'transposed-sigmoid' shape meant over-confident model. This is just from the example: scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html#sphx-glr-auto-examples-calibration-plot-calibration-curve-py
Now I'm confused too: from the example, it seems to me that NB (with a transposed sigmoid shape) is
under confident, while the LinearSVC (with a sigmoid shape) is overconfident: for example for the bin at 0.8, its predictions are close to 1, so I interpret this as being over-confident about the positive class. What am I getting wrong? Maybe @ogrisel can chime in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rom the example, it seems to me that NB (with a transposed sigmoid shape) is
under confident, while the LinearSVC (with a sigmoid shape) is overconfident:
Yes you are right! I should have thought about this more. (I think @ogrisel will be on holiday from next week though...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What am I getting wrong?
Ok so what I was getting wrong is that I was inverting the axes: the x axis is what the classifiers predicts and the y axis are the actual proportions. So indeed the sigmoid curve describes under-confidence and the example is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, okay. It is tricky to interpret. I guess you have to think about it from 0.5 and go up/down.
doc/modules/calibration.rst
Outdated
subject to :math:`\hat{f}_i >= \hat{f}_j` whenever | ||
:math:`f_i >= f_j`. :math:`y_i` is the true | ||
label of sample :math:`i` and :math:`\hat{f}_i` is the output of the | ||
calibrated classifier for sample :math:`i`. This method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calibrated classifier for sample :math:`i`. This method | |
calibrated classifier for sample :math:`i`, i.e. the calibrated probability. This method |
Thanks for the review @NicolasHug. I expanded on brier score as well, adding the definitions from #10969 I think once #11096 is done, they will amend the doc to talk about calibration loss instead of brier score, but for now I thought it would be useful to expand on Brier score. |
doc/modules/calibration.rst
Outdated
The sigmoid method is biased in that it assumes the :ref:`calibration curve | ||
<calibration_curve>` of the un-calibrated model has a sigmoid shape and is | ||
symmetrical [1]_. It is thus most effective when the un-calibrated model is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to delay merging further if you're sure about this but there are a few things that aren't clear for me here:
- why does sigmoid calibration assumes a sigmoid calibration curve?
- is this really discussed in [1]_, or rather in projecteuclid.org/download/pdfview_1/euclid.ejs/1513306867?
In projecteuclid.org/download/pdfview_1/euclid.ejs/1513306867 it is said that
the parametric assumption made by logistic calibration is exactly the right one if the scores output by a classifier are normally distributed within each class around class means s+ and s− with the same variance σ2
though I'm not sure yet how that relates to the comment made above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The symmetry assumption is discussed in projecteuclid.org/download/pdfview_1/euclid.ejs/1513306867 but not in '[1]
'. I can add it in.
* why does sigmoid calibration assumes a sigmoid calibration curve?
I am not clear on the maths but I think this is explained better in the original Platt paper: https://www.researchgate.net/publication/2594015_Probabilistic_Outputs_for_Support_Vector_Machines_and_Comparisons_to_Regularized_Likelihood_Methods
section 2.1 (which I can reference instead)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not clear on the maths but I think this is explained better in the original Platt paper
I don't see where this paper says such a thing. What I read is:
- the sigmiod model is equivalent to assuming that the output of the SVM is proportional to the log odds of a positive example
- the class-conditional densities between the margins are apparently exponential. Bayes rules on 2 exponentials suggests using a parametric form of a sigmoid
I don't understand how these two are equivalent to "using a sigmoid calibration assumes that the calibration curve has a sigmoid shape"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yes that is true. (Is it fair to say that:) It was designed to calibrate the output of the SVM - which has a sigmoid shape (is this always true)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can just say:
- using the sigmoid calibration method assumes that the calibration curve can corrected by applying a sigmoid function to the raw predictions. This assumption has been empirically justified in the case of support vector machines with common kernel functions on various benchmark datasets in section 2.1 of Platt 2000 [1] but does not necessarily hold in general.
Thanks for your help @ogrisel and @NicolasHug, I've made some changes and hopefully it is correct now... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot @lucyleeow , will merge when green
Reference Issues/PRs
Addresses: #16321 (comment)
What does this implement/fix? Explain your changes.
(I added these points below in this PR as well but I am happy to remove/change if not appropriate in this PR)
cv='prefit'
belongs next the the section about cross-valdiation.CalibratedClassifierCV
uses one-vs-the-rest to extend to multiclassAny other comments?
ping @NicolasHug