-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Add libsvm-like calibration procedure to CalibratedClassifierCV #17856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@NicolasHug, this is probably not going to be a problem but I thought I would mention it. We use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lucyleeow , made a first pass.
I think the refactoring makes sense, WDYT?
I haven't looked in detail at the changes related to the label encoders though. Are these needed here or can they wait for another PR?
sklearn/calibration.py
Outdated
# When binary, proba of clf_fitted.classes_[1] | ||
# output but `pos_class_indices` = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that the grammar is incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I amended to:
# When binary, df consists of predictions for
# clf_fitted.classes_[1] but `pos_class_indices` = 0
not sure if this is clearer
sklearn/calibration.py
Outdated
if base_estimator_method == "decision_function": | ||
if df.ndim == 1: | ||
df = df[:, np.newaxis] | ||
else: | ||
if len(self.label_encoder_.classes_) == 2: | ||
df = df[:, 1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duplication is a bit unfortunate, there should be a way to merge this logic with that of _get_predictions
Maybe by extracting this into a _reshape_df
utility?
Or maybe we can even let _get_predictions
call cross_val_predict
somehow, so that we can also merge the method checks above.
Also is df = df[:, 1:]
just equivalent to df = df[:, 1]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe we can even let
_get_predictions
callcross_val_predict
somehow, so that we can also merge the method checks above.
This is what I was trying to do before - but _get_predictions
is is used within _fit_calibrator
which is within the cv for loop. This is why I wanted to put the cv for loop within _get_predictions
but it got complicated.
I will can add a _reshape_df
function but I have a little dislike for the df
naming as it is not always 'decision_function`. (I also first think df=dataframe but maybe that is just me) WDYT?
(of course if this naming is used elsewhere, happy to keep)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I can't think of an elegant way to add the _reshape_df
function as we get the predictions of the classifier in _get_predictions
but we use cross_val_predict
to get the predictions in the case above...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I was confused by df
too. We can rename it as predictions
or just preds
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I can't think of an elegant way to add the _reshape_df
Would it help to also add a _get_method
helper, like
def _get_method(clf):
# return decision_function (the actual method) or predict_proba or raise error
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into it, _fit_calibrator
returns a _CalibratedClassifierPipeline
instance for each cv fold. For ensemble=False
, this won't work (for how _fit_calibrator
works atm) - we want to fit once with all the predictions. We would need to take _fit_calibrator
out of the cv for loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into it, _fit_calibrator returns a _CalibratedClassifierPipeline instance for each cv fold
Only when ensemble=True, right?
There's no CV loop when ensemble=False
, the "loop" is in cross_val_predict
and we only need to call _fit_calibrator
once. Maybe I misunderstood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is difficult over text! I may be confused too.
At the moment we have:
class CalibratedClassifierCV():
def fit():
if self.cv == "prefit":
# `_fit_calibrator` using all X and y
# append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`
else:
if ensemble:
for train, test in cv.split(X, y):
# fit classifier
# call `_fit_calibrator`, producing a `_CalibratedClassiferPipeline` instance
# append `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`,
# ending up with `n_cv` `_CalibratedClassiferPipeline` instances
else:
# Use `cross_val_predict` to get all `preds`
# call `_fit_calibrator` using `preds` and y
# append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`
we want to get rid of the last else
as there is duplication of code there. With the new _get_prediction_method
we have:
class CalibratedClassifierCV():
def fit():
if self.cv == "prefit":
# `_fit_calibrator` using all X and y
# append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`
else:
for train, test in cv.split(X, y):
# fit classifier
# call `_fit_calibrator`, which first calls `_get_predictions`.
# In case of `ensemble=False`, we use use `partial(cross_val_predict())` (this works fine)
# `_fit_calibrator` also fits calibrators and produces `_CalibratedClassiferPipeline` instance
# Only one _CalibratedClassiferPipeline` is required when `ensemble=False`
# (i.e. shouldn't be in the for loop)
# when `ensemble=True` `n_cv` `_CalibratedClassiferPipeline` instances are needed
# (i.e., should be in the for loop)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I wasn't clear! I don't think we can move away from the first version (with the last else
). The duplication of code that I wanted to avoid was:
- getting the right method
if hasattr(base_estimator, "decision_function"):
base_estimator_method = "decision_function"
elif hasattr(base_estimator, "predict_proba"):
base_estimator_method = "predict_proba"
- reshaping the predictions
if base_estimator_method == "decision_function":
if preds.ndim == 1:
preds = preds[:, np.newaxis]
else:
if len(self.label_encoder_.classes_) == 2:
preds = preds[:, 1]
Right now, these 2 things happen both in CalibratedClassifierCV.fit
(for ensemble=False) and in _get_predictions
(for ensemble=True).
So to expand on my previous suggestion:
def get_prediction_method(clf, return_string=False):
# return clf.decision_function or clf.predict_proba or their corresponding string
# we need strings when using cross_val_predict
def _get_predictions(X, pred_func):
# pred_func is either a method of the clf, or a partial cross_val_predict
preds = pred_func(X)
# reshape preds here
return preds
class CalibratedClassifierCV():
def fit():
if self.cv == "prefit":
# `_fit_calibrator` using all X and y
# append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`
else:
if ensemble:
for train, test in cv.split(X, y):
# fit classifier on train
preds = _get_predictions(test, get_prediction_method(clf, return_string=False)
# call `_fit_calibrator`, producing a `_CalibratedClassiferPipeline` instance
# append `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`,
# ending up with `n_cv` `_CalibratedClassiferPipeline` instances
else:
preds = _get_predictions(X, partial(cross_val_predict(..., method=get_prediction_method(clf, return_string=True)))
# call `_fit_calibrator` using `preds` and y
# append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`
I think we might need to modify _fit_calibrator
to only accept preds
and not call _get_predictions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI:
Also is df = df[:, 1:] just equivalent to df = df[:, 1]?
it seems that it is not the same, you get 2d output with df[:, 1:]
There wasn't a big change to this. Previously we passed the classes to scikit-learn/sklearn/calibration.py Lines 180 to 181 in fd23727
and then instantiate a new scikit-learn/sklearn/calibration.py Lines 325 to 329 in fd23727
(note: I just changed it so we do not pass the classes but the fitted |
I've make the changes. I like that I do think that it is a bit of duplication to have both
would do the trick. But maybe that is a separate issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lucyleeow , made another pass.
I think the use of partial
is appropriate and the refactoring looks good to me. We'll need tests for the new ensemble
parameter ;)
I do think that it is a bit of duplication to have both _check_classifier_response_method and get_prediction_method [...]
Yes we can consider having a unified utility, but indeed this would be better done in a separate PR.
ping @NicolasHug @ogrisel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good to me. Thank you very much @lucyleeow for your patience.
Here are some fixes / small suggestions for further improvements but otherwise it seems ready for second review and merge.
Please add a new entry to doc/whats_new/0.24.rst
.
Thanks for the review @ogrisel, suggestions added! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good. Just a couple of nitpicks.
Merged! Thank you very much @lucyleeow! |
@@ -66,7 +66,7 @@ def test_calibration(data, method, ensemble): | |||
(sparse.csr_matrix(X_train), | |||
sparse.csr_matrix(X_test))]: | |||
cal_clf = CalibratedClassifierCV( | |||
clf, method=method, cv=2, ensemble=ensemble | |||
clf, method=method, cv=5, ensemble=ensemble |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used cv=5
whenever we assess the quality of the calibration because cv=2 trains models on much small dataset and can degrade the base model performance, hence make the test quite brittle.
Reference Issues/PRs
closes #16145
Follows from PR #16167
The first step was to refactor
CalibratedClassifierCV
, which was done in #17803. This is a branch off of #17803. (will close #17803)What does this implement/fix? Explain your changes.
Add
ensemble
option toCalibratedClassifierCV
.ensemble=True
would be the default and would be the current implementation.ensemble=False
would be the same as the libsvm implementation (used whenprobability=True
):When
cv
is not'prefit'
:Note
ensemble=False
is the procedure described in the original Platt paper (see #16145 (comment))Any other comments?