-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Refactor CalibratedClassifierCV #17803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THanks @lucyleeow , haven't reviewed in details yet
Do you think it would make sense to open another PR that branches off this one and that implements the new calibration procedure? It might make it easier to judge whether the refactoring that I proposed is appropriate
sklearn/calibration.py
Outdated
|
||
return df, idx_pos_class | ||
Output of the `decision_function` method of the `clf_fitted` is used for | ||
calibration. If this method does not exist, `predict_proba` method used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calibration. If this method does not exist, `predict_proba` method used. | |
calibration. If this method does not exist, `predict_proba` method is used. |
sklearn/calibration.py
Outdated
if method == 'isotonic': | ||
calibrator = IsotonicRegression(out_of_bounds='clip') | ||
elif method == 'sigmoid': | ||
calibrator = _SigmoidCalibration() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this bloc be removed? It looks like it's also inside of the loop below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, whoops!
elif self.method == 'sigmoid': | ||
calibrator = _SigmoidCalibration() | ||
else: | ||
raise ValueError('method should be "sigmoid" or ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer validating the input here instead of having an if/elif
with no else clause. In general it's better to validate the input right before it is used.
sklearn/calibration.py
Outdated
|
||
|
||
class _CalibratedClassiferPipeline: | ||
"""Pipeline chaining a fitted classifier and it's fitted calibrators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Pipeline chaining a fitted classifier and it's fitted calibrators. | |
"""Pipeline-like chaining a fitted classifier and its fitted calibrators. |
sklearn/calibration.py
Outdated
'_SigmoidCalibration'). Number of calibrators equals the number of | ||
classes. However, if there are 2 classes, list contains only one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'_SigmoidCalibration'). Number of calibrators equals the number of | |
classes. However, if there are 2 classes, list contains only one | |
'_SigmoidCalibration'). The number of calibrators equals the number of | |
classes. However, if there are 2 classes, the list contains only one |
sklearn/calibration.py
Outdated
|
||
def predict_proba(self, X): | ||
"""Posterior probabilities of classification | ||
"""Calculate posterior (calibrated) probabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether "posterior" is a term that makes sense here. Should we just talk about calibrated probabilities instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sounds good to me! I was just keeping the old term.
sklearn/calibration.py
Outdated
calibrator.fit(this_df, Y[:, k], sample_weight) | ||
self.calibrators_.append(calibrator) | ||
calibrated_classifiers = [] | ||
for idx, this_df in zip(idx_pos_class, df.T): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest using class_idx
instead of just idx
. (k
is commonly used to define a class so it'd be fine too though maybe more cryptic )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I didn't realise the k
terminology otherwise I would have kept it. I will use class_idx
for others like me who aren't familiar with the convention.
sklearn/calibration.py
Outdated
---------- | ||
.. [1] Obtaining calibrated probability estimates from decision trees | ||
and naive Bayesian classifiers, B. Zadrozny & C. Elkan, ICML 2001 | ||
idx_pos_class : array-like, shape (n_classes,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idx_pos_class : array-like, shape (n_classes,) | |
pos_class_indices : array-like, shape (n_classes,) |
otherwise the name idx
suggests it's a single value
Though it seems that in the binary case this isn't the pos_class
index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in the binary case we only output the probability of clf_fitted.classes[1]
:
elif hasattr(clf_fitted, "predict_proba"):
df = clf_fitted.predict_proba(X)
if len(label_encoder_.classes_) == 2:
df = df[:, 1:]
though this can be considered the 'positive class'? (the other prob is calculated later as 1 - prob)
idx_pos_class
indicates which classes are 'present' in a particular train/test subset
(thanks for the review, you beat me to it while I was writing this long text) @NicolasHug suggested the following refactoring (copied over):
with suggested pseudocode: #16167 (comment) Details of implementation in this PR:
Classes
This is the only place that
(Edited for succinctness) |
Happy to do this but could you describe the steps? Locally, I know how to make branch (e.g., 'add_ensemble') that branches off this branch (e.g., 'refactor'). Would I then just push 'add_ensemble' to make a new PR? Is the purpose such that you can look at the diff between 'add_ensemble' and 'refactor' ? |
yes, the new branch would be just like any other branch. It would not branch from master, unlike most branches we use, but other than that it's really just a branch
Honestly I'm not sure my suggestion of having 2 PRs makes sense. The thing is, right now it's hard to review the refactoring because we cannot really tell whether it will help implementing the |
No problem @NicolasHug, happy to make a branch from this branch. I can practice my git! |
I think you were right here. The refactoring needs to be re-jigged for it to work/help the implementation (so it doesn't use 'predictions_in_X' - from #16167). However, even with re-jigging, I can't think of a good way. This is what I got but I think it uses more memory (by saving the outputs form each cv fold) the current implementation and I don't think it's very elegant: def _get_predictions(base_estimator, X, cv, ensemble,
base_estimator_supports_sw, label_encoder_):
# returns a list of prediction tuples of form (`df`, `pos_class_indices`, `test_indicies`)
# for `ensemble=True` list would contain 1 tuple, for `ensemble=False` the
# list wold contain a tuple for each cv
# Note cv loop done here
# The `test_indicies` are required, since we need `sample_weight[test]` to fit the calibration
# and the cv loop is inside here.
class CalibratedClassifierCV(BaseEstimator, ClassifierMixin,
MetaEstimatorMixin):
def fit(self, X, y, sample_weight=None):
# Checking
# `_get_predictions`. Loop through list of prediction tuples and perform `_fit_calibrator`
# return list of `_CalibratedClassiferPipeline` instances. Any suggestions? |
Sorry I don't understand why |
Yes I guess they don't. I just I wanted to avoid repetition but it is not necessary. We can do, as with the original PR:
I guess my question is did you just want an implementation without the use of |
I think this is what I had in mind yes. From what I can tell, what we need to change is mostly
yes
My suggestion (pasted below) doesn't work? My "vision" is that when def _fit_calibrator(fitted_classifier, method, y, X=None, df=None):
# Take a fitted classifier as input, fit a calibrator, and return the
# corresponding pipeline.
# df is mutually exclusive with X
# y should always be passed
# This should handle the multiclass case too
if df is None:
df = fitted_classifier.decision_function(X) # or predict_proba
calibrator = _get_regressor(method)
calibrator.fit(df, y) |
Thanks, I was making it more difficult than it needs to be. First attempt here: #17856 I will amend documentation and add tests once the implementation is okay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like the refactoring but I think we could do a better job at preserving previous attribute names for the calibrated classifiers as it could be considered semi-public API (from a duck-typing point of view) and would not cost us much from a maintenance point of view. See the details in the review comments below:
@@ -311,145 +326,150 @@ def _more_tags(self): | |||
} | |||
|
|||
|
|||
class _CalibratedClassifier: | |||
"""Probability calibration with isotonic regression or sigmoid. | |||
def _get_predictions(clf_fitted, X, label_encoder_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick renaming suggestion: _compute_predictions
to emphasize that this is actually calling the pred_method
(rather that just retrieving some precomputed predictions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this in #17856 as it's easier..
classes. However, if there are 2 classes, the list contains only one | ||
fitted calibrator. | ||
""" | ||
def __init__(self, clf_fitted, calibrators_fitted, label_encoder_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer to use the names:
clf_fitted
=>base_estimator
calibrators_fitted
= >calibrators
label_encoder_
=>label_encoder
and just mention in the docstring that all of them are expected to be fitted.
Also we could add a property named calibrators_
that would just return calibrators
and label_encoder_
pointing to label_encoder
to ensure backward compat with the previous attribute names, maybe with a deprecation warning.
While this is a private class, calibrated classifier instances are stored in a public attribute of the public CalibratedClassifierCV
class, therefore could be considered semi-public API, at least from a duck-typing point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NicolasHug reminded me of keeping the attributes constant and it's updated #17856
I've also removed the attribute label_encoder
.
Also we could add a property named calibrators_ that would just return calibrators
This may be tricky/confusing because for multiclass, there is more than one calibrator per classifier. e.g., for 3 classes each pair would be (1 classifier, 3 calibrators). I guess we could return a list of tuples, each tuple containing the calibrators for a 'pair' ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I realised that _CalibratedClassifier
has a calibrators
attribute, which can be accessed via calibrated_classifiers_.calibrators
Reference Issues/PRs
First step to implement #16145
Follows from #16167, implements the refactoring suggested.
What does this implement/fix? Explain your changes.
Add
ensemble
option toCalibratedClassifierCV
.ensemble=True
would be the default and would be the current implementation.ensemble=False
would be the same as the libsvm implementation (used whenprobability=True
):Whencv
is not'prefit'
:Concatenate predictions from each cv. Use all predictions to train calibrator.Classifiers fit after each cv would be destroyedUse classifier fit on all data at prediction timeNoteensemble=False
is the procedure described in the original Platt paper (see #16145 (comment))Performs refactoring suggested by @NicolasHug (#16167 (comment)) before implementing the new option for easier reviewing. Details in next comment.
Any other comments?
cc @ogrisel @agramfort