Skip to content

ENH Add libsvm-like calibration procedure to CalibratedClassifierCV #17856

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
Oct 29, 2020

Conversation

lucyleeow
Copy link
Member

@lucyleeow lucyleeow commented Jul 7, 2020

Reference Issues/PRs

closes #16145
Follows from PR #16167
The first step was to refactor CalibratedClassifierCV, which was done in #17803. This is a branch off of #17803. (will close #17803)

What does this implement/fix? Explain your changes.

Add ensemble option to CalibratedClassifierCV.
ensemble=True would be the default and would be the current implementation.
ensemble=False would be the same as the libsvm implementation (used when probability=True):

When cv is not 'prefit':

  • Concatenate predictions from each cv. Use all predictions to train calibrator.
  • Classifiers fit after each cv would be destroyed
  • Use classifier fit on all data at prediction time

Note ensemble=False is the procedure described in the original Platt paper (see #16145 (comment))

Any other comments?

@lucyleeow
Copy link
Member Author

@NicolasHug, this is probably not going to be a problem but I thought I would mention it.

We use cross_val_predict to get the predictions, where if a training subset is missing 1 or more classes, it is replaced by 0 if predict_proba otherwise by ('we approximate negative infinity') 'the minimum finite float value for the dtype' for decision_function.
CalibratedClassifierCV uses 0 for both predict_proba and decision_function.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lucyleeow , made a first pass.

I think the refactoring makes sense, WDYT?

I haven't looked in detail at the changes related to the label encoders though. Are these needed here or can they wait for another PR?

Comment on lines 527 to 528
# When binary, proba of clf_fitted.classes_[1]
# output but `pos_class_indices` = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that the grammar is incorrect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I amended to:

                # When binary, df consists of predictions for
                # clf_fitted.classes_[1] but `pos_class_indices` = 0

not sure if this is clearer

Comment on lines 302 to 307
if base_estimator_method == "decision_function":
if df.ndim == 1:
df = df[:, np.newaxis]
else:
if len(self.label_encoder_.classes_) == 2:
df = df[:, 1:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplication is a bit unfortunate, there should be a way to merge this logic with that of _get_predictions

Maybe by extracting this into a _reshape_df utility?
Or maybe we can even let _get_predictions call cross_val_predict somehow, so that we can also merge the method checks above.
Also is df = df[:, 1:] just equivalent to df = df[:, 1]?

Copy link
Member Author

@lucyleeow lucyleeow Jul 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we can even let _get_predictions call cross_val_predict somehow, so that we can also merge the method checks above.

This is what I was trying to do before - but _get_predictions is is used within _fit_calibrator which is within the cv for loop. This is why I wanted to put the cv for loop within _get_predictions but it got complicated.

I will can add a _reshape_df function but I have a little dislike for the df naming as it is not always 'decision_function`. (I also first think df=dataframe but maybe that is just me) WDYT?
(of course if this naming is used elsewhere, happy to keep)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I can't think of an elegant way to add the _reshape_df function as we get the predictions of the classifier in _get_predictions but we use cross_val_predict to get the predictions in the case above...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I was confused by df too. We can rename it as predictions or just preds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I can't think of an elegant way to add the _reshape_df

Would it help to also add a _get_method helper, like

def _get_method(clf):
	# return decision_function (the actual method) or predict_proba or raise error

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into it, _fit_calibrator returns a _CalibratedClassifierPipeline instance for each cv fold. For ensemble=False, this won't work (for how _fit_calibrator works atm) - we want to fit once with all the predictions. We would need to take _fit_calibrator out of the cv for loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into it, _fit_calibrator returns a _CalibratedClassifierPipeline instance for each cv fold

Only when ensemble=True, right?

There's no CV loop when ensemble=False, the "loop" is in cross_val_predict and we only need to call _fit_calibrator once. Maybe I misunderstood?

Copy link
Member Author

@lucyleeow lucyleeow Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is difficult over text! I may be confused too.

At the moment we have:

class CalibratedClassifierCV():

    def fit():
        if self.cv == "prefit":
            # `_fit_calibrator` using all X and y
            # append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`
        else:
            if ensemble:
                for train, test in cv.split(X, y):
                    # fit classifier
                    # call `_fit_calibrator`, producing a `_CalibratedClassiferPipeline` instance
                    # append `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`,
                    # ending up with `n_cv` `_CalibratedClassiferPipeline` instances
            else:
                # Use `cross_val_predict` to get all `preds`
                # call `_fit_calibrator` using `preds` and y
                # append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`

we want to get rid of the last else as there is duplication of code there. With the new _get_prediction_method we have:

class CalibratedClassifierCV():

    def fit():
        if self.cv == "prefit":
            # `_fit_calibrator` using all X and y
            # append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`
        else:
            for train, test in cv.split(X, y):
                # fit classifier
                # call `_fit_calibrator`, which first calls `_get_predictions`.
                # In case of `ensemble=False`, we use use `partial(cross_val_predict())` (this works fine)
                # `_fit_calibrator` also fits calibrators and produces `_CalibratedClassiferPipeline` instance
                #  Only one _CalibratedClassiferPipeline` is required when `ensemble=False`
                # (i.e. shouldn't be in the for loop)
                # when `ensemble=True` `n_cv` `_CalibratedClassiferPipeline` instances are needed 
                # (i.e., should be in the for loop) 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I wasn't clear! I don't think we can move away from the first version (with the last else). The duplication of code that I wanted to avoid was:

  1. getting the right method
                if hasattr(base_estimator, "decision_function"):
                    base_estimator_method = "decision_function"
                elif hasattr(base_estimator, "predict_proba"):
                    base_estimator_method = "predict_proba"
  1. reshaping the predictions
                if base_estimator_method == "decision_function":
                    if preds.ndim == 1:
                        preds = preds[:, np.newaxis]
                else:
                    if len(self.label_encoder_.classes_) == 2:
                        preds = preds[:, 1]

Right now, these 2 things happen both in CalibratedClassifierCV.fit (for ensemble=False) and in _get_predictions (for ensemble=True).

So to expand on my previous suggestion:

def get_prediction_method(clf, return_string=False):
	# return clf.decision_function or clf.predict_proba or their corresponding string
	# we need strings when using cross_val_predict

def _get_predictions(X, pred_func):
	# pred_func is either a method of the clf, or a partial cross_val_predict
	preds = pred_func(X)
	# reshape preds here
	return preds


class CalibratedClassifierCV():

    def fit():
        if self.cv == "prefit":
            # `_fit_calibrator` using all X and y
            # append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`
        else:
            if ensemble:
                for train, test in cv.split(X, y):
                    # fit classifier on train
					preds = _get_predictions(test, get_prediction_method(clf, return_string=False)
                    # call `_fit_calibrator`, producing a `_CalibratedClassiferPipeline` instance
                    # append `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`,
                    # ending up with `n_cv` `_CalibratedClassiferPipeline` instances
            else:
				preds = _get_predictions(X, partial(cross_val_predict(..., method=get_prediction_method(clf, return_string=True)))
                # call `_fit_calibrator` using `preds` and y
                # append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`

I think we might need to modify _fit_calibrator to only accept preds and not call _get_predictions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI:

Also is df = df[:, 1:] just equivalent to df = df[:, 1]?

it seems that it is not the same, you get 2d output with df[:, 1:]

@lucyleeow
Copy link
Member Author

I haven't looked in detail at the changes related to the label encoders though. Are these needed here or can they wait for another PR?

There wasn't a big change to this. Previously we passed the classes to _CalibratedClassifier:

calibrated_classifier = _CalibratedClassifier(
this_estimator, method=self.method, classes=self.classes_)

and then instantiate a new LabelEncoder in _CalibratedClassifier.fit:

self.label_encoder_ = LabelEncoder()
if self.classes is None:
self.label_encoder_.fit(y)
else:
self.label_encoder_.fit(self.classes)

(note: self.classes is never None when using CalibratedClassifierCV)

I just changed it so we do not pass the classes but the fitted LabelEncoder instance. I don't think it is a big change but I can change it back.

@lucyleeow
Copy link
Member Author

lucyleeow commented Jul 9, 2020

I've make the changes. I like that _get_predictions and _fit_calibrator are separate now. Not sure if it's overall better (the ensemble=False section is a bit messy), will sleep on it. Also, is it appropriate to use method.__name__, to get the str of the method?

I do think that it is a bit of duplication to have both _check_classifier_response_method and get_prediction_method. I think a single utility fun like:

get_prediction_method(estimator, method, allow_alt=False):
    """Return prediction method of `estimator`.

    Parameters
    ----------------
    estimator : Estimator instance

    method : {'decision_function', 'predict_proba'}

    allow_alt : bool, default=False
        Whether to return the alternative method, if `method` does not exist for `estimator`.

would do the trick. But maybe that is a separate issue.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lucyleeow , made another pass.

I think the use of partial is appropriate and the refactoring looks good to me. We'll need tests for the new ensemble parameter ;)

I do think that it is a bit of duplication to have both _check_classifier_response_method and get_prediction_method [...]

Yes we can consider having a unified utility, but indeed this would be better done in a separate PR.

@lucyleeow
Copy link
Member Author

lucyleeow commented Oct 8, 2020

ping @NicolasHug @ogrisel
It would be great to get this in. As there is such extensive change to calibration.py, merge conflicts are confusing to fix, e.g., caused by #18332

@ogrisel ogrisel self-requested a review October 15, 2020 05:44
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good to me. Thank you very much @lucyleeow for your patience.

Here are some fixes / small suggestions for further improvements but otherwise it seems ready for second review and merge.

Please add a new entry to doc/whats_new/0.24.rst.

@ogrisel ogrisel added this to the 0.24 milestone Oct 15, 2020
@lucyleeow
Copy link
Member Author

Thanks for the review @ogrisel, suggestions added!

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. Just a couple of nitpicks.

@glemaitre glemaitre self-assigned this Oct 29, 2020
@ogrisel ogrisel merged commit 8471c83 into scikit-learn:master Oct 29, 2020
@ogrisel
Copy link
Member

ogrisel commented Oct 29, 2020

Merged! Thank you very much @lucyleeow!

@@ -66,7 +66,7 @@ def test_calibration(data, method, ensemble):
(sparse.csr_matrix(X_train),
sparse.csr_matrix(X_test))]:
cal_clf = CalibratedClassifierCV(
clf, method=method, cv=2, ensemble=ensemble
clf, method=method, cv=5, ensemble=ensemble
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used cv=5 whenever we assess the quality of the calibration because cv=2 trains models on much small dataset and can degrade the base model performance, hence make the test quite brittle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CalibratedClassifierCV should return an estimator calibrated via CV, not an ensemble of calibrated estimators
4 participants