Description
The current cross validation procedure adopted in the CalibratedClassifierCV
does not follow the cross validation procedure described in the original Platt paper:
[Platt99] Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods, J. Platt, (1999)
I checked also the other papers cited in the references for the CalibratedClassifierCV
class and none of them describes the cross validation process it implements.
CalibratedClassifierCV
currently fits and calibrates an estimator for each fold (calibration is performed on the test part of the fold).
All the estimators fit at each fold are kept in a list.
At prediction time, every estimator makes a prediction and the average of the returned values is the final prediction.
The estimator produced by CalibratedClassifierCV
is thus an ensemble and not a single estimator calibrated on the whole training set via CV.
When using cross validation the original base_estimator
is not used to make the prediction.
Platt99, describes a cross validation procedure that fits an estimator on each fold and the predictions for the test fold are saved.
Then the predictions from all the folds are concatenated in a single list, and calibration parameters for the base_estimator
are determined using such list.
Cross validation should be only a mean to calibrate the base_estimator
on the same data it has been fit, not to fit a different estimator.
The procedure described in Platt99 is what one would expect from a proper application of a cross validation procedure, as the cross validation only determines the parameters of the calibration and does not fit the estimator.
It is also more efficient, as is does not store the estimators for each fold and requires a single predict at prediction time.