Description
Describe the issue linked to the documentation
I recently trained a binary gradient boosting model using the GradientBoostingClassifier. I wanted to plot the intermediate losses at each iteration on the test set while evaluating my results and stumbled upon the train_score_
attribute which is described as The i-th score train_score_[i] is the deviance (= loss) of the model at iteration i on the in-bag sample.
.
However, when plotting my test losses (calculated using normal log_loss using probabilities based on the staged_predict_proba(X)
function) the training scores were not in the expected [0,1) interval and the resulting plot did not make a lot of sense (I would love a test set with the loss curve :D). Following the plot:
After some digging, I found out that, at least for binomial log_loss, deviance is scaled by a factor of 2 using the following formula:
-2 * np.mean((y * raw_predictions) - np.logaddexp(0, raw_predictions))
Knowing this, the fix for the problem is very easy but at least for me took quite some time to find the scaling factor. For completion, the resulting plot makes much more sense if we scale the test loss by the same factor of 2:
Suggest a potential alternative/fix
I would suggest adding a hint to the train_scores_
attribute that the deviance is not necessarily the loss but could be a scaled version of the loss and dependent on the loss function employed. In addition, I would somehow change the deviance (= loss)
part of the current documentation as the equality sign leaves room for interpretation.
I did not open a pull request with a proposed change since my problem is quite specific to the binomial log loss/deviance and might not be important enough for a dedicated hint.