Skip to content

Unclear train_score_ attribute description for GradientBoostingClassifier #25206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Grimaldus12 opened this issue Dec 18, 2022 · 3 comments
Open

Comments

@Grimaldus12
Copy link

Describe the issue linked to the documentation

I recently trained a binary gradient boosting model using the GradientBoostingClassifier. I wanted to plot the intermediate losses at each iteration on the test set while evaluating my results and stumbled upon the train_score_ attribute which is described as The i-th score train_score_[i] is the deviance (= loss) of the model at iteration i on the in-bag sample..
However, when plotting my test losses (calculated using normal log_loss using probabilities based on the staged_predict_proba(X) function) the training scores were not in the expected [0,1) interval and the resulting plot did not make a lot of sense (I would love a test set with the loss curve :D). Following the plot:

loss_without_scalled_test_loss

After some digging, I found out that, at least for binomial log_loss, deviance is scaled by a factor of 2 using the following formula:

-2 * np.mean((y * raw_predictions) - np.logaddexp(0, raw_predictions))

Knowing this, the fix for the problem is very easy but at least for me took quite some time to find the scaling factor. For completion, the resulting plot makes much more sense if we scale the test loss by the same factor of 2:

loss_with_scaled_test_loss

Suggest a potential alternative/fix

I would suggest adding a hint to the train_scores_ attribute that the deviance is not necessarily the loss but could be a scaled version of the loss and dependent on the loss function employed. In addition, I would somehow change the deviance (= loss) part of the current documentation as the equality sign leaves room for interpretation.

I did not open a pull request with a proposed change since my problem is quite specific to the binomial log loss/deviance and might not be important enough for a dedicated hint.

@Grimaldus12 Grimaldus12 added Documentation Needs Triage Issue requires triage labels Dec 18, 2022
@glemaitre
Copy link
Member

We have a similar example here: https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_oob.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-oob-py

The issue is more that usually all the losses that have a factor x2 will be neglected and we take only the half-loss. I would rather try to update the documentation of the loss parameter to be more explicit since it is related to the loss itself and not what we are logging.

@glemaitre glemaitre removed the Needs Triage Issue requires triage label Dec 30, 2022
@Nebula1230
Copy link

Hello i'll take care of this if possible (:

@adrinjalali
Copy link
Member

Future contributors should have a look at the discussions here: #25383

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment